1. INTRODUCTION

Enabling Self-Service BI on Document Stores

Mohamed L. Chouder

m_chouder@esi.dz 1

Stefano Rizzi

stefano.rizzi@unibo.it 0

Rachid Chalal

r_chalal@esi.dz 1 0 DISI, University of Bologna , Bologna , Italy 1 LMCS, ESI , Algiers , Algeria

The growing use of document stores has resulted in vast amounts of semi-structured data holding precious information, which could be pro tably integrated into existing BI systems. Unfortunately, due to their schemaless nature, document stores are hardly accessible via direct OLAP querying. In this paper we propose an interactive, schema-on-read approach for nding multidimensional structures in document stores aimed at enabling OLAP querying in the context of self-service BI. Our approach works in three steps: multidimensional enrichment, querying, and OLAP enabling; the validation of user queries and the detection of multidimensional structures is based on the mining of approximate functional dependencies from data. The e ciency of our approach is discussed with reference to real datasets.

1. INTRODUCTION

Over the past decade, companies have been adopting NoSQL databases to deal with the huge volumes of data manipulated by modern applications. NoSQL systems have emerged as an alternative to relational database management systems in many implementations [ 4 ]. They can be classied based on their data model, the most popular categories being key-value, wide-column, graph-based and documentoriented; document-oriented databases (brie y, document stores) are probably the most widely adopted so far. The common features of document stores are horizontal scalability on commodity hardware and the lack of an explicit schema (according to the data rst, schema later or never paradigm). This schemaless nature provides a exible data model that has attracted developers seeking to avoid the restrictions posed by the relational model.

Document stores usually collect nested, denormalized, and hierarchical documents in the form of collections; the documents within the same collection may present a structural variety due to schema exibility and data evolution. Documents are self-describing and mainly encoded using the semi-structured data format JSON (JavaScript Object Notation), which is also popular as an exchange format and widely adopted in modern web APIs.

The growing use of document stores and the dominance of JSON have resulted in vast amounts of semi-structured data holding precious information, which could be pro tably integrated into existing business intelligence (BI) systems [ 1 ]. Indeed, multidimensional modeling and analysis has been recognized to be an e ective way for conducting analytics over big NoSQL data [ 6 ]. Unfortunately, due to their schemaless nature, document stores are hardly accessible via direct OLAP querying. Recent e orts in this area propose SQL-like query interfaces (e.g., Apache Drill, Spark SQL, Presto), which provide access to schemaless and nested data while o ering the possibility of using traditional BI tools. However, none of these solutions supports multidimensional querying or OLAP over document stores.

In this paper we propose an interactive, schema-on-read approach for nding multidimensional structures in document stores aimed at enabling OLAP querying in the context of self-service BI. Di erently from a schema-on-write approach, which would force a xed structure in data and load them into a data warehouse, a schema-on-read approach leaves data unchanged in their structure until they are accessed by the user [ 16 ]. We claim that, when querying document stores for self-service BI, a schema-on-read approach should be preferred to a schema-on-write approach. The main reason for this is that document stores handle high volumes of data with varied structure, which question the ability of the traditional ETL in processing data on the one hand, and the possibility of storing them into a data warehouse with a xed schema on the other hand.

To discover the information necessary for discovering multidimensional concepts despite the lack of structure, we resort to data distributions, i.e., we mine approximate functional dependencies (AFDs) and use them to automate the design process. More speci cally, our approach takes as input a collection of nested documents and operates in three phases. 1. Multidimensional Enrichment. In this phase, a schema is extracted out of the collection and enriched with some basic multidimensional knowledge. This is done by tentatively classifying attributes into dimensions and measures, and by relating each measure to the subset of dimensions that determine its granularity, so as to create a draft multidimensional schema. 2. Querying. Now the user can formulate a multidimensional query by picking dimensions and measures of interest. This query is checked by accessing data to ensure its multidimensional validity and correct summarization. If the query is found to be well-formed, it is executed. Otherwise, the multidimensional schema is re ned and some possible querying alternatives are proposed to the user. 3. OLAP Enabling. This iterative phase enables the user to further explore data by running an OLAP session. To this end, local portions of multidimensional hierarchies (consisting of roll-up and drill-down relationships for each level involved in the user query) are built by mining AFDs from data. The user can now apply an OLAP operator (e.g., roll-up or drill-down) to iteratively create a new query on the collection. The advantages of the proposed approach are twofold. First, it enables decision makers to formulate multidimensional queries and create reports on-the- y on document stores in a self-service fashion, i.e., without any support from ICT people. This reduces the time and e ort spent in traditional BI settings. Second, our approach also works when nested structures are not present within the input data (e.g., for at documents), which is relevant because a large number of non-nested datasets are available in JSON format on the web (e.g., more than 11000 collections in www.data.gov). The approach has been implemented on top of MongoDB, one of the most popular document stores.

The rest of this paper is organized as follows. Section 2 discusses the related work and Section 3 presents document stores along with the working example. Our approach is detailed in Section 4 and experimentally evaluated in Section 5. Section 6 concludes the paper and gives some future directions for research.

RELATED WORK

Supply-driven multidimensional design. The automation of multidimensional modeling has been widely explored in the area of data warehouse design. Supply-driven approaches automate the discovery of multidimensional concepts by a thorough analysis of the source data. The rst approaches proposed algorithms to build multidimensional schemata starting from Entity/Relationship diagrams or relational schemata [ 8, 19, 12 ], while further approaches considered more expressive conceptual sources such as UML class diagrams [ 20 ] and ontologies [ 23 ]. Furthermore, [ 24 ] discussed the use of FDs for automating multidimensional modeling from ontologies. In particular, similarly to the querying phase of our approach, [ 22 ] proposes an algorithm to check the multidimensional validity of an SQL cube query and to derive the underlying multidimensional schema. This algorithm identi es multidimensional concepts from the structure of the query itself and by following foreign keys in the relational schema.

In these approaches, multidimensional modeling is done at design time, mainly based on FDs expressed in the source schemata as foreign keys or many-to-one relationships. Conversely, our approach is meant to be used at query time and automates the discovery of multidimensional concepts by mining FDs from data, as required by the schemaless context.

Multidimensional design from non-relational data. Our approach closely relates to previous approaches for multidimensional modeling from semi-structured data [ 9, 13, 28 ] in XML format, which is similar to JSON. These approaches take in input DTDs or XML schemata that provide rich information about XML documents (e.g., multiplicities, data types), so they cannot operate directly on XML data not having a schema speci cation. In particular, the work in [ 28 ] builds multidimensional schemata starting from an XML schema but, in some cases, data is accessed to discover FDs that are not expressed in the schema. Similarly, starry vault [ 7 ] is a recent approach that mines FDs for multidimensional modeling from data vaults. These are databases characterized by a speci c data model tailored to provide historical storage in presence of schema evolutions. The main idea is to mine approximate and temporal FDs to cope with the issue of noisy and time-varying data, which may result in hidden FDs.

All the above-mentioned approaches are similar in that they de ne multidimensional schemata at design time using structural and additional information extracted from data (i.e., FDs). In contrast, our approach operates at query time and mostly relies on data distributions, while structural information is only used |when available| to intelligently reduce the search space.

OLAP on linked data. Recent works in this space propose to directly perform OLAP-like analysis over semantic web data. Exploratory OLAP has been de ned as the process that discovers, collects, integrates, and analyzes these external data on the y [ 2 ]. Some works in this direction address the problem of building multidimensional hierarchies from linked data [ 21, 27 ], which is closely related to the third phase of our approach. Speci cally, iMOLD [ 21 ] is an instance-based approach that operates at query time for exploratory OLAP on linked data; it aims at nding multidimensional patterns representing roll-up relationships, in order to enable users to extend corporate data cubes with new hierarchies extracted from linked data. Similarly, [ 27 ] proposes an algorithm for discovering hierarchies from dimensions present in statistical linked data represented using the RDF Data Cube (QB) vocabulary (www.w3.org/TR/ vocab-data-cube/). Like starry vault, this algorithm mines for AFDs (here called quasi -FDs) to cope with the presence of imperfect and partial data.

The algorithm that we propose for building hierarchies differs from these works in that, to ensure good performances, it looks for local portions of hierarchies for the levels involved in the user queries, thus acting in an on-demand fashion.

SQL on schemaless data. Several solutions have emerged in the industry to enable SQL querying of schemaless data for analytic purposes. These solutions can be classi ed into two categories: (1) relational systems enabling the storage and management of schemaless data, and (2) systems designed as an SQL interface for schemaless data.

Solutions in the rst category include RDBMSs that support storage and querying of schemaless data to be used along with relational data in one system [ 5, 17 ] (e.g., Oracle, IBM DB2, and PostgreSQL). Current systems do not impose a xed relational schema to store and query data, but derive and maintain a dynamic schema to be used for schema-on-read SQL/JSON querying instead [ 17 ]. Other solutions that fall into this category are virtual adapters that expose a relational view of the data in a document store, to be used in common BI tools; an example is the MongoDB BI connector (www.mongodb.com/products/bi-connector).

The second line of solutions are SQL-like interfaces that o er extensions to query schemaless data persisted in document stores or as JSON les (e.g., in Hadoop). Spark SQL [ 3 ] has been designed for relational data processing of native Spark distributed datasets in addition of diverse data sources and formats (including JSON). In order to run SQL queries on schemaless data in Spark SQL, a schema is automatically inferred beforehand. Apache Drill (drill. apache.org) is a distributed SQL engine that can join data from multiple data sources, including Hadoop and NoSQL databases. It has a built-in support for JSON with columnar in-memory execution. Drill is also able to dynamically discover a schema during query processing, which allows to handle schemaless data without de ning a schema upfront. Finally, Presto (prestodb.io) is a distributed SQL engine, developed at Facebook for interactive queries, that can also combine data from multiple kinds of data sources including relational and non-relational databases. When querying schemaless data, Presto tries to discover data types, but it may need a schema to be de ned manually in order to run queries. All above-mentioned engines di er in their architecture, support to ANSI SQL, and extensions to SQL to handle schemaless and nested data. These engines can connect to MongoDB, which gave us various architectural options for implementation. However, none of them supports multidimensional or OLAP querying over document stores.

DOCUMENT STORES

A document store holds a set of databases, each organizing the storage of documents in the form of collections in a way similar to rows and tables in relational databases. A document, also called object, consists of a set of key-value pairs (or elds) mainly encoded in JSON format (www.json.org). JSON has many variants that are used for storage and optimization purposes such as BSON (www.bsonspec.org) in MongoDB and, very recently, OSON in the Oracle DBMS [ 17 ]. Keys are always strings, while values have the following types: primitive (number, string, Boolean), object (or sub-document ), array of atomic values or objects.

From a conceptual point of view, a typical collection consists of a set of business entities connected through two kinds of relationships: nesting and references. Nesting consists of embedding objects arbitrarily within other objects, which corresponds to two types of relationships: to-one in the case of a nested object and to-many in the case of an array of nested objects. On the other hand, references are similar to foreign keys in relational databases and can be expressed manually or using a speci c mechanism such as $ref in MongoDB. However, when references are used, getting data requires joins which are not supported in document stores; so, nesting is most often used instead. Since a document store has a exible schema there are a variety of ways for data representation; nevertheless, the way data is modeled may a ect performance and data retrieval patterns. games : f\ id" : \52f2a70eddbd75540aba7c06", \date" : ISODate(\1989-03-27"), \teams" : [ f\name" : \Detroit Pistons", \abbreviation" : \DET", \score" : 90, \home" : true, \won" : 1, \results" : f \points" : 90, : : : g, \players" : [ f\player" : \Isiah Thomas", \points" : 30, : : : g, : : : ] g, f\name": \Dallas Mavericks", \abbreviation" : \DAL", \score" : 77, \home" : false, \won" : 0, \results" : f \points" :77, : : : g, \players" : [ f\player" : \Rolando Blackman", \points" : 23, : : : g, : : : ] \gg,uestteam" : f\name": \Dallas Mavericks", \abbreviation" : \DAL", \score" : 77, \home" : false, \won" : 0, \results" : f \points" :77, : : : g, \players" : [ f\player" : \Rolando Blackman", \points" : 23, : : : g, : : : g Example 1. In the scope of this paper, we focus on collections of JSON documents that logically represent the same business entities, while expecting that their structure may vary. In particular, we will use as a working example a collection that models NBA games (adapted from [ 26 ]), where a document represents one game between two teams along with players and team results. In the sample document shown in Figure 1a the main object is games, which holds the array of objects teams (to-many), which in turn holds the object results (to-one) and the array of objects players (to-many). Figure 1b shows an alternative JSON representation of the same domain; the relationship between game and teams is modeled using two separate sub-documents (home and guest team) instead of an array. Note that a simple query that computes the average score by team would be much easier to implement using the rst representation and it would also perform better, since in the second representation it would require two separate queries. 4.

APPROACH

As already discussed, our approach for enabling self-service BI on document stores includes three phases: multidimensional enrichment, querying, and OLAP enabling; these phases are discussed in the detail in the following subsections. 4.1

Multidimensional Enrichment

The aim of this phase is to de ne a draft multidimensional schema on which the user will be able to formulate queries. To this end, we extract the schema of the input collection and enrich it with basic multidimensional knowledge. Several works have addressed schema extraction and discovery from document stores [ 25, 29, 14 ]. These works focus on the structural variety of documents within the same collection caused by their schemaless nature and by the evolution of data; for instance, [ 14 ] proposes to extract a JSON schema (www.json-schema.org) similar to a JSON document and infer attribute and type occurrence frequencies to emphasize the structural variety. In [ 29 ], all the schema versions in a collection are extracted and stored in a repository to be used within a schema management framework. In order to have a single view of the distinct schemata in a collection, the authors propose a relaxed form of schema called skeleton that better captures the core and prominent elds and lters out unfrequent ones. This would be useful for us since we rely on a single view of the collection; however, the computational cost for building skeletons is high, making them unsuitable in real-time contexts.

Since schema discovery is not the focus of this paper, we adopt a simple algorithm that builds a tree-like schema including all the elds that appear in the documents [ 3 ]. This algorithm works in a single-pass over data to infer a tree structure of elds and their corresponding types ( elds with varying types are generalized to String).

The tree-like collection schema that will be the input for the subsequent steps of our approach is de ned as follows.

Definition 1 (Collection Schema). Given a collection D, its schema (brie y, c-schema) is a tree S pK; Eq where:

K is the union of the sets of keys appearing in the documents of D;

r P K is the distinguished root node and is labelled with the collection name;

E is a set of arcs that includes (i) an arc r Ñ k for

each key/value pair in the root r, and (ii) an arc k1 Ñ k2 i k2 appears as a key in an array of nested objects having key k1.

Figure 2 shows the c-schema to which the document of Figure 1a conforms. A path from the root key to a leaf key is called an attribute. In denoting attributes we use the dot notation omitting the root key (since a single collection is involved) but including the keys corresponding to nested objects; so, for instance, the c-schema in Figure 2 contains attributes id, date, teams.name, teams.results.points, teams.players.player, teams.players.points, etc. The depth of an attribute a is the number of arcs included in its path, and is denoted by depthpaq (e.g., depthpidq 1); the set of attributes at depth is denoted by Attrsp q.

games id date

teams name abbreviation home score results.points players player points

After the c-schema S of a collection has been de ned, a corresponding multidimensional schema has to be derived from it. Since measures at di erent granularities can possibly be included in the documents (e.g., in our example, points at the player's level and score at the team level), the de nition we provide relates each single measure to a speci c set of dimension.

Definition 2 (Md-Schema). A multidimensional schema (brie y, md-schema) is a triple M pH; M; gq where:

H is a nite set of hierarchies; each hierarchy hi P H

is associated to a set Li of categorical levels and a rollup partial order ©i of Li. Each element G P 2L, where L i Li and at most one level in G is taken from each hierarchy hi, is called a group-by set of H.

M is a nite set of numerical measures.

g is a function relating each measure in M to its dimensions, i.e., to the set of levels that determine its granularity: g : M Ñ G where G is the set of all groupby sets of H.

At rst, a draft md-schema Mdraft is built from S by tentatively labelling all attributes of types date, string, and Boolean as dimensions (i.e., as levels of di erent hierarchies), and all attributes of types integer and decimal as measures. Date dimensions are then decomposed into di erent categorical levels giving rise to standard temporal hierarchies (e.g., date © month © year). Note that no further attempt to discover hierarchies is made at this stage, since discovering the FDs involving all attributes would be computationally too expensive. As a consequence, in Mdraft two levels l and l1 might be erroneously modeled as two separate dimensions, while they should be actually part of the same hierarchy because l © l1.

The user can contribute to this step by manually changing the label of some attributes, since in some cases a numeric attribute can be used as a dimension, and a non-numeric attribute can be used as a measure. The rst situation has no impact on the following steps, since a group-by set can also include numeric attributes. Conversely, the second situation requires that the values of the non-numeric attribute are transformed into numerical values to be supported by aggregation functions.

To complete the de nition of Mdraft, the granularity mapping g must be built. We recall from [ 22 ] that an md-schema should comply with the multidimensional space arrangement constraint, stating that each instance of a measure is related to one instance of each dimension. In the c-schema S, attributes in Attrsp q are related by to-many multiplicity to those in Attrsp 1q, so a measure cannot be related to levels having a higher depth. Therefore, gpmq is set by connecting each measure m to the group-by set G that contains all levels l P L such that depthpmq ¥ depthplq.

Example 2. In this example and in the following ones we will refer to the representation in Figure 1a. Attribute teams.score is numerical, so it is assumed to be a measure. It is depthpteams.scoreq 2, therefore teams.score is tentatively associated in Mdraft with group-by set G tteams.name; teams.abbreviation; teams.home; id; dateu. All the measures that have the same depth of teams.score (e.g., teams.won and teams.results.points) are also associated with G. Similarly, measure teams.players.points has depth 3, so it is associated with group-by set G1 tteams.players.player; teams.name; teams.abbreviation; teams.home; id; dateu. 4.2

Querying

This phase is aimed at supporting a non-expert user in formulating a well-formed multidimensional query. In a schemaon-write approach, queries are formulated on a complete mdschema whose correctness is ensured by the designer. Conversely, in a schema-on-read approach |our case|, the mdschema is de ned at read-time by the query itself. Though this approach gives more exibility because each user can \force" her own multidimensional view onto the data, it requires a further check to ensure that the FDs implied by the query are not contradicted by data; so, querying becomes an iterative process where the underlying md-schema is progressively re ned together with the query.

Definition 3 (Md-Query). A multidimensional query (brie y, md-query) q on md-schema M pH; M; gq is a triple q pGq; Mq; q where

Gq P G is the md-query group-by set; Mq M is the set of required measures;

is a function that associates each measure m P Mq with an aggregation operator.

Starting from the draft md-schema Mdraft, the user formulates an md-query q by choosing one to three dimensions (Gq), one or more measures of interest (Mq), and an aggregation operator for each measure ( ). However, to be considered well-formed, q (and the md-schema q is formulated on) should comply with the following constraints [ 22 ]: 71 The base integrity constraint, stating that the levels in the group-by set are orthogonal, i.e., functionally independent on each other. 72 The summarization integrity constraint, which requires disjointness (the measure instances to be aggregated are partitioned by the group-by instances), completeness (the union of these partitions constitutes the entire set), and compatibility (the aggregation operator chosen for each measure is compatible with the type of that measure) [ 15 ].

How to carry out these validity checks is discussed in the following subsections.

Example 3. A possible md-query on the draft md-schema

described in Example 2 is the one characterized by Gq tteams.players.player; dateu, Mq tteams.scoreu, and

Sum. Unfortunately, aggregating on teams.score would re

sult in double counting since each instance of teams.score is related to multiple instances of teams.players.player (a team has several players). So, this query violates disjointness and is not well-formed.

The pseudo-code of the querying phase is sketched in Algorithm 1. It starts from the md-query q and takes as input both the c-schema and the md-schema. The output of the algorithm is a valid md-query based on a re ned md-schema Algorithm 1 Querying to be used in the next phase. Algorithm 1 works as follows. Firstly, procedure CheckMeasures drops from Mq the measures, if any, that are not compatible with Gq based on their granularity (Line 1). If no measure is left, some alternative (compatible) measures are suggested by RecommendMeasures (Lines 2{5). If some measures Mrec are found, the user interacts to approve their inclusion in the md-query (Line 5). Then, if the md-query group-by set includes either 2 or 3 levels (Lines 7{11), procedure CheckGroupBy is called to drop from Gq the levels, if any, that are not compliant with the base integrity constraint and update M accordingly (Line 8). If just one level is left in Gq, procedure RecommendLevels is called to look for additional group-by levels (Line 10) and possibly include them in the md-query (Line 11). Finally, procedure CheckSummarization checks the aggregation operators (Line 12). We allow the resulting md-query to be executed with a single group-by level, but not without measures (Lines 6 and 13); in the latter case, the user is asked to choose a new group-by set (Line 15). 4.2.1

Checking Measures

The goal of this procedure is to ensure that all the measures in Mq can be correctly aggregated at group-by set Gq; speci cally, our goal is to avoid that for some m P Mq there is a to-many relationship between m and a level l P Gq, because this would violate disjointness and lead to double counting when executing q. Measure check is based on the g component of M, which expresses the granularity of each measure.

To explain how this is done, we observe that the roll-up partial orders on the hierarchies H of M induce a partial order ©H on the set G of the group-by sets of H, de ned as the product order of the single roll-up orders.1 Measure m is compatible with Gq i gpmq ©H Gq; indeed, if this condition is satis ed, the granularity expressed by the group-by set is coarser than the one at which m is de ned, meaning that m can be safely aggregated at Gq. The measures that are found not to be compatible with Gq are removed from Mq.

Example 4. Considering again the query in Example 3, it is gpteams.scoreq «H Gq (because teams.players.player is not in gpteams.scoreq). Then, q is not well-formed and teams.score is dropped from Mq. 1The product order of n total orders is a partial order on the Cartesian product of the n totally ordered sets, such that px1; : : : ; xnq © py1; : : : ; ynq i xi © yi for i 1; : : : ; n.

This check is performed at the schema level, so some exceptions may arise when the relationship between a measure m and Gq is hidden in data. This may happen when arrays are not used to model to-many relationships, or when m depends either on a level not present in Gq or on a subset of the levels in Gq [ 18 ]. In these cases, m may be non-additive or even non-aggregable on Gq, so the user's knowledge of the application domain is required to manually x summarization (see Section 4.2.5). 4.2.2

Checking Group-by Set

This process is aimed at ensuring completeness and base integrity.

The completeness condition is violated when, for some level l P Gq, there is no instance corresponding to one or more instances of a measure in Mq; from a conceptual point of view, l is classi ed as optional [ 8 ]. In a schema-on-write approach this problem is xed by aggregating all \dangling" measures into an ad-hoc group of l. Similarly, in our schemaon-read approach, these instances are grouped into a null instance of l thus restoring the completeness condition; then for instance, the $ifNull operator of MongoDB could be used to replace null values with an ad-hoc one when executing q.

Base integrity requires that the levels in Gq are mutually orthogonal. So, from this point of view, md-query q is valid only if, for each pair of levels l; l1 P Gq, there is no FD between l and l1, i.e., there is a many-to-many relationship between them. An FD is a to-one relationship, which we will denote with l Ñ l1 to emphasize that values of l functionally determine the values of l1. Checking base integrity requires to access data in order to retrieve AFDs between the levels, in Gq, that are not related by a to-many multiplicity in the c-schema (see Section 4.3 for more details about AFDs).

We recall that 1 ¤ |Gq| ¤ 3: 1. If |Gq|

1, orthogonality is obvious. 2. If |Gq| 2, the relationship between the two levels in Gq is checked. If it turns out to be many-to-many (e.g., if Gq tdate; teams.nameu), the base integrity constraint is met and both levels remain in Gq. If it turns out to be many-to-one (e.g., if Gq tid; dateu), then there is a roll-up relationship between the two levels in Gq. Only the level at the \many" side (id) remains in Gq and the md-schema is modi ed by placing these levels in the same hierarchy (id © date). A special case is when the relationship is one-to-one (e.g., if Gq tteams.name; teams.abbreviationu); here, one of the levels is kept in Gq while the other is considered as a descriptive attribute. 3. In case |Gq|

3, there are three possibilities: (a) If many-to-many relationships are found between each pair of levels, then the constraint is met and all levels remain in Gq (e.g., if Gq tdate; teams.name; teams.players.playeru). (b) Let many-to-many relationships be found between two pairs of levels pl; l1q, pl; l2q, and a many-to-one relationship be found between the remaining pair pl1; l2q (e.g., if Gq tteams.name; id; dateu, since id Ñ date). In this case, l1and l2 are placed in the same hierarchy in the md-schema (l1 © l2), and l2 is removed from Gq. (c) Finally, when l1 Ñ l and l2 Ñ l, i. if the relationship between l1 and l2 is many-tomany, they are both retained in Gq; conceptually, l is a convergence [ 8 ] and the md-schema must be updated by adding l1 © l and l2 © l. ii. if the relationship between l1 and l2 is many-toone, then the three levels in Gq belong to the same hierarchy (l1 © l2 © l) and only the level with highest cardinality, l1, is kept in Gq. 4.2.3

Recommending Measures

When there is no compatible measure left in Mq, this procedure looks for alternative measures. Speci cally, it returns all the measures in the md-schema that are compatible with Gq, i.e., all m P M such that gpmq ©H Gq.

Example 5. With reference to Example 4, procedure

RecommendM easures returns teams.players.points, since it is gpteams.players.pointsq ©H Gq. 4.2.4

Recommending Levels

Recommending additional group-by levels is done when originally the user had selected two or three group-by levels, but only one of them was left in Gq after checking the groupby set. For a given candidate level l, it requires to check that the base integrity constraint is met between l and some other level in M, and that the measure check is not violated. The rst level found is proposed to the user, who has the choice to use it or to proceed looking for other levels (up to a maximum of three levels in the group-by set).

Example 6. Let Gq tid; dateu and Mq tteams.players.pointsu. The relationship between the levels in Gq is many-to-one, so date is removed. In this case, to recommend levels we may look for a many-to-many relationship between id and teams.name, teams.abbreviation, teams.home, and teams.players.player. Since teams.name does not violate the measure check and has a many-to-many relationship with id, it is recommended to the user as a possible group-by level. 4.2.5

Checking Summarization

Compatibility states that measures cannot be aggregated along hierarchies using any aggregation operators. In principle, checking summarization would require knowing the measure category ( ow, stock, or value-per-unit), the type of group-by levels (temporal or non-temporal), and the aggregation operator (Sum, Avg, Min, Max, Count, etc.) [ 15 ]. Knowing the measure category and type of group-by levels could be used to recommend some aggregation operators, but still the user would have to choose among a number of potential options; besides, correctly classifying a measure into ow, stock, or value-per-unit may be hard for nonskilled users. Therefore, we prefer to enable users to directly pick an aggregation operator for each measure in Mq.

Summarization is violated when a measure m P Mq has ner granularity than another measure m1 P Mq, since m would be double-counted when executing q. In this case, the user is warned that she will get erroneous results, and she may choose to change the aggregation operator for m (e.g., using Min, Max or Avg instead of Sum will give the correct result) or to drop m from Mq.

Example 7. Let Mq tteams.score; teams.players.ptsu.

These measures have di erent granularities, so if the user

has chosen to aggregate teams.score (which has ner granularity) using Sum she will get double counting. Then, she can either change the aggregation operator for teams.score to Avg or drop teams.score from Mq. 4.3

OLAP Enabling

The goal of this phase is to re ne the md-schema by discovering some hierarchies, so that the user is enabled to interact with data in an OLAP fashion. Completely building all the hierarchies would require to mine all FDs between levels, which would be computationally too expensive. For this reason, we only build local portions of hierarchies for the levels in the group-by set of the previously-formulated md-query q. Besides, again for complexity reasons, we only mine simple FDs (i.e., those relating single attributes rather than attribute sets), which are mostly common in multidimensional schemata.

Speci cally, the idea is to mine, for each level l P Gq for which no roll-up relationships have been discovered yet, the FDs of either type l Ñ l1 (to enable a roll-up of l) or l1 Ñ l (to enable a drill-down of l). Then, if the user applies a roll-up or drill-down, a new md-query q1 is formed and the process is iterated to further extend the hierarchies. Remarkably, using FDs to discover hierarchies guarantees that q1 satis es the conditions discussed earlier, speci cally disjointness and completeness.

FDs are not explicitly modeled in a document store, so we must resort to data. Since document stores host large amounts of data, we can reasonably assume that the mined FDs are representative enough of the application domain. However, schemaless data commonly present some errors and missing values which may hide some FDs. The tool we adopt to cope with this issue are approximate FDs (AFDs) [ 10, 11 ], which \almost hold" on data, instead of traditional ones. Unfortunately, this entails an approximate satisfaction of the disjointness and completeness conditions, possibly leading to non-disjoint and incomplete hierarchies.

Definition 4 (Approximate FD). Given two levels l and l1, let strengthpl; l1q denote the ratio between the number of unique values of l and the number of unique values of ll1.

We will say that AFD l ; l1 holds if strengthpl; l1q ¥ ,

where is a user-de ned threshold [ 11 ].

At a given time during a user's session, let M be the current version of the md-schema and q be the last md-query formulated. Then, the search space for mining AFDs includes the levels that are not in Gq and are not involved in roll-up relationships in M. To reduce the search complexity we take into account the structural clues provided by the c-schema. Speci cally, let l and l1 be two levels, such that depthplq depthpl1q; as already stated, the attributes in Attrsp q are related by to-many multiplicity to those in Attrsp 1q, so l ; l1. For this reason, checking if l and l1 should be placed in the same hierarchy only requires to check if l1 ; l. In addition, we avoid checking trivial and transitive AFDs since we explore one hierarchy at a time. Finally, the result of all AFD checks performed is stored in a meta-data repository, to be used to avoid checking the same AFDs twice.

In the following subsections, roll-up and drill-down discovery are detailed. 4.3.1

Roll-up Discovery

Let l P Gq; discovering possible roll-ups for l requires to mine the AFDs of type l ; l1, with l1 P LzGq and depthpl1q ¤ depthplq. Let R be the set of all right-hand sides for these AFDs. To avoid useless checks, the AFDs whose right-hand side l1 has higher cardinality than l are not checked, since they clearly cannot hold. One exception is when |l| |l1| or |l| Á |l1| (due to approximation): in this case both l1 ; l and l ; l1 must be checked, since the relationship between l and l1 may be one-to-one (so l1 is a descriptive attribute for l). These bidirectional checks are made at the end of the process to prioritize the discovery of \true" hierarchies. The AFDs in R are checked until one is found to hold, say l ; l1. Then, the md-schema M is updated by placing these levels in the same hierarchy (l © l1) and the user is noti ed of the ability of performing a roll-up from l.

Example 8. Let Gq tid; teams.nameu and Mq tteams.score; teams.wonu. The search space for level id only includes date, since it is the only level for which depthpdateq ¤ depthpidq. Since |id| ¡ |date|, id ; date is checked. This is found to be true, so the md-schema is updated with id © date. The search space for teams.name consists of teams.abbreviation and teams.home; by querying data we nd that |teams.name| |teams.abbreviation| ¡ |teams.home|. Firstly, teams.name ; teams.home is checked, and found not to hold. Then, teams.name ; teams.abbreviation and teams.abbreviation ; teams.name are checked; both are found to hold, giving rise to a descriptive attribute. 4.3.2

Drill-Down Discovery

Here the set of candidate levels for checking AFDs of type l1 ; l includes the levels in LzGq whose depth is greater or equal than l, i.e., such that depthpl1q ¥ depthplq. As in roll-up discovery, an AFD whose left-hand side l1 has lower cardinality than l is not checked, since it cannot hold. If l1 ; l is found to hold, l1 © l is added to M and the user is noti ed of the ability of drilling down from l.

Note that drilling down from l to l1 could produce a new md-query q1 whose group-by set, Gq1 Gqztlu Y tl1u, is not compatible with some of the measures in Mq1 Mq. Since checking compatibility is computationally cheaper than checking AFDs (as explained in Section 4.2.1, it can be done based on the partial order of group-by sets, without accessing data), the AFD for a candidate level l1 is checked only if Gq1 is found to be compatible with at least one of the measures in Mq1 . Of course, if l1 ; l is found to hold and the user decides to drill-down to l1, the incompatible measures in Mq1 must be dropped.

Example 9. Consider again Example 8. The

search space for id includes date, teams.home, and teams.players.player (we recall that teams.abbreviation has already been added as a descriptive attribute). Since id has the highest cardinality, there are no AFDs to check. On the other hand, the search space for teams.name consists of teams.home and teams.players.player, where |teams.home| |teams.name| |teams.players.player|. AFD teams.home ; teams.name is not checked because of its cardinality; the same for teams.players.player ; teams.name, since Gq1 tid; teams.players.playeru is not compatible with the measures in Mq1 .

The md-schema resulting after OLAP enabling in Examples 8 and 9 is shown in DFM notation [ 8 ] in Figure 3. Note that, at the end of the user's session, md-schemata can be stored for reuse and sharing.

Year Month

Date id

Game Teams.score Teams.won

Teams.name

Teams.abbreviation

EXPERIMENTAL EVALUATION

We evaluate the performance of our approach using two real-world datasets :

Games. The working example dataset has been collected by Sports Reference LLC [ 26 ]. It contains around 32K nested documents representing NBA games in the period of 1985{2013. Each document represents a game between two teams with at least 11 players each. It contains 46 attributes; 40 of them are numeric and represent team and player results.

DBLP. This dataset contains 5.4M documents scraped from DBLP (dblp.uni-trier.de/xml/) in XML format and converted into JSON. Documents are at and represent eight kinds of publications including conference proceedings, journal articles, books, thesis, etc. So, they have common attributes such as title, author, type, year and uncommon ones such as journal and booktitle.

Implementation. Each dataset has been loaded as a collection in a local MongoDB instance running on Intel Core i5 CPU at 3.3 GHz and 4 GB of RAM machine with Ubuntu 14.04 OS. We have focused on the parts of our approach that require accessing data to retrieve multiplicities of interlevel relationships by means of count distinct queries, since the purely algorithmic parts have no relevant computational complexity. Query execution can be delegated either to the document store (MongoDB in our architecture) or to query engines such as Spark SQL, Drill, and Presto. Based on the tests we performed, these engines are signi cantly slower than native querying, plausibly because of the di erent query execution plans. In addition, these engines sometimes return incorrect results when it comes to count distinct queries involving nested attributes. Therefore, we de ned two queries based on the native query language of MongoDB: 1. The rst one (QAF D) checks the AFD between a pair of levels and returns its approximation using the formula of De nition 4. 2. The second one (Qcard) returns the cardinality of a level.

Methodology. In the previous section, we showed how our approach validates md-queries and iteratively derives an md-schema that satis es the multidimensional constraints. In this section, we consider ve md-queries on the games dataset, q1 to q5, and four on the DBLP dataset, q6 to q9 (the group-by sets of all queries are shown in Table 1). Then, we measure the time required by the querying and OLAPenabling phases. We also discuss the e ciency of our algorithm based on the number of performed and skipped checks in each phase. The results obtained are proposed below. 5.1

Querying

To evaluate the e ciency of the validation of an md-query, we measure the time of the group-by check since it is the only step that requires to access data.2 In Table 1 we show, for our nine md-queries, the number of checks avoided using the c-schema (#Avoided), and the total time for validating the md-query (tquerying). In order to measure tquerying, the AFD checks required by a single md-query are executed in parallel. We can easily conclude that (i) our approach e ectively reduces the number of checks by relying on the dataset structure, and (ii) the md-query validation time depends on the number of levels in the group-by set and on their depths. Remarkably, the md-query validation time is still reasonable (max 50 secs) even with a large dataset, which makes our approach suitable for real-time usage.

Dataset q

Gq #Avoided tquerying Games DBLP q1 id, teams.name q2 date, teams.name q3 id,teams.name,

teams.home q4 teams.players.player,

date, teams.name q5 teams.players.player, teams.name, teams.home, q6 journal, year q7 year, type q8 author, year q9 type, year, author 1/2 1/2 2/6 3/6 2/6 0/2 0/2 1/2 2/6

For OLAP enabling, we measure the time to check each single AFD and the time to retrieve the cardinality of each level. Tables 2 and 3 show, for each AFD checked between a pair of levels, the number of processed documents, the AFD strength, and the checking time. The di erence in the number of documents (or the data size) for a given dataset is to due to the attening operation that is performed when executing QAF D. The results clearly show that the querying time tAF D depends on the levels and their depths. This is apparent for the AFDs involving the author attribute in Table 3, which have a higher execution time. This is due to the amount of memory allowed by MongoDB for groupby queries, which is limited to 100 MB; when a query does not t in memory, temporary les must be written on disk, which dramatically slows execution down. 2The time for actually executing each md-query is not considered here, since the optimization of each md-query is out of the paper scope. l ; l1 id ; date date ; id t.name ; id t.name ; date t.name ; t.abbreviation t.name ; t.home t.abbreviation ; id t.abbreviation ; date t.abbreviation ; t.name t.abbreviation ; t.home t.home ; id t.home ; date t.home ; t.name t.home ; t.abbreviation t.players.player ; id t.players.player ; date t.players.player ; t.name t.players.player ; t.abbrev. t.players.player ; t.home l ; l1 year ; type year ; journal year ; booktitle type ; year type ; journal type ; booktitle journal ; year journal ; type journal ; booktitle booktitle ; year booktitle ; type booktitle ; journal author ; type author ; year author ; journal author ; booktitle 32K 64K 64K 64K 644K #Docs 5.4M 5.4M 5.4M 5.4M 12M

In this paper we have proposed an interactive schema-onread approach for enabling OLAP on document stores. To this end, AFDs are mined and used to automate the discovery of multidimensional structures. The user interaction is limited to the selection of multidimensional concepts and to a proper choice of the aggregation operators. After validating user queries from the multidimensional point of view and re ning the underlying multidimensional schema, we adopt a smart strategy to e ciently build local portions of hierarchies aimed at enabling OLAP-style user interaction in the form of roll-ups and drill-downs.

Overall, the experiments we conducted show that the performances of our approach are in line with the requirements of a real-time user interaction. However, some relevant issues still need to be explored and are part of our future work. To improve performance, we plan to further optimize the algorithms proposed. Besides, to increase e ectiveness, the evolution of data and schemata in a collection must be considered; we intend to address this issues by searching for temporal FDs.

[1]

Abello ,

Darmont ,

Etcheverry ,

Golfarelli ,

J.-N.

Mazon ,

Naumann , T. B. Pedersen , S.

Rizzi , J.

Trujillo , P.

Vassiliadis , and G.

Vossen . Fusion cubes: towards self-service business intelligence . International Journal of Data Warehousing and Mining , 9 ( 2 ): 66 { 88 , 2013 .

[2]

Abello ,

Romero , T. B. Pedersen , R.

Berlanga , V.

Nebot , M. J.

Aramburu , and

Simitsis . Using semantic web technologies for exploratory OLAP: A survey . IEEE Transactions on Knowledge and Data Engineering , 27 ( 2 ): 571 { 588 , 2015 .

[3]

Armbrust ,

Ghodsi ,

Zaharia ,

R. S.

Xin ,

Lian ,

Huai ,

Liu ,

J. K.

Bradley ,

Meng ,

Kaftan , and M. J. Franklin. Spark SQL : Relational data processing in spark . In Proc. SIGMOD , pages 1383 { 1394 , 2015 .

[4]

Cattell . Scalable SQL and NoSQL data stores . ACM SIGMOD Record , 39 ( 4 ): 12 { 27 , 2011 .

[5]

Chasseur ,

Li ,

and J.

Patel . Enabling JSON document stores in relational systems . In Proc. WebDB , pages 1 { 6 , 2013 .

[6]

Cuzzocrea ,

Bellatreche ,

and I.-Y.

Song . Data warehousing and OLAP over big data: current challenges and future research directions . Proc. DOLAP , pages 67 { 70 , 2013 .

[7]

Golfarelli ,

Graziani , and

Rizzi . Starry vault: Automating multidimensional modeling from data vaults . In Proc. ADBIS , pages 137 { 151 , 2016 .

[8]

Golfarelli ,

Maio , and

Rizzi . The dimensional fact model: A conceptual model for data warehouses . International Journal of Cooperative Information Systems , 7 ( 2 -3): 215 { 247 , 1998 .

[9]

Golfarelli ,

Rizzi , and

Vrdoljak . Data warehouse design from XML sources . In Proc. DOLAP , pages 40 { 47 , 2001 .

[10]

Huhtala , J. Karkkainen, P. Porkka, and

Toivonen . TANE: An e cient algorithm for discovering functional and approximate dependencies . Computer Journal , 42 ( 2 ): 100 { 111 , 1999 .

[11]

I. F.

Ilyas ,

Markl ,

Haas ,

Brown , and

Aboulnaga . Cords: Automatic discovery of correlations and soft functional dependencies . In Proc. of SIGMOD , pages 647 { 658 , 2004 .

[12]

C. S.

Jensen ,

Holmgren , and

T. B.

Pedersen . Discovering multidimensional structure in relational data . In Proc. DaWaK , pages 138 { 148 , 2004 .

[13]

M. R.

Jensen , T. H.

M ller, and

T. B.

Pedersen . Specifying OLAP cubes on XML data . Journal of Intelligent Information Systems , 17 ( 2-3 ): 255 { 280 , 2001 .

[14]

Klettke , U.

Storl, and

Scherzinger . Schema extraction and structural outlier detection for JSON-based NoSQL data stores . In Proc. BTW , pages 425 { 444 , 2015 .

[15]

H. J.

Lenz and

Shoshani . Summarizability in OLAP and statistical databases . In Proc. SSDBM , pages 132 { 143 , 1997 .

[16]

Z. H.

Liu and

Gawlick . Management of exible schema data in RDBMSs - opportunities and limitations for NoSQL . In Proc. CIDR , 2015 .

[17]

Z. H.

Liu ,

Hammerschmidt ,

McMahon ,

Lu , and

H. J.

Chang . Closing the functional and performance gap between SQL and NoSQL . In Proc. SIGMOD , pages 227 { 238 , 2016 .

[18]

J. N.

Mazon , J.

Lechtenborger, and

Trujillo . A survey on summarizability issues in multidimensional modeling . Data and Knowledge Engineering , 68 ( 12 ): 1452 { 1469 , 2009 .

[19]

Phipps and

K. C.

Davis . Automating data warehouse conceptual schema design and evaluation . In Proc. DMDW , pages 23 { 32 , 2002 .

[20]

Prat ,

Akoka ,

and I.

Comyn-Wattiau . A UML-based data warehouse design method . Decision Support Systems , 42 ( 3 ): 1449 { 1473 , 2006 .

[21]

Rizzi , E. Gallinucci,

Golfarelli ,

Romero , and

Abello . Towards exploratory OLAP on linked data . In Proc. SEBD , pages 86 { 93 , 2016 .

[22]

Romero and

Abello . Multidimensional design by examples . In Proc. DaWaK , pages 85 { 94 , 2006 .

[23]

Romero and

Abello . Automating multidimensional design from ontologies . In Proc. DOLAP , pages 1 { 8 , 2007 .

[24]

Romero ,

Calvanese ,

Abello , and

Rodr guez-Muro. Discovering functional dependencies for multidimensional design . In Proc. DOLAP , pages 1 { 8 , 2009 .

[25]

D. S.

Ruiz ,

S. F.

Morales , and Jesus Garcia Molina. Inferring versioned schemas from NoSQL databases and its applications . In Proc. ER , pages 467 { 480 , 2015 .

[26]

Valeri . https://thecodebarbarian.wordpress.com, 2014 .

[27]

Varga ,

Vaisman ,

Romero ,

Etcheverry , T. B. Pedersen , and C. Thomsen . Dimensional enrichment of statistical linked open data . Journal of Web Semantics , 40 : 22 { 51 , 2016 .

[28]

Vrdoljak ,

Banek , and

Rizzi . Designing web warehouses from XML schemas . In Proc. DaWaK , pages 89 { 98 , 2003 .

[29]

Wang ,

Hassanzadeh ,

Zhang ,

Shi ,

Jiao ,

Zou , and

Wang . Schema management for document stores . VLDB Journal , 8 ( 9 ): 922 { 933 , 2015 .