1. Introduction

Italian Symposium on Advanced Database Systems, June

Describing Multidimensional Data Through Highlights

(Discussion Paper)

Matteo Francia

Enrico Gallinucci

Matteo Golfarelli

Patrick Marcel

Verónika Peralta

Stefano Rizzi

0 0 DISI, University of Bologna , Italy 1 LIFAT, University of Tours , France

2022

1 9 22

The Intentional Analytics Model (IAM) is a new paradigm to couple OLAP and analytics. It relies on two ideas: (i) letting the user explore data by expressing his/her analysis intentions rather than the data (s)he needs, and (ii) returning enhanced cubes, i.e., multidimensional data annotated with knowledge insights in the form of model components (e.g., clusters). In this paper we propose a proof-of-concept for the IAM vision by delivering an end-to-end implementation of describe, one of the five intention operators introduced by IAM.

eol>OLAP OLAM Analytics Multidimensional data Data exploration

1. Introduction

Data warehousing and OLAP (On-Line Analytical Processing) have been progressively gaining a leading role in enabling business analyses over enterprise data since the early 90’s. Recently, it has become more and more evident that the OLAP paradigm, alone, is no more suficient since the enormous success of machine learning techniques has consistently shifted the interest of corporate users towards sophisticated analytical applications.

The Intentional Analytics Model (IAM) has been envisioned as a way to tightly couple OLAP and analytics [ 1 ]. IAM relies on two major cornerstones: (i) the users explore the data space by expressing their analysis intentions rather than by explicitly stating what data they need, and (ii) in return they receive both multidimensional data and knowledge insights in the form of annotations of interesting subsets of data. As to (i), five intention operators have been proposed, namely, describe [ 2 ], assess [ 3 ], explain, predict, and suggest. As to (ii), first-class citizens of the IAM are enhanced cubes, defined as multidimensional cubes coupled with highlights, i.e., sets of cube cells associated with interesting components of models automatically extracted from cubes [ 1 ]. An overview of the process is given in Figure 1.

The goal of this paper is to provide a proof-of-concept for the IAM vision by delivering an end-to-end implementation of the describe operator, which aims at describing one or more cube measures, possibly focused on one or more level members. describe assess explain predict suggest data

Bagelstype Beer Bologna Canned Fruit Deli Meats Fresh Chicken Fresh Fruit Frozen Chicken Hamburger Hot Dogs Muffins Slices Bread Wine enhanced cube Example 1. Let a SALES cube be given, and let the user’s intention be: with SALES describe quantity for month = ’1997-04’ by type using outliers. Firstly, the subset of cells for April 1997 are selected from the SALES cube, aggregated by product type, and projected on measure quantity (in OLAP terms, a slice-and-dice and a roll-up operator are applied). Then, the outliers are found in these cells based on the values of quantity. Finally, a measure of interestingness is computed for the two components obtained (the outlier cells, and the non-outlier ones), and the cells belonging to the component with maximum interestingness (outlier cells) are highlighted in the results shown to the user (see Figure 1). □

After introducing a formalism to manipulate cubes and queries in Section 2, in Section 3 we introduce models, components, and enhanced cubes. Then, in Section 4 we show how an intention is transformed into an execution plan, and in Section 5 we explain how enhanced cubes are visualized. Finally, in Section 6 we discuss the related literature and draw the conclusion.

2. Formalities

In this section we introduce the formal notations we will use in the paper to manipulate cubes. We start by defining cube schemata.

Definition 1 (Hierarchy and Cube Schema). A hierarchy is a couple ℎ = (ℎ, ⪰ ℎ) where: (i) (ℎ, ⪰ ℎ) is a roll-up total order of categorical levels; (ii) each level ∈ ℎ is coupled with a domain () including a set of members. The top level of ⪰ ℎ is called dimension. A cube schema is a couple = (, ) where is a set of hierarchies and is a set of numerical measures, with each measure ∈ coupled with one aggregation operator () ∈ {sum, avg, . . .}. Example 2. For our working example it is SALES = (, ), where = {ℎDate, ℎCustomer, ℎProduct, ℎStore}, = {quantity, storeSales, storeCost}, date ⪰ month ⪰ year, customer ⪰ gender, product ⪰ type ⪰ category, store ⪰ city ⪰ country, (quantity) = (storeSales) = (storeCost) = sum. □

Aggregation is the basic mechanism to query cubes, and it is captured by the following definition of group-by set.

Definition 2 (Group-by Set and Coordinate). Given cube schema = (, ), a group-by set of is a set of levels, at most one from each hierarchy of . A coordinate of a group-by set is a tuple of members, one for each level of .

Example 3. Two group-by sets of SALES are 1 = {date, type, country} and 2 = {month, category}. Example of coordinates of these group-by sets are, respectively, 1 = ⟨1997-04-15, Fresh Fruit, Italy⟩ and 2 = ⟨1997-04, Fruit⟩. □

The instances of a cube schema are called cubes and are defined as follows: Definition 3 (Cube). A cube over is a tuple = ( , , ) where: (i) is a groupby set of ; (ii) ⊆ ; (iii) is a partial function that maps some coordinates of to a numerical value for each measure ∈ .

Each coordinate that participates in 0, with its associated tuple of measure values, is called a cell of and denoted ⟨, ⟩. A cube whose group-by set includes all and only the dimensions of the hierarchies in and such that = , is called a base cube, the others are called derived cubes. In OLAP terms, a derived cube is the result of either a roll-up, a slice-and-dice, or a projection made over a base cube; this is formalized as follows.

Definition 4 (Cube Query). A query over cube schema is a triple = (, , ) where: (i) is a group-by set of ; (ii) is a (possibly empty) set of selection predicates, each expressed over one level of ; (iii) ⊆ .

Example 4. The cube query over SALES used in Example 1 is = (, , ) where = {type}, = {month = ’1997-04’}, and = {quantity}. A cell of the resulting cube (SALES0) (where SALES0 is the base cube) is ⟨Canned Fruit⟩ with associated value 138 for quantity. □

3. Enhancing cubes with models

Models are concise, information-rich knowledge artifacts [ 4 ] that represent relationships hiding in the cube cells. The possible models range from simple functions and measure correlations to more elaborate techniques such as decision trees, clusterings, etc. A model is bound to (i.e., is computed over the levels/measures of) one cube, and is made of a set of components (e.g., a clustering model is made of a set of clusters). In the IAM, a relevant role is taken by data-to-model mappings. Indeed, a model partitions the cube on which it is computed into two or more subsets of cells, one for each component (e.g., the subsets of cells belonging to each cluster).

Definition 5 (Model and Component).

A model is a tuple ℳ = (, , , , , ) where: (i) is the model type; (ii) is the algorithm used to compute ; (iii) is the cube to which ℳ is bound; (iv) is the tuple of levels/measures of and parameter values supplied to to compute ℳ; (v) is the set of components that make up ; (vi) is a function mapping each coordinate of to one component of .

Each model component is a tuple of a component identifier plus a variable number of properties that describe that component.

In the scope of this work, it is ∈ {top-k, bottom-k, skyline, outliers, clustering}. For instance, for = clustering, each component is a cluster and is described by its centroid. Example 5. A possible model over the derived cube (SALES0) in Example 4 is characterized by = clustering, = K-Means, = (SALES0), = ⟨quantity, = 3, = 0⟩, = {1, 2, 3}, (⟨Bagels⟩) = 1, (⟨Beer⟩) = 1, (⟨Bologna⟩) = 2, . . ., where is the desired number of clusters and is the seed to be used by the k-means algorithm to randomly generate the 3 seed clusters. Component 1 is characterized by property with value 76. □

As the last step in the IAM approach, cube is enhanced by associating it with a set of models bound to and with a highlight, i.e., with the subset of cells corresponding to the most interesting component of the model; these cells are determined via function . Definition 6. An enhanced cube is a triple of a cube , a set of models {ℳ1, . . . , ℳ} bound to , and a highlight ℎℎ = {∈⋃︀=1 }(()).

How to estimate the interestingness of component , (), is explained in detail in [ 2 ]. Here we just mention that we consider three facets of interestingness identified in [ 5 ], namely, novelty, peculiarity, and surprise.

4. Execution plans for describe intentions

The describe operator provides an answer to the user asking “show me my business” by describing one or more cube measures, possibly focused on one or more level members, at some given granularity [ 1 ]. The cube is enhanced by showing either the top/bottom-k cells, the skyline, the outliers, or clusters of cells. Let 0 be a base cube over cube schema = (, ); the syntax for describe is with 0 describe 1, . . . , [ for ] [ by 1, . . . , ]

[ using 1 [ size 1 ], . . . , [ size ]] (optional parts are in brackets) where 1, . . . , ∈ are measures of , is a set of selection predicates each over one level of , {1, . . . , } denote a group-by set of , 1, . . . , are model types, and the ’s are the desired sizes to be applied to the models returned as explained in point 2 below.

The plan corresponding to a fully-specified intention, i.e., one where all optional clauses have been specified, is: 1. Execute query = (, , ), where = {1, . . . , }, = , and = {1, . . . , }.

Let = (0). 2. For 1 ≤ ≤ , compute model ℳ = (, , , , , ) and for each ∈ , compute (). Size is used for clustering to determine the number of clusters to be computed, for top-k and bottom-k to determine the number of cells to be returned, for outliers to determine the number of outliers; it is neglected for the skyline. 3. Find the highlight ℎℎ = {∈⋃︀ }(()). 4. Return the enhanced cube consisting of , {ℳ1, . . . ℳ}, and ℎℎ.

Partially-specified intentions are interpreted as follows: • If the for clause has not been specified, we consider = . • If the by clause has not been specified, we consider = ∅. • If the using 1, . . . , clause has not been specified, all model types listed in Section 3 are computed over (the skyline is computed only if > 1, i.e., at least two measures have been specified). • If the size clause has not been specified for one or more models, the value of is determined automatically through the Elbow method.

Example 6. Consider the following session on the SALES cube: with SALES describe quantity for month = ’1997-04’ by type with SALES describe quantity by category using clustering size 3 The models computed for the first intention are top-k, bottom-k, clustering, and outliers (computing the skyline for a single measure makes no sense). For the second intention, a clustering producing 3 clusters is computed. □

5. Visualizing enhanced cubes

To provide an efective description of an enhanced cube we couple text-based and graphical representations with an ad-hoc interaction paradigm. Specifically, the visualization includes three distinct but inter-related areas: a table area that shows the cube cells using a pivot table; a chart area that complements the table area by representing the cube cells through one or more charts; a component area that shows a list of model components sorted by their interestingness. The guidelines adopted to select the charts are detailed in [ 2 ]. The interaction paradigm we adopt is component-driven. Specifically, clicking on one component in the component area leads to emphasize the corresponding cube cells (i.e., those that map to via function ) both in the table area and in the chart area. The highlight is the top component in the list and is selected by default. Following the details-on-demand paradigm [ 6 ], interaction is enhanced using a tooltip that, when the mouse is positioned on a data point, shows its coordinate, its measure value(s), and the component(s) it belongs to.

Example 7. Figure 2 shows the visualization obtained when the following intention is formulated: with SALES describe storeCost by month, category. On the top-left, the table area; on the right, the chart area; on the bottom-left, the component area. Here a heat map and a bubble chart have been selected. The top-interestingness component is a cluster, so a color has been assigned to each component of clustering (i.e., to each cluster) and is uniformly used in all three areas. The highlight (in green) is currently selected and is emphasized using a thicker border in all areas. A tooltip with all the details about a single cell is also shown (in yellow). □

6. Related work and conclusion

The idea of coupling data and analytical models was born in the 90’s with inductive databases, where data were coupled with patterns meant as generalizations of the data. Later on, data-tomodel unification was addressed in MauveDB [ 7 ], which provides a language for specifying model-based views of data using common statistical models.

The coupling of the OLAP paradigm and data mining to create an approach where concise patterns are extracted from multidimensional data for user’s evaluation, was the goal of some approaches commonly labeled as OLAM [ 8 ]. In this context, k-means clustering is used by [ 9 ] to dynamically create semantically-rich aggregates of facts other than those statically provided by dimension hierarchies. Similarly, the shrink operator is proposed by [ 10 ] to compute small-size approximations of a cube via agglomerative clustering. Other operators that enrich data with knowledge extraction results are DIFF [ 11 ], which returns a set of tuples that most successfully describe the diference of values between two cells of a cube, and RELAX [ 12 ], which verifies whether a pattern observed at a certain level of detail is also present at a coarser level of detail, too. Finally, [ 13 ] reuse the OLAP paradigm to explore prediction cubes, i.e., cubes where each cell summarizes a predictive model trained on the data corresponding to that cell. The IAM approach can be regarded as OLAM since, like the approaches mentioned above, it relies on mining techniques to enhance the cube resulting from an OLAP query. However, while each of the approaches above uses one single technique (e.g., clustering) to this end, the IAM leans on multiple mining techniques to give users a wider variety of insights, using the interestingness measure to select the most relevant ones.

To the best of our knowledge, though some tools (e.g., Spotfire and Tableau) integrate OLAP and analytics capabilities in the same environment, none of them allows users to formulate queries at a higher level of abstraction than OLAP (as done in the IAM using intentions), nor they support the automated out-of-the-box enrichment of cubes with insights obtained by analytics (as done in the IAM through enhanced cubes).

In this paper we have given a proof-of-concept for the IAM vision by delivering an implementation of the describe operator, relying on a visual metaphor to display enhanced cubes. Our implementation uses a simple multidimensional engine [ 14, 15 ] that relies on the Oracle 11g DBMS to execute queries on a star schema; the mining models are imported from the Scikit-Learn Python library. The web-based visualization is implemented in JavaScript and uses the D3 library. The prototype can be accessed at http://semantic.csr.unibo.it/describe/.

In [ 2 ], we have showed that our approach diminishes the efort for formulating complex analyses while ensuring that performances are compatible with near-real-time requirements of interactive sessions. Specifically, using the ASCII character length as an approximation for the efort it takes to craft a query, we evaluated the saving in user’s efort when writing a describe intention over the one necessary to obtain the same result using plain SQL and Python. We considered a simple session including three intentions, where the by clause is progressively enlarged and all the models are computed. Remarkably, it turned out that the total formulation efort using SQL+Python is about two orders of magnitude larger than using describe intentions (in the average, about 5400 vs. 55 chars). For the eficiency test we used the FoodMart data (github.com/julianhyde/foodmart-data-mysql) and the same session mentioned above. Table 1 shows the total execution time and its breakdown into the times necessary to query the base cube, to compute the models, to measure the interestingness, and to generate the pivot table returned to the browser. Remarkably, it turns out that at most 18 seconds are necessary to retrieve and visualize an enhanced cube of more than 86000 cells, which is perfectly compatible with the execution time of a standard OLAP query.

The main directions for future research we wish to pursue are: (i) evaluate the usability of the approach by conducting tests with real users, and (ii) extend the approach to operate with dashboards of enhanced cubes.

[1]

Vassiliadis ,

Marcel ,

Rizzi , Beyond roll-up's and drill-down's: An intentional analytics model to reinvent OLAP, Inf . Sys. 85 ( 2019 ) 68 - 91 .

[2]

Francia ,

Marcel ,

Peralta ,

Rizzi , Enhancing cubes with models to describe multidimensional data , Inf. Sys. Frontiers 24 ( 2022 ) 31 - 48 .

[3]

Francia ,

Golfarelli ,

Marcel ,

Rizzi ,

Vassiliadis , Assess queries for interactive analysis of data cubes , in: Proc. of EDBT , 2021 , pp. 121 - 132 .

[4]

Terrovitis ,

Vassiliadis ,

Skiadopoulos ,

Bertino ,

Catania ,

Maddalena ,

Rizzi , Modeling and language support for the management of pattern-bases, Data Knowl . Eng . 62 ( 2007 ) 368 - 397 .

[5]

Marcel ,

Peralta ,

Vassiliadis , A framework for learning cell interestingness from cube explorations , in: Proc. of ADBIS , 2019 .

[6]

Shneiderman , The eyes have it: A task by data type taxonomy for information visualizations , in: Proc. of IEEE Symp. on Visual Languages , 1996 , pp. 336 - 343 .

[7]

Deshpande , S. Madden, MauveDB: supporting model-based user views in database systems , in: Proc. of SIGMOD , 2006 , pp. 73 - 84 .

[8]

Han , OLAP mining: Integration of OLAP with data mining , in: Proc. of Working Conf. on Database Semantics , 1997 , pp. 3 - 20 .

[9]

Bentayeb , C. Favre, RoK: Roll-up with the k-means clustering method for recommending OLAP queries , in: Proc. of DEXA , 2009 , pp. 501 - 515 .

[10]

Golfarelli ,

Graziani ,

Rizzi , Shrink: An OLAP operation for balancing precision and size of pivot tables, Data Knowl . Eng . 93 ( 2014 ) 19 - 41 .

[11]

Sarawagi , Explaining diferences in multidimensional aggregates , in: Proc. of VLDB , 1999 , pp. 42 - 53 .

[12]

Sathe ,

Sarawagi , Intelligent rollups in multidimensional OLAP data , in: Proc. of VLDB , 2001 , pp. 531 - 540 .

[13]

Chen ,

Lin ,

Ramakrishnan , Prediction cubes , in: Proc. of VLDB , 2005 , pp. 982 - 993 .

[14]

Francia , E. Gallinucci,

Golfarelli , Towards conversational OLAP , in: Proc. of DOLAP , 2020 , pp. 6 - 15 .

[15]

Francia , E. Gallinucci, M. Golfarelli, COOL: A framework for conversational OLAP, Inf . Syst. 104 ( 2022 ) 101752 .