Introduction

A Similarity Function for Multi-Level and Multi-Dimensional Itemsets

Matteo Francia

Matteo Golfarelli

Stefano Rizzi

0 0 DISI, University of Bologna , Italy

2018

24 27

The key objective of frequent itemsets (FIs) mining is uncovering relevant patterns from a transactional dataset. In particular we are interested in multi-dimensional and multi-level transactions, i.e., ones that include di erent points of view about the same event and are described at di erent levels of detail. In the context of a work aimed at devising original techniques for summarizing and visualizing this kind of itemsets, in this paper we extend the de nition of itemset containment to the multi-dimensional and multi-level scenario, and we propose a new similarity function for itemsets, enabling a more e ective grouping. The most innovative aspect of our similarity function is that it takes into account both the extensional and intensional natures of itemsets.

Frequent itemset mining Itemset summaries

Introduction

The key objective of frequent itemsets (FIs) mining is uncovering relevant patterns from a transactional dataset [ 2 ]. Since its initial formulation on transactions of uniform and at items, where a transaction corresponds to a set of products bought together by a customer, FIs mining has been applied to different types of data. In particular, in this work we consider multi-dimensional and multi-level data [ 8 ]. A multi-dimensional transaction represents an event from di erent points of views, which we call features. In a multi-level transaction feature values are described using hierarchies with di erent levels of detail. A typical application is that of user pro ling: each transaction describes a user by means of several features related for instance to where she lives, where she works, how much she earns; each feature values can be described at di erent, hierarchically-organized levels (e.g., she lives close to Macy's, which is in the Garment district, which is part of Manhattan). In this context, a FI describes a pro le of a group of people sharing the same features/behavior.

The exponential nature of FIs [ 2 ] makes it di cult for data scientists and domain experts to visualize and explore their information content. Increasing the frequency threshold (i.e., the minimum itemset support) just decreases the number of FIs possibly leading to missing useful information; so, it is recognized that more e ective approach is that of providing FI summaries to assist decision Hierarchies

FIs Mining

FIs

FI Summariza/on

FI Visualiza/on makers in getting insights over data [ 14 ]. Summarization di ers from FIs mining since the latter searches, within a set of transactions, those itemset that are frequently present disregarding redundancy. Conversely, summarization is an optimization problem that addresses the extraction of the minimum number of FIs that represent an entire population while maximizing the overall diversity of the representatives. The two techniques are complementary to enhance and simplify FIs analysis: the adoption of summarization on top of FIs mining makes the choice of a frequency threshold less critical and enables the discovery of both speci c and general patterns. Though several FI summarization approaches have been proposed in the literature (e.g., [ 1,12,5 ]), they do not consider the multilevel and multi-dimensional natures of FIs. A third approach to make FI analysis more e ective is the use of advanced visualizations. Interactive visual interfaces and visual data mining approaches can unveil hidden information, simplify the process of understanding, and allow users to focus their attention on what is important. Although some visual representations for FIs have been proposed in the literature [ 13,9,4 ], to the best of our knowledge no approaches for visualizing FIs summaries have been proposed so far.

To ll this gap, we are currently working on a framework addressing the summarization and visualization of multi-level and multi-dimensional FIs. As shown in Figure 1, our approach is independent of the algorithm applied for generating the FIs taken in input [ 11,3 ]. The summarization and the visualization components work jointly to give users the relevant information; the user can iteratively create and visualize new summaries that better meet her needs by tuning a set of parameters. In the context of this framework, here we extend the de nitions of FIs and itemset containment to the multi-dimensional and multi-level scenario, and we propose a new similarity function for FIs that enables more e ective groupings. The most innovative aspect of our similarity function is that it takes into account both the extensional and intensional natures of FIs. The intensional nature is considered in feature-based similarity : the higher the number of features (i.e., semantics) shared by two FIs, the higher their similarity; the extensional nature is considered in support-based similarity : the higher the percentage of transactions supporting both FIs, the higher their similarity. Adopting this twofaceted similarity function, agglomerative clustering algorithms [ 7 ] can then be leveraged to summarize FIs.

The paper is organized as follows. After providing the formal de nitions of multi-dimensional and multi-level FIs in Section 2, in Section 3 we propose our similarity function. Finally, in Section 4 we discuss the research perspectives.

Itemsets

The itemsets we consider are multi-level, which implies the presence of a hierarchy of concepts. The type of hierarchies we consider are those de ned in classic multi-dimensional modeling [ 6 ].

De nition 1 (Hierarchy). A hierarchy H is de ned by (i) a set LH of categorical levels, (ii) a domain Dom(l) including a set of values for each level l 2 LH (all domains are disjoint), (iii) a roll-up partial order H of LH , and (iv) a part-of partial order H of Sl2LH Dom(l). Exactly one level dim(H) 2 LH , called dimension, is such that dim(H) H l for each other l 2 LH . The part-of partial order is such that, for each couple of levels l and l0 such that l H l0, for each value v 2 Dom(l) there is exactly one value v0 2 l0 such that v H v0.

The itemsets we consider are also multi-dimensional, i.e., they refer to di erent features (e.g., worksIn) each related to a speci c hierarchy (e.g., Location). A feature de nes the semantics carried by an item at a speci c hierarchical level. This can be formalized as follows: De nition 2 (Domain Schema). A domain schema is a triple D = (H; F ; ) where: (1) H is a set of hierarchies; (2) F is a set of features; and (3) is a function mapping each feature onto one hierarchy.

Example 1. As a working example we will use the Pro ling domain schema, which describes the customers who regularly visit a mall and features the two hierarchies depicted in Figure 2. The rst one is rooted in the Location dimension and has two branches: the rst one describes locations from the geographical point of view (with reference to New York City), the second one based on their features. In the roll-up partial order we have, for instance, Neighborhood Location Borough; in the part-of partial order, we have Harlem Location Manhattan. The second hierarchy describes incomes in terms of their ranges. The features of Pro ling are worksIn, frequents, and earns; speci cally,

(worksIn) = (frequents) = Location; (earns) = Income

Itemsets are non-redundant sets of items, i.e., two items in an itemset cannot be de ned on values related in the part-of partial order (e.g., GreenwichVillage and Manhattan). Finally, transactions are itemsets whose items are all de ned on dimension values (e.g., Macy's).

De nition 3 (Itemset and Transaction). Given domain schema D = (H; F ; ), an item of D is a couple i = (f; v) where f 2 F , v 2 Dom(l), and l is a level of hierarchy (f ). An itemset I of D is a set of distinct items of D where, for each i; i0 2 I, with i = (f; v) and i0 = (f; v0), it is v 6 (f) v0 and v0 6 (f) v. A transaction is an itemset only including items de ned over dimensions of H. Example 2. Examples of itemset I and transaction T of Pro ling are I = f(worksIn; Harlem); (frequents; Museum); (earns; High)g T = f(worksIn; CityCollege); (frequents; WhitneyMuseum); (earns; 35to60)g CityCollege WhitneyMuseum MuncanFood Macy’s

QueensZoo

Location Harlem GreenwichVillage Astoria Flushing College Store Museum Zoo Neighborhood

Type Manhattan

Queens

Amenity

Tourism

Borough

Category

Let I denote the set of all items of a schema domain. It can be easily veri ed that the containment relationship is re exive, antisymmetric, and transitive, and that for each pair of itemsets in I there are a least upper bound and a greatest lower bound; so v induces a lattice on 2I . The top element of the lattice is the empty itemset, the bottom element is I. Given two itemsets I and I0, we denote with lub(I; I0) and glb(I; I0) their least upper bound and greatest lower bound. Example 3. Figure 3 shows a small portion of the containment lattice for Pro ling; for simplicity we restrict to features frequents and earns and denote items by their value only. For instance, for frequents it is fAmenityg v fCollege; Storeg and fAmenityg v fCollegeg. Besides, it is lub(fCityCollegeg; fCollege; Storeg) = fCityCollege; Storeg glb(fCityCollegeg; fCollege; Storeg) = fCollegeg

Transaction T is said to support itemset I i I v T . With reference to Example 2, T supports I. Given a set of transactions T , the set of transactions that support I is denoted by TI T . This allows us to introduce a relevant numerical property of itemsets, namely, their support.

De nition 5 (Itemset Support). Given itemset I, its support sup(I) within a set of transactions T is de ned as sup(I) = jTI j

jT j  {Manhattan} {Queens} {Amenity} {Tourism} {Low} {High} {Amenity,Manhattan} {College} {Amenity,Tourism} {Store} {Museum} {Zoo}

{CityCollege}{College,Tourism}{College,Store}{Store,Tourism}{Macy’s} {CityCollege,Tourism} {CityCollege,Store} {College,Store,Tourism}{College,Macy’s} {Macy’s,Tourism} {CityCollege,Store,Tourism} {CityCollege,Macy’s} {College,Macy’s,Tourism}

. . . . . . . . . . . . . . . . . . . . . .

{CityCollege,Macy’s,MuncanFood,WhitneyMuseum,QueensZoo,below10,10to35,35to60,over60}

Itemset I is said to be frequent if it is greater or equal to a given threshold. Note that, since the containment relationship induces a lattice on the set 2I of all possible itemsets, it also induces a partial order over the set F 2I of FIs. Thus, from now on we will say that F is a POS (Partially Ordered Set). 3

Itemset Similarity

We argue that itemset similarity is a two-faceted concept: (1) according to feature-based similarity, the higher the number of features (i.e., semantics) shared by two FIs, the higher their similarity; and (2) feature-based similarity is useless if two FIs include two distinct groups of transactions, thus, it is complemented by support-based similarity in which the higher the percentage of transactions supporting both FIs, the higher their similarity. These two aspects of similarity are not necessarily correlated; for example, support-based similarity can be low even if feature-based similarity is high when non-shared features are rare and supported by a small fraction of transactions.

In a multi-level and multi-dimensional domain, computing feature-based similarity is not just a matter of nding the subset of common items between two FIs, but it is also related to the informative value they carry in terms of level of detail. Intuitively, we consider an FI to be more relevant than another if it includes a larger number of distinct features; in turn, the relevance of a feature increases with the level of detail at which it is expressed.

De nition 6 (Itemset Relevance). Given itemset I, its relevance is de ned as rel(I) =

X f2F eat(I)

X l2Levf (I) rel(l)A where F eat(I) is the set of distinct features of the items in I, Levf (I) is the set of levels of the values coupled with feature f in the items of I, rel(f ) is the relevance of f , and rel(l) is the relevance of level l. Conventionally, rel(?) = 0.

We nally introduce the similarity between two FIs as a linear combination of a support-based and a feature-based similarity.

De nition 7 (Itemset Similarity). Given a set of transactions T , a POS of FIs F , two FIs I and I0 supported by T , and a coe cient 2 [0::1], the similarity of I and I0 is de ned as sim(I; I0) = simsup(I; I0) + (1 )simrel(I; I0) where simsup(I; I0) = simrel(I; I0) =

sup(glb(I; I0)) sup(I) + sup(I0) sup(glb(I; I0)) ( rreell((lgulbb((II;;II00)))) , if lub(I; I0); glb(I; I0) 2 F

0, otherwise Both simsup and simrel range in [0::1] and can be intuitively explained as follows: simsup is the ratio between the number of transactions supporting both FIs I and I0 and the number of transactions supporting either I or I0; simrel is the ratio between the relevance of the features common to I and I0 and the relevance of the union of the features of I and I0. Clearly, since the lub and glb operators are commutative, it is always sim(I; I0) = sim(I0; I).

Example 4. With reference to the hierarchies de ned in Figure 2, we assume that (i) all features are equally relevant (rel(f ) = 1 for all f ), and (ii) relevance increases by 0.1 for each level of detail. Given FIs I = f(frequents; College); (worksIn; Store)g and I0 = f(frequents; College); (frequents; Store)g, it is

L = lub(I; I0) = f(frequents; College); (frequents; Store); (worksIn; Store)g G = glb(I; I0) = f(frequents; College)g Assuming for instance that sup(I) = 0:3; sup(I0) = 0:4; sup(L) = 0:2; sup(G) = 0:5, and = 0:5, then sim(I; I0) = 0:44. 4

Discussion

In this paper we have proposed an original similarity measure for multidimensional and multi-level FIs, to be used for enabling the creation of concise and valuable summaries of sets of FIs. Though for space reasons we cannot fully detail how summaries are de ned and visualized in our approach, in this section we give an informal explanation.

First of all, since our goal is to support interactive exploration and navigation of FIs, we organize summaries in a hierarchical fashion so that they can be analyzed at di erent levels of detail. So we build a hierarchical clustering of the set of all FIs using an agglomerative algorithm; any (complete and disjoint) \cut" of the resulting dendrogram is a summary.

In a summary each cluster is represented by a single FI; speci cally, the representative of cluster c is the most speci c FI in c, i.e., such that I v rep(c); 8I 2 c. With reference to this, we note that the strategy commonly used in the literature picks as a cluster representative its most general FI [ 5 ]; however, this strategy often lacks in properly characterizing clusters since it easily yields very low relevance as the cluster cohesion decreases, which may entail one or more features appearing in some of the cluster FIs to be missing from the representative. Conversely, when using our strategy, all the features appearing in at least one FI of the cluster are included in the representative.

As to the agglomerative clustering algorithm we adopt, it is basically a greedy algorithm that progressively merges couples of clusters starting from singletons and until one single cluster is obtained. Remarkably, the POS of FIs induced by our de nition of itemset containment allows the search space to be signi cantly pruned. Indeed, our preliminary tests show that the POS obtained when a feature is described using a linear hierarchy of n levels is several orders of magnitude smaller than the one we would get if those n levels were at. Another relevant improvement we get in terms of computational complexity depends on an interesting property of our similarity function. It can be proved that, in domains where the relevance of the levels of a hierarchy increases with the level of detail (i.e., rel(l) rel(l0) if l H l0), similarity is antimonotonic along the itemset containment relationship. As a consequence, at each step of our agglomerative algorithm we can just estimate the similarity between FIs that are directly contained into one another, thus avoiding a large number of useless computations.

Finally, to visualize summaries we adopt treemaps, a popular method for visualizing large hierarchical data sets by mapping hierarchical concepts into 2D areas [ 10 ]. Figure 4 shows a treemap in which the visualization area is partitioned into nested rectangles, each corresponding to a cluster of FIs whose area is proportional to the cluster cardinality; colors code both the predominant feature (i.e., the one with the highest relevance within the cluster FIs) of the cluster representative (hue) and its support (saturation). So, for instance, the top right pink rectangle describes a cluster that (i) includes 97 FIs with support ranging from 0.14 to 0.58 and relevance ranging from 1.00 to 3.60; (ii) has 10 child clusters in the dendrogram; and (iii) has a representative (namely, f(earns; avg30); (livesIn; close); (livesIn; collina BO); (worksIn; Bologna); (worksIn; close)g) whose predominant feature is livesIn. On this visualization the user can then apply classical OLAP operators (roll-up, drill-down, slice-and-dice) to navigate the dendrogram so as to exibly explore the set of FIs at di erent abstraction levels and focusing on the more relevant FIs.

1. Afrati , F.N. , Gionis , A. , Mannila , H.: Approximating a collection of frequent sets . In: Proc. SIGKDD . pp. 12 { 19 . Seattle, USA ( 2004 )

2. Agrawal , R. , Srikant , R.: Fast algorithms for mining association rules in large databases . In: Proc. VLDB . pp. 487 { 499 . Santiago , Chile ( 1994 )

3. Baralis , E. , Cagliero , L. , Cerquitelli , T. , D'Elia , V. , Garza , P. : Support driven opportunistic aggregation for generalized itemset extraction . In: Proc. IS . pp. 102 { 107 . London, UK ( 2010 )

4. Bothorel , G. , Serrurier , M. , Hurter , C. : Visualization of frequent itemsets with nested circular layout and bundling algorithm . In: Proc. ISVC . pp. 396 { 405 . Rethymnon , Greece ( 2013 )

5. Chandola , V. , Kumar , V. : Summarization | compressing data into an informative representation . Knowl. Inf. Syst . 12 ( 3 ), 355 { 378 ( 2007 )

6. Golfarelli , M. , Rizzi , S. : Data warehouse design: Modern principles and methodologies . McGraw-Hill ( 2009 )

7. Guha , S. , Rastogi , R. , Shim , K. : CURE: an e cient clustering algorithm for large databases . In: Proc. SIGMOD . pp. 73 { 84 . Washington, USA. ( 1998 )

8. Han, J ., Fu , Y. : Mining multiple-level association rules in large databases . IEEE Trans. Knowl. Data Eng . 11 ( 5 ), 798 { 804 ( 1999 )

9. Leung , C.K. , Carmichael , C.L. : FpViz: a visualizer for frequent pattern mining . In: Proc. SIGKDD . pp. 30 { 39 . Paris, France ( 2009 )

10. Shneiderman , B. : Tree visualization with tree-maps: 2-d space- lling approach . ACM Trans. Graph . 11 ( 1 ), 92 { 99 ( 1992 )

11. Srikant , R. , Agrawal , R.: Mining generalized association rules . In: Proc. VLDB . pp. 407 { 419 . Zurich , Switzerland ( 1995 )

12. Wang , J. , Karypis , G.: On e ciently summarizing categorical databases . Knowl. Inf. Syst . 9 ( 1 ), 19 { 37 ( 2006 )

13. Yan , X., Cheng, H., Han, J ., Xin , D. : Summarizing itemset patterns: a pro le-based approach . In: Proc. SIGKDD . pp. 314 { 323 . Chicago, USA ( 2005 )

14. Zhang , S. , Jin , Z. , Lu , J.: Summary queries for frequent itemsets mining . Journal of Systems and Software 83 ( 3 ), 405 { 411 ( 2010 )