-

AAvvooiiddiinngg tthhee iitteemmsseett cclloossuurree ccoommppuuttaattiioonn ””ppiittffaallll””

T. Hamrouni

n@isf 1

S. Ben Yahia

Y. Slimani Tarek Hamrouni

Sadok Ben Yahia

Yahya Slimani

1 0 Csaamdopku.bseUnnyaivheiras , ityaaihrey,a1.s0l6im0 aTnui 1 Facult ́e des Sciences de Tunis D ́eparteFmaecnutletd ́deessSSccieienncceessddeel'TInufnoirsmatique, DCep ́aamrtpeumseUntnidveesrsSitcaieirnec , e1s0d6e0 lT'Iunnfoisr,mTautinqiusiee,

46 59

Extracting generic bases of association rules seems to be a promising issue in order to present informative and compact user addedvalue knowledge. However, extracting generic bases requires partially ordering costly computed itemset closures. To avoid the nightmarish itemset closure computation cost, specially for sparse contexts, we introduce an algorithm, called Prince, allowing an astute extraction of generic bases of association rules. The Prince algorithm main originality is that the partial order is maintained between frequent minimal generators and no more between frequent closed itemsets. A structure called minimal generator lattice is then built, from which the derivation of itemset closures and generic association rules becomes straightforward. An intensive experimental evaluation, carried out on benchmarking and ”worst case” datasets, showed that Prince largely outperforms the pioneer algorithms, i.e., Close, A-Close and Titanic.

It is widely known that frequent itemset based algorithms suffer from the generation of a very large number of frequent itemsets and hence association rules. Thus, this prohibitive generation reduces not only efficiency but also effectiveness of the mined knowledge. In fact, users have to perform tedious rummage within an overwhelming large number of mined association rules [ 1 ]. In this context, the approach based on the extraction of frequent closed itemsets [ 2 ] presented a clear promise to reduce the frequent itemset extraction cost and mainly to offer, to users, irreducible nuclei of association rules that are commonly known as ”generic bases” of association rules. This approach, relying on the Formal Concept Analysis mathematical background [ 3 ], proposes to reduce the search space by detecting intrinsic structural properties. Therefore, the problem of mining association rules might be reformulated, under the (frequent) closed itemsets discovery point of view, as follows [ 4 ]: 1. Discover both distinct ”closure systems”, i.e., sets of sets which are closed under the intersection operator, namely the set of closed itemsets and the set of minimal generators. Also, the upper cover (Covu) of each closed itemset should be available. 2. From all discovered information during the first step, i.e., both closure systems and the upper cover sets, derive generic bases of association rules (from which all remaining association rules can be derived).

The essential report after an overview of the state of the art of frequent closed itemset based algorithms (e.g., [ 1, 2, 5–9 ]) can be summarized in what follows: 1. These algorithms mainly concentrate on the first task, i.e., reducing the computation time of the frequent itemset extraction step. Their performances are interesting on dense contexts. However, they present modest performances on sparse contexts. Indeed, computing itemset closures in this type of contexts is heavily handicapping on these algorithm performances, since frequent closed itemset search space tends to overlap that of frequent itemsets. 2. The frequent closed itemset based algorithms neglect the second task, i.e., extracting generic association rule bases. Indeed, none of them accepted to maintain the order covering the relationship between frequent closed itemsets.

In this paper, we propose a new algorithm, called Prince, aiming to extract generic bases of association rules. Prince performs a level-wise browsing of the search space. Its main originality is that it is the only one who accepted to bear the cost of building the partial order. Interestingly enough, to amortize this prohibitive cost, the partial order is maintained between frequent minimal generators and no more between frequent closed itemsets. The obtained partially ordered structure is called minimal generator lattice [ 10 ], in which each equivalence class is reduced to the corresponding set of frequent minimal generators. Hence, itemset closures are not computed but derived when Prince performs a simple sweeping of the minimal generator lattice to derive generic bases of association rules. Practical performances of the Prince algorithm have been compared to those of well known level-wise browsing algorithms, i.e., Close [ 2 ], A-Close [ 5 ], and Titanic [ 6 ]. Our experiments were carried out on benchmark datasets (dense and sparse) and on ”worst case” datasets. Obtained results are very encouraging: although our algorithm performs the partial order construction task, it largely outperforms Close, A-Close , and Titanic algorithms. In addition to the ”worst case” datasets and due to space limit, we report our results only on two benchmark datasets, frequently used for evaluating data mining algorithms.

It is important to note that omitting to compare Prince performances to those of more recent algorithms, e.g., LCM [ 8 ], DCI-Closed [ 9 ], is argued by two reasons: 1. Close, A-Close and Titanic algorithms determine at least the ”key” information provided by the frequent minimal generator set. 2. Following our claim that stressing on fast enumeration of frequent closed itemsets will not be of any interest nor presents any added-value knowledge for end-users, since not all required information is extracted. The remainder of the paper is organized as follows. Section 2 sketches the generic association rule basis extraction problem. Section 3 is dedicated to the presentation of Prince algorithm. Experimental results showing the utility of the proposed approach are reported in section 4. The conclusion and future work are presented in section 5. 2

Generic association rule basis extraction

Since the apparition of the approach based on the extraction of frequent closed itemsets [ 2 ], several generic association rule bases were introduced among which those of Bastide et al. [ 11 ] and which are defined as follows: 1. The generic basis for exact association rules is defined as follows: Definition 1. Let F CIK be the set of frequent closed itemsets extracted from the extraction context K. For each entry f in F CIK, let M Gf be the set of its minimal generators. The generic basis for exact association rules GB is given by: GB = {R: g ⇒ (f - g) | f ∈ F CI K and g ∈ M Gf and g = f (1)}. 2. The transitive reduction of the informative basis [ 11 ], which is a basis for all approximate association rules, is defined as follows(2): Definition 2. Let F MGK be the set of frequent minimal generators extracted from the extraction context K. The transitive reduction RI is given by: RI = {R | R: g ⇒ (f - g) | f ∈ F CI K and g ∈ F MG K and g ≺ f (3) and Conf(R) ≥ minconf }.

In the remainder of the paper, we will refer to the generic association rules formed by the couple (GB, RI). This couple is informative, sound and lossless [ 11, 12 ] and the association rules forming it are referred as informative association rules. Thus, given an Iceberg Galois lattice – in which each frequent closed itemset is decorated by its list of minimal generators – the derivation of these association rules can be performed straightforwardly. Indeed, approximate generic association rules represent ”inter-node” implications, assorted with the confidence measure, between two adjacent comparable equivalence classes, i.e., from a frequent closed itemset to another frequent closed itemset immediately covering it. For example, referring to the Iceberg Galois lattice depicted by Figure 1 (Right), the approximate generic association rule C0⇒ .75ADEF is generated from both equivalence classes topped respectively by the frequent closed itemsets ”CE” and ”ACDEF”. Inversely, exact generic association rules are ”intra-node” implications, with a confidence equal to 1, extracted from each node in the partially ordered structure. For example, from the closed itemset ”ACDEFG”, three exact generic association rules are obtained: AG⇒ CDEF, DG⇒ ACEF and FG⇒ ACDE. 1 The condition g = f ensures discarding non-informative association rules of the form g ⇒ ∅ . 2 The closure operator is noted . 3 The notation ≺ indicates that f covers g in the Iceberg Galois lattice.

A B C D E F G 1 × × × × × × 2 × × × × × × 3 × × × 4 × × × × 5 × × × × × × In order to palliate the frequent closed itemset based algorithm insufficiencies, i.e., the cost of the closure computation as well as neglecting the partial order construction, we will introduce a new algorithm called Prince. Prince highly reduces the cost of closure computation and generates the partially ordered structure, which makes it able to extract straightforwardly generic association rule bases without coupling it with another algorithm. Prince takes as input an extraction context K where the items are sorted by lexicographic order, the minimum threshold of support minsup and the minimum threshold of confidence minconf. It outputs the list of frequent closed itemsets and their associated minimal generators as well as the informative association rules formed by the couple (GB, RI). Thus, Prince operates in three successive steps: (i) Minimal generator determination (ii) Partial order construction (iii) Generic association rule basis extraction. Following the ”Test-and-generate” technique, Prince traverses the search space by level to determine the set of frequent minimal generators F MGK sorted by decreasing support values. F MGK is then considered as divided into several subsets. Each subset represents a given support. Thus, each time that a frequent minimal generator is determined, it is added to the subset representing its support. Prince also keeps track of the negative border of minimal generators GBd− (4) [ 13 ]. In the second step, the set of frequent minimal generators will serve as a backbone to construct the minimal generator lattice. As shown by the following property, the union of F MGK and GBd− will be used, in the second step, as a concise lossless representation of frequent itemsets: Property 1. [ 13 ] Let X be an itemset. If ∃ Z ∈ GB d− and Z ⊆ X then X is infrequent. Otherwise, X is frequent and Supp(X) = min {Supp(g) | g ∈ F MG K and g ⊆ X}. 4 An itemset belong to GBd− if it is an infrequent minimal generator and all its subsets are frequent minimal generators. Prince uses, in this step, the same pruning strategies introduced in Titanic namely minsup, the ideal order of the frequent minimal generator set and the estimated support. A trie is used to store the minimal generator set in order to speed-up the extraction of information that will be later of use. The path from the root to each node represents a minimal generator. 3.2

Partial order construction

In this step, the frequent minimal generator set F MGK will form a minimal generator lattice, and this without any access to the extraction context. The main idea is how to construct the partial order without computing itemset closures, i.e., how guessing the subsumption relation by only comparing minimal generators? To achieve this goal, the list of immediate successors(5) of each equivalence class will be updated in an iterative way. The processing of the frequent minimal generator set is done according to the order imposed in the first step (i.e., by decreasing support values). Each frequent minimal generator g of size k (k ≥ 1) is introduced into the minimal generator lattice by comparing it to the immediate successors of its (k-1)-subsets(6). This is based on the isotony property of the closure operator [ 14 ]. Indeed, let g1 , a (k-1)-itemset, be one of the subsets of g, g1 ⊂ g ⇒ g1 ⊂ g . Thus, the equivalence class to which belongs g is a successor (not necessarily an immediate one) of the equivalence class to which belongs g1 .

While comparing g to the immediate successor list of g1 , noted L, two cases are to be distinguished. If L is empty then g is added to L. Otherwise, g is compared to the elements already belonging to L (cf. Proposition 1). The imposed order in the first step allows to distinguish only two cases sketched by Proposition 1 by replacing the frequent minimal generators X and Y by respectively g and one of the elements of L.

Proposition 1. [ 15 ] Let X, Y ∈ F MG K, CX and CY their respective equivalence classes: a. If Supp(X) = Supp(Y ) = Supp(X ∪ Y ) then X and Y belong to the same equivalence class. b. If Supp(X) < Supp(Y ) and Supp(X) = Supp(X ∪ Y ) then CX (resp. CY ) is a successor (resp. predecessor) of CY (resp. CX ).

The computation of the support of (X ∪ Y ) is performed in a direct manner if (X ∪ Y ) belongs to F MGK ∪ GB d−. CX and CY are then incomparable. Otherwise, Property 1 is applied. The support computation stops then as soon as we find a minimal generator that is included in (X ∪ Y ) and has a support strictly lower than that of X and that of Y . CX and CY are then incomparable.

During these comparisons and to avoid redundant closure computations, Prince introduces two complementary functions. These functions make it possible to maintain the concept of equivalence class throughout processing. To this 5 By the term ”immediate successor”, we indicate a frequent minimal generator, unless otherwise specified. 6 In the first step and for each k-candidate, links towards its (k-1)-subsets are stored during the check of the ideal order property. end, each equivalence class C will be characterized by a representative itemset, which is the first frequent minimal generator introduced into the minimal generator lattice. Both functions are described below:

1. Manage-Equiv-Class : This function is used if a frequent minimal generator, say g, is compared to the representative itemset of its equivalence class, say R. The Manage-Equiv-Class function replaces all occurrences of g by R in the immediate successor lists in which g was added. Then, comparisons to carry out with g will be made with R. Thus, for each equivalence class, only its representative itemset appears in the lists of immediate successors.

2. Representative: This function makes it possible to find, for each frequent minimal generator g, the representative R of its equivalence class in order to complete the immediate successor list of Cg. This allows to manage only one immediate successor list for all frequent minimal generators belonging to the same equivalence class.

The pseudo-code of the second step is given by the GenO-rder procedure (Algorithm 1). Each entry, say g, in F MGK is composed by the following fields: (i) support: the support of g (ii) direct-subsets : the list of (k-1)-subsets of g (iii) immediates-uccs : the list of immediate successors of g. At the end of the execution of the GenO-rder procedure, g.immediate-succs is empty if g is not the representative itemset of its equivalence class or if g belongs to a maximal equivalence class, i.e., not subsumed by any equivalence class. Otherwise, this list will contain only representative frequent minimal generators.

Algorithm 1 Gen-Order

Require: - F MGK.

Ensure: - The elements of F MGK partially ordered in the form of a minimal generator lattice. 1: for all (g ∈ F MG K) do 2: for all (g1 ∈ g.direct-subsets ) do 3: R = Representative(g1 ); 4: for all (g2 ∈ R .immediate-succs ) do 5: if (g.support = g2 .support = Supp(g ∪ g2 )) then 6: ManageE-quivC-lass (g,g2 ); /*g, g2 ∈ C g and g2 is the representative of

Cg*/ 7: else if (g.support < g2 .support and g.support = Supp(g ∪ g2 )) then 8: g is compared with g2 .immediate-succs ; 9: /*For the remainder of the element of R.immediate-succs , g is compared only with each g3 | g3 .support > g.support;*/ 10: end if 11: end for 12: if (∀ g2 ∈ R .immediate-succs , Cg and Cg2 are incomparable) then 13: R.immediate-succs = R.immediate-succs ∪ { g}; 14: end if 15: end for 16: end for 3.3

Generic association rule basis extraction

In this step, Prince extracts the valid informative association rules. For this purpose and using Proposition 2, Prince finds the frequent closed itemset corresponding to each equivalence class.

Proposition 2. [ 15 ] Let f and f1 be two closed itemsets such that f covers f1 in the Galois lattice LCK . Let M Gf be the set of minimal generators of f . The closed itemset f can be composed as follows: f = ∪{ g|g ∈ M Gf } ∪ f1 .

The traversal of the minimal generator lattice is carried out in an ascending manner from the equivalence class whose frequent minimal generator is the empty set(7) (denoted C∅ ) to the non subsumed equivalence class(es). If the closure of the empty set is not null, the exact generic association rule between the empty set and its closure is then extracted. Having the partial ordered structure built, Prince extracts the valid approximate generic association rules between the empty set and the frequent closed itemsets of the upper cover of C∅ . These closures are found, by applying Proposition 2, using the minimal generators of each equivalence class and the closure of the empty set. Equivalence classes forming the upper cover of C∅ are stored which makes it possible to apply the same process to them. By the same manner, Prince treats higher levels of the minimal generator lattice until reaching the maximal equivalence class(es).

The pseudo-code of this step is given by the procedure Gen-GRB (Algorithm 2). We use the same notations of the procedure Gen-Order to which we add the field FCI to each element of F MGK. Thus, for each frequent minimal generator g, this field allows to store the frequent closed itemset corresponding to Cg if g is its representative. In the GenG-RB procedure, L1 indicates the list of equivalence classes from which are extracted the valid informative association rules. By L2 , we note the list of equivalence classes which cover those forming L1 (8). Example 1. Let us consider the extraction context K given by Figure 1 (Left) for minsup=2 and minconf =0.5. The first step allows the determination of the empty set closure, the sorted set F MGK and the negative border of minimal generators GBd−. Thus, ∅ =E, F MGK = {(∅ ,5), (C,4), (D,4), (A,3), (B,3), (F,3), (G,3), (CD,3), (AG,2), (BC,2), (BD,2), (DG,2), (FG,2)} and GBd−={(AB,1), (BF,1), (BG,1), (BCD,1)}. During the second step, Prince processes the element of F MGK by comparing each frequent minimal generator g, of size k (k ≥ 1), with the immediate successor lists of its (k-1)-subsets. Since the list of immediate successors of the empty set is empty, C is added to ∅ .immediate-succs . Then, D is compared to C. Since CD is a minimal generator, CC and CD are then incomparable and D is added to ∅ .immediate-succs . A is then compared to this 7 This class is called the Bottom element of the lattice [ 16 ]. The corresponding closure is calculated in the first step by collecting items appearing in all transactions of the extraction context. 8 A test is carried out to check that an equivalence class does not belong to L2 . This test consists in checking if the corresponding frequent closed itemset were already calculated (Line 11 in Algorithm 2). list. By comparing A to C, A.support < C.support and A.support = Supp(AC) and CA is then a successor of CC . A is added to C.immediate-succs without any comparison since this list is still empty. A is also added to D.immediate-succs since A.support < D.support and A.support = Supp(AD). At this moment of processing, we have ∅ .immediate-succs = {C,D} and B is added to this list since CB is incomparable with CC (BC is a minimal generator) and CD (BD is also a minimal generator). F is then introduced into the minimal generator lattice by comparing it with the immediate successor list of its unique 0subset, i.e., the empty set. By comparing F to C, F.support < C.support and F.support = Supp(CF) and then CF is a successor of CC . F is then compared to C.immediate-succs which contains A. F.support = A.support = Supp(AF) and thus F ∈ C A whose A is the representative one. The Manage-Equiv-Class function is then applied by replacing occurrences of F, in the immediate successor lists, by A (in this case, there is no occurrence) and by continuing comparisons with A instead of F (in this case, there are no more comparisons to do with F). G is then compared to ∅ .immediate-succs equal to {C,B,D}. CG is a successor of CC since G.support < C.support and G.support = Supp(CG). After comparing G with C.immediate-succs which only contains A, G is added to C.immediate-succs since CG is incomparable with CA (AG is a minimal generator). By comparing G to D (resp. B), CG is incomparable with CD (resp. CB) since DG (resp. BG) is a minimal generator. Then, CD is compared to the immediate successor lists of its 1-subsets, i.e., C and D. CC has CA and CG as immediate successors. By comparing CD and A, CD is affected to CA since CD.support = A.support = Supp(ACD). The Manage-Equiv-Class function is then applied. In particular, comparisons to carry out with CD will be made with A. A is then compared to the immediate successor list of the second 1-subset of CD, i.e., D. However, D.immediate-succs contains only A and the comparison process stops. It is the same for the remainder of F MGK. Having the minimal generator lattice built (cf. Figure 1 (Center)), an ascending sweeping is carried out from C∅ . As ∅ =E, the exact generic association rule ∅ ⇒ E is then extracted. ∅ .immediates-uccs ={C,D,B}. The frequent closed itemset associated to CC is then found and is equal to CE. The approximate generic association rule ∅ ⇒ CE, of a support equal to 4 and a confidence equal to 0.8, will be extracted. It is the same for CD and CB. Using the same process and from CC , CD and CB, the traversal of the minimal generator lattice is performed in an ascending way until extracting all valid informative association rules. The resulting generic association rule bases are sketched by Figure 2.

Exact generic association rules R1 : ∅ ⇒ R2 : C ⇒ R3 : D ⇒ R4 : B ⇒ R5 : A ⇒ R6 : F ⇒ R7 : CD ⇒

E R8 : G ⇒ E R9 : BC ⇒ E R10 : BD ⇒ E R11 : AG ⇒ CDEF R12 : DG ⇒ ACDE R13 : FG ⇒

AEF

E E CDEF ACEF ACDE

Approximate generic association rules R14 : ∅ 0⇒.8CE R21 : D0⇒.5BE R15 : ∅ 0⇒.8DE R22 : B0⇒ .66CE R16 : ∅ 0⇒.6BE R23 : B0⇒ .66DE R17 : C0⇒ .75ADEF R24 : A0⇒ .66CDEFG R18 : C0⇒ .75GE R19 : C0⇒.5BE

R25 : F0⇒ .66ACDEG

R26 : CD0⇒ .66AEFG

R20 : D0⇒ .75ACEF R27 : G0⇒ .66ACDEF

Fig. 2. Left: GB basis. Right: RI basis. 3.4

Correctness and Computational cost

In this section, we prove the correctness of Prince algorithm and we evaluate its computational cost in the worst case.

Theorem 1. (correctness) The Prince algorithm extracts all frequent minimal generators and derives all frequent closed itemsets and all valid informative association rules.

Proof. During the first step, a minimal generator candidate c is pruned only if its estimate support is equal to its actual support or if it does not verify the ideal order of minimal generators. Otherwise, c is a minimal generator and by comparing its actual support to minsup, Prince algorithm adds it to the frequent minimal generator set F MGK or to the negative border of minimal generators GBd−. Thus, at the end of the first step of Prince, all frequent minimal generators are extracted in addition to the negative border of minimal generators.

During the second step, Prince takes care to introduce all frequent minimal generators into the minimal generator lattice. Indeed, a frequent minimal generator g is compared to the immediate successor list of all its (k-1)-subsets. The Representative function allows to find the representative itemset of the equivalence class of a (k-1)-subset of g. Once the representative found, the used Proposition 1 treats both possible cases. The Manage-Equiv-Class function is used only if g is compared to the representative of Cg. At the end of this step, the minimal generator lattice is completely built.

During the third step, all equivalence classes are taken in consideration when deriving frequent closed itemsets and valid informative association rules. Indeed, each equivalence class C, except C∅ , has at least one immediate predecessor. Hence, the representative of C belongs at least to one immediate successor list of another equivalence class, say C1 . When treating C1 , the frequent closed itemset of C is derived and C is added to the equivalence class list from which valid informative association rules will be derived in the next iteration. Thus, at the end of this step, all frequent closed itemsets and all valid informative association rules are derived.

Proposition 3. (computational cost) In the worst case, the time complexity of Prince is O((n3 + m) × 2n), where n (resp. m) is the number of distinct items (resp. transactions) in the extraction context.

Proof. The worst case is obtained when each extracted frequent itemset is a frequent closed minimal generator. Thus, the frequent itemset lattice strictly overlaps both the Iceberg Galois lattice and the minimal generator lattice. The number of frequent closed minimal generators is then equal to 2n. We consider that each transaction contains the n distinct items.

During the first step, Prince performs two main tasks. The first task consists in candidate support computations and is of order O(m × 2n). The second task consists in trying to prune non-minimal generator candidates and it is done in the order of O(n2×2n). The cost of the first step is then of order O((n2+m)×2n).

During the second step, and for each frequent minimal generator g of size k, Prince performs, in the worst case, O(k × (n − k)) comparisons ((k × (n − k)) will be over-estimated by n2). Indeed, the number of its (k-1)-subset is equal to k. Each (k-1)-subset g1 has, in the worst case, (n − k) immediate successors when comparing g with g1 .immediate-succs . Each comparison is performed by making the union of g with an element of g1 .immediate-succs . The union cost is O(n). The search of the support of the itemset, result of this union, costs O(n) since it is a minimal generator. The cost of the second step is then O((n + n) × n2 × 2n), i.e., O(n3 × 2n).

During the third step, and for each equivalence class C, Prince performs two main tasks. The first task consists in deriving the corresponding frequent closed itemset f . This is carried out by performing the union of the set of frequent minimal generators of f , containing only one element, and a frequent closed itemset f1, which is an immediate predecessor of f . The first task then costs O(n). The second task consists in deriving valid informative association rules. As each frequent minimal generator is also closed, there is no exact generic association rules. However, by fixing minconf to 0, there are k approximate generic association rules, for an equivalence class whose frequent closed minimal generator is of size k. To derive each approximate generic association rule, Prince performs the difference between the frequent closed itemset f and the corresponding premise and this costs O(n). The second task then costs O(k × n) (k will be over-estimated by n). Hence, the cost of the third step is O((n + n2) × 2n), i.e., O(n2 × 2n).

Thus, in the worst case, the time complexity of Prince is the sum of costs of its three steps and is of order O((n3 + m) × 2n).

It is important to mention that although Prince constructs the partial order, its running time remains of the same order of magnitude as that of algorithms dedicated to the extraction of frequent closed itemsets [ 17 ]. 4

Experimental results

In this section, we shed light on Prince performances vs those of Close, AClose and Titanic algorithms. Prince was implemented in the C language using gcc version 3.3.1. All experiments were carried out on a PC with a 2.4 GHz Pentium IV and 512 MB of main memory (with 2 GB of Swap) and running S.u.s.e Linux 9.0.

In all our experiments, all times reported are real times, including system and user times, into benchmark datasets (dense and sparse(9)) and ”worst case” datasets. Figure 3 (Left) summarizes the characteristics of benchmark datasets. The definition of a ”worst case” context is given as follows: Definition 3. A ”worst case” context is a context K = (O, A, R) where O represents a finite set of objects (or transactions) of size (n+1), A is a finite set of attributes (or items) of size n and R is a binary (incidence) relation (i.e., R ⊆ O × A ). Each object, among the first n ones, is verified by (n-1) distinct attributes. The last object is verified by all attributes. Each attribute is checked by n distinct objects.

Thus, in a ”worst case” dataset, each closed itemset is equal to its (minimal) generator. Hence, from a ”worst case” dataset of dimension equal to (n+1)×n, 2n frequent closed itemsets can be extracted when minsup is fixed to 1 transaction. Figure 3 (Right) presents an example of a ”worst case” dataset for n=4. 9 All these datasets are http://fimi.cs.helsinki.fi/data.

downloadable on the following address: Dataset Type # items Avg. tr. size # transactions Mushroom dense 119 23 8124 T40I10D100K sparse 1000 40 100000 1 × × × 2 × × × 3 × × × 4 × × × 5 × × × × Fig. 3. (Left) Benchmark dataset characteristics. (Right) A ”worst case” dataset for n=4.

Figure 4 shows execution times of Prince(10) algorithm compared to those of Close, A-Close and Titanic algorithms.

- Mushroom: In the case of Mushroom dataset, Prince performances are better than those of Close, A-Close and Titanic for all minsup values, given the important role played by equivalence class management functions. Indeed, for a value of minsup equal to 0.1%, the number of frequent minimal generators (equal to 360,166) is almost to 2.2 times the number of frequent closed itemsets (equal to 164,117). Titanic performances decrease in a significant way due to the extension attempts carried out for each frequent minimal generator. Indeed, for minsup = 0.1%, 116 items, among 119, are frequent and the maximum size of a frequent minimal generator is only equal to 10 items.

- T40I10D100K: Prince performances for this dataset are largely better than those of Close, A-Close and Titanic for all minsup values. Thus, Close and A-Close are handicapped by a large average transaction size (40 items). In the same way, Titanic performances regress considerably for the same reasons previously evoked. The comparison cost for a frequent minimal generator, in the case of Prince, being definitely more reduced than the intersection operations performed in Close and A-Close and the extension attempts elaborated in Titanic, explains the big gap between Prince performances and those of remaining algorithms.

- ”Worst case” datasets: For these experiments, minsup was fixed to 1 transaction. We tested 26 datasets showing the variation of n from 1 to 26. The execution times of the four algorithms began to be distinguishable only starting from the value of n equal to 15. The Prince algorithm performances remain better than those of Close, A-Close and Titanic algorithms. Close and Titanic executions stop for n=24 for lack of memory space. It is the same for A-Close for n=25 and Prince for n=26. It is important to mention that the partial order construction requires to store much more information than needed when aiming only to extract frequent closed itemsets. Thus, the use of only one trie to store information about all minimal generators instead of several tries, as in Close, A-Close and Titanic algorithms(11), is an attempt aiming to reduce the memory need of Prince algorithm. 10 The minconf value is set to 0. 11 Indeed, in the case of these three algorithms, a trie is used to save information about each set of (frequent) minimal generators of size k.

Mushroom

Prince Close A-Close

Titanic In this paper, we proposed a new algorithm, called Prince, for an efficient extraction of frequent closed itemsets and their respective minimal generators as well as the generic association rule bases. To this end, Prince builds the partial order contrary to the existing algorithms. A main characteristic of Prince algorithm is that it relies only on minimal generators to build the underlying partial order. Carried out experiments outlined that Prince largely outperforms existing ”Test-and-generate” algorithms of the literature for both benchmark and ”worst case” contexts. In the near future, we plan to tackle two issues. Firstly, we plan to study the possibility of integrating the work of Calders et al. [ 18 ] in the first step of Prince. Indeed, this work can be applied to any set verifying the property of ideal order such as the set of frequent minimal generators in our case. Secondly, we propose to add constraints [ 19 ], so that the number of generic association rules will be reduced while keeping the most interesting for the user. Acknowledgements The authors are deeply grateful to Yves Bastide who kindly accepted to provide source codes of Close, A-Close and Titanic algorithms.

1. Pei , J ., Han, J ., Mao , R. , Nishio , S. , Tang , S. , Yang , D. : Closet: An efficient algorithm for mining frequent closed itemsets . In: Proceedings of the ACM-SIGMOD DMKD'00 , Dallas, Texas, USA. ( 2000 ) 21 - 30

2. Pasquier , N. , Bastide , Y. , Taouil , R. , Lakhal , L. : Efficient Mining of Association Rules Using Closed Itemset Lattices . Journal of Information Systems 24 ( 1999 ) 25 - 46

3. Wille , R.: Restructuring lattices theory: An approach based on hierarchies of concepts. I. Rival, editor , Ordered Sets , Dordrecht-Boston ( 1982 ) 445 - 470

4. BenYahia , S. , Nguifo , E.M. : Approches d'extraction de r` egles d'association bas´ees sur la correspondance de Galois . In Boulicault, J.F. , Cremilleux , B., eds.: Revue d'Ing´enierie des Syst`emes d'Information (ISI), Herm`es-Lavoisier . Volume 9 . ( 2004 ) 23 - 55

5. Pasquier , N. , Bastide , Y. , Touil , R. , Lakhal , L. : Discovering frequent closed itemsets . In Beeri, C. , Buneman , P., eds. : Proceedings of 7th International Conference on Database Theory (ICDT'99) , LNCS, volume 1540 , Springer-Verlag, Jerusalem, Israel. ( 1999 ) 398 - 416

6. Stumme , G. , Taouil , R. , Bastide , Y. , Pasquier , N. , Lakhal , L. : Computing iceberg concept lattices with Titanic . Journal on Knowledge and Data Engineering (KDE) 2 ( 2002 ) 189 - 222

7. Zaki , M.J. , Hsiao , C.J.: Charm: An efficient algorithm for closed itemset mining . In: Proceedings of the 2nd SIAM International Conference on Data Mining , Arlington, Virginia, USA. ( 2002 ) 34 - 43

8. Uno , T. , Kiyomi , M. , Arimura , H.: LCM ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets . In Goethals, B. , Zaki , M.J. , Bayardo , R., eds. : Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'04) . Volume 126 of CEUR Workshop Proceedings, Brighton, UK. ( 2004 )

9. Lucchesse , C. , Orlando , S. , Perego , R. : DCI-Closed : a fast and memory efficient algorithm to mine frequent closed itemsets . In Goethals, B. , Zaki , M.J. , Bayardo , R., eds. : Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'04) . Volume 126 of CEUR Workshop Proceedings, Brighton, UK. ( 2004 )

10. BenYahia , S. , Cherif , C.L. , Mineau , G. , Jaoua , A.: D´ecouverte des r`egles associatives non redondantes : application aux corpus textuels. Revue d'Intelligence Artificielle (special issue of Intl. Conference of Journ´ees francophones d'Extraction et Gestion des Connaissances (EGC' 2003 )), Lyon, France 17 ( 2003 ) 131 - 143

11. Bastide , Y. , Pasquier , N. , Taouil , R. , Lakhal , L. , Stumme , G.: Mining minimal non-redundant association rules using frequent closed itemsets . In: Proceedings of the International Conference DOOD' 2000 , LNAI, volume 1861 , Springer-Verlag, London, UK. ( 2000 ) 972 - 986

12. Kryszkiewicz , M. : Concise representations of association rules . In Hand, D.J., Adams , N. , Bolton , R., eds. : Proceedings of Exploratory Workshop on Pattern Detection and Discovery in Data Mining (ESF) , 2002 , LNAI, volume 2447 , SpringerVerlag, London, UK. ( 2002 ) 92 - 109

13. Kryszkiewicz , M. : Concise representation of frequent patterns based on disjunctionfree generators . In: Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM) , San Jose, California, USA. ( 2001 ) 305 - 312

14. Davey , B. , Priestley , H.: Introduction to Lattices and Order . Cambridge University Press ( 2002 )

15. Hamrouni , T. , BenYahia , S. , Slimani , Y. : Prince : Extraction optimis´ee des bases g´en´eriques de r`egles sans calcul de fermetures . In: Proceedings of the 23rd International Conference INFORSID , Inforsid

Editions

, Grenoble, France. ( 2005 ) 353 - 368

16. Ganter , B. , Wille , R.: Formal Concept Analysis . Springer-Verlag ( 1999 )

17. Pasquier , N.: Datamining: Algorithmes d' extraction et de r´ eduction des r`egles d'association dans les bases de donn´ees . Th`ese de doctorat, Ecole Doctorale Sciences pour l'Ing´enieur de Clermont Ferrand , Universit´e Clermont Ferrand II , France ( 2000 )

18. Calders , T. , Goethals , B. : Mining all non-derivable frequent itemsets . In Elomaa, T., Mannila , H. , Toivonen , H., eds. : Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery , PKDD 2002 , LNCS , volume 2431 , Springer-Verlag, Helsinki, Finland. ( 2002 ) 74 - 85

19. Bonchi , F. , Lucchese , C. : On closed constrained frequent pattern mining . In: Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04) , Brighton, UK. ( 2004 ) 35 - 42