=Paper=
{{Paper
|id=Vol-126/paper-2
|storemode=property
|title=ABS: Adaptive Borders Search of frequent itemsets
|pdfUrl=https://ceur-ws.org/Vol-126/flouvat.pdf
|volume=Vol-126
|dblpUrl=https://dblp.org/rec/conf/fimi/FlouvatMP04
}}
==ABS: Adaptive Borders Search of frequent itemsets==
ABS: Adaptive Borders Search of frequent itemsets
Frédéric Flouvat1, Fabien De Marchi2 , Jean-Marc Petit1
1
Laboratoire LIMOS, UMR CNRS 6158
Université Blaise Pascal - Clermont-Ferrand II,
24 avenue des Landais, 63 177 Aubière cedex, France
flouvat@isima.fr, jmpetit@math.univ-bpclermont.fr
2
Laboratoire LIRIS, FRE CNRS 2672
Université Claude Bernard Lyon 1
8, boulevard Niels Bohr, 69 622 Villeurbanne cedex France
fabien.demarchi@liris.cnrs.fr
Abstract inclusion dependencies in databases [8]. This algorithm is
called ABS for Adaptive Borders Search, and is in the spirit
In this paper, we present an ongoing work to discover of algorithms based on the concept of dualization [14, 21].
maximal frequent itemsets in a transactional database. We The basic idea of our proposition is to combine the
propose an algorithm called ABS for Adaptive Borders strength of both levelwise algorithm [1, 18] and Dualize and
Search, which is in the spirit of algorithms based on the Advance algorithm [14] in such a way that:
concept of dualization. From an abstract point of view, our
• "small" maximal frequent itemsets are efficiently gen-
contribution can be seen as an improvement of the basic
erated with levelwise strategies.
APRIORI algorithm for mining maximal frequent itemsets.
The key point is to decide dynamically at which iteration, if • "large" maximal frequent itemsets may be found effi-
any, the dualization has to be made to avoid the enumera- ciently by dualization.
tion of all subsets of large maximal itemsets. Once the first
dualization has been done from the current negative border, The dualization performed is quite similar to that pro-
APRIORI is no longer used and instead, another dualiza- posed in the Dualize and Advance algorithm. Nevertheless,
tion is carried out from the positive border known so far. instead of starting from some subset of maximal frequent
The process is repeated until no change occurs anymore in itemsets as Dualize and Advance algorithm does, we use
the positive border in construction. infrequent itemsets to perform the dualization. As a conse-
Experiments have been done on FIMI datasets from quence, we obtain the so-called optimistic positive border of
which tradeoffs on adaptive behavior have been proposed maximal frequent itemsets. The set of such candidates cor-
to guess the best iteration for the first dualization. Far responds exactly to k-uniform hypergraph clique proposed
from being the best implementation wrt FIMI’03 contribu- in [22]. As a consequence, our proposition contributes to
tions, performance evaluations of ABS exhibit better per- clarify some related contributions [22, 17, 3, 12, 7]) since
formance than IBE, the only public implementation based it gives an exact characterization of the optimistic positive
on the concept of dualization. border of maximal frequent itemsets from some subset of
infrequent itemsets.
From an abstract point of view, our contribution can be
seen as an improvement of the basic APRIORI algorithm
1 Introduction for mining maximal frequent itemsets. The key point is to
decide dynamically at which iteration, if any, the dualiza-
In this paper, we present an ongoing work to discover tion has to be made to avoid the enumeration of all subsets
maximal frequent itemsets in a transactional database. We of large maximal itemsets. Once the first dualization has
propose to adapt an algorithm originally devised for mining been done from the current negative border available at that
iteration, APRIORI is no longer used and instead, another be still competitive for maximal frequent itemsets mining
dualization is carried out from the positive border known so in many cases [11], when the size of elements to discover
far. The process is repeated until no change occurs anymore remain small.
in the positive border in construction. Our goal is to exploit the efficiency of Apriori, but to
Experiments have been done on FIMI datasets [10]. The automatically detect when it will fall into troubles and stop
adaptive behavior of our algorithm has been tuned from its execution. Then we propose to exploit the knowledge
results gathered from these experiments. For the tested mined so far to initialize a different search, based on the
dataset, we were able to guess dynamically the best iteration concept of dualization between positive and negative bor-
for the first dualization, a key parameter of our algorithm. ders; each border is updated and used to compute the corre-
Far from being the best implementation wrt FIMI’03 sponding dual border, until a fix point is reached.
contributions [11], performance evaluations of ABS exhibit
better performance than IBE [21], the only public imple- 3.2 From negative to positive border
mentation based on the concept of dualization.
In the sequel, let r be a binary database over a set of
2 Preliminaries items R, minsup a minimum support, and F I the set of
frequent itemsets in r. After the levelwise part, our method
Let R be a set of symbols called items; a line is a subset is still iterative; at each iteration i, new elements of the posi-
of R, and a binary relation r over R is a multiset of lines. tive and negative borders are expected to be discovered. We
We suppose the reader is familiar with the notions of item- denote by Bd+ − +
i (resp. Bdi ) the subset of Bd (F I) (resp.
− th
sets, support, and with the main aspects of frequent item- Bd (F I)) discovered until the i iteration. In other words,
sets mining problem in a binary relation, given a threshold ∀i < j, Bd+ + − −
i ⊆ Bdj and Bdi ⊆ Bdj . Roughly speaking,
minsup (see e.g. [1] for details). We recall the notion of candidates for Bd i are obtained from elements of Bd −
+
i , and
borders of a set of itemsets [18]. Given F a set of item- candidates for Bd − i+1 are obtained from elements of Bd +
i .
sets over R, the positive border of F denoted by Bd + (F ) is The following definitions and results have been proposed
defined by Bd+ (F ) = max⊆ {X ∈ F }. The negative bor- in [8] for inclusion dependency discovery problem in rela-
der of F is defined by Bd − (F ) = min⊆ {Y ⊆ R | ∀X ∈ tional databases. We recall them in the context of maximal
F, Y ⊆ X}. If F I is the set of all itemsets frequent in r, frequent itemsets mining, only the proofs are omitted.
then Bd+ (F I) is called the set of maximal frequent itemsets We first define the notion of Optimistic positive border.
in r. Definition 1(Optimistic positive border) Given a set F of
We will use the concepts of hypergraph and minimal itemsets, the optimistic positive border of F is: F opt(F ) =
transversal of a hypergraph, whose definition is pointed out max⊆ {X ⊆ R | ∀Y ∈ F, Y ⊆ X}.
here (see for example [4] for more details). Given V a
The next theorem gives a constructive characterization
finite set of elements. A subset E of V defines a hyper-
of F opt(F ).
graph H = (V, E), where elements of V are called ver-
tices of H and elements of E edges of H. A transversal T Theorem 1[8] F opt(F ) = T r(F )
of H = (V, E) is a subset of V that intersect all the ele- Therefore, the idea is to compute the optimistic positive
ments of E. T is minimal if no other transversal of H are border for Bd −
i to obtain exactly the largest itemsets which
included in T. The set of all minimal transversals of H is do not contain any infrequent itemset discovered so far.
noted T r(H).
Proposition 1 Let X ∈ Fopt(Bd−
i ). If sup(X) ≥
The relationship between the notion of borders and min-
minsup, X ∈ Bd+ (F I).
imal transversals of hypergraph has been exhibited in [18].
Indeed, any set of itemsets can be seen as a hypergraph; if Proof Since X is maximal in the definition of F opt(Bd −
i ),
F I is the set of frequent itemsets in a binary relation r, we each of its superset contains at least one element of Bd −
i ,
have: T r(F I) = Bd− (F I), where F I = {R − X | X ∈ and is infrequent by anti-monotonicity. 2
F I}.
Then, Bd+
i is exactly made up of all the frequent itemsets
in F opt(Bd−
i ).
3 Method description
3.3 From positive to negative border
3.1 Starting with a levelwise algorithm
In a dual way, the set Bd +i is then used to compute its
The algorithm Apriori [1] was initially devoted to fre- negative border Bd − (Bd+i ), to finally update the negative
quent itemset mining; Nevertheless, it has been proved to border in construction and obtain Bd − i+1 .
The next theorem gives a constructive characterization Algorithm 1 ABS: Adaptive Border Search
of Bd− (F ), for any set F of frequent itemsets. Require: a binary database r, a integer minsup
Theorem 2[18] Bd− (F ) = T r(F ) Ensure: Bd+ (F I) and Bd− (F I)
1: F1 = {A ∈ R | sup(A) ≥ minsup}
Proposition 2 Let X ∈ Bd− (Bd+
i ). If sup(X) < 2: C2 = AprioriGen(F1 )
minsup, X ∈ Bd− (F I). − +
3: i = 2; Bd1 = R − F1 ; Bd0 = ∅
4: while Ci = ∅ do
Proof Let X be an element of Bd − (Bd+ i ). By the definition
5: Fi = {X ∈ Ci | sup(X) ≥ minsup}
of the negative cover of a set, each subset of X is included
6: Bd− −
i = Bdi−1 ∪ (Ci − Fi )
in an element of Bd+ (F I) and then is frequent. 2
7: Bdi−1 = Bd+
+
i−2 ∪ {X ∈ Fi−1 | ∀Y ∈ Fi , X ⊆ Y }
Then, Bd− i+1 is exactly made up of all the infrequent 8: if IsDualizationRelevant(i, |Bd− i |, |Fi |, |Ci |) =
itemsets in Bd− (Bd+i ). T RU E then
9: Bd+i = {X ∈ GenP osBorder(Bdi ) | |X| ≥
−
3.4 The Algorithm ABS minsup}
10: while Bd+ +
i = Bdi−1 do
Algorithm 1 computes the positive and negative borders 11: Bdi = {X ∈ GenN egBorder(Bd+
−
i ) | |X| ≤
of frequent itemsets in a given binary database. Within minsup}
the framework of levelwise algorithms, ABS decides at 12: Bd+i+1 = {X ∈ GenP osBorder(Bdi ) |
−
each level whether or not the levelwise approach has to |X| ≥ minsup}
be stopped. In that case, the levelwise approach is halted, 13: i=i+1
and the two borders are incrementally updated as de- 14: end while
scribed previously. The functions GenP osBorder and 15: Return Bd+ −
i and Bdi−1 and exit
GenN egBorder compute respectively the optimistic pos- 16: end if
itive and negative borders, using characterizations in theo- 17: Ci+1 = AprioriGen(Fi )
rems 1 and 2. The algorithm terminates when all elements 18: i=i+1
of the optimistic positive border currently computed are fre- 19: end while
+ +
quent. It is worth noting that no dualization may occur at all: 20: Bdi−1 = Bdi−2 ∪ Fi−1
+ −
in this case, ABS is reduced to AP RIORI. The proposi- 21: Return Bdi−1 and Bdi−1
tion 3 ensures the correctness of ABS.
The behavior of the function IsDualizationRelevant
is described in section 3.5.
3.5 Adaptive aspects of ABS
Proposition 3 The algorithm ABS returns Bd + (F I) and
Bd− (F I).
The main adaptive aspect of ABS is conveyed by the
Proof If the test performed by IsDualizationRelevant() function IsDualizationRelevant, line 8 of algorithm 1.
is never true, the demonstration is obvious. As mentioned, its goal is to estimate if it is interesting to
If not, in line 15, from propositions 1 and 2, we have Bd + i ⊆ dualize the current negative border to the optimistic positive
Bd+ (F I) and Bd− i−1 ⊆ Bd −
(F I) border.
Moreover, the termination condition ensures that Bd + i =
We have identified four parameters specific to a given
GenP osBorder(Bd− i−1 ); all elements in Bd +
i are frequent iteration of the levelwise algorithm, which can be obtained
and all elements in Bd − i−1 are infrequent. Suppose that
dynamically without any overhead:
∃X ∈ Bd− (F I) | X ∈ Bd− i−1 . Then:
• The current level i. No jump is allowed until a given
• if ∃Y ∈ Bd−i−1 | Y ⊂ X, since Y is infrequent, X ∈ integer threshold; we set the threshold equal to 4, since
Bd− (F I) and there is a contradiction Apriori is very efficient in practice to explore the lev-
els 1 to 4. In our experiments, dualizing before this
• if ∃Y ∈ Bd−i−1 | Y ⊆ X, then from the definition of level incurs no improvement.
the optimistic positive border ∃Z ∈ Bd + i | X ⊆ Z,
which contradict the fact that X is infrequent. • |Bd−i |, the size of the current negative border. A sim-
ple remark can be made here: if this parameter is very
Thus Bd− −
i−1 = Bd (F I). An identical reasoning leads to
+ +
large (more than 100000) the minimal transversals
Bdi = Bd (F I). computation become prohibitive. We are not aware
2 of existing implementations of minimal transversals
computation able to handle such input hypergraphs 1 . A B C D E F G H I
Moreover, such cases are likely to correspond to best 1 1 1 1 1
scenario for Apriori. 1 1 1 1 1
1 1 1 1 1
• |Fi |, the number of frequent i-itemsets and |Bd −i | have 1 1 1 1 1
to be compared. Indeed, a small value of |Bd − i | wrt 1 1 1 1 1
|Fi | is likely to give a successful dualization. 1 1 1 1 1
1 1 1 1 1
• |Fi | and |Ci |, the number of candidates in level i, can
1 1 1 1 1
also be compared. If |F i |/|Ci | is close to 1, we can
1 1 1 1 1
suspect to be in a "dense" part of the search space, and
1 1 1 1 1
thus the levelwise search should be stopped.
1 1 1 1 1
1 1 1 1 1
3.6 Practical aspects 1 1 1 1 1
1 1 1 1 1
3.6.1 Candidate generation from the current positive 1 1 1 1 1
border 1 1 1 1
From [18], candidate generation of a levelwise algorithm
for a problem representable as sets can be formulated using Table 1. Example database
dualization: At the ith iteration, we have
Ci+1 = T r(∪j≤i Fj ) − ∪j≤i Cj to compute the optimistic positive border from level 2. It
is made up of 16 itemsets of size 5, among which the only
It is shown in [18] that candidate itemsets of C i+1 are ex- non frequent itemset is ABCDI. Thus, at this time, Bd +
2 =
actly of size i + 1, which allows to improve candidate gen- {ABCHI, ABDGI, ABGHI, ACDF I, ACF HI,
eration. ADF GI, AF GHI, BCDEI, BCEHI, BDEGI,
In the setting of this paper, we can see C i+1 as the set BEGHI, CDEF I, CEF HI, DEF GI, EF GHI}. We
Bd− −
i+1 − Bdi , and thus we get: obtain Bd− (Bd+ 2 ) = {ABCD} of size 4, being understood
that no elements of size 3 does exist.
Ci+1 = T r(Bd+
i ) − ∪j≤i Cj In our first implementation, computing the set
Bd− (Bd+ i ) using minimal transversals had a quite
Here, the major difference with a pure levelwise ap-
prohibitive cost on large hypergraph instances. Therefore,
proach is that Bd+ i may contain some elements of size
we made the choice to restrict Bd − (Bd+ i ) to its (i + 1)-
greater than i + 1.
itemsets for efficiency reasons. This choice has no effect
One may question about the size of the largest elements
on the correctness of the algorithm, since the termination
of Ci+1 : does there exist elements of size strictly greater
condition is always the same 2 .
than i + 1 ? The answer is yes as shown in the following
non trivial example.
3.6.2 Dealing with "almost frequent" large candidate
Example 1
itemsets
Let r be the binary relation over a schema R =
{A, B, C, D, E, F, G, H, I} represented in Table 1. For a Let us consider the case of a candidate itemset obtained af-
minsup equals to 1, the borders of frequent itemsets in r ter a dualization from the current negative border. Let X
are Bd− = {AE, BF, CG, DH, ABCDI} and Bd+ = be this candidate. Two main cases do exist: either X is
{ABCHI, ABDGI, ABGHI, ACDF I, ACF HI, frequent, or X is infrequent. In that case, we propose to
ADF GI, AF GHI, BCDEI, BCEHI, BDEGI, estimate a degree of error in order to "qualify the jump".
BEGHI, CDEF I, CEF HI, DEF GI, EF GHI, Given a new user-defined threshold δ, and a
ABCD}. minimal support minsup, an error measure, noted
After a levelwise pass until level two, the four NFI 2 We suspect the algorithm P incerSearch [17] to be not complete.
of size two have been discovered, i.e. Bd −2 = Indeed, the search strategy of P incerSearch is very close to our propo-
{AE, BF, CG, DH}. Suppose the algorithm decides here sition: if we only consider (i + 1)-itemsets in Bd− (Bd+ i ), they cor-
to stop the pure levelwise search. Then, these sets are used respond exactly to the candidate set Ci+1 of P incerSearch. Since
P incerSearch stops as soon as Ci+1 = ∅, some elements could be for-
1 Experiments conducted in [16, 2] only consider hypergraphs with not gotten. From the example 1, after the level 2, C3 is empty, and therefore
more than 32000 edges. the maximal set ABCD seems to be never generated by P incerSearch.
error(X, minsup), can be defined as the ratio between the We first evaluate the influence of the level from which
minsup minsup and the support of the infrequent itemset the levelwise approach is stopped on the performances of
X, i.e. error(X, minsup) = 1 − support(X)
minsup .
ABS. Then, the impact of "almost frequent" large itemsets
Two sub-cases are worth considering: is studied for different threshold values for the error mea-
sure. Finally, we compare ABS with four maximal frequent
• either error(X, minsup) ≤ δ : the "jump" was not itemsets mining algorithms: Apriori and Eclat [12] imple-
successful but solutions should exist among the nearest mented by C.Borgelt [5], F pmax [13] based on F P −trees
subsets of X. [15] and IBE [21].
• or error(X, minsup) > δ : In that case, the jump 180
Connect Minsup 60%
was over-optimistic and probably, no solution does ex- 160
ist among the nearest generalizations of X.
140
Note that this error measure is decreasing, i.e. X ⊂ 120
Total Time (sec)
Y ⇒ error(X, minsup) ≤ error(Y, minsup) 100
In our current implementation, these almost frequent 80
large itemsets are first considered as frequent to enable more 60
pruning in subsequent passes. Afterward, they are consid- 40
ered at the very end of our algorithm. A pure top-down 20
3 4 5 6 7 8 9 10
levelwise approach has been implemented to find out their level
subsets which can be maximal frequent itemsets. 80
Pumsb* Minsup 30%
70
4 Implementation details and experimental 60
results Total Time (sec) 50
40
4.1 Implementation details
30
An implementation of the algorithm has been performed 20
in C++/STL. Two existing implementations available from 10
3 4 5 6 7 8 9 10
the FIMI repository website [10] were borrowed: the level
Apriori code of C. Borgelt [5] and the prefix-tree imple-
mentation of B. Goethals using C++/STL [10]. Figure 1. Forcing the first dualization at level
To keep coherence with this implementation, we use a k for connect (top) and pumsb* (bottom)
similar data structure for the new parts of the algorithm.
The itemsets and the transactions are stored in a prefix-tree
[1, 6].
In figure 1, we forced the first dualization for different
levels (from 3 to 8), on the connect dataset with a minsup
Minimal Transversals computation For the minimal of 80 % and the pumsb* dataset with a minsup of 30%.
transversals computation, we implemented the algorithm The results confirm the necessity to fix dynamically this pa-
proposed in [9] using a prefix-tree in order to handle rel- rameter, and then justify an adaptive approach. Second, for
atively large hypergraph instances. Its incremental aspect all tested datasets, our function IsDualizationRelevant
is very interesting in our case, since the negative border is has dynamically determined the best level to begin dualiza-
itself incremental. Note that improvements have been per- tion.
formed by exploiting the knowledge of previous dualiza- The optimization based on the error measure is evalu-
tions. We do not give more details here. ated on figure 2. From pumsb dataset (on the top), this op-
timization appears to be interesting with a threshold value
4.2 Experimental results near 0.002. Nevertheless, on the connect dataset (bottom)
no improvements is achieved. This comes from the fact that
We conducted experiments on a pentium 4.3GHz Pro- the proposed error measure is not strictly decreasing; and
cessor, with 1Go of memory. The operating system was the equivalence classes induced by closed frequent itemsets
Redhat Linux 7.3 and we used gcc 2.96 for the compilation. are large. Our top down levelwise approach is prone to fail
We used four datasets available on the FIMI’03 repository. on this kind of databases .
1600
Minsup 60%
Minsup 65%
From figure 4, IBE exhibits better performances than
Minsup 70%
1400 Minsup 75% ABS for low support thresholds (less than 20%). This is
1200
due to the fact that while the size of the positive border re-
mains small (less than 5000 elements) the size of the neg-
1000
ative border exceeds 10 6 elements, where some elements
Total Time (sec)
800 appear to have a very large size. This seems to be the worst
600
case for ABS.
400
200 1000
Apriori
Eclat
FPmax
0 IBE
0 0.01 0.02 0.03 0.04 0.05 0.06 ABS
Error
100
1000
Minsup 50%
Minsup 60%
900 Minsup 70%
Total Time (sec)
800
10
700
Total Time (sec)
600
1
500
400
300
0.1
50 45 40 35 30 25 20 15 10 5 0
200 Minimum Support (%)
100
0 Figure 4. Execution times for database
0 0.01 0.02 0.03 0.04 0.05 0.06
Error Pumsb*
Figure 2. Exec. times for pumsb (top) and
connect (down) wrt different error measure
thresholds
From figure 5, ABS behaves like Apriori as expected.
Indeed, the positive border of retail is made up of "small"
From figures 3, 5 and 6, ABS is far to compete with itemsets, and Apriori turns out to be the best implementa-
best known implementations but tends to outperform IBE tion for this kind of datasets.
for most of our experimentations. Recall that IBE is the
unique implementation based on the concept of dualization
available from FIMI’03. We believe that this is due to the 1000
Apriori
Eclat
number of dualization performed by IBE, which is in the FPmax
IBE
ABS
size of the positive border.
100
1000
Total Time (sec)
Apriori
Eclat
FPmax 10
IBE
ABS
100
1
Total Time (sec)
10
0.1
0.1 0.08 0.06 0.04 0.02 0
Minimum Support (%)
1
Figure 5. Execution times for database Retail
0.1
95 90 85 80 75 70 65 60 55 50
Minimum Support (%)
Figure 3. Execution times for database From figure 6, ABS is not as efficient as best known
Pumsb implementations (e.g. f pmax), but improves Apriori by a
factor of ten and beats Eclat and IBE.
10000
Apriori
Eclat
In [14], the authors propose the Dualize and Advance
FPmax
IBE
ABS
algorithm. In their approach, the positive border in con-
1000
struction is always a subset of the positive border to be dis-
covered. At each step, from some elements of the positive
border already discovered, they generate the correspond-
Total Time (sec)
100
ing negative border. If one element of the negative border
10 appears to be satisfied, they generate a specialization of it
which belongs to the positive border and they re-iterate the
1 process until each element of the negative border is indeed
not satisfied. An implementation of a variant of Dualize and
0.1 Advance has been proposed in [21] with an irredundant du-
100 80 60 40 20 0
Minimum Support (%) alization. Their code is available from the FIMI’03 website.
Some algorithms like M af ia [7] or DCI [19] can adapt
Figure 6. Execution times for database Con- themselves to mine frequent itemsets, with respect to the
nect dataset density and some architectural characteristics (e.g.
available memory). Even if these aspects improve perfor-
mances, it only concerns choices for data structures; the
To sum up, two main reasons explain our mitigate re- mentioned algorithms do not really adapt their strategy to
sults: 1) the cost of dualization remains high on very large explore the search space.
hypergraph instances and 2) candidate generation and sup-
port counting seems to be not enough efficient in our current 6 Conclusion and future works
implementation.
The main parameter influencing performance of ABS In this paper, we have proposed an ongoing effort to-
turns out to be around the negative border. If for a given ward the discovery of maximal frequent itemsets. Our con-
minsup, the negative border does not become too huge tribution takes its roots from the algorithm ZigZag devised
and its largest element remains "small" with respect to the for inclusion dependency discovery in databases. Even if
largest maximal frequent itemset, ABS should have good this two data mining problems fit into the same theoretical
performance. framework [18], they widely differ in practice which is not
a surprise. Indeed, while ZigZag performed very well in
5 Related works our experiments 3, ABS does not exhibit such a good behav-
ior for maximal frequent itemsets mining. Many reasons
Several algorithms exist for discovering maximal fre- explain this result, for instance the availability of public
quent itemsets mining in a transactional database (see datasets allowing thorough experimentations, the intrinsic
FIMI’03). The goal is always to avoid an exponential search properties of each problem, and may be the more impor-
by characterizing as fast as possible largest frequent item- tant reason lies in the cost of a database access, in-memory
sets without exploring their subsets. M axM iner [3] uses a resident data vs data stored into a DBMS.
levelwise approach to explore the candidate itemsets, using Many improvements can be brought to our current im-
the Rymon’s enumeration system [20] - in which itemsets plementation. Some are specific to our algorithm like for
are arranged in a non redundant tree. But when a candi- instance minimal transversal computation on large hyper-
date X is counted over the database, the greatest candidate graph instances or taking into account large equivalence
in the subtree of X is also counted; if it is frequent, then classes induced by closed frequent itemsets during candi-
all the subtree can be pruned by anti-monotony of the "is date generation from "almost frequent" itemsets. Some oth-
frequent" property. Jumps done by M axM iner depend on ers belong to "the state of the art" of maximal frequent item-
the ordering of items used to build the tree and are therefore sets implementation : managing huge set of set, support
quiet different from jumps proposed in this paper. The al- counting... Complexity issues need also to be addressed.
gorithms M af ia [7] and GenM ax [12] use the same prin- To end up, we want to quote a personal note on the main
ciple as M axM iner with efficient optimizations, e.g. ver- objective of the FIMI workshop. We believe that frequent,
tical bitmaps. closed and maximal itemsets mining are key data mining
The P incer − Search Algorithm [17] uses a search tasks since algorithms devised to solve these tasks are likely
strategy very close to ours. After a levelwise initialization, to be used in other contexts under some conditions [18].
the principle is also to look at the largest not yet eliminated Roughly speaking, for every problem representable as sets
candidates. However, these large candidates are not charac- 3 Note that dynamic parameters were quiet different, e.g. the first dual-
terized in a formal way. ization was always performed at the second level.
with an anti-monotone predicate as for instance with func- [13] G. Grahne and J. Zhu. Efficiently using prefix-trees in min-
tional dependency inference or simply anti-monotone pred- ing frequent itemsets. In FIMI’03 Workshop on Frequent
icates on itemsets other than "is frequent", the algorithms Itemset Mining Implementations, November 2003.
devised for FIMI should be useful to answer these tasks. [14] D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivo-
nen, and R. S. Sharma. Discovering all most specific sen-
Nevertheless, it seems rather optimistic to envision the ap-
tences. ACM Transaction on Database System, 28(2):140–
plication of many FIMI’03 [11] implementations to another
174, 2003.
data mining problem representable as sets. Indeed, even if [15] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without
the development of efficient data structures for managing candidate generation. In ACM SIGMOD’00, Dallas, Texas,
huge sets of set is definitely useful, loading the database in USA, 2000.
main memory using sophisticated data structure specially [16] D. J. Kavvadias and E. C. Stavropoulos. Evaluation of an al-
devised for the anti-monotone predicate to be mined turns gorithm for the transversal hypergraph problem. In J. S. Vit-
out to give very efficient algorithms but deserve other data ter and C. D. Zaroliagis, editors, Algorithm Engineering, In-
mining tasks. ternational Workshop, WAE ’99, London, UK, volume 1668,
1999.
[17] D.-I. Lin and Z. M. Kedem. Pincer search: A new algo-
References rithm for discovering the maximum frequent set. In H.-J.
Schek, F. Saltor, I. Ramos, and G. Alonso, editors, Extend-
[1] R. Agrawal and R. Srikant. Fast algorithms for mining as- ing Database Technology (EDBT’98), Valencia, Spain, vol-
sociation rules in large databases. In J. B. Bocca, M. Jarke, ume 1377 of Lecture Notes in Computer Science, pages 105–
and C. Zaniolo, editors, International Conference on Very 119. Springer, 1998.
Large Data Bases (VLDB’94), Santiago de Chile, Chile, [18] H. Mannila and H. Toivonen. Levelwise Search and Bor-
pages 487–499. Morgan Kaufmann, 1994. ders of Theories in Knowledge Discovery. Data Mining and
[2] J. Bailey, T. Manoukian, and K. Ramamohanarao. A fast Knowledge Discovery, 1(1):241–258, 1997.
algorithm for computing hypergraph transversals and its ap- [19] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Adap-
plication in mining emerging patterns. In International Con- tive and resource-aware mining of frequent sets. In Inter-
ference on Data Mining (ICDM’03), Floride, USA, pages national Conference on Data Mining (ICDM’02), Maebashi
485–488, 2003. City, Japan, pages 338–345, 2002.
[3] R. J. Bayardo. Efficiently mining long patterns from [20] R. Rymon. Search through systematic set enumeration. In
databases. In L. M. Haas and A. Tiwary, editors, ACM SIG- B. Nebel, C. Rich, and W. R. Swartout, editors, International
MOD Conference, Seattle, USA, pages 85–93, 1998. Conference on Principles of Knowledge Representation and
[4] C. Berge. Graphs and Hypergraphs. North Holland, Ams- Reasoning (KR’92), Cambridge, USA, pages 539–550. Mor-
terdam, 1973. gan Kaufmann, 1992.
[5] C. Borgelt. Efficient implementations of apriori and eclat. In [21] T. Uno and K. Satoh. Detailed description of an algorithm
FIMI’03 Workshop on Frequent Itemset Mining Implemen- for enumeration of maximal frequent sets with irredundant
tations, November 2003. dualization. In B. Goethals and M. Zaki, editors, ICDM
[6] C. Borgelt and R. Kruse. Induction of association rules : 2003 Workshop on Frequent Itemset Mining Implementa-
Apriori implementation. 2002. tions (FIMI ’03), Melbourne, Florida, USA, volume 90 of
[7] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal CEUR Workshop Proceedings, 2003.
frequent itemset algorithm for transactional databases. In [22] M.-J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New
International Conference on Data Engineering (ICDE’01), algorithms for fast discovery of association rules. In In-
Heidelberg, Germany, pages 443–452. IEEE CS, 2001. ternational Conference on Knowledge Discovery and Data
[8] F. De Marchi and J.-M. Petit. Zigzag : a new algorithm Mining (KDD-97), Newport Beach, California, USA, pages
for discovering large inclusion dependencies in relational 283–286. AAAI Press, 1997.
databases. In International Conference on Data Mining
(ICDM’03), Melbourne, Florida, USA, pages 27–34. IEEE
Computer Society, 2003.
[9] J. Demetrovics and V. Thi. Some remarks on generating
armstrong and inferring functional dependencies relation.
Acta Cybernetica, 12(2):167–180, 1995.
[10] B. Goethals. Frequent itemset mining implementations
repository, http://fimi.cs.helsinki.fi/, 2003.
[11] B. Goethals and M. Zaki, editors. Workshop on Fre-
quent Itemset Mining Implementations. CEUR Workshop
Proceedings, 2003.
[12] K. Gouda and M. J. Zaki. Efficiently mining of maximal fre-
quent itemsets. In N. Cercone, T. Y. Lin, and X. Wu, editors,
International Conference on Data Mining (ICDM’01), San
Jose, USA. IEEE Computer Society, 2001.