ABS: Adaptive Borders Search of frequent itemsets

                             Frédéric Flouvat1, Fabien De Marchi2 , Jean-Marc Petit1

                                      1
                                      Laboratoire LIMOS, UMR CNRS 6158
                                  Université Blaise Pascal - Clermont-Ferrand II,
                              24 avenue des Landais, 63 177 Aubière cedex, France
                               flouvat@isima.fr, jmpetit@math.univ-bpclermont.fr

                                          2
                                       Laboratoire LIRIS, FRE CNRS 2672
                                       Université Claude Bernard Lyon 1
                           8, boulevard Niels Bohr, 69 622 Villeurbanne cedex France
                                         fabien.demarchi@liris.cnrs.fr


                        Abstract                                inclusion dependencies in databases [8]. This algorithm is
                                                                called ABS for Adaptive Borders Search, and is in the spirit
   In this paper, we present an ongoing work to discover        of algorithms based on the concept of dualization [14, 21].
maximal frequent itemsets in a transactional database. We          The basic idea of our proposition is to combine the
propose an algorithm called ABS for Adaptive Borders            strength of both levelwise algorithm [1, 18] and Dualize and
Search, which is in the spirit of algorithms based on the       Advance algorithm [14] in such a way that:
concept of dualization. From an abstract point of view, our
                                                                  • "small" maximal frequent itemsets are efficiently gen-
contribution can be seen as an improvement of the basic
                                                                    erated with levelwise strategies.
APRIORI algorithm for mining maximal frequent itemsets.
The key point is to decide dynamically at which iteration, if     • "large" maximal frequent itemsets may be found effi-
any, the dualization has to be made to avoid the enumera-           ciently by dualization.
tion of all subsets of large maximal itemsets. Once the first
dualization has been done from the current negative border,         The dualization performed is quite similar to that pro-
APRIORI is no longer used and instead, another dualiza-         posed in the Dualize and Advance algorithm. Nevertheless,
tion is carried out from the positive border known so far.      instead of starting from some subset of maximal frequent
The process is repeated until no change occurs anymore in       itemsets as Dualize and Advance algorithm does, we use
the positive border in construction.                            infrequent itemsets to perform the dualization. As a conse-
   Experiments have been done on FIMI datasets from             quence, we obtain the so-called optimistic positive border of
which tradeoffs on adaptive behavior have been proposed         maximal frequent itemsets. The set of such candidates cor-
to guess the best iteration for the first dualization. Far      responds exactly to k-uniform hypergraph clique proposed
from being the best implementation wrt FIMI’03 contribu-        in [22]. As a consequence, our proposition contributes to
tions, performance evaluations of ABS exhibit better per-       clarify some related contributions [22, 17, 3, 12, 7]) since
formance than IBE, the only public implementation based         it gives an exact characterization of the optimistic positive
on the concept of dualization.                                  border of maximal frequent itemsets from some subset of
                                                                infrequent itemsets.
                                                                    From an abstract point of view, our contribution can be
                                                                seen as an improvement of the basic APRIORI algorithm
1 Introduction                                                  for mining maximal frequent itemsets. The key point is to
                                                                decide dynamically at which iteration, if any, the dualiza-
   In this paper, we present an ongoing work to discover        tion has to be made to avoid the enumeration of all subsets
maximal frequent itemsets in a transactional database. We       of large maximal itemsets. Once the first dualization has
propose to adapt an algorithm originally devised for mining     been done from the current negative border available at that
iteration, APRIORI is no longer used and instead, another        be still competitive for maximal frequent itemsets mining
dualization is carried out from the positive border known so     in many cases [11], when the size of elements to discover
far. The process is repeated until no change occurs anymore      remain small.
in the positive border in construction.                              Our goal is to exploit the efficiency of Apriori, but to
    Experiments have been done on FIMI datasets [10]. The        automatically detect when it will fall into troubles and stop
adaptive behavior of our algorithm has been tuned from           its execution. Then we propose to exploit the knowledge
results gathered from these experiments. For the tested          mined so far to initialize a different search, based on the
dataset, we were able to guess dynamically the best iteration    concept of dualization between positive and negative bor-
for the first dualization, a key parameter of our algorithm.     ders; each border is updated and used to compute the corre-
    Far from being the best implementation wrt FIMI’03           sponding dual border, until a fix point is reached.
contributions [11], performance evaluations of ABS exhibit
better performance than IBE [21], the only public imple-         3.2 From negative to positive border
mentation based on the concept of dualization.
                                                                     In the sequel, let r be a binary database over a set of
2 Preliminaries                                                  items R, minsup a minimum support, and F I the set of
                                                                 frequent itemsets in r. After the levelwise part, our method
   Let R be a set of symbols called items; a line is a subset    is still iterative; at each iteration i, new elements of the posi-
of R, and a binary relation r over R is a multiset of lines.     tive and negative borders are expected to be discovered. We
We suppose the reader is familiar with the notions of item-      denote by Bd+                   −                  +
                                                                                  i (resp. Bdi ) the subset of Bd (F I) (resp.
                                                                     −                                 th
sets, support, and with the main aspects of frequent item-       Bd (F I)) discovered until the i iteration. In other words,
sets mining problem in a binary relation, given a threshold      ∀i < j, Bd+             +          −        −
                                                                               i ⊆ Bdj and Bdi ⊆ Bdj . Roughly speaking,
minsup (see e.g. [1] for details). We recall the notion of       candidates for Bd i are obtained from elements of Bd −
                                                                                       +
                                                                                                                            i , and
borders of a set of itemsets [18]. Given F a set of item-        candidates for Bd −   i+1 are  obtained  from elements of  Bd +
                                                                                                                               i .
sets over R, the positive border of F denoted by Bd + (F ) is        The following definitions and results have been proposed
defined by Bd+ (F ) = max⊆ {X ∈ F }. The negative bor-           in [8] for inclusion dependency discovery problem in rela-
der of F is defined by Bd − (F ) = min⊆ {Y ⊆ R | ∀X ∈            tional databases. We recall them in the context of maximal
F, Y ⊆ X}. If F I is the set of all itemsets frequent in r,     frequent itemsets mining, only the proofs are omitted.
then Bd+ (F I) is called the set of maximal frequent itemsets        We first define the notion of Optimistic positive border.
in r.                                                            Definition 1(Optimistic positive border) Given a set F of
   We will use the concepts of hypergraph and minimal            itemsets, the optimistic positive border of F is: F opt(F ) =
transversal of a hypergraph, whose definition is pointed out     max⊆ {X ⊆ R | ∀Y ∈ F, Y ⊆ X}.
here (see for example [4] for more details). Given V a
                                                                    The next theorem gives a constructive characterization
finite set of elements. A subset E of V defines a hyper-
                                                                 of F opt(F ).
graph H = (V, E), where elements of V are called ver-
tices of H and elements of E edges of H. A transversal T         Theorem 1[8] F opt(F ) = T r(F )
of H = (V, E) is a subset of V that intersect all the ele-          Therefore, the idea is to compute the optimistic positive
ments of E. T is minimal if no other transversal of H are        border for Bd −
                                                                               i to obtain exactly the largest itemsets which
included in T. The set of all minimal transversals of H is       do not contain any infrequent itemset discovered so far.
noted T r(H).
                                                                 Proposition 1 Let X ∈ Fopt(Bd−
                                                                                              i ).                If sup(X) ≥
   The relationship between the notion of borders and min-
                                                                 minsup, X ∈ Bd+ (F I).
imal transversals of hypergraph has been exhibited in [18].
Indeed, any set of itemsets can be seen as a hypergraph; if      Proof Since X is maximal in the definition of F opt(Bd −
                                                                                                                        i ),
F I is the set of frequent itemsets in a binary relation r, we   each of its superset contains at least one element of Bd −
                                                                                                                          i ,
have: T r(F I) = Bd− (F I), where F I = {R − X | X ∈             and is infrequent by anti-monotonicity.                 2
F I}.
                                                                    Then, Bd+
                                                                            i is exactly made up of all the frequent itemsets
                                                                 in F opt(Bd−
                                                                            i ).
3 Method description
                                                                 3.3 From positive to negative border
3.1 Starting with a levelwise algorithm
                                                                    In a dual way, the set Bd +i is then used to compute its
  The algorithm Apriori [1] was initially devoted to fre-        negative border Bd − (Bd+i ),  to finally update the negative
quent itemset mining; Nevertheless, it has been proved to        border in construction and obtain Bd − i+1 .
   The next theorem gives a constructive characterization           Algorithm 1 ABS: Adaptive Border Search
of Bd− (F ), for any set F of frequent itemsets.                    Require: a binary database r, a integer minsup
Theorem 2[18] Bd− (F ) = T r(F )                                    Ensure: Bd+ (F I) and Bd− (F I)
                                                                     1: F1 = {A ∈ R | sup(A) ≥ minsup}
Proposition 2 Let X ∈ Bd− (Bd+
                             i ).               If sup(X) <          2: C2 = AprioriGen(F1 )
minsup, X ∈ Bd− (F I).                                                            −               +
                                                                     3: i = 2; Bd1 = R − F1 ; Bd0 = ∅
                                                                     4: while Ci = ∅ do
Proof Let X be an element of Bd − (Bd+  i ). By the definition
                                                                     5:   Fi = {X ∈ Ci | sup(X) ≥ minsup}
of the negative cover of a set, each subset of X is included
                                                                     6:   Bd−         −
                                                                             i = Bdi−1 ∪ (Ci − Fi )
in an element of Bd+ (F I) and then is frequent.            2
                                                                     7:   Bdi−1 = Bd+
                                                                             +
                                                                                        i−2 ∪ {X ∈ Fi−1 | ∀Y ∈ Fi , X ⊆ Y }
   Then, Bd− i+1 is exactly made up of all the infrequent            8:   if IsDualizationRelevant(i, |Bd−    i |, |Fi |, |Ci |) =
itemsets in Bd− (Bd+i ).                                                  T RU E then
                                                                     9:      Bd+i = {X ∈ GenP osBorder(Bdi ) | |X| ≥
                                                                                                                    −

3.4 The Algorithm ABS                                                        minsup}
                                                                    10:      while Bd+         +
                                                                                       i = Bdi−1 do
    Algorithm 1 computes the positive and negative borders          11:        Bdi = {X ∈ GenN egBorder(Bd+
                                                                                  −
                                                                                                                     i ) | |X| ≤
of frequent itemsets in a given binary database. Within                        minsup}
the framework of levelwise algorithms, ABS decides at               12:        Bd+i+1 = {X ∈ GenP osBorder(Bdi ) |
                                                                                                                              −

each level whether or not the levelwise approach has to                        |X| ≥ minsup}
be stopped. In that case, the levelwise approach is halted,         13:        i=i+1
and the two borders are incrementally updated as de-                14:      end while
scribed previously. The functions GenP osBorder and                 15:      Return Bd+          −
                                                                                         i and Bdi−1 and exit
GenN egBorder compute respectively the optimistic pos-              16:   end if
itive and negative borders, using characterizations in theo-        17:   Ci+1 = AprioriGen(Fi )
rems 1 and 2. The algorithm terminates when all elements            18:   i=i+1
of the optimistic positive border currently computed are fre-       19: end while
                                                                           +         +
quent. It is worth noting that no dualization may occur at all:     20: Bdi−1 = Bdi−2 ∪ Fi−1
                                                                                   +           −
in this case, ABS is reduced to AP RIORI. The proposi-              21: Return Bdi−1 and Bdi−1
tion 3 ensures the correctness of ABS.
    The behavior of the function IsDualizationRelevant
is described in section 3.5.
                                                                    3.5 Adaptive aspects of ABS
Proposition 3 The algorithm ABS returns Bd + (F I) and
Bd− (F I).
                                                                        The main adaptive aspect of ABS is conveyed by the
Proof If the test performed by IsDualizationRelevant()              function IsDualizationRelevant, line 8 of algorithm 1.
is never true, the demonstration is obvious.                        As mentioned, its goal is to estimate if it is interesting to
If not, in line 15, from propositions 1 and 2, we have Bd +   i ⊆   dualize the current negative border to the optimistic positive
Bd+ (F I) and Bd−     i−1 ⊆  Bd  −
                                   (F I)                            border.
Moreover, the termination condition ensures that Bd +        i =
                                                                        We have identified four parameters specific to a given
GenP osBorder(Bd−        i−1 ); all elements in Bd +
                                                   i are frequent   iteration of the levelwise algorithm, which can be obtained
and all elements in Bd −     i−1 are infrequent. Suppose that
                                                                    dynamically without any overhead:
∃X ∈ Bd− (F I) | X ∈ Bd−       i−1 . Then:
                                                                      • The current level i. No jump is allowed until a given
  • if ∃Y ∈ Bd−i−1 | Y ⊂ X, since Y is infrequent, X ∈                 integer threshold; we set the threshold equal to 4, since
    Bd− (F I) and there is a contradiction                              Apriori is very efficient in practice to explore the lev-
                                                                        els 1 to 4. In our experiments, dualizing before this
  • if  ∃Y ∈ Bd−i−1 | Y ⊆ X, then from the definition of               level incurs no improvement.
    the optimistic positive border ∃Z ∈ Bd +  i | X ⊆ Z,
    which contradict the fact that X is infrequent.                   • |Bd−i |, the size of the current negative border. A sim-
                                                                        ple remark can be made here: if this parameter is very
Thus Bd−        −
       i−1 = Bd (F I). An identical reasoning leads to
  +      +
                                                                        large (more than 100000) the minimal transversals
Bdi = Bd (F I).                                                         computation become prohibitive. We are not aware
                                                   2                    of existing implementations of minimal transversals
     computation able to handle such input hypergraphs 1 .                            A     B     C    D     E     F    G     H     I
     Moreover, such cases are likely to correspond to best                            1     1     1                           1     1
     scenario for Apriori.                                                            1     1           1               1           1
                                                                                      1     1                           1     1     1
  • |Fi |, the number of frequent i-itemsets and |Bd −i | have                        1           1     1          1                1
    to be compared. Indeed, a small value of |Bd −     i | wrt                        1           1                1          1     1
    |Fi | is likely to give a successful dualization.                                 1                 1          1    1           1
                                                                                      1                            1    1     1     1
  • |Fi | and |Ci |, the number of candidates in level i, can
                                                                                            1     1     1    1                      1
    also be compared. If |F i |/|Ci | is close to 1, we can
                                                                                            1     1          1                1     1
    suspect to be in a "dense" part of the search space, and
                                                                                            1           1    1          1           1
    thus the levelwise search should be stopped.
                                                                                            1                1          1     1     1
                                                                                                  1     1    1     1                1
3.6 Practical aspects                                                                             1          1     1          1     1
                                                                                                        1    1     1    1           1
3.6.1 Candidate generation from the current positive                                                         1     1    1     1     1
      border                                                                          1     1     1     1
From [18], candidate generation of a levelwise algorithm
for a problem representable as sets can be formulated using                               Table 1. Example database
dualization: At the ith iteration, we have

                Ci+1 = T r(∪j≤i Fj ) − ∪j≤i Cj                            to compute the optimistic positive border from level 2. It
                                                                          is made up of 16 itemsets of size 5, among which the only
It is shown in [18] that candidate itemsets of C i+1 are ex-              non frequent itemset is ABCDI. Thus, at this time, Bd +
                                                                                                                                2 =
actly of size i + 1, which allows to improve candidate gen-               {ABCHI, ABDGI, ABGHI, ACDF I, ACF HI,
eration.                                                                  ADF GI, AF GHI, BCDEI, BCEHI, BDEGI,
    In the setting of this paper, we can see C i+1 as the set             BEGHI, CDEF I, CEF HI, DEF GI, EF GHI}. We
Bd−           −
    i+1 − Bdi , and thus we get:                                          obtain Bd− (Bd+ 2 ) = {ABCD} of size 4, being understood
                                                                          that no elements of size 3 does exist.
                 Ci+1 = T r(Bd+
                              i ) − ∪j≤i Cj                                  In our first implementation, computing the set
                                                                          Bd− (Bd+ i ) using minimal transversals had a quite
   Here, the major difference with a pure levelwise ap-
                                                                          prohibitive cost on large hypergraph instances. Therefore,
proach is that Bd+  i may contain some elements of size
                                                                          we made the choice to restrict Bd − (Bd+ i ) to its (i + 1)-
greater than i + 1.
                                                                          itemsets for efficiency reasons. This choice has no effect
   One may question about the size of the largest elements
                                                                          on the correctness of the algorithm, since the termination
of Ci+1 : does there exist elements of size strictly greater
                                                                          condition is always the same 2 .
than i + 1 ? The answer is yes as shown in the following
non trivial example.
                                                                          3.6.2 Dealing with "almost frequent" large candidate
Example 1
                                                                                itemsets
Let r be the binary relation over a schema R =
{A, B, C, D, E, F, G, H, I} represented in Table 1. For a                 Let us consider the case of a candidate itemset obtained af-
minsup equals to 1, the borders of frequent itemsets in r                 ter a dualization from the current negative border. Let X
are Bd− = {AE, BF, CG, DH, ABCDI} and Bd+ =                               be this candidate. Two main cases do exist: either X is
{ABCHI, ABDGI, ABGHI, ACDF I, ACF HI,                                     frequent, or X is infrequent. In that case, we propose to
ADF GI, AF GHI, BCDEI, BCEHI, BDEGI,                                      estimate a degree of error in order to "qualify the jump".
BEGHI, CDEF I, CEF HI, DEF GI, EF GHI,                                       Given a new user-defined threshold δ, and a
ABCD}.                                                                    minimal support minsup, an error measure, noted
After a levelwise pass until level two, the four NFI                          2 We suspect the algorithm P incerSearch [17] to be not complete.
of size two have been discovered, i.e.            Bd −2    =              Indeed, the search strategy of P incerSearch is very close to our propo-
{AE, BF, CG, DH}. Suppose the algorithm decides here                      sition: if we only consider (i + 1)-itemsets in Bd− (Bd+   i ), they cor-
to stop the pure levelwise search. Then, these sets are used              respond exactly to the candidate set Ci+1 of P incerSearch. Since
                                                                          P incerSearch stops as soon as Ci+1 = ∅, some elements could be for-
  1 Experiments conducted in [16, 2] only consider hypergraphs with not   gotten. From the example 1, after the level 2, C3 is empty, and therefore
more than 32000 edges.                                                    the maximal set ABCD seems to be never generated by P incerSearch.
error(X, minsup), can be defined as the ratio between the           We first evaluate the influence of the level from which
minsup minsup and the support of the infrequent itemset          the levelwise approach is stopped on the performances of
X, i.e. error(X, minsup) = 1 − support(X)
                                   minsup .
                                                                 ABS. Then, the impact of "almost frequent" large itemsets
   Two sub-cases are worth considering:                          is studied for different threshold values for the error mea-
                                                                 sure. Finally, we compare ABS with four maximal frequent
  • either error(X, minsup) ≤ δ : the "jump" was not             itemsets mining algorithms: Apriori and Eclat [12] imple-
    successful but solutions should exist among the nearest      mented by C.Borgelt [5], F pmax [13] based on F P −trees
    subsets of X.                                                [15] and IBE [21].
  • or error(X, minsup) > δ : In that case, the jump                                180
                                                                                                                         Connect Minsup 60%

    was over-optimistic and probably, no solution does ex-                          160

    ist among the nearest generalizations of X.
                                                                                    140


  Note that this error measure is decreasing, i.e. X ⊂                              120


                                                                 Total Time (sec)
Y ⇒ error(X, minsup) ≤ error(Y, minsup)                                             100


   In our current implementation, these almost frequent                              80


large itemsets are first considered as frequent to enable more                       60


pruning in subsequent passes. Afterward, they are consid-                            40

ered at the very end of our algorithm. A pure top-down                               20
                                                                                             3   4   5   6           7     8           9      10
levelwise approach has been implemented to find out their                                                    level


subsets which can be maximal frequent itemsets.                                     80
                                                                                                                         Pumsb* Minsup 30%


                                                                                    70


4 Implementation details and experimental                                           60


  results                                                        Total Time (sec)   50


                                                                                    40

4.1 Implementation details
                                                                                    30


    An implementation of the algorithm has been performed                           20


in C++/STL. Two existing implementations available from                             10
                                                                                         3       4   5   6           7     8           9      10

the FIMI repository website [10] were borrowed: the                                                          level


Apriori code of C. Borgelt [5] and the prefix-tree imple-
mentation of B. Goethals using C++/STL [10].                                        Figure 1. Forcing the first dualization at level
    To keep coherence with this implementation, we use a                            k for connect (top) and pumsb* (bottom)
similar data structure for the new parts of the algorithm.
The itemsets and the transactions are stored in a prefix-tree
[1, 6].
                                                                     In figure 1, we forced the first dualization for different
                                                                 levels (from 3 to 8), on the connect dataset with a minsup
Minimal Transversals computation For the minimal                 of 80 % and the pumsb* dataset with a minsup of 30%.
transversals computation, we implemented the algorithm           The results confirm the necessity to fix dynamically this pa-
proposed in [9] using a prefix-tree in order to handle rel-      rameter, and then justify an adaptive approach. Second, for
atively large hypergraph instances. Its incremental aspect       all tested datasets, our function IsDualizationRelevant
is very interesting in our case, since the negative border is    has dynamically determined the best level to begin dualiza-
itself incremental. Note that improvements have been per-        tion.
formed by exploiting the knowledge of previous dualiza-              The optimization based on the error measure is evalu-
tions. We do not give more details here.                         ated on figure 2. From pumsb dataset (on the top), this op-
                                                                 timization appears to be interesting with a threshold value
4.2 Experimental results                                         near 0.002. Nevertheless, on the connect dataset (bottom)
                                                                 no improvements is achieved. This comes from the fact that
   We conducted experiments on a pentium 4.3GHz Pro-             the proposed error measure is not strictly decreasing; and
cessor, with 1Go of memory. The operating system was             the equivalence classes induced by closed frequent itemsets
Redhat Linux 7.3 and we used gcc 2.96 for the compilation.       are large. Our top down levelwise approach is prone to fail
We used four datasets available on the FIMI’03 repository.       on this kind of databases .
                   1600
                                                                                    Minsup 60%
                                                                                    Minsup 65%
                                                                                                               From figure 4, IBE exhibits better performances than
                                                                                    Minsup 70%
                   1400                                                             Minsup 75%              ABS for low support thresholds (less than 20%). This is
                   1200
                                                                                                            due to the fact that while the size of the positive border re-
                                                                                                            mains small (less than 5000 elements) the size of the neg-
                   1000
                                                                                                            ative border exceeds 10 6 elements, where some elements
Total Time (sec)


                   800                                                                                      appear to have a very large size. This seems to be the worst
                   600
                                                                                                            case for ABS.
                   400


                   200                                                                                                         1000
                                                                                                                                                                                                Apriori
                                                                                                                                                                                                 Eclat
                                                                                                                                                                                                FPmax
                     0                                                                                                                                                                             IBE
                          0         0.01        0.02          0.03           0.04        0.05        0.06                                                                                         ABS
                                                              Error
                                                                                                                               100
                   1000
                                                                                    Minsup 50%
                                                                                    Minsup 60%
                   900                                                              Minsup 70%


                                                                                                            Total Time (sec)
                   800
                                                                                                                                10

                   700
Total Time (sec)


                   600

                                                                                                                                 1
                   500


                   400


                   300
                                                                                                                                0.1
                                                                                                                                      50    45   40     35   30      25       20      15   10             5   0
                   200                                                                                                                                       Minimum Support (%)

                   100


                     0                                                                                                         Figure 4. Execution times for database
                          0         0.01        0.02          0.03           0.04        0.05        0.06
                                                              Error                                                            Pumsb*
                   Figure 2. Exec. times for pumsb (top) and
                   connect (down) wrt different error measure
                   thresholds
                                                                                                               From figure 5, ABS behaves like Apriori as expected.
                                                                                                            Indeed, the positive border of retail is made up of "small"
   From figures 3, 5 and 6, ABS is far to compete with                                                      itemsets, and Apriori turns out to be the best implementa-
best known implementations but tends to outperform IBE                                                      tion for this kind of datasets.
for most of our experimentations. Recall that IBE is the
unique implementation based on the concept of dualization
available from FIMI’03. We believe that this is due to the                                                                     1000
                                                                                                                                                                                                Apriori
                                                                                                                                                                                                 Eclat
number of dualization performed by IBE, which is in the                                                                                                                                         FPmax
                                                                                                                                                                                                   IBE
                                                                                                                                                                                                  ABS
size of the positive border.
                                                                                                                               100


                   1000
                                                                                                            Total Time (sec)


                                                                                         Apriori
                                                                                          Eclat
                                                                                         FPmax                                  10
                                                                                            IBE
                                                                                           ABS


                   100


                                                                                                                                 1
Total Time (sec)


                    10


                                                                                                                                0.1
                                                                                                                                      0.1        0.08        0.06              0.04        0.02               0
                                                                                                                                                              Minimum Support (%)

                     1


                                                                                                                               Figure 5. Execution times for database Retail
                    0.1
                          95   90          85   80         75      70        65     60          55   50
                                                       Minimum Support (%)


                   Figure 3. Execution times for database                                                      From figure 6, ABS is not as efficient as best known
                   Pumsb                                                                                    implementations (e.g. f pmax), but improves Apriori by a
                                                                                                            factor of ten and beats Eclat and IBE.
                   10000
                                                                Apriori
                                                                 Eclat
                                                                                 In [14], the authors propose the Dualize and Advance
                                                                FPmax
                                                                   IBE
                                                                  ABS
                                                                              algorithm. In their approach, the positive border in con-
                   1000
                                                                              struction is always a subset of the positive border to be dis-
                                                                              covered. At each step, from some elements of the positive
                                                                              border already discovered, they generate the correspond-
Total Time (sec)


                    100


                                                                              ing negative border. If one element of the negative border
                     10                                                       appears to be satisfied, they generate a specialization of it
                                                                              which belongs to the positive border and they re-iterate the
                      1                                                       process until each element of the negative border is indeed
                                                                              not satisfied. An implementation of a variant of Dualize and
                     0.1                                                      Advance has been proposed in [21] with an irredundant du-
                        100   80     60               40   20             0
                                     Minimum Support (%)                      alization. Their code is available from the FIMI’03 website.
                                                                                 Some algorithms like M af ia [7] or DCI [19] can adapt
                   Figure 6. Execution times for database Con-                themselves to mine frequent itemsets, with respect to the
                   nect                                                       dataset density and some architectural characteristics (e.g.
                                                                              available memory). Even if these aspects improve perfor-
                                                                              mances, it only concerns choices for data structures; the
   To sum up, two main reasons explain our mitigate re-                       mentioned algorithms do not really adapt their strategy to
sults: 1) the cost of dualization remains high on very large                  explore the search space.
hypergraph instances and 2) candidate generation and sup-
port counting seems to be not enough efficient in our current                 6 Conclusion and future works
implementation.
   The main parameter influencing performance of ABS                              In this paper, we have proposed an ongoing effort to-
turns out to be around the negative border. If for a given                    ward the discovery of maximal frequent itemsets. Our con-
minsup, the negative border does not become too huge                          tribution takes its roots from the algorithm ZigZag devised
and its largest element remains "small" with respect to the                   for inclusion dependency discovery in databases. Even if
largest maximal frequent itemset, ABS should have good                        this two data mining problems fit into the same theoretical
performance.                                                                  framework [18], they widely differ in practice which is not
                                                                              a surprise. Indeed, while ZigZag performed very well in
5 Related works                                                               our experiments 3, ABS does not exhibit such a good behav-
                                                                              ior for maximal frequent itemsets mining. Many reasons
    Several algorithms exist for discovering maximal fre-                     explain this result, for instance the availability of public
quent itemsets mining in a transactional database (see                        datasets allowing thorough experimentations, the intrinsic
FIMI’03). The goal is always to avoid an exponential search                   properties of each problem, and may be the more impor-
by characterizing as fast as possible largest frequent item-                  tant reason lies in the cost of a database access, in-memory
sets without exploring their subsets. M axM iner [3] uses a                   resident data vs data stored into a DBMS.
levelwise approach to explore the candidate itemsets, using                       Many improvements can be brought to our current im-
the Rymon’s enumeration system [20] - in which itemsets                       plementation. Some are specific to our algorithm like for
are arranged in a non redundant tree. But when a candi-                       instance minimal transversal computation on large hyper-
date X is counted over the database, the greatest candidate                   graph instances or taking into account large equivalence
in the subtree of X is also counted; if it is frequent, then                  classes induced by closed frequent itemsets during candi-
all the subtree can be pruned by anti-monotony of the "is                     date generation from "almost frequent" itemsets. Some oth-
frequent" property. Jumps done by M axM iner depend on                        ers belong to "the state of the art" of maximal frequent item-
the ordering of items used to build the tree and are therefore                sets implementation : managing huge set of set, support
quiet different from jumps proposed in this paper. The al-                    counting... Complexity issues need also to be addressed.
gorithms M af ia [7] and GenM ax [12] use the same prin-                          To end up, we want to quote a personal note on the main
ciple as M axM iner with efficient optimizations, e.g. ver-                   objective of the FIMI workshop. We believe that frequent,
tical bitmaps.                                                                closed and maximal itemsets mining are key data mining
    The P incer − Search Algorithm [17] uses a search                         tasks since algorithms devised to solve these tasks are likely
strategy very close to ours. After a levelwise initialization,                to be used in other contexts under some conditions [18].
the principle is also to look at the largest not yet eliminated               Roughly speaking, for every problem representable as sets
candidates. However, these large candidates are not charac-                       3 Note that dynamic parameters were quiet different, e.g. the first dual-

terized in a formal way.                                                      ization was always performed at the second level.
with an anti-monotone predicate as for instance with func-            [13] G. Grahne and J. Zhu. Efficiently using prefix-trees in min-
tional dependency inference or simply anti-monotone pred-                  ing frequent itemsets. In FIMI’03 Workshop on Frequent
icates on itemsets other than "is frequent", the algorithms                Itemset Mining Implementations, November 2003.
devised for FIMI should be useful to answer these tasks.              [14] D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivo-
                                                                           nen, and R. S. Sharma. Discovering all most specific sen-
Nevertheless, it seems rather optimistic to envision the ap-
                                                                           tences. ACM Transaction on Database System, 28(2):140–
plication of many FIMI’03 [11] implementations to another
                                                                           174, 2003.
data mining problem representable as sets. Indeed, even if            [15] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without
the development of efficient data structures for managing                  candidate generation. In ACM SIGMOD’00, Dallas, Texas,
huge sets of set is definitely useful, loading the database in             USA, 2000.
main memory using sophisticated data structure specially              [16] D. J. Kavvadias and E. C. Stavropoulos. Evaluation of an al-
devised for the anti-monotone predicate to be mined turns                  gorithm for the transversal hypergraph problem. In J. S. Vit-
out to give very efficient algorithms but deserve other data               ter and C. D. Zaroliagis, editors, Algorithm Engineering, In-
mining tasks.                                                              ternational Workshop, WAE ’99, London, UK, volume 1668,
                                                                           1999.
                                                                      [17] D.-I. Lin and Z. M. Kedem. Pincer search: A new algo-
References                                                                 rithm for discovering the maximum frequent set. In H.-J.
                                                                           Schek, F. Saltor, I. Ramos, and G. Alonso, editors, Extend-
 [1] R. Agrawal and R. Srikant. Fast algorithms for mining as-             ing Database Technology (EDBT’98), Valencia, Spain, vol-
     sociation rules in large databases. In J. B. Bocca, M. Jarke,         ume 1377 of Lecture Notes in Computer Science, pages 105–
     and C. Zaniolo, editors, International Conference on Very             119. Springer, 1998.
     Large Data Bases (VLDB’94), Santiago de Chile, Chile,            [18] H. Mannila and H. Toivonen. Levelwise Search and Bor-
     pages 487–499. Morgan Kaufmann, 1994.                                 ders of Theories in Knowledge Discovery. Data Mining and
 [2] J. Bailey, T. Manoukian, and K. Ramamohanarao. A fast                 Knowledge Discovery, 1(1):241–258, 1997.
     algorithm for computing hypergraph transversals and its ap-      [19] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Adap-
     plication in mining emerging patterns. In International Con-          tive and resource-aware mining of frequent sets. In Inter-
     ference on Data Mining (ICDM’03), Floride, USA, pages                 national Conference on Data Mining (ICDM’02), Maebashi
     485–488, 2003.                                                        City, Japan, pages 338–345, 2002.
 [3] R. J. Bayardo. Efficiently mining long patterns from             [20] R. Rymon. Search through systematic set enumeration. In
     databases. In L. M. Haas and A. Tiwary, editors, ACM SIG-             B. Nebel, C. Rich, and W. R. Swartout, editors, International
     MOD Conference, Seattle, USA, pages 85–93, 1998.                      Conference on Principles of Knowledge Representation and
 [4] C. Berge. Graphs and Hypergraphs. North Holland, Ams-                 Reasoning (KR’92), Cambridge, USA, pages 539–550. Mor-
     terdam, 1973.                                                         gan Kaufmann, 1992.
 [5] C. Borgelt. Efficient implementations of apriori and eclat. In   [21] T. Uno and K. Satoh. Detailed description of an algorithm
     FIMI’03 Workshop on Frequent Itemset Mining Implemen-                 for enumeration of maximal frequent sets with irredundant
     tations, November 2003.                                               dualization. In B. Goethals and M. Zaki, editors, ICDM
 [6] C. Borgelt and R. Kruse. Induction of association rules :             2003 Workshop on Frequent Itemset Mining Implementa-
     Apriori implementation. 2002.                                         tions (FIMI ’03), Melbourne, Florida, USA, volume 90 of
 [7] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal              CEUR Workshop Proceedings, 2003.
     frequent itemset algorithm for transactional databases. In       [22] M.-J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New
     International Conference on Data Engineering (ICDE’01),               algorithms for fast discovery of association rules. In In-
     Heidelberg, Germany, pages 443–452. IEEE CS, 2001.                    ternational Conference on Knowledge Discovery and Data
 [8] F. De Marchi and J.-M. Petit. Zigzag : a new algorithm                Mining (KDD-97), Newport Beach, California, USA, pages
     for discovering large inclusion dependencies in relational            283–286. AAAI Press, 1997.
     databases. In International Conference on Data Mining
     (ICDM’03), Melbourne, Florida, USA, pages 27–34. IEEE
     Computer Society, 2003.
 [9] J. Demetrovics and V. Thi. Some remarks on generating
     armstrong and inferring functional dependencies relation.
     Acta Cybernetica, 12(2):167–180, 1995.
[10] B. Goethals. Frequent itemset mining implementations
     repository, http://fimi.cs.helsinki.fi/, 2003.
[11] B. Goethals and M. Zaki, editors. Workshop on Fre-
     quent Itemset Mining Implementations. CEUR Workshop
     Proceedings, 2003.
[12] K. Gouda and M. J. Zaki. Efficiently mining of maximal fre-
     quent itemsets. In N. Cercone, T. Y. Lin, and X. Wu, editors,
     International Conference on Data Mining (ICDM’01), San
     Jose, USA. IEEE Computer Society, 2001.