<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ABS: Adaptive Borders Search of frequent itemsets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frédéric Flouvat</string-name>
          <email>ouvat@isima.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabien De Marchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean-Marc Petit</string-name>
          <email>jmpetit@math.univ-bpclermont.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laboratoire LIMOS</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>UMR CNRS</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratoire LIRIS, FRE CNRS 2672 Université Claude</institution>
          <addr-line>Bernard Lyon 1 8, boulevard Niels Bohr, 69 622 Villeurbanne cedex</addr-line>
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université Blaise Pascal - Clermont-Ferrand II</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present an ongoing work to discover maximal frequent itemsets in a transactional database. We propose an algorithm called ABS for Adaptive Borders Search, which is in the spirit of algorithms based on the concept of dualization. From an abstract point of view, our contribution can be seen as an improvement of the basic APRIORI algorithm for mining maximal frequent itemsets. The key point is to decide dynamically at which iteration, if any, the dualization has to be made to avoid the enumeration of all subsets of large maximal itemsets. Once the first dualization has been done from the current negative border, APRIORI is no longer used and instead, another dualization is carried out from the positive border known so far. The process is repeated until no change occurs anymore in the positive border in construction. Experiments have been done on FIMI datasets from which tradeoffs on adaptive behavior have been proposed to guess the best iteration for the first dualization. Far from being the best implementation wrt FIMI'03 contributions, performance evaluations of ABS exhibit better performance than IBE, the only public implementation based on the concept of dualization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In this paper, we present an ongoing work to discover
maximal frequent itemsets in a transactional database. We
propose to adapt an algorithm originally devised for mining
inclusion dependencies in databases [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This algorithm is
called ABS for Adaptive Borders Search, and is in the spirit
of algorithms based on the concept of dualization [
        <xref ref-type="bibr" rid="ref14 ref21">14, 21</xref>
        ].
      </p>
      <p>
        The basic idea of our proposition is to combine the
strength of both levelwise algorithm [
        <xref ref-type="bibr" rid="ref1 ref18">1, 18</xref>
        ] and Dualize and
Advance algorithm [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] in such a way that:
• "small" maximal frequent itemsets are efficiently
generated with levelwise strategies.
• "large" maximal frequent itemsets may be found
efficiently by dualization.
      </p>
      <p>
        The dualization performed is quite similar to that
proposed in the Dualize and Advance algorithm. Nevertheless,
instead of starting from some subset of maximal frequent
itemsets as Dualize and Advance algorithm does, we use
infrequent itemsets to perform the dualization. As a
consequence, we obtain the so-called optimistic positive border of
maximal frequent itemsets. The set of such candidates
corresponds exactly to k-uniform hypergraph clique proposed
in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. As a consequence, our proposition contributes to
clarify some related contributions [
        <xref ref-type="bibr" rid="ref12 ref17 ref22 ref3 ref7">22, 17, 3, 12, 7</xref>
        ]) since
it gives an exact characterization of the optimistic positive
border of maximal frequent itemsets from some subset of
infrequent itemsets.
      </p>
      <p>From an abstract point of view, our contribution can be
seen as an improvement of the basic APRIORI algorithm
for mining maximal frequent itemsets. The key point is to
decide dynamically at which iteration, if any, the
dualization has to be made to avoid the enumeration of all subsets
of large maximal itemsets. Once the first dualization has
been done from the current negative border available at that
iteration, APRIORI is no longer used and instead, another
dualization is carried out from the positive border known so
far. The process is repeated until no change occurs anymore
in the positive border in construction.</p>
      <p>
        Experiments have been done on FIMI datasets [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The
adaptive behavior of our algorithm has been tuned from
results gathered from these experiments. For the tested
dataset, we were able to guess dynamically the best iteration
for the first dualization, a key parameter of our algorithm.
      </p>
      <p>
        Far from being the best implementation wrt FIMI’03
contributions [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], performance evaluations of ABS exhibit
better performance than IBE [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], the only public
implementation based on the concept of dualization.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>
        Let R be a set of symbols called items; a line is a subset
of R, and a binary relation r over R is a multiset of lines.
We suppose the reader is familiar with the notions of
itemsets, support, and with the main aspects of frequent
itemsets mining problem in a binary relation, given a threshold
minsup (see e.g. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for details). We recall the notion of
borders of a set of itemsets [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Given F a set of
itemsets over R, the positive border of F denoted by Bd +(F ) is
defined by Bd+(F ) = max⊆{X ∈ F }. The negative
border of F is defined by Bd−(F ) = min⊆{Y ⊆ R | ∀X ∈
F, Y ⊆ X }. If F I is the set of all itemsets frequent in r,
then Bd+(F I) is called the set of maximal frequent itemsets
in r.
      </p>
      <p>
        We will use the concepts of hypergraph and minimal
transversal of a hypergraph, whose definition is pointed out
here (see for example [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for more details). Given V a
finite set of elements. A subset E of V defines a
hypergraph H = (V, E), where elements of V are called
vertices of H and elements of E edges of H. A transversal T
of H = (V, E) is a subset of V that intersect all the
elements of E. T is minimal if no other transversal of H are
included in T. The set of all minimal transversals of H is
noted T r(H).
      </p>
      <p>
        The relationship between the notion of borders and
minimal transversals of hypergraph has been exhibited in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
Indeed, any set of itemsets can be seen as a hypergraph; if
F I is the set of frequent itemsets in a binary relation r, we
have: T r(F I) = Bd−(F I), where F I = {R − X | X ∈
F I}.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Method description</title>
    </sec>
    <sec id="sec-4">
      <title>Starting with a levelwise algorithm</title>
      <p>
        The algorithm Apriori [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was initially devoted to
frequent itemset mining; Nevertheless, it has been proved to
be still competitive for maximal frequent itemsets mining
in many cases [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], when the size of elements to discover
remain small.
      </p>
      <p>Our goal is to exploit the efficiency of Apriori, but to
automatically detect when it will fall into troubles and stop
its execution. Then we propose to exploit the knowledge
mined so far to initialize a different search, based on the
concept of dualization between positive and negative
borders; each border is updated and used to compute the
corresponding dual border, until a fix point is reached.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>From negative to positive border</title>
      <p>In the sequel, let r be a binary database over a set of
items R, minsup a minimum support, and F I the set of
frequent itemsets in r. After the levelwise part, our method
is still iterative; at each iteration i, new elements of the
positive and negative borders are expected to be discovered. We
denote by Bdi+ (resp. Bdi−) the subset of Bd+(F I) (resp.
Bd−(F I)) discovered until the ith iteration. In other words,
∀i &lt; j, Bdi+ ⊆ Bdj+ and Bdi− ⊆ Bdj−. Roughly speaking,
candidates for Bdi+ are obtained from elements of Bdi−, and
candidates for Bdi−+1 are obtained from elements of Bdi+.</p>
      <p>
        The following definitions and results have been proposed
in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for inclusion dependency discovery problem in
relational databases. We recall them in the context of maximal
frequent itemsets mining, only the proofs are omitted.
      </p>
      <p>We first define the notion of Optimistic positive border.
Definition 1(Optimistic positive border) Given a set F of
itemsets, the optimistic positive border of F is: F opt(F ) =
max⊆{X ⊆ R | ∀Y ∈ F, Y ⊆ X }.</p>
      <p>The next theorem gives a constructive characterization
of F opt(F ).</p>
      <p>
        Theorem 1[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] F opt(F ) = T r(F )
      </p>
      <p>Therefore, the idea is to compute the optimistic positive
border for Bdi− to obtain exactly the largest itemsets which
do not contain any infrequent itemset discovered so far.</p>
      <sec id="sec-5-1">
        <title>Proposition 1 Let X ∈</title>
        <p>minsup, X ∈ Bd+(F I).</p>
        <p>F opt(Bdi−).</p>
        <p>If sup(X ) ≥
Proof Since X is maximal in the definition of F opt(Bdi−),
each of its superset contains at least one element of Bdi−,
and is infrequent by anti-monotonicity.</p>
        <p>Then, Bdi+ is exactly made up of all the frequent itemsets
in F opt(Bdi−).
3.3</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>From positive to negative border</title>
      <p>In a dual way, the set Bdi+ is then used to compute its
negative border Bd−(Bdi+), to finally update the negative
border in construction and obtain Bd i−+1.</p>
      <p>The next theorem gives a constructive characterization
of Bd−(F ), for any set F of frequent itemsets.</p>
      <p>
        Theorem 2[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] Bd−(F ) = T r(F )
      </p>
      <p>Bd−(Bdi+).</p>
      <p>If sup(X ) &lt;</p>
      <sec id="sec-6-1">
        <title>Proposition 2 Let X ∈</title>
        <p>minsup, X ∈ Bd−(F I).</p>
        <p>Proof Let X be an element of Bd−(Bdi+). By the definition
of the negative cover of a set, each subset of X is included
in an element of Bd+(F I) and then is frequent.</p>
        <p>Then, Bdi−+1 is exactly made up of all the infrequent
itemsets in Bd−(Bdi+).
3.4</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>The Algorithm ABS</title>
      <p>Algorithm 1 computes the positive and negative borders
of frequent itemsets in a given binary database. Within
the framework of levelwise algorithms, ABS decides at
each level whether or not the levelwise approach has to
be stopped. In that case, the levelwise approach is halted,
and the two borders are incrementally updated as
described previously. The functions GenP osBorder and
GenN egBorder compute respectively the optimistic
positive and negative borders, using characterizations in
theorems 1 and 2. The algorithm terminates when all elements
of the optimistic positive border currently computed are
frequent. It is worth noting that no dualization may occur at all:
in this case, ABS is reduced to AP RIORI. The
proposition 3 ensures the correctness of ABS.</p>
      <p>The behavior of the function IsDualizationRelevant
is described in section 3.5.</p>
      <p>Proposition 3 The algorithm ABS returns Bd+(F I) and
Bd−(F I).</p>
      <p>Proof If the test performed by IsDualizationRelevant()
is never true, the demonstration is obvious.</p>
      <p>If not, in line 15, from propositions 1 and 2, we have Bd i+ ⊆
Bd+(F I) and Bdi−−1 ⊆ Bd−(F I)
Moreover, the termination condition ensures that Bdi+ =
GenP osBorder(Bdi−−1); all elements in Bdi+ are frequent
and all elements in Bdi−−1 are infrequent. Suppose that
∃X ∈ Bd−(F I) | X ∈ Bdi−−1. Then:
• if ∃Y ∈ Bdi−−1 | Y ⊂ X , since Y is infrequent, X ∈</p>
      <p>Bd−(F I) and there is a contradiction
• if ∃Y ∈ Bdi−−1 | Y ⊆ X , then from the definition of
the optimistic positive border ∃Z ∈ Bdi+ | X ⊆ Z,
which contradict the fact that X is infrequent.</p>
      <p>Thus Bdi−−1 = Bd−(F I). An identical reasoning leads to
Bdi+ = Bd+(F I).</p>
      <p>The main adaptive aspect of ABS is conveyed by the
function IsDualizationRelevant, line 8 of algorithm 1.
As mentioned, its goal is to estimate if it is interesting to
dualize the current negative border to the optimistic positive
border.</p>
      <p>We have identified four parameters specific to a given
iteration of the levelwise algorithm, which can be obtained
dynamically without any overhead:
• The current level i. No jump is allowed until a given
integer threshold; we set the threshold equal to 4, since
Apriori is very efficient in practice to explore the
levels 1 to 4. In our experiments, dualizing before this
level incurs no improvement.
• |Bdi−|, the size of the current negative border. A
simple remark can be made here: if this parameter is very
large (more than 100000) the minimal transversals
computation become prohibitive. We are not aware
of existing implementations of minimal transversals
computation able to handle such input hypergraphs 1.
Moreover, such cases are likely to correspond to best
scenario for Apriori.
• |Fi|, the number of frequent i-itemsets and |Bd i−| have
to be compared. Indeed, a small value of |Bd i−| wrt
|Fi| is likely to give a successful dualization.
• |Fi| and |Ci|, the number of candidates in level i, can
also be compared. If |Fi|/|Ci| is close to 1, we can
suspect to be in a "dense" part of the search space, and
thus the levelwise search should be stopped.
3.6</p>
    </sec>
    <sec id="sec-8">
      <title>Practical aspects</title>
      <p>3.6.1</p>
      <p>
        Candidate generation from the current positive
border
From [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], candidate generation of a levelwise algorithm
for a problem representable as sets can be formulated using
dualization: At the ith iteration, we have
      </p>
      <p>
        Ci+1 = T r(∪j≤iFj ) − ∪j≤iCj
It is shown in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] that candidate itemsets of Ci+1 are
exactly of size i + 1, which allows to improve candidate
generation.
      </p>
      <p>In the setting of this paper, we can see Ci+1 as the set
Bdi−+1 − Bdi−, and thus we get:</p>
      <p>Ci+1 = T r(Bdi+) − ∪j≤iCj</p>
      <p>Here, the major difference with a pure levelwise
approach is that Bdi+ may contain some elements of size
greater than i + 1.</p>
      <p>One may question about the size of the largest elements
of Ci+1: does there exist elements of size strictly greater
than i + 1 ? The answer is yes as shown in the following
non trivial example.</p>
      <p>Example 1
Let r be the binary relation over a schema R =
{A, B, C, D, E, F, G, H, I} represented in Table 1. For a
minsup equals to 1, the borders of frequent itemsets in r
are Bd− = {AE, BF, CG, DH, ABCDI} and Bd+ =
{ABCHI, ABDGI, ABGHI, ACDF I, ACF HI,
ADF GI, AF GHI, BCDEI, BCEHI, BDEGI,
BEGHI, CDEF I, CEF HI, DEF GI, EF GHI,
ABCD}.</p>
      <p>
        After a levelwise pass until level two, the four NFI
of size two have been discovered, i.e. Bd2− =
{AE, BF, CG, DH}. Suppose the algorithm decides here
to stop the pure levelwise search. Then, these sets are used
1Experiments conducted in [
        <xref ref-type="bibr" rid="ref16 ref2">16, 2</xref>
        ] only consider hypergraphs with not
more than 32000 edges.
to compute the optimistic positive border from level 2. It
is made up of 16 itemsets of size 5, among which the only
non frequent itemset is ABCDI. Thus, at this time, Bd2+ =
{ABCHI, ABDGI, ABGHI, ACDF I, ACF HI,
ADF GI, AF GHI, BCDEI, BCEHI, BDEGI,
BEGHI, CDEF I, CEF HI, DEF GI, EF GHI}. We
obtain Bd−(Bd+) = {ABCD} of size 4, being understood
2
that no elements of size 3 does exist.
      </p>
      <p>In our first implementation, computing the set
Bd−(Bdi+) using minimal transversals had a quite
prohibitive cost on large hypergraph instances. Therefore,
we made the choice to restrict Bd−(Bdi+) to its (i +
1)itemsets for efficiency reasons. This choice has no effect
on the correctness of the algorithm, since the termination
condition is always the same2.
Let us consider the case of a candidate itemset obtained
after a dualization from the current negative border. Let X
be this candidate. Two main cases do exist: either X is
frequent, or X is infrequent. In that case, we propose to
estimate a degree of error in order to "qualify the jump".</p>
      <p>
        Given a new user-defined threshold δ, and a
minimal support minsup, an error measure, noted
2We suspect the algorithm P incerSearch [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to be not complete.
Indeed, the search strategy of P incerSearch is very close to our
proposition: if we only consider (i + 1)-itemsets in Bd−(Bdi+), they
correspond exactly to the candidate set Ci+1 of P incerSearch. Since
P incerSearch stops as soon as Ci+1 = ∅, some elements could be
forgotten. From the example 1, after the level 2, C3 is empty, and therefore
the maximal set ABCD seems to be never generated by P incerSearch.
error(X, minsup), can be defined as the ratio between the
minsup minsup and the support of the infrequent itemset
X , i.e. error(X, minsup) = 1 − sumppinorstu(pX) .
      </p>
      <p>Two sub-cases are worth considering:
• either error(X, minsup) ≤ δ : the "jump" was not
successful but solutions should exist among the nearest
subsets of X .
• or error(X, minsup) &gt; δ : In that case, the jump
was over-optimistic and probably, no solution does
exist among the nearest generalizations of X .</p>
      <p>Note that this error measure is decreasing, i.e. X ⊂
Y ⇒ error(X, minsup) ≤ error(Y, minsup)</p>
      <p>In our current implementation, these almost frequent
large itemsets are first considered as frequent to enable more
pruning in subsequent passes. Afterward, they are
considered at the very end of our algorithm. A pure top-down
levelwise approach has been implemented to find out their
subsets which can be maximal frequent itemsets.
4 Implementation details and experimental
results
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Implementation details</title>
      <p>
        An implementation of the algorithm has been performed
in C++/STL. Two existing implementations available from
the FIMI repository website [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] were borrowed: the
Apriori code of C. Borgelt [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and the prefix-tree
implementation of B. Goethals using C++/STL [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        To keep coherence with this implementation, we use a
similar data structure for the new parts of the algorithm.
The itemsets and the transactions are stored in a prefix-tree
[
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ].
      </p>
      <p>
        Minimal Transversals computation For the minimal
transversals computation, we implemented the algorithm
proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] using a prefix-tree in order to handle
relatively large hypergraph instances. Its incremental aspect
is very interesting in our case, since the negative border is
itself incremental. Note that improvements have been
performed by exploiting the knowledge of previous
dualizations. We do not give more details here.
4.2
      </p>
    </sec>
    <sec id="sec-10">
      <title>Experimental results</title>
      <p>We conducted experiments on a pentium 4.3GHz
Processor, with 1Go of memory. The operating system was
Redhat Linux 7.3 and we used gcc 2.96 for the compilation.
We used four datasets available on the FIMI’03 repository.</p>
      <p>
        We first evaluate the influence of the level from which
the levelwise approach is stopped on the performances of
ABS. Then, the impact of "almost frequent" large itemsets
is studied for different threshold values for the error
measure. Finally, we compare ABS with four maximal frequent
itemsets mining algorithms: Apriori and Eclat [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
implemented by C.Borgelt [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], F pmax [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] based on F P −trees
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and IBE [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>Connect Minsup 60%
180
160
140
)c 120
(se
ieTm 100
ltoT 80
a
60
40
In figure 1, we forced the first dualization for different
levels (from 3 to 8), on the connect dataset with a minsup
of 80 % and the pumsb* dataset with a minsup of 30%.
The results confirm the necessity to fix dynamically this
parameter, and then justify an adaptive approach. Second, for
all tested datasets, our function IsDualizationRelevant
has dynamically determined the best level to begin
dualization.</p>
      <p>The optimization based on the error measure is
evaluated on figure 2. From pumsb dataset (on the top), this
optimization appears to be interesting with a threshold value
near 0.002. Nevertheless, on the connect dataset (bottom)
no improvements is achieved. This comes from the fact that
the proposed error measure is not strictly decreasing; and
the equivalence classes induced by closed frequent itemsets
are large. Our top down levelwise approach is prone to fail
on this kind of databases .
0.195</p>
      <p>From figures 3, 5 and 6, ABS is far to compete with
best known implementations but tends to outperform IBE
for most of our experimentations. Recall that IBE is the
unique implementation based on the concept of dualization
available from FIMI’03. We believe that this is due to the
number of dualization performed by IBE, which is in the
size of the positive border.</p>
      <p>From figure 4, IBE exhibits better performances than
ABS for low support thresholds (less than 20%). This is
due to the fact that while the size of the positive border
remains small (less than 5000 elements) the size of the
negative border exceeds 106 elements, where some elements
appear to have a very large size. This seems to be the worst
case for ABS.</p>
      <p>From figure 5, ABS behaves like Apriori as expected.
Indeed, the positive border of retail is made up of "small"
itemsets, and Apriori turns out to be the best
implementation for this kind of datasets.
Minsup 50%
Minsup 60%
Minsup 70%</p>
      <p>Apriori
Eclat
FPmax</p>
      <p>IBE
ABS
1000
100
)
sce
(
ieTm 10
lt
a
o
T
1
From figure 6, ABS is not as efficient as best known
implementations (e.g. f pmax), but improves Apriori by a
factor of ten and beats Eclat and IBE.
80
60 40
Minimum Support (%)
20
0</p>
      <p>To sum up, two main reasons explain our mitigate
results: 1) the cost of dualization remains high on very large
hypergraph instances and 2) candidate generation and
support counting seems to be not enough efficient in our current
implementation.</p>
      <p>The main parameter influencing performance of ABS
turns out to be around the negative border. If for a given
minsup, the negative border does not become too huge
and its largest element remains "small" with respect to the
largest maximal frequent itemset, ABS should have good
performance.
5</p>
    </sec>
    <sec id="sec-11">
      <title>Related works</title>
      <p>
        Several algorithms exist for discovering maximal
frequent itemsets mining in a transactional database (see
FIMI’03). The goal is always to avoid an exponential search
by characterizing as fast as possible largest frequent
itemsets without exploring their subsets. M axM iner [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] uses a
levelwise approach to explore the candidate itemsets, using
the Rymon’s enumeration system [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] - in which itemsets
are arranged in a non redundant tree. But when a
candidate X is counted over the database, the greatest candidate
in the subtree of X is also counted; if it is frequent, then
all the subtree can be pruned by anti-monotony of the "is
frequent" property. Jumps done by M axM iner depend on
the ordering of items used to build the tree and are therefore
quiet different from jumps proposed in this paper. The
algorithms M af ia [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and GenM ax [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] use the same
principle as M axM iner with efficient optimizations, e.g.
vertical bitmaps.
      </p>
      <p>
        The P incer − Search Algorithm [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] uses a search
strategy very close to ours. After a levelwise initialization,
the principle is also to look at the largest not yet eliminated
candidates. However, these large candidates are not
characterized in a formal way.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the authors propose the Dualize and Advance
algorithm. In their approach, the positive border in
construction is always a subset of the positive border to be
discovered. At each step, from some elements of the positive
border already discovered, they generate the
corresponding negative border. If one element of the negative border
appears to be satisfied, they generate a specialization of it
which belongs to the positive border and they re-iterate the
process until each element of the negative border is indeed
not satisfied. An implementation of a variant of Dualize and
Advance has been proposed in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] with an irredundant
dualization. Their code is available from the FIMI’03 website.
      </p>
      <p>
        Some algorithms like M af ia [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or DCI [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] can adapt
themselves to mine frequent itemsets, with respect to the
dataset density and some architectural characteristics (e.g.
available memory). Even if these aspects improve
performances, it only concerns choices for data structures; the
mentioned algorithms do not really adapt their strategy to
explore the search space.
6
      </p>
    </sec>
    <sec id="sec-12">
      <title>Conclusion and future works</title>
      <p>
        In this paper, we have proposed an ongoing effort
toward the discovery of maximal frequent itemsets. Our
contribution takes its roots from the algorithm ZigZag devised
for inclusion dependency discovery in databases. Even if
this two data mining problems fit into the same theoretical
framework [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], they widely differ in practice which is not
a surprise. Indeed, while ZigZag performed very well in
our experiments3, ABS does not exhibit such a good
behavior for maximal frequent itemsets mining. Many reasons
explain this result, for instance the availability of public
datasets allowing thorough experimentations, the intrinsic
properties of each problem, and may be the more
important reason lies in the cost of a database access, in-memory
resident data vs data stored into a DBMS.
      </p>
      <p>Many improvements can be brought to our current
implementation. Some are specific to our algorithm like for
instance minimal transversal computation on large
hypergraph instances or taking into account large equivalence
classes induced by closed frequent itemsets during
candidate generation from "almost frequent" itemsets. Some
others belong to "the state of the art" of maximal frequent
itemsets implementation : managing huge set of set, support
counting... Complexity issues need also to be addressed.</p>
      <p>
        To end up, we want to quote a personal note on the main
objective of the FIMI workshop. We believe that frequent,
closed and maximal itemsets mining are key data mining
tasks since algorithms devised to solve these tasks are likely
to be used in other contexts under some conditions [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
Roughly speaking, for every problem representable as sets
3Note that dynamic parameters were quiet different, e.g. the first
dualization was always performed at the second level.
with an anti-monotone predicate as for instance with
functional dependency inference or simply anti-monotone
predicates on itemsets other than "is frequent", the algorithms
devised for FIMI should be useful to answer these tasks.
Nevertheless, it seems rather optimistic to envision the
application of many FIMI’03 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] implementations to another
data mining problem representable as sets. Indeed, even if
the development of efficient data structures for managing
huge sets of set is definitely useful, loading the database in
main memory using sophisticated data structure specially
devised for the anti-monotone predicate to be mined turns
out to give very efficient algorithms but deserve other data
mining tasks.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Fast algorithms for mining association rules in large databases</article-title>
          .
          <source>In J. B</source>
          .
          <string-name>
            <surname>Bocca</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jarke</surname>
          </string-name>
          , and C. Zaniolo, editors,
          <source>International Conference on Very Large Data Bases (VLDB'94)</source>
          , Santiago de Chile, Chile, pages
          <fpage>487</fpage>
          -
          <lpage>499</lpage>
          . Morgan Kaufmann,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Manoukian</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramamohanarao</surname>
          </string-name>
          .
          <article-title>A fast algorithm for computing hypergraph transversals and its application in mining emerging patterns</article-title>
          .
          <source>In International Conference on Data Mining (ICDM'03)</source>
          , Floride, USA, pages
          <fpage>485</fpage>
          -
          <lpage>488</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Bayardo</surname>
          </string-name>
          .
          <article-title>Efficiently mining long patterns from databases</article-title>
          . In L. M.
          <article-title>Haas and A</article-title>
          . Tiwary, editors,
          <source>ACM SIGMOD Conference</source>
          , Seattle, USA, pages
          <fpage>85</fpage>
          -
          <lpage>93</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Berge</surname>
          </string-name>
          . Graphs and Hypergraphs. North Holland, Amsterdam,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Borgelt</surname>
          </string-name>
          .
          <article-title>Efficient implementations of apriori and eclat</article-title>
          .
          <source>In FIMI'03 Workshop on Frequent Itemset Mining Implementations</source>
          ,
          <year>November 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Borgelt</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Kruse</surname>
          </string-name>
          .
          <article-title>Induction of association rules : Apriori implementation</article-title>
          .
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Burdick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Calimlim</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          .
          <article-title>Mafia: A maximal frequent itemset algorithm for transactional databases</article-title>
          .
          <source>In International Conference on Data Engineering (ICDE'01)</source>
          , Heidelberg, Germany, pages
          <fpage>443</fpage>
          -
          <lpage>452</lpage>
          . IEEE CS,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>De Marchi and J.-M. Petit</surname>
          </string-name>
          .
          <article-title>Zigzag : a new algorithm for discovering large inclusion dependencies in relational databases</article-title>
          .
          <source>In International Conference on Data Mining (ICDM'03)</source>
          , Melbourne, Florida, USA, pages
          <fpage>27</fpage>
          -
          <lpage>34</lpage>
          . IEEE Computer Society,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Demetrovics</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Thi</surname>
          </string-name>
          .
          <article-title>Some remarks on generating armstrong and inferring functional dependencies relation</article-title>
          .
          <source>Acta Cybernetica</source>
          ,
          <volume>12</volume>
          (
          <issue>2</issue>
          ):
          <fpage>167</fpage>
          -
          <lpage>180</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          .
          <article-title>Frequent itemset mining implementations repository</article-title>
          , http://fimi.cs.helsinki.fi/,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          and M. Zaki, editors.
          <source>Workshop on Frequent Itemset Mining Implementations. CEUR Workshop Proceedings</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gouda</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          .
          <article-title>Efficiently mining of maximal frequent itemsets</article-title>
          . In N. Cercone,
          <string-name>
            <given-names>T. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <surname>X</surname>
          </string-name>
          . Wu, editors,
          <source>International Conference on Data Mining (ICDM'01)</source>
          , San Jose, USA. IEEE Computer Society,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Grahne</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Efficiently using prefix-trees in mining frequent itemsets</article-title>
          .
          <source>In FIMI'03 Workshop on Frequent Itemset Mining Implementations</source>
          ,
          <year>November 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gunopulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khardon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saluja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          .
          <article-title>Discovering all most specific sentences</article-title>
          .
          <source>ACM Transaction on Database System</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ):
          <fpage>140</fpage>
          -
          <lpage>174</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pei, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          .
          <article-title>Mining frequent patterns without candidate generation</article-title>
          .
          <source>In ACM SIGMOD'00</source>
          , Dallas, Texas, USA,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Kavvadias</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Stavropoulos</surname>
          </string-name>
          .
          <article-title>Evaluation of an algorithm for the transversal hypergraph problem</article-title>
          . In J. S. Vitter and C. D. Zaroliagis, editors, Algorithm Engineering, International Workshop, WAE '
          <fpage>99</fpage>
          , London, UK, volume
          <volume>1668</volume>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.-I.</given-names>
            <surname>Lin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z. M.</given-names>
            <surname>Kedem</surname>
          </string-name>
          .
          <article-title>Pincer search: A new algorithm for discovering the maximum frequent set</article-title>
          . In H.
          <article-title>-</article-title>
          <string-name>
            <surname>J. Schek</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Saltor</surname>
            ,
            <given-names>I. Ramos</given-names>
          </string-name>
          , and G. Alonso, editors,
          <source>Extending Database Technology (EDBT'98)</source>
          , Valencia, Spain, volume
          <volume>1377</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>105</fpage>
          -
          <lpage>119</lpage>
          . Springer,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          .
          <article-title>Levelwise Search and Borders of Theories in Knowledge Discovery</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>241</fpage>
          -
          <lpage>258</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Palmerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          .
          <article-title>Adaptive and resource-aware mining of frequent sets</article-title>
          .
          <source>In International Conference on Data Mining (ICDM'02)</source>
          , Maebashi City, Japan, pages
          <fpage>338</fpage>
          -
          <lpage>345</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rymon</surname>
          </string-name>
          .
          <article-title>Search through systematic set enumeration</article-title>
          . In B. Nebel, C. Rich, and W. R. Swartout, editors,
          <source>International Conference on Principles of Knowledge Representation and Reasoning (KR'92)</source>
          , Cambridge, USA, pages
          <fpage>539</fpage>
          -
          <lpage>550</lpage>
          . Morgan Kaufmann,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Uno</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Satoh</surname>
          </string-name>
          .
          <article-title>Detailed description of an algorithm for enumeration of maximal frequent sets with irredundant dualization</article-title>
          . In B. Goethals and M. Zaki, editors,
          <source>ICDM 2003 Workshop on Frequent Itemset Mining Implementations (FIMI '03)</source>
          , Melbourne, Florida, USA, volume
          <volume>90</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>M.-J. Zaki</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Parthasarathy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ogihara</surname>
            , and
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>New algorithms for fast discovery of association rules</article-title>
          .
          <source>In International Conference on Knowledge Discovery and Data Mining (KDD-97)</source>
          , Newport Beach, California, USA, pages
          <fpage>283</fpage>
          -
          <lpage>286</lpage>
          . AAAI Press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>