<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Finding minimal rare itemsets in a depth- rst manner</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laszlo Szathmary</string-name>
          <email>Szathmary.L@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petko Valtchev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amedeo Napoli</string-name>
          <email>napoli@loria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Godin</string-name>
          <email>godin.robertg@uqam.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. d'Informatique UQAM</institution>
          ,
          <addr-line>C.P. 8888</addr-line>
          ,
          <institution>Succ. Centre-Ville</institution>
          ,
          <addr-line>Montreal H3C 3P8</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LORIA UMR 7503</institution>
          ,
          <addr-line>B.P. 239, 54506 Vand uvre-les-Nancy Cedex</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Debrecen, Faculty of Informatics, Department of IT</institution>
          ,
          <addr-line>H-4010 Debrecen, Pf. 12</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Rare itemsets are an important sort of patterns that have a wide range of practical applications. Although mining rare patterns poses speci c algorithmic problems, it is yet insu ciently studied. In a previous work, we proposed a levelwise approach for rare itemset mining. Here, we examine the bene ts of depth- rst methods for that task as such methods are known to outperform the levelwise ones in many practical cases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Introduction
Pattern mining is a basic data mining task whose aim is to uncover the hidden
regularities in a set of data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As a simplifying hypothesis, the overwhelming
majority of pattern miners chose to look exclusively on item combinations that
are su ciently frequent, i.e., observed in a large enough proportion of the
transactions. Yet such a hypothesis fails to re ect the entire variety of situations in
data mining practice [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In some speci c situations, frequency may be the exact
opposite of pattern interestingness. The reason behind is that in these cases, the
most typical item combinations from the data correspond to widely-known and
well-understood phenomena. In contrast, less frequently occurring patterns may
point to unknown or poorly studied aspects of the underlying domain [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In a previous paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we proposed a bottom-up, levelwise approach that
traverses the frequent zone of the search space. In this paper we are looking for
a more e cient manner for traversing the frequent part of the Boolean lattice,
using a depth- rst method. Indeed, depth- rst frequent pattern miners have
been shown to outperform breadth- rst ones on a number of datasets.
      </p>
      <p>Consider a set of objects or transactions O = fo1; o2; : : : ; omg, a set of
attributes or items A = fa1; a2; : : : ; ang, and a relation R O A. A set of items
is called an itemset. Each transaction has a unique identi er (tid), and a set of
transactions is called a tidset. The tidset of all transactions sharing a given
itemset X is its image, denoted by t(X). The length of an itemset X is jXj, whereas
an itemset of length i is called an i-itemset. The (absolute) support of an itemset
X, denoted by supp(X), is the size of its image, i.e. supp(X) = jt(X)j.</p>
      <p>The lattice is separated into two segments or zones through a user-provided
\minimum support" threshold, denoted by min supp. Thus, given an itemset X,
if supp(X) min supp, then it is called frequent, otherwise it is called rare (or
infrequent ). In the lattice in Figure 1, the two zones corresponding to a support
threshold of 2 are separated by a solid line. The rare itemset family and the
corresponding lattice zone is the target structure of our study.</p>
    </sec>
    <sec id="sec-2">
      <title>De nition 1. X subsumes Z, i X</title>
      <p>
        Z and supp(X) = supp(Z) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>De nition 2. An itemset Z is generator if it has no proper subset with the
same support.</p>
      <p>
        Property 1. Given X A, if X is a generator, then 8Y X, Y is a generator,
whereas if X is not a generator, 8Z X, Z is not a generator [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Proposition 1. An itemset X is a generator i supp(X) 6= mini2X (supp(X n
fig)) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Each of the frequent and rare zones is delimited by two subsets, the
maximal elements and the minimal ones, respectively. The above intuitive ideas are
formalized in the notion of a border introduced by Mannila and Toivonen in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
According to their de nition, the maximal frequent itemsets constitute the
positive border of the frequent zone1 whereas the minimal rare itemsets form the
negative border of the same zone.
      </p>
      <p>De nition 3. An itemset is a maximal frequent itemset (MFI) if it is frequent
but all its proper supersets are rare.</p>
      <p>De nition 4. An itemset is a minimal rare itemset (mRI) if it is rare but all
its proper subsets are frequent.</p>
      <p>
        The levelwise search yields as a by-product all mRIs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Hence we prefer
a di erent optimization strategy that still yields mRIs while traversing only
a subset of the frequent zone of the Boolean lattice. It exploits the minimal
generator status of the mRIs. By Property 1, frequent generators (FGs) can
be traversed in a levelwise manner while yielding their negative border as a
by-product. It is enough to observe that mRIs are in fact generators:
Proposition 2. All minimal rare itemsets are generators [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
1 The frequent zone contains the set of frequent itemsets.
As pointed out by Mannila and Toivonen [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the easiest way to reach the negative
border of the frequent itemset zone, i.e., the mRIs, is to use a levelwise algorithm
such as Apriori. Indeed, albeit a frequent itemset miner, Apriori yields the mRIs
as a by-product.
      </p>
      <p>
        Apriori-Rare [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a slightly modi ed version of Apriori that retains the
mRIs. Thus, whenever an i-long candidate survives the frequent i 1 subset
test, but proves to be rare, it is kept as an mRI.
      </p>
      <p>
        MRG-Exp [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] produces the same output as Apriori-Rare but in a more e
cient way. Following Proposition 2, MRG-Exp avoids exploring all frequent
itemsets: instead, it looks after frequent generators only. In this case mRIs, which are
rare generators as well, can be ltered among the negative border of the frequent
generators. The output of MRG-Exp is identical to the output of Apriori-Rare,
i.e. both algorithms nd the set of mRIs.
3
      </p>
      <p>
        Finding Minimal Rare Itemsets in a Depth-First
Manner
Eclat [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was the rst FI-miner to combine the vertical encoding with a
depthrst traversal of a tree structure, called IT-tree, whose nodes are X t(X) pairs.
Eclat traverses the IT-tree in a depth- rst manner in a pre-order way, from
left-to-right [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (see Figure 2).
Talky-G [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a vertical FG-miner following a depth- rst traversal of the
ITtree and a right-to-left order on sibling nodes. Talky-G applies an
inclusioncompatible traversal: it goes down the IT-tree while listing sibling nodes from
right-to-left and not the other way round as in Eclat.
      </p>
      <p>
        The authors of [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] explored that order for mining calling it reverse pre-order.
They observed that for any itemset X its subsets appear in the IT-tree in nodes
that lay either higher on the same branch as (X; t(X)) or on branches to the
right of it. Hence, depth- rst processing of the branches from right-to-left would
perfectly match set inclusion, i.e., all subsets of X are met before X itself. While
the algorithm in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] extracts the so-called non-derivable itemsets, Talky-G uses
this traversal to nd the set of frequent generators. See Figure 2 for a comparison
of Eclat and its \reversed" version.
3.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Walky-G</title>
      <p>Since Walky-G is an extension of Talky-G, we also present the latter algorithm
at the same time. Walky-G, in addition to Talky-G, retains rare itemsets and
checks them for minimality.</p>
      <p>Hash structure. In Walky-G a hash structure is used for storing the already
found frequent generators. This hash, called f gM ap, is a simple dictionary with
key/value pairs, where the key is an itemset (a frequent generator) and the value
is the itemset's support.2 The usefulness of this hash is twofold. First, it allows a
quick look-up of the proper subsets of an itemset with the same support, thus the
generator status of a frequent itemset can be tested easily (see Proposition 1).
Second, this hash is also used to look-up the proper subsets of a minimal rare
candidate. This way rare but non-minimal itemsets can be detected e ciently.
Pseudo code. Algorithm 1 provides the main block of Walky-G. First, the
IT-tree is initialized, which involves the creation of the root node, representing
2 In our implementation we used the java.util.HashMap class for f gM ap.
Algorithm 1 (main block of Walky-G):
1) // for quick look-up of (1) proper subsets with the same support
2) // and (2) one-size smaller subsets:
3) f gM ap ; // key: itemset (frequent generator); value: support
4)
5) root:itemset ; // root is an IT-node whose itemset is empty
6) // the empty set is included in every transaction:
7) root:tidset fall transaction IDsg
8) f gM ap:put(;; jOj) // the empty set is an FG with support 100%
9) loop over the vertical representation of the dataset (attr) f
10) if (min supp attr:supp &lt; jOj) f
11) // jOj is the total number of objects in the dataset
12) root:addChild(attr) // attr is frequent and generator
13) g
14) if (0 &lt; attr:supp &lt; min supp) f
15) saveMri(attr) // attr is a minimal rare itemset
16) g
17) g
18) loop over the children of root from right-to-left (child) f
19) saveFg(child) // the direct children of root are FGs
20) extend(child) // discover the subtree below child
21) g
the empty set (of 100% support, by construction). Walky-G then transforms
the layout of the dataset in vertical format, and inserts below the root node all
1-long frequent itemsets. Such a set is an FG whenever its support is less than
100%. Rare attributes (whose support is less than min supp) are minimal rare
itemsets since all their subsets (in this case, the empty set) are frequent. Rare
attributes with support 0 are not considered.</p>
      <p>The saveMri procedure processes the given minimal rare itemset by storing
it in a database, by printing it to the standard output, etc. At this point, the
dataset is no more needed since larger itemsets can be obtained as unions of
smaller ones while for the images intersection must be used.</p>
      <p>The addChild procedure inserts an IT-node under a node. The saveFg
procedure stores a given FG with its support value in the hash structure f gM ap.</p>
      <p>In the core processing, the extend procedure (see Algorithm 2) is called
recursively for each child of the root in a right-to-left order. At the end, the
ITtree contains all FGs. Rare itemsets are veri ed during the construction of the
IT-tree and minimal rare itemsets are retained. The extend procedure discovers
all FGs in the subtree of a node. First, new FGs are tentatively generated from
the right siblings of the current node. Then, certi ed FGs are added below the
current node and later on extended recursively in a right-to-left order.</p>
      <p>The getNextGenerator function (see Algorithm 3) takes two nodes and
returns a new FG, or \null" if no FG can be produced from the input nodes. In
addition, this method tests rare itemsets and retains the minimal ones. First,
a candidate node is created by taking the union of both itemsets and the
intersection of their respective images. The input nodes are thus the candidate's
Algorithm 2 (\extend" procedure):
Method: extend an IT-node recursively (discover FGs in its subtree)
Input: an IT-node (curr)
1) loop over the right siblings of curr from left-to-right (other) f
2) generator getNextGenerator(curr; other)
3) if (generator 6= null) then curr:addChild(generator)
4) g
5) loop over the children of curr from right-to-left (child) f
6) saveFg(child) // child is a frequent generator
7) extend(child) // discover the subtree below child
8) g
parents. Then, the candidate undergoes a frequency test (test 1). If the test fails
then the candidate is rare. In this case, the minimality of the rare itemset cand
is tested. If all its one-size smaller subsets are present in f gM ap then cand is
a minimal rare generator since all its subsets are FGs (see Property 1). From
Proposition 2 it follows that an mRG is an mRI too, thus cand is processed
by the saveMri procedure. If the frequency test was successful, the candidate
is compared to its parents (test 2): if its tidset is equivalent to a parent tidset,
then the candidate cannot be a generator. Even with both outcomes positive, an
itemset may still not be a generator as a subsumed subset may lay elsewhere in
the IT-tree. Due to the traversal strategy in Walky-G, all generator subsets of
the current candidate are already detected and the algorithm has stored them
in f gM ap (see the saveFg procedure). Thus, the ultimate test (test 3) checks
whether the candidate has a proper subset with the same support in f gM ap. A
positive outcome disquali es the candidate.</p>
      <p>This last test (test 3) is done in Algorithm 4. First, one-size smaller subsets
of cand are collected in a list. The two parents of cand can be excluded since
cand was already compared to them in test 2 in Algorithm 3. If the support
value of one of these subsets is equal to the support of cand, then cand cannot
be a generator. Note that when the one-size smaller subsets are looked up in
f gM ap, it can be possible that a subset is missing from the hash. It means that
the missing subset was tested before and turned out to subsume an FG, thus the
subset was not added to f gM ap. In this case cand has a non-FG subset, thus
cand cannot be a generator either (by Property 1).</p>
      <p>Candidates surviving the nal test in Algorithm 3 are declared FG and added
to the IT-tree. An unsuccessful candidate X is discarded which ultimately
prevents any itemset Y having X as a pre x to be generated as candidate and hence
substantially reduces the overall search space. When the algorithm stops, all
frequent generators (and only frequent generators) are inserted in the IT-tree and
in the f gM ap structure. Furthermore, upon the termination of the algorithm,
all minimal rare itemsets have been found. For a running example, see Figure 3.
Algorithm 3 (\getNextGenerator" function):
Method: create a new frequent generator and lter minimal rare itemsets
Input: two IT-nodes (curr and other)
Output: a frequent generator or null
We presented an approach for rare itemset mining from a dataset that splits
the problem into two tasks. Our new algorithm, Walky-G, limits the traversal
of the frequent zone to frequent generators only. Our approach breaks with the
dominant levelwise algorithmic schema since the traversal is achieved through a
depth- rst strategy.
Method: verify if cand subsumes an already found FG
Input: an IT-node (cand)
1) subsets fone-size smaller subsets of cand minus the two parentsg
2) loop over the elements of subsets (ss) f
3) if (ss is stored in f gM ap) f
4) stored support f gM ap:get(ss) // get the support of ss
5) if (stored support = cand:support) f
6) return true // case 1: cand subsumes an FG
7) g
8) g
9) else // if ss is not present in fgMap
10) f // case 2: cand has a non-FG subset ) cand is not an FG either
11) return true
12) g
13) g
14) return false // if we get here then cand is an FG</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannila</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srikant</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verkamo</surname>
            ,
            <given-names>A.I.</given-names>
          </string-name>
          :
          <article-title>Fast discovery of association rules</article-title>
          . In:
          <article-title>Advances in knowledge discovery and data mining</article-title>
          .
          <source>American Association for Arti cial Intelligence</source>
          (
          <year>1996</year>
          )
          <volume>307</volume>
          {
          <fpage>328</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Weiss</surname>
          </string-name>
          , G.:
          <article-title>Mining with rarity: a unifying framework</article-title>
          .
          <source>SIGKDD Explor. Newsl</source>
          .
          <volume>6</volume>
          (
          <issue>1</issue>
          ) (
          <year>2004</year>
          )
          <volume>7</volume>
          {
          <fpage>19</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Szathmary</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valtchev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Towards Rare Itemset Mining</article-title>
          .
          <source>In: Proceedings of the 19th IEEE International Conference on Tools with Arti cial Intelligence (ICTAI '07)</source>
          . Volume
          <volume>1</volume>
          ., Patras, Greece (Oct
          <year>2007</year>
          )
          <volume>305</volume>
          {
          <fpage>312</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsiao</surname>
            ,
            <given-names>C.J.: CHARM</given-names>
          </string-name>
          :
          <article-title>An E cient Algorithm for Closed Itemset Mining</article-title>
          .
          <source>In: SIAM International Conference on Data Mining (SDM' 02)</source>
          .
          <source>(Apr</source>
          <year>2002</year>
          )
          <volume>33</volume>
          {
          <fpage>43</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kryszkiewicz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Concise Representations of Association Rules</article-title>
          .
          <source>In: Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery</source>
          . (
          <year>2002</year>
          )
          <volume>92</volume>
          {
          <fpage>109</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bastide</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taouil</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasquier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stumme</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lakhal</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Mining Frequent Patterns with Counting Inference</article-title>
          .
          <source>SIGKDD Explor. Newsl</source>
          .
          <volume>2</volume>
          (
          <issue>2</issue>
          ) (
          <year>2000</year>
          )
          <volume>66</volume>
          {
          <fpage>75</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mannila</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
          </string-name>
          , H.:
          <article-title>Levelwise Search and Borders of Theories in Knowledge Discovery</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <volume>1</volume>
          (
          <issue>3</issue>
          ) (
          <year>1997</year>
          )
          <volume>241</volume>
          {
          <fpage>258</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parthasarathy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ogihara</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>New Algorithms for Fast Discovery of Association Rules</article-title>
          .
          <source>In: Proceedings of the 3rd International Conference on Knowledge Discovery in Databases. (August</source>
          <year>1997</year>
          )
          <volume>283</volume>
          {
          <fpage>286</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Szathmary</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valtchev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Godin</surname>
          </string-name>
          , R.:
          <article-title>E cient Vertical Mining of Frequent Closures and Generators</article-title>
          .
          <source>In: Proc. of the 8th Intl. Symposium on Intelligent Data Analysis (IDA '09)</source>
          . Volume 5772 of LNCS.,
          <string-name>
            <surname>Lyon</surname>
          </string-name>
          , France, Springer (
          <year>2009</year>
          )
          <volume>393</volume>
          {
          <fpage>404</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Calders</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goethals</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Depth- rst non-derivable itemset mining</article-title>
          .
          <source>In: Proceedings of the SIAM International Conference on Data Mining (SDM '05)</source>
          , Newport Beach, USA. (
          <year>Apr 2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>