<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Support Estimation in Frequent Itemset Mining by Locality Sensitive Hashing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Annika Pick</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>s Horv</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Wrobel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dep. of Computer Science, University of Bonn</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer Center for Machine Learning</institution>
          ,
          <addr-line>Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fraunhofer IAIS</institution>
          ,
          <addr-line>Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The main computational eort in generating all frequent itemsets in a transactional database is in the step of deciding whether an itemset is frequent, or not. We present a method for estimating itemset supports with two-sided error. In a preprocessing step our algorithm rst partitions the database into groups of similar transactions by using locality sensitive hashing and calculates a summary for each of these groups. The support of a query itemset is then estimated by means of these summaries. Our preliminary empirical results indicate that the proposed method results in a speed-up of up to a factor of 50 on large datasets. The F-measure of the output patterns varies between 0.83 and 0.99.</p>
      </abstract>
      <kwd-group>
        <kwd>frequent itemset mining</kwd>
        <kwd>locality sensitive hashing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Frequent itemset mining, the rst step of listing all association rules (see, e.g., [
        <xref ref-type="bibr" rid="ref2 ref9">2,
9</xref>
        ]) is one of the historical core problems of data mining and knowledge discovery.
The most time-consuming step of all frequent itemset mining algorithms is to
decide the frequency of itemsets, i.e., whether an itemset is supported by at least
a certain number of transactions in the input database, or not. In the case that
the mining algorithm has access to the transaction database only via queries
of the form Is itemset X frequent? , all deterministic algorithms solving the
frequent itemset mining problem correctly are required to check all itemsets in
the border of frequent itemsets for frequency [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Checking if an itemset is frequent is often treated as a query to an oracle
(or black box), focusing only on the answer rather than on its computational
aspects. In contrast, we concentrate on the algorithmic issues of deciding whether
an itemset is frequent, or not. In particular, in this work-in-progress paper we
propose an algorithm that decides itemset frequency in a runtime that is
guaranteed to be faster by at least a constant factor than the brute-force linear time
algorithm solving this problem. Although our algorithm remains linear in the
combined size of the input database and the query itemset, the speed-up (up to
a factor of 50) obtained on large benchmark datasets is a remarkable
improvement.</p>
      <p>
        To obtain this improvement, we give up correctness and only require the
(binary valued) answer returned to be correct with high probability. We achieve
this goal by utilizing the advantageous algorithmic properties of locality sensitive
hashing (LSH) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Its main feature is that for a given metric space, elements
with a small distance have a high probability of being hashed to the same bucket,
while elements with a large distance have a low probability of collision. In the
algorithm proposed in this work, transactions are grouped by using LSH based
on the Hamming distance; we utilize the basic fact that the characteristic
vectors of the subsets of a universe of cardinality d can be regarded as the vertices
of the d-dimensional Hamming cube. Using the standard probability
amplication techniques for LSH [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], in a preprocessing step we devise a data structure
partitioning the input transaction database into blocks. With high probability,
transactions with a small Hamming distance will be hashed to the same block,
while those with a large Hamming distance to dierent ones. We remove all
blocks of size less than c, where c is some user specied parameter, and calculate
a weighted summary itemset for each remaining block. In this way, we compress
the transaction database by a factor of at least c. For a query itemset X we then
decide its frequency by matching X with all such summaries. Our algorithm
estimates itemset support with two-sided error in this way. The error probability
can, however, be controlled by standard amplication techniques [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Our preliminary empirical results indicate that this technique outperforms
the fastest exact method in runtime on four out of six benchmark datasets,
reaching a speed-up factor of up to 50. It is slower only on two (out of six)
datasets; the slow-down is, however, marginal. In the long version we
experimentally demonstrate that our method is especially suitable for datasets that
are large and have a relatively small Hamming distance on average; these
properties can quickly be checked (the latter e.g. by sampling) before running the
algorithm. Regarding the estimation quality of our algorithm, we obtained an
F-score of at least 80% on all datasets. Our algorithm is governed by dierent
parameters; we also propose some simple heuristics for the choice of their values.</p>
      <p>
        Finally we note that LSH has already been applied to frequent pattern
discovery, however, in entirely dierent contexts. In particular, it was used in reducing
the computational eort in the candidate generation step of the Apriori
algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This application uses LSH to examine only pairs of elements that are
likely to be similar.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Support Estimation using LSH</title>
      <p>Our LSH-based algorithm for rapid frequency query evaluation consists of two
main parts: In a preprocessing step we rst compress the input database by
partitioning it into blocks containing similar transactions and compute a summary
for each of these blocks. In the query answering step, we estimate the support
of the itemsets at hand by means of these summaries.</p>
      <p>
        The Preprocessing Step. The preprocessing routine takes as input a transactional
database D over I = f1; : : : ; mg and lters out infrequent items so they do not
inuence the hashing procedure. It then builds L data structures, corresponding
to the OR-amplication in LSH [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For each such data structure, we rst select k
functions hi1 ; : : : ; hik independently and uniformly at random, where hi returns
the ith entry (1 i m), and assemble them into a function g as follows:
g hashes two transactions X; Y I into the same bucket i hij (X) = hij (Y )
(i.e., either ij 2 X \ Y or ij 2= X [ Y ) for all j = 1; : : : ; k. Thus, we enforce two
transactions in the same bucket to agree on all of the items in fi1; : : : ; ikg. Notice
that two transactions in the same bucket can agree on even more items if the
items are dependent, i. e., occur frequently together; this is precisely the case we
are interested in. This construction of g corresponds to the AND-amplication
in LSH [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>In our empirical evaluation we found that there were many singleton buckets,
which clearly do not represent any interesting pattern. In general, we discard all
buckets with less than c elements, where c is an input parameter. Its choice is
crucial for the speed-up. For each remaining bucket, we only store one
representative transaction. Our heuristic to dene it is to include only those items that
are contained in a high fraction of the transactions in the bucket. This is
controlled by parameter d: An item i 2 f1; : : : ; wg is included in the representative
transaction of bucket b if at least a fraction d of the transactions hashed into
b contain i. A heuristic for the choice of parameters c and d is proposed in the
next section.</p>
      <p>
        Since storing only the representative transactions and ignoring the bucket
size is insucient, we need to weight the summaries. A straightforward choice
would be just using the bucket size, but we empirically validated that it is better
to use the lower bound for how often the items in the representative transaction
must have occurred together. For space limitations we omit these details. When
this bound is positive, we use it as a weight and add the weighted representative
transaction to the output database. For further speed-up, the output databases
are stored in vertical form in a way similar to [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Answering Frequency Queries. Using the compressed databases constructed
above, we evaluate frequency queries as follows. We compensate the
information loss in the compressed databases by using a smaller frequency threshold
r = for some 2 (0; 1]. An itemset is then regarded frequent in the original
dataset w.r.t. if it is frequent in at least one of the L databases w.r.t. r.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Empirical Evaluation</title>
      <p>
        For the experimental evaluation, we compared the runtime of our method with
that of Eclat [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which was the fastest algorithm we considered for reference.
      </p>
      <p>Preproc. (s) Query ( s) Query LSH Scores (%)
Dataset</p>
      <p>Transact. Thr.</p>
      <p>Eclat LSH Eclat LSH Speed-up Prec. Rec. F</p>
      <p>
        Regarding the datasets, we used four benchmark datasets [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ] to compare our
algorithm with other approaches, as well as two newly generated synthetic datasets
to examine the impact of database length on runtime. For the latter, we used
the synthetic data generator contained in the ARtool project [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and, except for
the length, a xed set of parameters. We call the resulting sets AR-tool100k
(100;000 transactions) and AR-tool1M (one million transactions). The frequency
parameter was set in the same magnitude as in [
        <xref ref-type="bibr" rid="ref10 ref6 ref8">6, 8, 10</xref>
        ], except for Mushroom.
For this dataset we had to double the threshold used in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], as the original
threshold leads to an enormous output set.
      </p>
      <p>Our algorithm depends on dierent parameters. For their choice, we randomly
selected one of the datasets ( AR-tool100k) and optimized the parameter choice
for minimal information loss (parameter k) and maximal speed-up while
maintaining an F-score above 80 %. As a result, we suggest the following heuristics
to choose parameters: For parameters k and c, take a small sample of the new
dataset and evaluate the number of deleted item occurrences for dierent values.
As starting points, we recommend setting k to half of the number of frequent
single items and c to 0:01% of the number of transactions in the dataset. Setting
d = 0:9 and L = 25, for take the relative amount of item occurrences which
have not been deleted. This corresponds to around = 0:35 on AR-tool100k.
Though even these simple heuristics lead to remarkable experimental results,
in the long version of this paper we are going to develop some theoretically
well-founded, more sophisticated strategies for the parameter choice.</p>
      <p>
        For the experimental evaluation , we considered only itemsets with more than
one element that were either frequent or minimal infrequent, as only these types
of itemsets are checked for frequency by most frequent itemset mining algorithms.
As mentioned above, we used the support counting method of Eclat [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] as a
reference. Our rst results with the six datasets are summarized in Table 1. Our
approach is signicantly faster on three datasets and is comparably fast on the
other ones, all with an F-score of above 80 %.
There are three properties inuencing the performance of our algorithm. Firstly,
(very) large datasets benet more from our approach; this can be observed on
the two ARtool datasets, which were generated with the same parameters except
for the length. The dataset with one million transactions reaches a speed-up
factor of nearly 50 vs. only factor 2.3 for the one with 100;000 transactions.
Secondly, the average distance between two transactions in the database is also
crucial because merging transactions in a bucket works better with more similar
transactions. Thirdly, itemsets returned as infrequent slow down the algorithm,
as all L compressed databases have to be checked. Overall, checking the three
properties enables us to assess if a dataset is suitable for our algorithm.
      </p>
      <p>Our algorithm can further be accelerated. In particular, for infrequent
itemsets, all L compressed databases must be checked. Instead, we could break earlier,
e.g., when a query itemset is not contained at all in the rst few databases or
its frequency is far below the threshold in them. Regarding the quality of the
output, it would be interesting to understand the combinatorial properties of the
transactions within a block that result in more accurate weighted summaries.
Another challenge is to develop some theoretically well-founded method able to
nd the optimal parameter values for the dataset at hand in practically
feasible time. Finally, for datasets over large sets of items it would be interesting to
modify our algorithm to min-hashing by considering Jaccard similarity.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <article-title>Frequent itemset mining dataset repository</article-title>
          . http://mi.ua.ac.be/data/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srikant</surname>
          </string-name>
          , R.:
          <article-title>Fast algorithms for mining association rules</article-title>
          .
          <source>In: Proceedings of the 20th VLDB Conference</source>
          . vol.
          <volume>1215</volume>
          , pp.
          <volume>487499</volume>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bera</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pratap</surname>
          </string-name>
          , R.:
          <article-title>Frequent-itemset mining using locality-sensitive hashing</article-title>
          .
          <source>In: International Computing and Combinatorics Conf</source>
          . pp.
          <fpage>143155</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cristofor</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The ARtool project</article-title>
          . https://www.cs.umb.edu/ laur/ARtool/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. De Bie,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>An information theoretic framework for data mining</article-title>
          .
          <source>In: Proc. of the 17th ACM Int. Conf. on Knowledge Discovery and Data Mining</source>
          . pp.
          <fpage>564572</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Goethals</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Survey on frequent pattern mining</article-title>
          .
          <source>Univ. of Helsinki 19</source>
          ,
          <issue>840852</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Indyk</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motwani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Approximate nearest neighbors: towards removing the curse of dimensionality</article-title>
          .
          <source>In: Proceedings of the thirtieth annual ACM symposium on Theory of computing</source>
          . pp.
          <fpage>604613</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.C.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>T.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.T.</given-names>
          </string-name>
          :
          <article-title>Incrementally updating the discovered sequential patterns based on pre-large concept</article-title>
          .
          <source>Intelligent Data Analysis</source>
          <volume>19</volume>
          (
          <issue>5</issue>
          ),
          <volume>10711089</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mannila</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
          </string-name>
          , H.:
          <article-title>Levelwise search and borders of theories in knowledge discovery. Data mining and knowledge discovery 1(3</article-title>
          ),
          <volume>241258</volume>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aksehirli</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goethals</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Frequent itemset mining for big data</article-title>
          .
          <source>In: BigData Conference</source>
          . pp.
          <volume>111118</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parthasarathy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ogihara</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Parallel algorithms for discovery of association rules. Data mining and knowledge discovery 1(4</article-title>
          ),
          <volume>343373</volume>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>