<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jose L. Balcazar</string-name>
          <email>jose.luis.balcazar@upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Garc a-Saiz</string-name>
          <email>garciasad@unican.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Javier de la Dehesa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LSI Department, UPC</institution>
          ,
          <addr-line>Campus Nord, Barcelona</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mathematics, Statistics and Computation Department, University of Cantabria</institution>
          <addr-line>Avda. de los Castros s/n, Santander</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>14</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>We describe the internal algorithmics of our recent implementation of a closure-based self-tuning associator: yacaree. This system is designed so as not to request the user to specify any threshold. In order to avoid the need of a support threshold, we introduce an algorithm that constructs closed sets in order of decreasing support; we are not aware of any similar previous algorithm. In order not to overwhelm the user with large quantities of partial implications, our system lters the output according to a recently studied lattice-closure-based notion of con dence boost, and self-adjusts the threshold for that rule quality measure as well. As a consequence, the necessary algorithmics interact in complicated ways. In order to control this interaction, we have resorted to a well-known, powerful conceptual tool, called Iterators: this notion allows us to distribute control among the various algorithms at play in a relatively simple manner, leading to a fully operative, open-source, e cient system for discovery of partial implications in relational data.</p>
      </abstract>
      <kwd-group>
        <kwd>Association mining</kwd>
        <kwd>parameter-free mining</kwd>
        <kwd>iterators</kwd>
        <kwd>Python</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The task of identifying which implications hold in a given dataset has received
already a great deal of attention [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], also the problem of identifying
partial implications has been considered. Major impulse was received with the
proposal of \mining association rules", a very closely related concept. A majority
of existing association mining programs follow a well-established scheme [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
according to which the user provides a dataset, a support constraint, a con dence
constraint, and, optionally, in most modern implementations, further constraints
on other rule quality evaluation measures such as lift or leverage (a survey of
quality evaluation measures for partial implications is [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). A wealth of
algorithms, of which the most famous is apriori, have been proposed to perform
association mining.
      </p>
      <p>Besides helping the algorithm to focus on hopefully useful partial
implications, the support constraint has an additional role: by restricting the process
to frequent (or frequent closed) itemsets, the antimonotonicity property of the
support threshold de nes a limited search space for exploration and avoids the
often too wide space of the whole powerset of items.</p>
      <p>Instead, however, the price becomes a burden on the user, who must supply
thresholds on rule evaluation measures and on support. Rule measure
thresholds may be di cult to set correctly, but at least they o er often a \semantic"
interpretation that guides the choice; for instance, con dence is (the frequentist
approximation to) the conditional probability of the consequent of the rule, given
the antecedent, whereas lift and leverage refer to the (multiplicative or additive,
respectively) deviation from independence of antecedent and consequent. But
support thresholds are known to be very di cult to set right. Some smallish
datasets are so dense that any exploration below 95% support, on our current
technology, leads to a not always graceful breakdown of the associator program,
whereas other, large but sparse datasets hardly yield any association rule
unless the support is set at quantities as low as 0.1%, spanning a factor of almost
one thousand; and, in order to set the \right" support threshold (whatever that
means), no intuitive guidance is currently known, except for the rather trivial
one of trying various supports and monitoring the number of resulting rules and
the running time and memory needed.</p>
      <p>
        The Weka apriori implementation automates partially the process, as follows:
it explores repeatedly at several support levels, reducing the threshold from one
run to the next by a \delta" parameter (to be set as well by the user), until a given
number of rules has been gathered. Inspired by this idea, but keeping our focus in
avoiding user-set parameters, we are developing an alternative association miner.
It includes an algorithm that explores closed itemsets in order of decreasing
support. This algorithm is similar in spirit to ChARM [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], except that some of
the accelerations of that algorithm require ordering some itemsets by increasing
support, which becomes inapplicable in our case. Additionally, our algorithm
keeps adjusting automatically the support bound as necessary so as to be able
to proceed with the exploration within the available memory resources. This
is, of course, more expensive in computation time, compared to a traditional
exploration with the \right" support threshold, as the number of closed frequent
sets that can be ltered out right away is much smaller; on the other hand,
no one can tell ahead of time which is the \right" support threshold, and our
alternative spares the user the need of guessing it. To our knowledge, this is the
rst algorithm available for mining closed sets in order of descending support
and without employing a user- xed support threshold.
      </p>
      <p>
        Similarly, in order to spare the user the choice of rule measure thresholds,
we employ a somewhat complex (and slightly slower to evaluate) measure, the
closure-based con dence boost, for which our previous work has led to useful,
implementable bounds as well as to a speci c form of self-tuning [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It can be
proved that this quantity is bounded by a related, easy to compute quantity:
namely, the closure-based con dence boost is always less than or equal to the
support ratio, introduced (with a less prononceable name) in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and de ned
below; this bound allows us to \push" into the closure mining process a constraint
on the support ratio that spares computation of rules that will fail the rule
measure threshold. We do this by postponing the consideration of the closed
sets that, upon processing, would give rise only to partial implications below the
con dence boost threshold.
      </p>
      <p>As indicated, our algorithm self-tunes this threshold, which starts at a
somewhat selective level, by lowering it in case the output rules show it appropriate.
Then, the support ratio in the closure miner is to follow suit: the constraint is to
be pushed into the closure mining process with the new value. This may mean
that previously discarded closures are to be now considered. Therefore, we must
reconcile four processes: one of mining closed frequent sets in order of
decreasing support, ltering them according to their support ratio; two further ones
that change, along the way, respectively, the support threshold and the support
ratio threshold; and the one of obtaining the rules themselves from the closed
itemsets. Unfortunately, these processes interfere very heavily with each other.
Closed sets are the rst objects to be mined from the dataset, and are to be
processed in order of decreasing support to obtain rules from them, but they are
to be processed only if they have both high enough support, and high enough
support ratio. Closed sets of high support and low support ratio, however, cannot
be simply discarded: a future decrease of the self-adjusting rule measure bound
may require us to \ sh" them back in, as a consequence of evaluations made
\at the end" of the process upon evaluating rules; likewise, rules of low
closurebased con dence boost need to be kept on hold instead of discarded, so as to
be produced if, later, they turn out to clear the threshold after adjusting it to a
lower level. The picture gains an additional complication from the fact that
constructing partial implications requires not only the list of frequent closures, but
also the Hasse edges that constitute the corresponding Formal Concept Lattice.</p>
      <p>
        As a consequence, the varying thresholds make it di cult to organize the
architecture of the software system in the traditional form of, rst, mining the
lattice of frequent closures and, then, extracting rules from them. We describe
here how iterators o er a simple and e cient solution for the organization of
our partial implication miner yacaree, available at SourceForge and shown at
the demo track of a recent conference [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The details of the implementation are
described here for the rst time.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Concepts, Notation, and Overview</title>
      <p>In our technological context (pure Python), \generators" constitute one of the
ways of obtaining iterators. An iterator constructed in this way is a method (in
the object-oriented sense) containing, anywere inside, the \yield" instruction;
most often, this instruction is inside some loop. This instruction acts as a
\return" instruction for the iterator, except that its whole status, including values
of local variables and program counter, is stored, and put back into place at the
next call to the method. Thus, we obtain a \lazy" method that gives us, one
by one, a sequence of values, but only computes one more value whenever it is
called from the \consumer" that needs these values.</p>
      <p>Generators as a comfortable way of constructing iterators are available only
in a handful of platforms: several quite specialized lazy functional programming
languages o er them, but, among the most common programming languages,
only Python and C# include generators. Java or C++ o er a mere \iterator"
interface that simply states that classes implementing iterators must o er, with
speci c names, the natural operations to iterate over them, but the notion of
generators to program them easily is not available.</p>
      <p>We move on to describe the essentials of our system, and the way iterators
de ned by means of generators allow us to organize, in a clear and simple way,
the various processes involved.</p>
      <p>A given set of available items U is assumed; its subsets are called itemsets.
We will denote itemsets by capital letters from the end of the alphabet, and use
juxtaposition to denote union of itemsets, as in XY . The inclusion sign as in
X Y denotes proper subset, whereas improper inclusion is denoted X Y .
For a given dataset D, consisting of n transactions, each of which is an itemset
labeled with a unique transaction identi er, we de ne the support sup(X) of an
itemset X as the cardinality of the set of transactions that contain X. Sometimes,
the support is measured \normalized" by dividing by the dataset size; then, it
is an empirical approximation to the probability of the event that the itemset
appears in a \random" transaction. Except where explicitly indicated, all our
uses of support will take the form of ratios, and, therefore, it does not matter at
all whether they come absolute or normalized.</p>
      <p>An association rule is an ordered pair of itemsets, often written X ! Y .
The con dence c(X ! Y ) of rule X ! Y is sup(XY )=sup(X). We will refer
occasionally below to a popular measure of deviation from independence, often
named lift : assuming X \ Y = ;, the lift of X ! Y is</p>
      <p>sup(XY )
sup(X) sup(Y )
where all three supports are assumed normalized (if they are not, then the
dataset size must of course appear as an extra factor in the numerator).</p>
      <p>An itemset X is called frequent if its support is greater than or equal to
some user-de ned threshold: sup(X) &gt; . We often assume that is known; no
support bound is implemented by setting = 0. Our algorithms will attempt
at self-tuning to an appropriate value without concourse of the user. Given
an itemset X U , its closure X of X is the maximal set (with respect to set
inclusion) Y U such that X Y and sup(X) = sup(Y ). It is easy to see that X
is unique. An itemset X is closed if X = X. Closure operators are characterized
by the three properties of monotonicity, idempotency, and extensivity.</p>
      <p>
        The support ratio was essentially employed rst, to our knowledge, in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
where, together with other similar quotients, it was introduced with the aim
of providing a faster algorithm for computing representative rules. The support
ratio of an association rule X ! Y is that of the itemset XY , de ned as follows:
(X ! Y ) = (XY ) =
      </p>
      <p>sup(XY )
maxfsup(Z) j sup(Z) &gt; ; XY</p>
      <p>Zg
:</p>
      <p>For many quality measures for partial implications, including support,
condence, and closure-based con dence boost (to be de ned momentarily), the
relevant supports turn out to be the support of the antecedent and the support
of the union of antecedent and consequent. As these are captured by the
corresponding closures, we deem inequivalent two rules X ! Y and X0 ! Y 0 exactly
when they are not \mutually redundant" with respect to the closure space
dened by the dataset: either X 6= X0, or XY 6= X0Y 0. We denote that fact as
(X ! Y ) 6 (X0 ! Y 0).</p>
      <p>We now assume sup(XY ) &gt; . As indicated, our system keeps a varying
threshold on the following rule evaluation measure: (X ! Y ) =
c(X ! Y )
maxfc(X0 ! Y 0) j (X ! Y ) 6 (X0 ! Y 0); sup(X0Y 0) &gt; ; X0
X; Y</p>
      <p>X0Y 0</p>
      <p>
        This notion, known as \closure-based con dence boost", as well as the \plain
con dence boost", which is a simpler variant where the closure operator reduces
to the identity, are studied in depth in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Intuitively, this is a relative, instead
of absolute, form of con dence: we are less interested in a partial implication
having very similar con dence to that of a simpler one. A related formula
measures relative con dence with respect to logically stronger partial implications
(con dence width, see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]); the formula just given seems to work better in
practice. For the value of this measure to be nontrivial, XY must be a closed set;
the following inequality holds:
:
Proposition 1. (X ! Y )
      </p>
      <p>(X ! Y ):</p>
      <p>
        The threshold on (X ! Y ) is self-adjusted along the mining process, on the
basis of several properties such as coincidence with lift under certain conditions;
all these details and properties are described in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2.1
      </p>
      <sec id="sec-2-1">
        <title>Architecture of yacaree</title>
        <p>The diagram in Figure 1 shows the essentials of the class structure, for easier
reference along the description of the iterators. For simplicity, a number of
additional classes existing in the system are not shown. A couple of them, added
recently, nd minimal generators via standard means and implement a plain
con dence boost version appropriate for full-con dence implications; their
algorithmics are not novel, pose no challenge, and are omitted here. We are also
omitting discussion of classes like the Dataset class, some heap-like auxiliary
data structures, user interfaces, and a class capturing a few static values, as
their role in our description is minor or easy to understand (or both).
We give a brief explanation of the roles of the classes given in the diagram.
Details about their main methods (the corresponding iterators) come below.</p>
        <p>Class ItemSet keeps the information and methods to prettyprint itemsets,
including information such as support; it inherits from sets all set-theoretic
operations. Class Rule keeps both antecedent and consequent (technically, it keeps
the antecedent and the union of antecedent and consequent, as in this case the
latter is always closed, which allows for more e cient processing), and is able to
provide rule evaluation measures such as con dence or lift.</p>
        <p>Class ClMiner runs the actual closure mining, with some auxiliary methods
to handle all details. Its main method is the iterator mine closures() (described
below) which yields, one by one and upon being called, all closed sets having
support above the threshold, in order of decreasing support. This \decreasing
support" condition allows us to increase the support threshold, if necessary, to
continue the exploration. As explained below, when the internal data structures
of the closure miner are about to over ow, the support threshold is increased in
such a way that half the closures found so far and still pending consideration
are discarded.</p>
        <p>
          Class Lattice runs its own iterator, candidate closures(), which, in turn, calls
mine closures() as new closed sets become needed. Its main task is to call
methods that implement the algorithms from [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and actually build the lattice of
closed sets, so that further iterators can expect to receive closures for which the
immediate predecessors have been identi ed. Version 1.0 of yacaree employed
the Border algorithm but in version 1.1 we have implemented the faster
algorithm iPred and indeed obtained around a 10% acceleration. The fact that
iPred could be employed in enumerations of closures by decreasing support was
proved in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. See [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for futher discussions.
        </p>
        <p>Additionally, the support ratio of each closed set is also computed here, and
the class o ers yet another iterator that provides, for each closure, all the
predecessor closures having support above a local, additional support threshold that
can be speci ed at call time. In this way, we obtain all the candidate antecedents
for a given closed set as potential consequent. This internal iterator amounts to
a plain depth- rst search, so that we do not discuss it further here.</p>
        <p>Within class Lattice, two heap-structured lists keep, respectively, the closures
that are ready to be passed on as they clear both the support and the support
ratio thresholds (Lattice.ready) and the closures that clear the support threshold
but fail the support ratio threshold (Lattice.freezer); these will be recovered in
case a decrease of the con dence boost bound is to a ect the support ratio
pruning.</p>
        <p>Finally, class RuleMiner is in charge of o ering the system an iterator over all
the association rules passing the current thresholds of support and closure-based
con dence boost: mine rules(). Its usage allows one to include easily further
checks of con dence, lift, or any other such quantity.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Details</title>
      <p>This section provides details of the main iterators and their combined use to
attain our goals.
3.1</p>
      <sec id="sec-3-1">
        <title>ClMiner.mine closures()</title>
        <p>The closure miner is the iterator that supports the whole scheme; it follows a
\best- rst" strategy, where here \best" means \highest support". We maintain a
heap containing closures already generated, but which have not been expanded
yet to generate further closures after them. The heap can provide the one of
highest support in logarithmic time, as this is the closure that comes next. Then,
as this closure is passed on to the lattice constructor, items (rather, closures of
singletons) are added to it in all possible ways, and closure operations are applied
in order to generate its closed successors, which are added to the heap unless
they were already in it. The decreasing support condition ensures that they were
never visited before. For simplicity, we omit discussing the particular case of the
empty set, which, if closed, is to be traversed rst, separately.
1: identify closures of singletons
2: organize them into a maxheap according to support
3: while heap nonempty do
4: consider increasing the support threshold by monitoring the available
memory
5: if the support threshold must be raised then
6: kill from the heap pending closures of support below new threshold,
which is chosen so that the size of the heap halves
7: end if
8: pop from heap the max-support itemset
9: yield it
10: try to extend it with all singleton closures
11: for such extensions with su cient support do
12: if their closure is new then
13: add it to the heap
14: end if
15: end for
16: end while</p>
        <p>In order to clarify how this algorithm works, we develop the following
example. Consider a dataset with 24 transactions over universe U = fa; b; c; d; eg
of 5 items: fabcde; bcde 2; abe; cde; be; ae 3; ab 4; cd 6; b 2; a 3g. For
this dataset, there are 12 closed sets, which we enumerate here with their
corresponding supports: ;=24, a=12, b=11, cd=10, e=9, ab=6, ae=5, be=5, cde=4, bcde=3,
abe=2, abcde=1. The empty set is treated rst, separately, as indicated. Then, the
four next closures correspond to closures of singletons (the closures of c and d
coincide) and form an initial heap, containing: [a=12; b=11; cd=10; e=9].</p>
        <p>The heap provides a as next closure in descending support; it is passed
on to further processing at the \yield" instruction, and it is expanded with
singleton closures in all possible ways, enlarging the heap into containing six
pending closures: [b=11; cd=10; e=9; ab=6; ae=5; abcde=1]: each of the new sets in the
heap is obtained by adding to a the closure of a singleton, and closing the
result. The next closure is b, which goes into the \yield" instruction and,
subsequently, generates two further closures to be added to the heap, which becomes:
[cd=10; e=9; ab=6; ae=5; be=5; bcde=3; abcde=1]. The closure ab generated from b is
omitted, as it is repeated since it was already generated from a.</p>
        <p>For illustration purposes, we assume now that the length of the heap,
currently 7, is deemed too large. Of course, in a toy example like this one there is
no need of moving up the support threshold, but let's do it anyway: assume that
the test indicates that the heap is occupying too much memory, incurring in a
risk of soon-coming over ow. Then, the support is raised as much as necessary
so as to halve the length of the heap. Pending closures of support 5 or less would
be discarded from the heap, the support threshold would be set at 6, and only
three closures would remain in the heap: [cd=10; e=9; ab=6]. Each of them would be
processed in turn, given onwards by the \yield" instruction, and expanded with
all closures of singletons; in all cases, we will nd that expanding any of them
with a singleton closure leads to a closure of support below 6, which is therefore
omitted as it does not clear the threshold. Eventually, these three closures in the
heap are passed on, and the iterator will have traversed all closures of support
6 or higher.</p>
        <p>As a di erent run, assume now that we accept the growth of the heap, so that
it is not reduced. The traversal of closures would go on yielding cd, which would
add cde to the heap; adding either a or b to cd leads to abcde which is already in
the heap. The next closure e adds nothing new to the heap, and the next is ab
which adds abe; at this point the heap is [ae=5; be=5; cde=4; bcde=3; abe=2; abcde=1].
All further extensions only lead to repeated closures, hence nothing is further
added to the heap, and, as it is emptied, the traversal of all the closures is
completed.</p>
        <p>The main property that has to be argued here is the following:
Proposition 2. As the support threshold rises, all the closed sets delivered so
far by the iterator to the next phase are still correct, that is, have support at least
the new value of the threshold.</p>
        <p>Proof. We prove this statement by arguing the following invariants: rst, the
new threshold is bounded above by the highest support of a closure in the heap;
second, all the closed sets provided so far up to any \yield" statement have
support bounded below by all the supports currently in the heap. These invariants
are maintained as we extract the closure C of maximum support in the heap,
and also when we add to it extensions of C: indeed, C having been freshly taken
o the heap, all previous deliveries have at least the same support, whereas all
extensions that are to enter the heap are closed supersets of C and must have
lower support, because C is closed.</p>
        <p>Hence, all previous deliveries have support higher than the maximum support
in the heap, which, in turn, is also higher than the new threshold; transitivity
now proves the statement.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Lattice.candidate closures()</title>
        <p>
          In order to actually mine rules from the closures traversed by the loop described
in the previous section, further information is necessary: data structures to allow
for traversing predecessors, namely, the Hasse edges, that is, the immediate,
nontransitive neighbors of each closure. These come from a second iterator that
implements the iPred algorithm [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>Additionally, we wish to push into the closure mining the con dence boost
constraint. The way to do it is to compute the support ratio of each closure, and
only pass it on to mine rules from it if this support ratio is above the con dence
boost threshold; indeed, Proposition 1 tells us that, if the support ratio is below
the threshold, the con dence boost will be too.</p>
        <p>Due to the condition of decreasing support, we know that the closed superset
that de nes the support ratio is exactly the rst successor to appear from the
closure mining iterator. As soon as one successor of C appears, if the support
ratio is high enough, we can yield C, as the Hasse edges to its own predecessors
are guaranteed to have been set before. If the support ratio is not enough, it is
kept on a \freezer" (again a heap-like structure) from where it might be \ shed
back in" if the con dence boost threshold decreases later on.</p>
        <p>One has to be careful that the same closed set, say C, may be the rst
successor of more than one predecessor. As we set the predecessors C0 of C, we
move to the \ready" heap those that have C as rst successor, if their support
ratio is high enough; then we yield them all. Additionally, as we shall see, it may
happen that RuleMiner.mine rules() moves closures from \freezer" to \ready".
We explain this below.
1: for each closed set C yielded by ClMiner.mine closures() do
2: apply a Hasse edges algorithm (namely iPred) to set up the lattice edges
connecting C to its predecessors
3: for each unprocessed predecessor C0 do
4: compute the support ratio of C0
5: if support ratio is over the rule evaluation threshold then
6: add C0 to the \ready" heap
7: else
8: add C0 to the \freezer" heap
9: end if
10: end for
11: for each closure in the \ready" heap do
12: yield it
13: end for
14: end for</p>
        <p>We observe here that we are not guaranteeing decreasing support order in this
iterator, as the changes to the support ratio threshold may swap closures with
respect to the order in which they were obtained. What we do need is that the
most basic iterator, ClMiner.mine closures(), does provide them in decreasing
support order, rst, to ensure that the support threshold can be raised if
necessary, and, second, to make sure that the support ratio is correctly computed
from the rst successor found for each closure.</p>
        <p>Along the same example as before, consider, for instance, what happens when
ClMiner.mine closures() yields the closure ab to Lattice.candidate closures().
The iPred algorithm identi es a and b as immediate predecessors of ab, and
the corresponding Hasse edges are stored. Then, both a and b are identi ed as
closures whose rst successor (in decreasing support) has just appeared; indeed,
other successors have less support than ab. The support ratios of a and b, namely,
12=6 = 2 and 11=6, are seen to be higher than the con dence boost threshold
(which starts at 1.15 by default) and both a and b are moved to the \ready"
heap and yielded to the subsequent rule mining phase. On the other hand, if the
con dence boost threshold was, say, at 1.4, upon processing bcde we would nd
4=3 &lt; 1:4 as support ratio of cde, and this closure would wait in the freezer heap,
until (if at all) a revised lower value of the con dence boost threshold would let
it through, by moving it from the freezer queue to the ready queue.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>RuleMiner.mine rules()</title>
        <p>
          Each closure makes available an iterator over its predecessors in the closures
lattice (closed proper subsets), up to a given support level that we can specify
upon calling it. For instance, at the closure bcde, of support 3, and assuming
a con dence threshold of 0.6, we would explore predecessors be and cde, which
lead to rules be ! cd and cde ! b. The con dence boost has to be checked,
but part of the task is already made since the very fact that the closure bcde
arrived here implies that its support ratio is over the con dence boost threshold.
In this case, the support ratio of closure bcde is 3. We must test con dences with
smaller antecedents (see [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]). As the con dences of b ! cd and e ! cd are low
enough, the rule be ! cd becomes indeed reported; cde ! b does as well, after
checking how low the con dences of cd ! b and e ! b are.
        </p>
        <p>
          The revision of the closure-based con dence boost threshold can be done
in a number of ways. The current implementation keeps computing the lift of
those rules whose antecedent is a singleton, as the condition on support ratio
ensures that, in this case, it will coincide with the con dence boost [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]; these lift
values enter a weighted average with the current threshold, and, if the average is
su ciently smaller, the threshold is decreased. Only a partial justi cation exists
so far for this choice.
        </p>
        <p>When the threshold for con dence boost decreases, closures whose support
ratio was too low may become now high enough; thus, the freezer is explored and
closures whose support ratio is now above the new con dence boost threshold
are moved into the ready queue (lines 6 and 7), to be processed subsequently.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>System</title>
        <p>The main program simply traverses all rules, as obtained from the iterator
mine rules(), in the class RuleMiner:
1: for each rule in RuleMiner.mine rules() do
2: account for it
3: end for</p>
        <p>What is to be done with each rule depends on the instructions from the user
interface, but usually we count how many of them are obtained and we write
them all on disk, maybe up to a xed limit on the number of rules (that can
be modi ed by editing the source code). In this case, we report those of highest
closure-based con dence boost.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A Second Implementation</title>
      <p>With a view to o ering this system in a more widespread manner, we have
developed a joint project with KNIME GmbH, a small company that develops the
open source data mining suite KNIME. This data mining suite is implemented
in Java. Hence, we have constructed a second implementation in Java.</p>
      <p>However, the issue is not fully trivial because of two main reasons. The rst
is that the notion of iterator in Java is di erent from that in Python, and is not
obtained from generators: the \yield" instruction, which saves the state of an
iteration at the time it is invoked, does not exist in Java, which simply declares
that hasNext() and next() methods must be made available: respectively, to
know whether there are more elements to process and to get the next element.
A second signi cative change is that the memory control to ensure that the list
of pending closures does not over ow has to be made in terms of the memory
management API of KNIME, and requires one extra loop to check whether the
decrease in memory usage was su cient.</p>
      <p>Therefore, we have to use the Iterator class to \copy", to the extent possible,
the \yield" behavior, saving all necessary information to continue in queues and
lists. The three most a ected methods for this issue are, of course, mine rules(),
candidate closures() and mine closures(). We describe here only mine rules().</p>
      <p>In this case, a queue, called ready rules, is needed in order to store the
rules that are built from the current closure among the candidates and have
achieved the support, con dence, and con dence boost requirements. Rules that
do not clear these thresholds are stored in another queue, reserved rules, as in
the Python implementation. The code is shown next:</p>
      <p>In Lattices.candidate closures(), the candidate closures are likewise stored in
a list called cadidate closures list in order that mine rules method can obtain
them. The program is constructed in the same way as the one just described,
and is omitted here.</p>
      <p>The last method that needs a change in the translation from Python to Java
and KNIME is clminer.mine closures(), and it consists of storing in a list called
max-support itemset list the candidate itemsets that obey the max-support
requirement, and of returning this list at the end of the method. In this case
iterators aren't needed beacuse in this method is only required to store and
return the list, so next() and hastNext() methods are not used.
1: max-support itemset list = empty list of itemset
2: identify closures of singletons
3: organize them into a maxheap according to support
4: while heap nonempty do
5: consider increasing the support threshold by monitoring the available
memory
6: if the support threshold must be raised then
7: kill from the heap pending closures of support below new threshold,
which is chosen so that the size of the heap halves
8: end if
9: pop from heap the max-support itemset and store it in max-support itemset list
10: try to extend it with all singleton closures
11: for such extensions with su cient support do
12: if their closure is new then
13: add it to the heap
14: end if
15: end for
16: return max-support itemset list
17: end while
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have studied a variant of the basic association mining process. In our variant,
we try to avoid burdening the user with requests to x threshold parameters.
We keep an internal support threshold and adjust it upwards whenever the
computation process shows that the system will be unable to run down to the
current threshold value. We tackle the problem of limiting the number of rules
through one speci c rule measure, closure-based con dence boost, for which the
threshold is self-adjusted along the mining process. A minor detail is that for
full-con dence implications it is not di cult to see that closure-based con dence
boost is inappropriate, and plain con dence boost is to be used. Further details
about this issue will be given in future work.</p>
      <p>The con dence boost constraint is pushed into the mining process through its
connection to the support ratio. Therefore, the closure miner has to coordinate
with processes that move upwards the support threshold, or downwards the
support ratio threshold.</p>
      <p>
        Further study on the basis of our implementation is underway, and further
versions of our association miner, with hopefully faster algorithmics, will be
provided in the coming months. Another line of activity is as follows: granted
that our approach o ers partial implications without user-de ned parameters,
to what extent users that are not experts in data analysis are satis ed with the
results? Our research group explores that topic in a separate paper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Additionally, we are aware of two independent works where an algorithm is
proposed to traverse the closure space in linear time [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]; these algorithms
do not follow an order of decreasing support, and we nd nontrivial to modify
them so that they ful ll this condition. Our research group is attempting at it,
as, if successful, faster implementations could be designed.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal Concept Analysis: Mathematical Foundations</source>
          . Springer-Verlag (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Luxenburger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Implications partielles dans un contexte</article-title>
          .
          <source>Mathematiques et Sciences Humaines</source>
          <volume>29</volume>
          (
          <year>1991</year>
          )
          <volume>35</volume>
          {
          <fpage>55</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannila</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srikant</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verkamo</surname>
            ,
            <given-names>A.I.</given-names>
          </string-name>
          :
          <article-title>Fast discovery of association rules</article-title>
          . In:
          <article-title>Advances in Knowledge Discovery and Data Mining</article-title>
          . AAAI/MIT Press (
          <year>1996</year>
          )
          <volume>307</volume>
          {
          <fpage>328</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Geng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamilton</surname>
            ,
            <given-names>H.J.:</given-names>
          </string-name>
          <article-title>Interestingness measures for data mining: A survey</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>38</volume>
          (
          <issue>3</issue>
          ) (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsiao</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>E cient algorithms for mining closed itemsets and their lattice structure</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ) (
          <year>2005</year>
          )
          <volume>462</volume>
          {
          <fpage>478</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Balcazar</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Formal and computational properties of the con dence boost in association rules</article-title>
          . Available at: [http://personales.unican.es/balcazarjl].
          <article-title>Extended abstract appeared as "Objective novelty of association rules: Measuring the con - dence boost</article-title>
          . In Yahia, S.B.,
          <string-name>
            <surname>Petit</surname>
          </string-name>
          , J.M., eds.
          <source>: EGC</source>
          . Volume
          <string-name>
            <surname>RNTI-E-</surname>
          </string-name>
          19 of Revue des Nouvelles Technologies de lInformation.,
          <article-title>Cepadu`es-</article-title>
          <string-name>
            <surname>Editions</surname>
          </string-name>
          (
          <year>2010</year>
          )
          <fpage>297</fpage>
          -
          <lpage>302</lpage>
          " (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kryszkiewicz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Closed set based discovery of representative association rules</article-title>
          . In Ho mann, F.,
          <string-name>
            <surname>Hand</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>N.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fisher</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          , Guimara~es, G., eds.
          <source>: Proc. of the 4th International Symposium on Intelligent Data Analysis (IDA)</source>
          .
          <source>Volume 2189 of Lecture Notes in Computer Science</source>
          ., Springer-Verlag (
          <year>2001</year>
          )
          <volume>350</volume>
          {
          <fpage>359</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Balcazar</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Parameter-free association rule mining with yacaree</article-title>
          . In Khenchaf,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Poncelet</surname>
          </string-name>
          , P., eds.
          <source>: EGC</source>
          . Volume
          <string-name>
            <surname>RNTI-E-</surname>
          </string-name>
          20 of Revue des Nouvelles Technologies de l'Information., Hermann-Editions (
          <year>2011</year>
          )
          <volume>251</volume>
          {
          <fpage>254</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Baixeries</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szathmary</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valtchev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Godin</surname>
          </string-name>
          , R.:
          <article-title>Yet a faster algorithm for building the Hasse diagram of a concept lattice</article-title>
          . In Ferre, S.,
          <string-name>
            <surname>Rudolph</surname>
          </string-name>
          , S., eds.
          <source>: Proc. of the 7th International Conference on Formal Concept Analysis (ICFCA)</source>
          .
          <source>Volume 5548 of Lecture Notes in Arti cial Intelligence</source>
          ., Springer-Verlag (
          <year>2009</year>
          )
          <volume>162</volume>
          {
          <fpage>177</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Balcazar</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          , T^rnauca, C.:
          <article-title>Border algorithms for computing Hasse diagrams of arbitrary lattices</article-title>
          . In Valtchev, P., Jaschke, R., eds.
          <source>: ICFCA</source>
          . Volume
          <volume>6628</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2011</year>
          )
          <volume>49</volume>
          {
          <fpage>64</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Obiedkov</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          :
          <article-title>Algorithms for the construction of concept lattices and their diagram graphs</article-title>
          . In Raedt, L.D.,
          <string-name>
            <surname>Siebes</surname>
          </string-name>
          , A., eds.
          <source>: Proc. of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD)</source>
          .
          <source>Volume 2168 of Lecture Notes in Arti cial Intelligence</source>
          ., Springer-Verlag (
          <year>2001</year>
          )
          <volume>289</volume>
          {
          <fpage>300</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Garc</surname>
            a-Saiz,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zorrilla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balcazar</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Closures and partial implications in educational data mining</article-title>
          .
          <source>ICFCA</source>
          ,
          <string-name>
            <surname>Supplementary proceedings</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Two basic algorithms in concept analysis</article-title>
          (preprint
          <year>1987</year>
          ). In Kwuida, L.,
          <string-name>
            <surname>Sertkaya</surname>
          </string-name>
          , B., eds.
          <source>: ICFCA</source>
          . Volume
          <volume>5986</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2010</year>
          )
          <volume>312</volume>
          {
          <fpage>340</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Uno</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uchida</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arimura</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>An e cient algorithm for enumerating closed patterns in transaction databases</article-title>
          . In Suzuki, E.,
          <string-name>
            <surname>Arikawa</surname>
          </string-name>
          , S., eds.: Discovery Science. Volume
          <volume>3245</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2004</year>
          )
          <volume>16</volume>
          {
          <fpage>31</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>