=Paper=
{{Paper
|id=None
|storemode=property
|title= Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications
|pdfUrl=https://ceur-ws.org/Vol-876/paper2.pdf
|volume=Vol-876
}}
== Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications==
Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications José L. Balcázar1 , Diego Garcı́a-Sáiz2 , and Javier de la Dehesa2 1 LSI Department, UPC, Campus Nord, Barcelona jose.luis.balcazar@upc.edu 2 Mathematics, Statistics and Computation Department, University of Cantabria Avda. de los Castros s/n, Santander, Spain garciasad@unican.es Abstract. We describe the internal algorithmics of our recent imple- mentation of a closure-based self-tuning associator: yacaree. This system is designed so as not to request the user to specify any threshold. In order to avoid the need of a support threshold, we introduce an algo- rithm that constructs closed sets in order of decreasing support; we are not aware of any similar previous algorithm. In order not to overwhelm the user with large quantities of partial implications, our system filters the output according to a recently studied lattice-closure-based notion of confidence boost, and self-adjusts the threshold for that rule quality measure as well. As a consequence, the necessary algorithmics interact in complicated ways. In order to control this interaction, we have resorted to a well-known, powerful conceptual tool, called Iterators: this notion allows us to distribute control among the various algorithms at play in a relatively simple manner, leading to a fully operative, open-source, efficient system for discovery of partial implications in relational data. Keywords: Association mining, parameter-free mining, iterators, Python . 1 Introduction The task of identifying which implications hold in a given dataset has received already a great deal of attention [1]. Since [2], also the problem of identifying partial implications has been considered. Major impulse was received with the proposal of “mining association rules”, a very closely related concept. A majority of existing association mining programs follow a well-established scheme [3], ac- cording to which the user provides a dataset, a support constraint, a confidence constraint, and, optionally, in most modern implementations, further constraints on other rule quality evaluation measures such as lift or leverage (a survey of quality evaluation measures for partial implications is [4]). A wealth of algo- rithms, of which the most famous is apriori, have been proposed to perform association mining. Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications 15 Besides helping the algorithm to focus on hopefully useful partial implica- tions, the support constraint has an additional role: by restricting the process to frequent (or frequent closed) itemsets, the antimonotonicity property of the support threshold defines a limited search space for exploration and avoids the often too wide space of the whole powerset of items. Instead, however, the price becomes a burden on the user, who must supply thresholds on rule evaluation measures and on support. Rule measure thresh- olds may be difficult to set correctly, but at least they offer often a “semantic” interpretation that guides the choice; for instance, confidence is (the frequentist approximation to) the conditional probability of the consequent of the rule, given the antecedent, whereas lift and leverage refer to the (multiplicative or additive, respectively) deviation from independence of antecedent and consequent. But support thresholds are known to be very difficult to set right. Some smallish datasets are so dense that any exploration below 95% support, on our current technology, leads to a not always graceful breakdown of the associator program, whereas other, large but sparse datasets hardly yield any association rule un- less the support is set at quantities as low as 0.1%, spanning a factor of almost one thousand; and, in order to set the “right” support threshold (whatever that means), no intuitive guidance is currently known, except for the rather trivial one of trying various supports and monitoring the number of resulting rules and the running time and memory needed. The Weka apriori implementation automates partially the process, as follows: it explores repeatedly at several support levels, reducing the threshold from one run to the next by a “delta” parameter (to be set as well by the user), until a given number of rules has been gathered. Inspired by this idea, but keeping our focus in avoiding user-set parameters, we are developing an alternative association miner. It includes an algorithm that explores closed itemsets in order of decreasing support. This algorithm is similar in spirit to ChARM [5], except that some of the accelerations of that algorithm require ordering some itemsets by increasing support, which becomes inapplicable in our case. Additionally, our algorithm keeps adjusting automatically the support bound as necessary so as to be able to proceed with the exploration within the available memory resources. This is, of course, more expensive in computation time, compared to a traditional exploration with the “right” support threshold, as the number of closed frequent sets that can be filtered out right away is much smaller; on the other hand, no one can tell ahead of time which is the “right” support threshold, and our alternative spares the user the need of guessing it. To our knowledge, this is the first algorithm available for mining closed sets in order of descending support and without employing a user-fixed support threshold. Similarly, in order to spare the user the choice of rule measure thresholds, we employ a somewhat complex (and slightly slower to evaluate) measure, the closure-based confidence boost, for which our previous work has led to useful, implementable bounds as well as to a specific form of self-tuning [6]. It can be proved that this quantity is bounded by a related, easy to compute quantity: namely, the closure-based confidence boost is always less than or equal to the 16 José L. Balcázar et al. support ratio, introduced (with a less prononceable name) in [7], and defined below; this bound allows us to “push” into the closure mining process a constraint on the support ratio that spares computation of rules that will fail the rule measure threshold. We do this by postponing the consideration of the closed sets that, upon processing, would give rise only to partial implications below the confidence boost threshold. As indicated, our algorithm self-tunes this threshold, which starts at a some- what selective level, by lowering it in case the output rules show it appropriate. Then, the support ratio in the closure miner is to follow suit: the constraint is to be pushed into the closure mining process with the new value. This may mean that previously discarded closures are to be now considered. Therefore, we must reconcile four processes: one of mining closed frequent sets in order of decreas- ing support, filtering them according to their support ratio; two further ones that change, along the way, respectively, the support threshold and the support ratio threshold; and the one of obtaining the rules themselves from the closed itemsets. Unfortunately, these processes interfere very heavily with each other. Closed sets are the first objects to be mined from the dataset, and are to be processed in order of decreasing support to obtain rules from them, but they are to be processed only if they have both high enough support, and high enough support ratio. Closed sets of high support and low support ratio, however, cannot be simply discarded: a future decrease of the self-adjusting rule measure bound may require us to “fish” them back in, as a consequence of evaluations made “at the end” of the process upon evaluating rules; likewise, rules of low closure- based confidence boost need to be kept on hold instead of discarded, so as to be produced if, later, they turn out to clear the threshold after adjusting it to a lower level. The picture gains an additional complication from the fact that con- structing partial implications requires not only the list of frequent closures, but also the Hasse edges that constitute the corresponding Formal Concept Lattice. As a consequence, the varying thresholds make it difficult to organize the architecture of the software system in the traditional form of, first, mining the lattice of frequent closures and, then, extracting rules from them. We describe here how iterators offer a simple and efficient solution for the organization of our partial implication miner yacaree, available at SourceForge and shown at the demo track of a recent conference [8]. The details of the implementation are described here for the first time. 2 Concepts, Notation, and Overview In our technological context (pure Python), “generators” constitute one of the ways of obtaining iterators. An iterator constructed in this way is a method (in the object-oriented sense) containing, anywere inside, the “yield” instruction; most often, this instruction is inside some loop. This instruction acts as a “re- turn” instruction for the iterator, except that its whole status, including values of local variables and program counter, is stored, and put back into place at the next call to the method. Thus, we obtain a “lazy” method that gives us, one Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications 17 by one, a sequence of values, but only computes one more value whenever it is called from the “consumer” that needs these values. Generators as a comfortable way of constructing iterators are available only in a handful of platforms: several quite specialized lazy functional programming languages offer them, but, among the most common programming languages, only Python and C# include generators. Java or C++ offer a mere “iterator” interface that simply states that classes implementing iterators must offer, with specific names, the natural operations to iterate over them, but the notion of generators to program them easily is not available. We move on to describe the essentials of our system, and the way iterators defined by means of generators allow us to organize, in a clear and simple way, the various processes involved. A given set of available items U is assumed; its subsets are called itemsets. We will denote itemsets by capital letters from the end of the alphabet, and use juxtaposition to denote union of itemsets, as in XY . The inclusion sign as in X ⊂ Y denotes proper subset, whereas improper inclusion is denoted X ⊆ Y . For a given dataset D, consisting of n transactions, each of which is an itemset labeled with a unique transaction identifier, we define the support sup(X) of an itemset X as the cardinality of the set of transactions that contain X. Sometimes, the support is measured “normalized” by dividing by the dataset size; then, it is an empirical approximation to the probability of the event that the itemset appears in a “random” transaction. Except where explicitly indicated, all our uses of support will take the form of ratios, and, therefore, it does not matter at all whether they come absolute or normalized. An association rule is an ordered pair of itemsets, often written X → Y . The confidence c(X → Y ) of rule X → Y is sup(XY )/sup(X). We will refer occasionally below to a popular measure of deviation from independence, often named lift: assuming X ∩ Y = ∅, the lift of X → Y is sup(XY ) sup(X) sup(Y ) where all three supports are assumed normalized (if they are not, then the dataset size must of course appear as an extra factor in the numerator). An itemset X is called frequent if its support is greater than or equal to some user-defined threshold: sup(X) > τ . We often assume that τ is known; no support bound is implemented by setting τ = 0. Our algorithms will attempt at self-tuning τ to an appropriate value without concourse of the user. Given an itemset X ⊆ U , its closure X of X is the maximal set (with respect to set inclusion) Y ⊆ U such that X ⊆ Y and sup(X) = sup(Y ). It is easy to see that X is unique. An itemset X is closed if X = X. Closure operators are characterized by the three properties of monotonicity, idempotency, and extensivity. The support ratio was essentially employed first, to our knowledge, in [7], where, together with other similar quotients, it was introduced with the aim of providing a faster algorithm for computing representative rules. The support 18 José L. Balcázar et al. ratio of an association rule X → Y is that of the itemset XY , defined as follows: sup(XY ) σ(X → Y ) = σ(XY ) = . max{sup(Z) | sup(Z) > τ, XY ⊂ Z} For many quality measures for partial implications, including support, con- fidence, and closure-based confidence boost (to be defined momentarily), the relevant supports turn out to be the support of the antecedent and the support of the union of antecedent and consequent. As these are captured by the corre- sponding closures, we deem inequivalent two rules X → Y and X 0 → Y 0 exactly when they are not “mutually redundant” with respect to the closure space de- fined by the dataset: either X 6= X 0 , or XY 6= X 0 Y 0 . We denote that fact as (X → Y ) 6≡ (X 0 → Y 0 ). We now assume sup(XY ) > τ . As indicated, our system keeps a varying threshold on the following rule evaluation measure: β(X → Y ) = c(X → Y ) . max{c(X 0 → Y 0 ) | (X → Y ) 6≡ (X 0 → Y 0 ), sup(X 0 Y 0 ) > τ, X 0 ⊆ X, Y ⊆ X 0 Y 0 This notion, known as “closure-based confidence boost”, as well as the “plain confidence boost”, which is a simpler variant where the closure operator reduces to the identity, are studied in depth in [6]. Intuitively, this is a relative, instead of absolute, form of confidence: we are less interested in a partial implication having very similar confidence to that of a simpler one. A related formula mea- sures relative confidence with respect to logically stronger partial implications (confidence width, see [6]); the formula just given seems to work better in prac- tice. For the value of this measure to be nontrivial, XY must be a closed set; the following inequality holds: Proposition 1. β(X → Y ) ≤ σ(X → Y ). The threshold on β(X → Y ) is self-adjusted along the mining process, on the basis of several properties such as coincidence with lift under certain conditions; all these details and properties are described in [6]. 2.1 Architecture of yacaree The diagram in Figure 1 shows the essentials of the class structure, for easier reference along the description of the iterators. For simplicity, a number of ad- ditional classes existing in the system are not shown. A couple of them, added recently, find minimal generators via standard means and implement a plain confidence boost version appropriate for full-confidence implications; their al- gorithmics are not novel, pose no challenge, and are omitted here. We are also omitting discussion of classes like the Dataset class, some heap-like auxiliary data structures, user interfaces, and a class capturing a few static values, as their role in our description is minor or easy to understand (or both). Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications 19 Fig. 1. Partial class diagram of the associator 2.2 Class Overview We give a brief explanation of the roles of the classes given in the diagram. Details about their main methods (the corresponding iterators) come below. Class ItemSet keeps the information and methods to prettyprint itemsets, including information such as support; it inherits from sets all set-theoretic op- erations. Class Rule keeps both antecedent and consequent (technically, it keeps the antecedent and the union of antecedent and consequent, as in this case the latter is always closed, which allows for more efficient processing), and is able to provide rule evaluation measures such as confidence or lift. Class ClMiner runs the actual closure mining, with some auxiliary methods to handle all details. Its main method is the iterator mine closures() (described below) which yields, one by one and upon being called, all closed sets having support above the threshold, in order of decreasing support. This “decreasing support” condition allows us to increase the support threshold, if necessary, to continue the exploration. As explained below, when the internal data structures of the closure miner are about to overflow, the support threshold is increased in such a way that half the closures found so far and still pending consideration are discarded. 20 José L. Balcázar et al. Class Lattice runs its own iterator, candidate closures(), which, in turn, calls mine closures() as new closed sets become needed. Its main task is to call meth- ods that implement the algorithms from [9] and actually build the lattice of closed sets, so that further iterators can expect to receive closures for which the immediate predecessors have been identified. Version 1.0 of yacaree employed the Border algorithm but in version 1.1 we have implemented the faster al- gorithm iPred and indeed obtained around a 10% acceleration. The fact that iPred could be employed in enumerations of closures by decreasing support was proved in [10]. See [11] for futher discussions. Additionally, the support ratio of each closed set is also computed here, and the class offers yet another iterator that provides, for each closure, all the prede- cessor closures having support above a local, additional support threshold that can be specified at call time. In this way, we obtain all the candidate antecedents for a given closed set as potential consequent. This internal iterator amounts to a plain depth-first search, so that we do not discuss it further here. Within class Lattice, two heap-structured lists keep, respectively, the closures that are ready to be passed on as they clear both the support and the support ratio thresholds (Lattice.ready) and the closures that clear the support threshold but fail the support ratio threshold (Lattice.freezer); these will be recovered in case a decrease of the confidence boost bound is to affect the support ratio pruning. Finally, class RuleMiner is in charge of offering the system an iterator over all the association rules passing the current thresholds of support and closure-based confidence boost: mine rules(). Its usage allows one to include easily further checks of confidence, lift, or any other such quantity. 3 Details This section provides details of the main iterators and their combined use to attain our goals. 3.1 ClMiner.mine closures() The closure miner is the iterator that supports the whole scheme; it follows a “best-first” strategy, where here “best” means “highest support”. We maintain a heap containing closures already generated, but which have not been expanded yet to generate further closures after them. The heap can provide the one of highest support in logarithmic time, as this is the closure that comes next. Then, as this closure is passed on to the lattice constructor, items (rather, closures of singletons) are added to it in all possible ways, and closure operations are applied in order to generate its closed successors, which are added to the heap unless they were already in it. The decreasing support condition ensures that they were never visited before. For simplicity, we omit discussing the particular case of the empty set, which, if closed, is to be traversed first, separately. 1: identify closures of singletons Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications 21 2: organize them into a maxheap according to support 3: while heap nonempty do 4: consider increasing the support threshold by monitoring the available memory 5: if the support threshold must be raised then 6: kill from the heap pending closures of support below new threshold, which is chosen so that the size of the heap halves 7: end if 8: pop from heap the max-support itemset 9: yield it 10: try to extend it with all singleton closures 11: for such extensions with sufficient support do 12: if their closure is new then 13: add it to the heap 14: end if 15: end for 16: end while In order to clarify how this algorithm works, we develop the following ex- ample. Consider a dataset with 24 transactions over universe U = {a, b, c, d, e} of 5 items: {abcde, bcde × 2, abe, cde, be, ae × 3, ab × 4, cd × 6, b × 2, a × 3}. For this dataset, there are 12 closed sets, which we enumerate here with their cor- responding supports: ∅/24 , a/12 , b/11 , cd/10 , e/9 , ab/6 , ae/5 , be/5 , cde/4 , bcde/3 , abe/2 , abcde/1 . The empty set is treated first, separately, as indicated. Then, the four next closures correspond to closures of singletons (the closures of c and d coincide) and form an initial heap, containing: [a/12 , b/11 , cd/10 , e/9 ]. The heap provides a as next closure in descending support; it is passed on to further processing at the “yield” instruction, and it is expanded with singleton closures in all possible ways, enlarging the heap into containing six pending closures: [b/11 , cd/10 , e/9 , ab/6 , ae/5 , abcde/1 ]: each of the new sets in the heap is obtained by adding to a the closure of a singleton, and closing the result. The next closure is b, which goes into the “yield” instruction and, subse- quently, generates two further closures to be added to the heap, which becomes: [cd/10 , e/9 , ab/6 , ae/5 , be/5 , bcde/3 , abcde/1 ]. The closure ab generated from b is omitted, as it is repeated since it was already generated from a. For illustration purposes, we assume now that the length of the heap, cur- rently 7, is deemed too large. Of course, in a toy example like this one there is no need of moving up the support threshold, but let’s do it anyway: assume that the test indicates that the heap is occupying too much memory, incurring in a risk of soon-coming overflow. Then, the support is raised as much as necessary so as to halve the length of the heap. Pending closures of support 5 or less would be discarded from the heap, the support threshold would be set at 6, and only three closures would remain in the heap: [cd/10 , e/9 , ab/6 ]. Each of them would be processed in turn, given onwards by the “yield” instruction, and expanded with all closures of singletons; in all cases, we will find that expanding any of them with a singleton closure leads to a closure of support below 6, which is therefore 22 José L. Balcázar et al. omitted as it does not clear the threshold. Eventually, these three closures in the heap are passed on, and the iterator will have traversed all closures of support 6 or higher. As a different run, assume now that we accept the growth of the heap, so that it is not reduced. The traversal of closures would go on yielding cd, which would add cde to the heap; adding either a or b to cd leads to abcde which is already in the heap. The next closure e adds nothing new to the heap, and the next is ab which adds abe; at this point the heap is [ae/5 , be/5 , cde/4 , bcde/3 , abe/2 , abcde/1 ]. All further extensions only lead to repeated closures, hence nothing is further added to the heap, and, as it is emptied, the traversal of all the closures is completed. The main property that has to be argued here is the following: Proposition 2. As the support threshold rises, all the closed sets delivered so far by the iterator to the next phase are still correct, that is, have support at least the new value of the threshold. Proof. We prove this statement by arguing the following invariants: first, the new threshold is bounded above by the highest support of a closure in the heap; second, all the closed sets provided so far up to any “yield” statement have sup- port bounded below by all the supports currently in the heap. These invariants are maintained as we extract the closure C of maximum support in the heap, and also when we add to it extensions of C: indeed, C having been freshly taken off the heap, all previous deliveries have at least the same support, whereas all extensions that are to enter the heap are closed supersets of C and must have lower support, because C is closed. Hence, all previous deliveries have support higher than the maximum support in the heap, which, in turn, is also higher than the new threshold; transitivity now proves the statement. 3.2 Lattice.candidate closures() In order to actually mine rules from the closures traversed by the loop described in the previous section, further information is necessary: data structures to allow for traversing predecessors, namely, the Hasse edges, that is, the immediate, nontransitive neighbors of each closure. These come from a second iterator that implements the iPred algorithm [9]. Additionally, we wish to push into the closure mining the confidence boost constraint. The way to do it is to compute the support ratio of each closure, and only pass it on to mine rules from it if this support ratio is above the confidence boost threshold; indeed, Proposition 1 tells us that, if the support ratio is below the threshold, the confidence boost will be too. Due to the condition of decreasing support, we know that the closed superset that defines the support ratio is exactly the first successor to appear from the closure mining iterator. As soon as one successor of C appears, if the support ratio is high enough, we can yield C, as the Hasse edges to its own predecessors Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications 23 are guaranteed to have been set before. If the support ratio is not enough, it is kept on a “freezer” (again a heap-like structure) from where it might be “fished back in” if the confidence boost threshold decreases later on. One has to be careful that the same closed set, say C, may be the first successor of more than one predecessor. As we set the predecessors C 0 of C, we move to the “ready” heap those that have C as first successor, if their support ratio is high enough; then we yield them all. Additionally, as we shall see, it may happen that RuleMiner.mine rules() moves closures from “freezer” to “ready”. We explain this below. 1: for each closed set C yielded by ClMiner.mine closures() do 2: apply a Hasse edges algorithm (namely iPred) to set up the lattice edges connecting C to its predecessors 3: for each unprocessed predecessor C 0 do 4: compute the support ratio of C 0 5: if support ratio is over the rule evaluation threshold then 6: add C 0 to the “ready” heap 7: else 8: add C 0 to the “freezer” heap 9: end if 10: end for 11: for each closure in the “ready” heap do 12: yield it 13: end for 14: end for We observe here that we are not guaranteeing decreasing support order in this iterator, as the changes to the support ratio threshold may swap closures with respect to the order in which they were obtained. What we do need is that the most basic iterator, ClMiner.mine closures(), does provide them in decreasing support order, first, to ensure that the support threshold can be raised if nec- essary, and, second, to make sure that the support ratio is correctly computed from the first successor found for each closure. Along the same example as before, consider, for instance, what happens when ClMiner.mine closures() yields the closure ab to Lattice.candidate closures(). The iPred algorithm identifies a and b as immediate predecessors of ab, and the corresponding Hasse edges are stored. Then, both a and b are identified as closures whose first successor (in decreasing support) has just appeared; indeed, other successors have less support than ab. The support ratios of a and b, namely, 12/6 = 2 and 11/6, are seen to be higher than the confidence boost threshold (which starts at 1.15 by default) and both a and b are moved to the “ready” heap and yielded to the subsequent rule mining phase. On the other hand, if the confidence boost threshold was, say, at 1.4, upon processing bcde we would find 4/3 < 1.4 as support ratio of cde, and this closure would wait in the freezer heap, until (if at all) a revised lower value of the confidence boost threshold would let it through, by moving it from the freezer queue to the ready queue. 24 José L. Balcázar et al. 3.3 RuleMiner.mine rules() In the class RuleMiner, which inherits from Lattice, the iterator mine rules() relies on the closures provided by the previous iterator in the pipeline: 1: reserved rules = [ ] 2: for each closure from candidate closures() do 3: for each predecessor having high enough support so as to reach the con- fidence threshold do 4: form a rule r with the predecessor as antecedent and the closure as consequent 5: use it to revise the closure-based confidence boost threshold 6: if threshold decreased then 7: move from Lattice.freezer to Lattice.ready those closures whose sup- port ratio now passes the new threshold 8: for each rule in reserved rules do 9: if its closure-based confidence boost threshold passes the threshold then 10: yield it 11: else 12: keep it in reserved rules 13: end if 14: end for 15: end if 16: if the closure-based confidence boost of r passes the threshold then 17: yield r 18: else 19: keep it in reserved rules 20: end if 21: end for 22: end for Each closure makes available an iterator over its predecessors in the closures lattice (closed proper subsets), up to a given support level that we can specify upon calling it. For instance, at the closure bcde, of support 3, and assuming a confidence threshold of 0.6, we would explore predecessors be and cde, which lead to rules be → cd and cde → b. The confidence boost has to be checked, but part of the task is already made since the very fact that the closure bcde arrived here implies that its support ratio is over the confidence boost threshold. In this case, the support ratio of closure bcde is 3. We must test confidences with smaller antecedents (see [6]). As the confidences of b → cd and e → cd are low enough, the rule be → cd becomes indeed reported; cde → b does as well, after checking how low the confidences of cd → b and e → b are. The revision of the closure-based confidence boost threshold can be done in a number of ways. The current implementation keeps computing the lift of those rules whose antecedent is a singleton, as the condition on support ratio ensures that, in this case, it will coincide with the confidence boost [6]; these lift values enter a weighted average with the current threshold, and, if the average is Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications 25 sufficiently smaller, the threshold is decreased. Only a partial justification exists so far for this choice. When the threshold for confidence boost decreases, closures whose support ratio was too low may become now high enough; thus, the freezer is explored and closures whose support ratio is now above the new confidence boost threshold are moved into the ready queue (lines 6 and 7), to be processed subsequently. 3.4 System The main program simply traverses all rules, as obtained from the iterator mine rules(), in the class RuleMiner: 1: for each rule in RuleMiner.mine rules() do 2: account for it 3: end for What is to be done with each rule depends on the instructions from the user interface, but usually we count how many of them are obtained and we write them all on disk, maybe up to a fixed limit on the number of rules (that can be modified by editing the source code). In this case, we report those of highest closure-based confidence boost. 4 A Second Implementation With a view to offering this system in a more widespread manner, we have de- veloped a joint project with KNIME GmbH, a small company that develops the open source data mining suite KNIME. This data mining suite is implemented in Java. Hence, we have constructed a second implementation in Java. However, the issue is not fully trivial because of two main reasons. The first is that the notion of iterator in Java is different from that in Python, and is not obtained from generators: the “yield” instruction, which saves the state of an iteration at the time it is invoked, does not exist in Java, which simply declares that hasNext() and next() methods must be made available: respectively, to know whether there are more elements to process and to get the next element. A second significative change is that the memory control to ensure that the list of pending closures does not overflow has to be made in terms of the memory management API of KNIME, and requires one extra loop to check whether the decrease in memory usage was sufficient. Therefore, we have to use the Iterator class to “copy”, to the extent possible, the “yield” behavior, saving all necessary information to continue in queues and lists. The three most affected methods for this issue are, of course, mine rules(), candidate closures() and mine closures(). We describe here only mine rules(). In this case, a queue, called ready rules, is needed in order to store the rules that are built from the current closure among the candidates and have achieved the support, confidence, and confidence boost requirements. Rules that do not clear these thresholds are stored in another queue, reserved rules, as in the Python implementation. The code is shown next: 26 José L. Balcázar et al. 1: reserved rules = empty queue of rules 2: ready rules = empty queue of rules 3: ready rules iterator = iterator for ready rules 4: while !ready rules iterator.hasNext() do 5: for each closure from candidate closures() do 6: for each predecessor having high enough support so as to reach the confidence threshold do 7: form a rule r with the predecessor as antecedent and the closure as consequent 8: use it to revise the threshold for the rule evaluation measure 9: if threshold decreased then 10: move from Lattice.freezer to Lattice.ready those closures whose support ratio now passes the new threshold 11: for each rule in reserved rules do 12: if its rule measure passes the new threshold then 13: store it in ready rules 14: else 15: keep it in reserved rules 16: end if 17: end for 18: end if 19: if the rule measure of r passes the threshold then 20: store it in ready rules 21: else 22: keep it in reserved rules 23: end if 24: end for 25: end for 26: end while 27: return ready rules In Lattices.candidate closures(), the candidate closures are likewise stored in a list called cadidate closures list in order that mine rules method can obtain them. The program is constructed in the same way as the one just described, and is omitted here. The last method that needs a change in the translation from Python to Java and KNIME is clminer.mine closures(), and it consists of storing in a list called max-support itemset list the candidate itemsets that obey the max-support re- quirement, and of returning this list at the end of the method. In this case iterators aren’t needed beacuse in this method is only required to store and return the list, so next() and hastNext() methods are not used. 1: max-support itemset list = empty list of itemset 2: identify closures of singletons 3: organize them into a maxheap according to support 4: while heap nonempty do Iterator-based Algorithms in Self-Tuning Discovery of Partial Implications 27 5: consider increasing the support threshold by monitoring the available memory 6: if the support threshold must be raised then 7: kill from the heap pending closures of support below new threshold, which is chosen so that the size of the heap halves 8: end if 9: pop from heap the max-support itemset and store it in max-support itemset list 10: try to extend it with all singleton closures 11: for such extensions with sufficient support do 12: if their closure is new then 13: add it to the heap 14: end if 15: end for 16: return max-support itemset list 17: end while 5 Conclusion We have studied a variant of the basic association mining process. In our variant, we try to avoid burdening the user with requests to fix threshold parameters. We keep an internal support threshold and adjust it upwards whenever the computation process shows that the system will be unable to run down to the current threshold value. We tackle the problem of limiting the number of rules through one specific rule measure, closure-based confidence boost, for which the threshold is self-adjusted along the mining process. A minor detail is that for full-confidence implications it is not difficult to see that closure-based confidence boost is inappropriate, and plain confidence boost is to be used. Further details about this issue will be given in future work. The confidence boost constraint is pushed into the mining process through its connection to the support ratio. Therefore, the closure miner has to coordinate with processes that move upwards the support threshold, or downwards the support ratio threshold. Further study on the basis of our implementation is underway, and further versions of our association miner, with hopefully faster algorithmics, will be provided in the coming months. Another line of activity is as follows: granted that our approach offers partial implications without user-defined parameters, to what extent users that are not experts in data analysis are satisfied with the results? Our research group explores that topic in a separate paper [12]. Additionally, we are aware of two independent works where an algorithm is proposed to traverse the closure space in linear time [13], [14]; these algorithms do not follow an order of decreasing support, and we find nontrivial to modify them so that they fulfill this condition. Our research group is attempting at it, as, if successful, faster implementations could be designed. 28 José L. Balcázar et al. References 1. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer-Verlag (1999) 2. Luxenburger, M.: Implications partielles dans un contexte. Mathématiques et Sciences Humaines 29 (1991) 35–55 3. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discov- ery of association rules. In: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996) 307–328 4. Geng, L., Hamilton, H.J.: Interestingness measures for data mining: A survey. ACM Comput. Surv. 38(3) (2006) 5. Zaki, M.J., Hsiao, C.J.: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering 17(4) (2005) 462–478 6. Balcázar, J.L.: Formal and computational properties of the confidence boost in association rules. Available at: [http://personales.unican.es/balcazarjl]. Extended abstract appeared as ”Objective novelty of association rules: Measuring the confi- dence boost. In Yahia, S.B., Petit, J.M., eds.: EGC. Volume RNTI-E-19 of Revue des Nouvelles Technologies de lInformation., Cepadu‘es-Editions (2010) 297-302” (2010) 7. Kryszkiewicz, M.: Closed set based discovery of representative association rules. In Hoffmann, F., Hand, D.J., Adams, N.M., Fisher, D.H., Guimarães, G., eds.: Proc. of the 4th International Symposium on Intelligent Data Analysis (IDA). Volume 2189 of Lecture Notes in Computer Science., Springer-Verlag (2001) 350–359 8. Balcázar, J.L.: Parameter-free association rule mining with yacaree. In Khenchaf, A., Poncelet, P., eds.: EGC. Volume RNTI-E-20 of Revue des Nouvelles Technolo- gies de l’Information., Hermann-Éditions (2011) 251–254 9. Baixeries, J., Szathmary, L., Valtchev, P., Godin, R.: Yet a faster algorithm for building the Hasse diagram of a concept lattice. In Ferré, S., Rudolph, S., eds.: Proc. of the 7th International Conference on Formal Concept Analysis (ICFCA). Volume 5548 of Lecture Notes in Artificial Intelligence., Springer-Verlag (2009) 162–177 10. Balcázar, J.L., Tı̂rnăucă, C.: Border algorithms for computing Hasse diagrams of arbitrary lattices. In Valtchev, P., Jäschke, R., eds.: ICFCA. Volume 6628 of Lecture Notes in Computer Science., Springer (2011) 49–64 11. Kuznetsov, S.O., Obiedkov, S.A.: Algorithms for the construction of concept lat- tices and their diagram graphs. In Raedt, L.D., Siebes, A., eds.: Proc. of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). Volume 2168 of Lecture Notes in Artificial Intelligence., Springer-Verlag (2001) 289–300 12. Garcı́a-Sáiz, D., Zorrilla, M., Balcázar, J.L.: Closures and partial implications in educational data mining. ICFCA, Supplementary proceedings (2012) 13. Ganter, B.: Two basic algorithms in concept analysis (preprint 1987). In Kwuida, L., Sertkaya, B., eds.: ICFCA. Volume 5986 of Lecture Notes in Computer Science., Springer (2010) 312–340 14. Uno, T., Asai, T., Uchida, Y., Arimura, H.: An efficient algorithm for enumerating closed patterns in transaction databases. In Suzuki, E., Arikawa, S., eds.: Discovery Science. Volume 3245 of Lecture Notes in Computer Science., Springer (2004) 16– 31