-

Tuning Hyperparameters of Classification Based on Associations (CBA)

Tomáš Kliegr

tomas.kliegr@vse.cz 0

Jaroslav Kucharˇ

jaroslav.kuchar@fit.cvut.cz 1 0 Department of Information and Knowledge Engineering, Faculty of Informatics and Statistics University of Economics , W. Churchill Sq. 1938/4, Prague 3 , Czech Republic 1 Web Intelligence Research Group, Faculty of Information Technology Czech Technical University in Prague , Thákurova 9, 160 00, Prague 6 , Czech Republic

Classification models composed of crisp rules provide excellent explainability. The limitation of many conventional rule learning algorithms is the separate-andconquer strategy, which may be slow on large data. Association Rule Classifiers (ARC) is an alternative approach that can be very fast on massive datasets but is highly susceptible to the correct choice of metaparameters. Most existing ARC algorithms use default thresholds of 50% for minimum confidence and 1% minimum support, which can result in excessively long rule generation or underperforming models. Due to the high-costs that can be associated with evaluation of single combination, it is impractical to use standard metaparameter optimization approaches. In this paper, we introduce two variant threshold tuning algorithms specifically designed for ARC. Evaluation on 22 standard UCI datasets shows promising results in terms of model size and accuracy in comparison with the default thresholds. The implementation of the proposed algorithms is made available in R packages rCBA and arc, which are available in the CRAN repository.

Association rule classifiers (ARC) are formed by selecting a subset of rules from a high number of candidates, which are generated by association rule learning algorithms known for their excellent performance on big and sparse datasets. The large base of candidate rules or frequent itemsets provides opportunities for achieving a good balance between predictive performance and interpretability of the produced models.

An ARC algorithm has two fundamental steps: candidate generation, and building of a classifier by selecting a subset of the generated candidates. While most research has focused on the classifier building phase, the candidate generation phase has not received much attention. Most ARC algorithms including state-of-the-art approaches like Interpretable Decision Sets (IDS) [ 1 ], Scalable Bayesian Rule Lists (SBRL) [ 2 ], or Bayesian Rule Sets (BRS) [ 3 ] rely on simple heuristics for generating the candidates, such as step-wise increases in support threshold by 5% until a fixed desired number of candidate frequent itemsets is reached.

Candidate generation can fundamentally affect all facets of ARC models, including speed of model building, size of the generated models, and particularly the predictive performance. In this paper, we provide two alternative approaches to rule generation. We focus on approaches applicable to the rule generation step of the Classificationbased on Associations (CBA) algorithm [ 4 ]. While there are newer approaches, CBA is still one of the best rulebased classification algorithms that concerns balance between comprehensibility of the model, predictive power and scalability [ 5 ].

The two tuning algorithms that we describe are based on different principles. The first approach is a heuristic, which aims to produce a user-set number of rules by varying minimum support, minimum confidence, and maximum antecedent length thresholds. The second approach is a supervised algorithm, in which each metaparameter setting is used to create a classifier. Next, it is evaluated through internal validation. As optimization algorithm we adopt simulated annealing.

This paper is organized as follows. Section 2 briefly introduces the CBA algorithm. Section 3 covers the two proposed threshold tuning algorithms. Section 4 presents evaluation and Section 5 summarizes limitations of the presented work and provides outlook for future extensions. The conclusions summarize the contributions of our proposal, briefly discussing possible applications. 2

Association Rule Classifiers

The first association rule classification algorithm was Classification based on Associations (CBA) [ 4 ]. While there were multiple follow-up algorithms providing marginal improvements in classification performance (e.g. CPAR [ 6 ], CMAR [ 7 ]), the structure of most ARC algorithms follows, with some deviations, that of CBA [ 8 ]:

1. learn classification association rules, 2. prune the set of rules, 3. classify new objects.

Rule learning In this phase, some algorithms such as CBA learn complete association rules of the form antecedent ! consequent. The learning step returns all rules matching the minimum confidence and minimum support thresholds. The confidence of a rule is defined as con f (r) = a=(a + b), where a is the number of correctly classified objects, i.e. those matching rule antecedent as well rule consequent, and b is the number of misclassified objects, i.e. those matching the antecedent, but not the consequent. The support of a rule is defined as supp(r) = a=n, where n is the number of all objects (relative support), or simply as a (absolute support). Additionally, the rule mining setup is constrained so that only the target class values can occur in the consequent of the rules.

In some newer methods, the first step involves generating frequent itemsets, rather than complete rules. An example of such method is IDS, which does not impose the minimum confidence threshold. It takes on the input already the result of frequent itemset mining (i.e. conjunctions of conditions). Rules are then formed within IDS by splitting the frequent itemset into an antecedent and consequent parts.

In both approaches, adaptations of standard frequent itemset generation association rule learning algorithms such as apriori [ 9 ] or FP-growth [ 10 ] are used. Rule pruning What is performed during the pruning phase varies strongly from algorithm to algorithm. CBA uses a simple and fast heuristic, which first sorts the rules and then removes redundant rules. Rule is considered as redundant if it does not correctly classify any instances after instances covered by rules with higher priority. In contrast, the IDS algorithm uses computationally intensive submodular optimization, which provides guarantees in terms of the optimality of the selected subset of rules with respect to a chosen balance between predictive performance and interpretability.

Classification phase The way classification is performed depends primarily on whether the ARC algorithm produces rule lists or rule sets. Rule lists are ordered, and typically only the first matching rule in the rule list is used to classify an instance. CBA produces rule lists. In contrast, rule sets are unordered and typically all rules with matching antecedents contribute to classifying an instance. CPAR is an example of an algorithm that produces a rule set. 3

Automatic Tuning of Mining Parameters

The minimum support threshold is a mandatory hyperparameter of most, if not all, association rule learning approaches, yet even the latest algorithms do little to tune it algorithmically. Minimum confidence threshold is used in smaller number of algorithms, but when it is used, it is also not tuned. We suspect that the reason is that these thresholds are notoriously difficult to optimize due to exponential complexity of the search space [ 11 ]. Additionally, the classification performance is typically very sensitive to parameter setting. While generally lower values of confidence and support and higher values of rule length produce the best results, the side effect of such setting can be a disproportionally long time needed to build the classifier caused by a combinatorial explosion, and consequently extreme memory requirements.

We considered standard approaches such as pure random or grid search. Since they do not use any background knowledge of the algorithm, we found them to be unsuitable for optimizing the hyperparameters of association rule learning, because of the sudden steep increases in space state complexity that can be triggered by small changes in the value of the hyperparameter.

In the following, we introduce our two proposals for hyperparameter tuning for association rule classification. 3.1

Simulated Annealing Optimization 10 11 12

Algorithms 1 and 2 present our implementation for the hyperparameter optimization based on simulated annealing [ 12, 13, 14 ]. The objective criterion which is optimized against is the accuracy of the model. else else return newSetting if resultStatus is empty then // threshold may have been too high newSetting[p] = newSetting[p] - rand(0, newSetting[p]) else

newSetting[p] = random(0,1) case "ruleLength" do if resultStatus is timeout then // shorter rule length can fasten execution newSetting[p] = newSetting[p] - 1 newSetting[p] = rand(1, MAX_LENGTH)

The algorithm starts as a random search for one valid initial solution providing non-empty classifier. Each subsequent classifier is evaluated using nested crossvalidation. Input data are internally divided into a train and a validation subset with a stratified split. The classifier is built with a setting generated on the train set. The accuracy is computed using the created classifier on the validation set. If the execution time of the evaluation is over a predefined threshold we stop the computation and mark the setting as invalid and set the computed accuracy to null.

Algorithm 2: Perturbate - Generating new setting for SA. input : Current setting: currentSetting

Current setting status: resultStatus (timeout, success or empty rule set) output: New setting: newSetting 1 begin 2 newSetting = currentSetting 3 // with uniform probability select one parameter 4 p = random("support", "confidence", "ruleLength") 5 switch p do 6 case "support" or "confidence" do 7 if resultStatus is timeout then 8 // increasing threshold can fasten execution 9 newSetting[p] = newSetting[p] + rand(0, 1 newSetting[p])

The evaluated new setting is accepted as a candidate for next iteration if: 1) it is a valid setting not leading to a timeout, 2) the accuracy is better than the current setting, or the computed probability of acceptance exceeds the random value. As an optimization we always remember the best solution found so far so that it can be used if the algorithm terminates at a sub-optimal place.

An important part of the algorithm is a generation of a new setting based on the previous one. Only one parameter is changed during generation of a new setting, which composes of support, confidence or rule length. If the current setting was labeled as invalid, the support or confidence are increased or rule length decreased to overcome long computation time and perform more restricted rule mining. If the setting does not generate any rule or no rule is applicable, the support or confidence are decreased. For remaining situations, a random value is generated. 3.2

Heuristic algorithm

As an alternative to the supervised evolutionary approach, we also introduce an unsupervised heuristic algorithm. While the search in the simulated annealing approach uses accuracy as objective function, the heuristic algorithm only aims to return a user-set number of rules. This approach is conceptually faster, since repeated evaluations of the classification model are not performed.

According to the recommendation in [ 4 ], CBA generates best results when the rule generation step returns at least 60.000 of rules. The experiments performed by [ 4 ] also provide recommended values for minimum confidence (50%) and support (1%) thresholds.

The problem that our CBA-RG-auto algorithm addresses is that on some datasets the combination of the values suggested in [ 4 ] fails. The principal reasons are either not enough rules generated or a combinatorial explosion generating high number of overly short (and thus general) rules.

The CBA-RG-auto algorithm (Alg. 3) takes on the input two principal parameters: the number of desired rules (targetRuleCount) and preferred time that can be spent with tuning (totalTimeout). The algorithm then iteratively refines the minimum support (support) and confidence (conf ) thresholds. The mining time and risk of combinatorial explosion is controlled by adjusting the constraint on the minimum and maximum number of conditions that can appear in the antecedent of the rules (minLen and maxLen). To guide the search process, the algorithm takes on input several additional parameters. According to our experiments, their values can be typically left at their default values (we used the same defaults in all experiments reported in our evaluation). 4

Evaluation

In our benchmark, we aim to evaluate the performance of the two proposed tuning steps against CBA with default parameters as a baseline.

For simulated annealing, we report on two setups, one using default values of metaparameters of simulated annealing (denoted as sa). To investigate the effect of metaparameters introduced in the simulated annealing algorithm, we also involve approach denoted as saopt, which corresponds to the simulated annealing tuning algorithm with metaparameter values optimized with random search. For saopt the configurations were evaluated against test data to determine the upper bound of attainable accuracy. As a result, saopt cannot be directly compared with the remaining evaluated algorithms, which did not have access to test data during training.

Algorithm 3: Automatic parameter tuning heuristic algorithm (CBA-RG-auto) input : train training data parameters: main: targetRuleCount, totalTimeout,

supplementary: initSupport = 0.01, initCon f = 0.5, con f Step = 0.05, suppStep = 0.05, minLen = 2, initMaxlen = 3, iterTimeout = 2, maxIterations = 40 output : rules - list of rules to be used as input for CBA-CB with 1 begin 2 startTime currentTime(), supp initSupport, con f initCon f , maxLen initMaxlen, iterations 0, maxLenDecreasedDueToT IMEOUT false, lastRuleCount -1 3 MAX RULELEN number of explanatory attributes 4 while true do 5 iterations iterations + 1 6 if iterations = maxIterations then 7 break 8 Datasets The evaluation was performed on 22 datasets selected from the UCI repository [ 15 ]. All selected datasets were previously used in evaluation of rule learning or decision tree algorithms in one of the following seminal papers: [ 5, 16, 4, 17 ]. Numerical attributes with more than 3 values were binned with entropy-based discretization [ 18 ]. Ten-fold crossvalidation was used to generate train-test splits. The same splits were used for all evaluated configurations.

Implementation We made available under an open source licence implementations of all evaluated algorithms. We used R package rCBA1 (available via CRAN) to obtain results for the baseline CBA run and for simulated annealing. R package arc2 (also available via CRAN) was used to obtain results for the heuristic algorithm. Both implementations use the apriori algorithm for the rule learning phase.

Settings The classifier building phase of CBA does not have any metaparameters. The rule learning phase requires setting of rule mining parameters – minimum support, minimum confidence and maximum rule length. The starting parameters for the proposed threshold tuning methods (Algorithms 1 – 3) are also listed below.

Baseline CBA (base): 50% minimum confidence, 1% minimum support, maximal rule length 3.

Heuristic algorithm (heuristic): Default setting is targetRuleCount=60000, initSu p port = 0.01, initCon f = 0.5, con f Ste p = 0.05, su p pSte p = 0.05, minLen = 2, initMaxlen = 3, iterTimeout = 2, maxIterations = 40 Simulated Annealing (sa): Default setting for the SA algorithm is INIT _T EMP = 100:0, ALPHA = 0:05, MAX _LENGT H = 5, T IME_LIMIT = 10 Optimized Simulated dom search from INIT _T EMP = 10:0 MAX _LENGT H = 3

Annealing (saopt): Ranthe following intervals 100:0, ALPHA = 0:01 0:5, 10, T IME_LIMIT = 1 10 4.2

Results

Results are reported in terms of accuracy (Table 1), rule count (Table 2), average number of conditions in rules in the model (Table 3), average model size computed as average number of conditions average rule count (Table 4), and classifier build time (Table 5). Finally, Table 6 provides for each of the evaluated approaches an aggregate number of wins in each of the five criteria above. Baseline CBA The results show that CBA with default parameter values performs surprisingly well, achieving best results in terms of overall size of the classifier on most datasets (14 out of 22), while obtaining the best results on 1Version 0.4.3 2Version 1.2 5 datasets in terms of predictive performance. Remarkably, there are three datasets (breast-w, credit-g, sonar) for which the default parameter values generate models that have best accuracy and at the same time are smallest in terms of combined rule count and rule length.

Despite the five wins, base CBA had the worst average and median accuracy. Detailed examination of Table 1 shows that the default thresholds result in either very low accuracy or excessive size on several datasets, the drop in accuracy is particularly strong on glass and letter datasets. The instability of results is reflected by high standard deviation for accuracy.

Heuristic The optimization heuristic provides best outcome in terms of predictive performance, both in terms of accuracy and the number of wins against other datasets. This comes at a cost of creating larger models than generated by other methods, also the build time is the highest. One dataset (letter) was not even processed. For accuracy, the heuristic approach provides the most stable results with lowest standard deviation.

Simulated annealing

When it comes to compact models, very promising results were obtained by simulated annealing with default parameters (sa), which produced the smallest models in terms of rule count on 12 datasets. In two cases (australian, hepatitis), this algorithm produced much smaller models than the other methods with a small gap in terms of accuracy. On the ionosphere dataset, sa even generated a model which was most accurate and at the same time smallest.

The saopt algorithm generated almost consistently better results than sa. However, this approach is not fully comparable with the remaining two, because it used the test set to select the best combination of hyperparameters. It is included to show a possible effect of tuning hyperparameters of simulated annealing as opposed to only using the default values. 5

Limitations and Future Work

We acknowledge several limitations affecting our preliminary study:

Our benchmark did not account for the tradeoff between rule count and accuracy.

For example, 1% improvement in accuracy may need to be offset by much higher increase in number of rules, which are required to cover various specialized cases.

We have not performed statistical testing on significance in differences between the algorithms.

The baseline approaches could include some previously proposed approaches for metaparameter optimization, such as [ 11 ]. For the baseline CBA algorithm, we evaluated only setting with maximum length of antecedent set to 3, as higher thresholds sometimes led to combinatorial explosion.

The included datasets are of small or moderate size, evaluation on large datasets was not performed. We plan to address some of the limitations noted above in a larger follow-up study. In future work, it would also be interesting to adapt the proposed rule tuning heuristics to the recent generation of association rule classification algorithms. Unlike CBA, which uses a computationally lightweight approach to selecting rules for the final classifier, these algorithms typically subject the input rule set to much more sophisticated selection process, involving optimization techniques such as Markov Chain Monte Carlo (in SBRL), submodular optimization (in IDS) or simulated annealing (in BRS).

This adaptation may require experimentation with other metaparameter optimization algorithms, such as sequential model based optimization approaches (SMBO) [ 19 ] or other types of nature-inspired algorithms, e.g. F-race (Irace) [ 20 ], which were experimentally showed to outperform SMBO on tasks with mixed types of parameters [ 21 ].

Conclusions

In this paper, we have shown how thresholds used in rule generation can be tuned in both unsupervised and supervised way to improve results of association rule classification algorithms in terms of predictive performance and size of the resulting model. Our results showed, somewhat surprisingly, that the default thresholds recommended for the CBA algorithm (1% minimum support and 50% minimum confidence thresholds) provide on many datasets results highly competitive to the best configuration found with any of the proposed tuning algorithms. Despite this, using these defaults cannot be unanimously recommended as the default settings works well on some datasets, but has abysmal results on others. The proposed unsupervised heuristic tuning algorithm provides best predictive accuracy and relatively stable results. The supervised approach based on simulated annealing has promising results in terms of generating compact models.

Possible applications include not only general classification problems, but particularly the use of associative classification for anomaly detection, where the results are known to be very sensitive to the choice of the support threshold [22].

The implementation of the proposed algorithms is made available in R packages rCBA and arc, which are available in the CRAN repository.

Acknowledgments

The authors would like to thank the three anonymous reviewers for insightful comments that helped to improve the final version of the paper. This research was supported by Faculty of Informatics, Czech Technical University in Prague and by the Faculty of Informatics and Statistics, University of Economics, Prague by institutional support for research and grant IGA 12/2019. cal study on hyperparameter tuning of decision trees. arXiv preprint arXiv:1812.02207, 2018. [22] Brauckhoff, D.; Dimitropoulos, X.; Wagner, A.; aj.: Anomaly extraction in backbone networks using association rules. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement, ACM, 2009, s. 28–34.

[1] Lakkaraju , H. ; Bach, S. H. ; Leskovec , J. : Interpretable Decision Sets: A Joint Framework for Description and Prediction . In Proceedings of KDD '16 , New York, NY, USA: ACM, 2016 , ISBN 978-1- 4503 -4232-2, s. 1675 - 1684 .

[2] Yang , H. ; Rudin, C. ; Seltzer , M. : Scalable Bayesian rule lists . In Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org , 2017 , s. 3921 - 3930 .

[3] Wang , T. ; Rudin , C. ; Doshi-Velez , F. ; aj.: A bayesian framework for learning rule sets for interpretable classification . The Journal of Machine Learning Research, rocˇník 18, cˇ. 1 , 2017 : s. 2357 - 2393 .

[4] Liu , B. ; Hsu , W. ; Ma, Y. : Integrating classification and association rule mining . In Proceedings of KDD'98 , 1998 , s. 80 - 86 .

[5] Alcala-Fdez , J. ; Alcala, R. ; Herrera , F. : A fuzzy association rule-based classification model for high-dimensional problems with genetic rule selection and lateral tuning . IEEE Transactions on Fuzzy Systems, rocˇník 19, cˇ. 5 , 2011 : s.

[6] Yin , X. ; Han, J .: CPAR: Classification based on Predictive Association Rules . In Proceedings of the SIAM International Conference on Data Mining , San Franciso: SIAM Press, 2003 , s. 369 - 376 .

[7] Li , W. ; Han, J .; Pei , J.: CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules . In Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM '01 , Washington, DC, USA: IEEE, 2001 , ISBN 0-7695-1119-8, s. 369 - 376 .

[8] Vanhoof , K. ; Depaire , B. : Structure of association rule classifiers: a review . In 2010 International Conference on Intelligent Systems and Knowledge Engineering (ISKE) , November 2010 , s. 9- 12 .

[9] Agrawal , R. ; Imielinski, T. ; Swami , A. N. : Mining Association Rules between Sets of Items in Large Databases . In SIGMOD , 1993 , s. 207 - 216 .

[10] Han , J .; Pei , J. ; Yin, Y.; aj.: Mining Frequent Patterns Without Candidate Generation: A Frequent-Pattern Tree Approach. Data Mining and Knowledge Discovery, rocˇník 8 , cˇ. 1, Leden

2004

: s. 53-87 , ISSN 1384-5810.

[11] Coenen , F. ; Leng , P. ; Zhang, L.: Threshold tuning for improved classification association rule mining . In PacificAsia Conference on Knowledge Discovery and Data Mining , Springer, 2005 , s. 216 - 225 .

[12] Cˇerný , V.: Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm . Journal of Optimization Theory and Applications , rocˇník 45 , cˇ. 1, 1984 : s. 41 - 51 , ISSN 1573-2878.

[13] Kirkpatrick , S. ; Gelatt , C. D. ; Vecchi, M. P. : Optimization by Simulated Annealing . Science, rocˇník 220 , cˇ. 4598 , 1983 : s. 671 - 680 , ISSN 00368075, doi:10.1126/science. 220.4598.671.

[14] Johnson , D. S. ; Aragon , C. R. ; McGeoch , L. A. ; aj.: Optimization by Simulated Annealing: An Experimental Evaluation. Part I, Graph Partitioning . Oper. Res., rocˇník 37, cˇ. 6 , Rˇíjen 1989 : s. 865 - 892 , ISSN 0030-364X, doi: 10.1287/opre.37.6.865.

[15] Dua , D. ; Graff , C.: UCI Machine Learning Repository . 2017 . Available on: http://archive.ics.uci.edu/ml

[16] Hühn , J. ; Hüllermeier , E. : FURIA: an algorithm for unordered fuzzy rule induction . Data Mining and Knowledge Discovery, rocˇník 19, cˇ. 3 , 2009 : s. 293 - 319 .

[17] Quinlan , J. R. : Improved use of continuous attributes in C4. 5 . Journal of artificial intelligence research, rocˇník 4 , 1996 : s. 77 - 90 .

[18] Fayyad , U. M. ; Irani , K. B. : Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning . In 13th International Joint Conference on Uncertainly in Artificial Intelligence (IJCAI93) , 1993 , s. 1022 - 1029 .

[19] Bergstra , J. S. ; Bardenet, R. ; Bengio, Y.; aj.: Algorithms for hyper-parameter optimization . In Advances in neural information processing systems , 2011 , s. 2546 - 2554 .

[20] Birattari , M. ; Yuan , Z. ; Balaprakash , P.; aj.: F-Race and iterated F-Race: An overview. In Experimental methods for the analysis of optimization algorithms , Springer, 2010 , s. 311 - 336 .

[21] Mantovani , R. G. ; Horváth, T. ; Cerri , R.; aj.: An empiri-