=Paper=
{{Paper
|id=Vol-3226/paper2
|storemode=property
|title=On the Incremental Construction of Deep Rule Theories
|pdfUrl=https://ceur-ws.org/Vol-3226/paper2.pdf
|volume=Vol-3226
|authors=Florian Beck,Johannes FΓΌrnkranz,Van Quoc Phuong Huynh
|dblpUrl=https://dblp.org/rec/conf/itat/BeckFQ22
}}
==On the Incremental Construction of Deep Rule Theories==
On the Incremental Construction of Deep Rule Theories Florian Beck1 , Johannes FΓΌrnkranz1 and Van Quoc Phuong Huynh1 1 Johannes Kepler University Linz, Department of Computer Science, Institute for Application-oriented Knowledge Processing (FAW), Linz, Austria Abstract While we have seen considerable progress in learning rule-based theories in the past years, all state-of-the-art rule learners still learn descriptions that directly relate the input features to the target concept. However, the limitation of learning concepts in this shallow disjunctive normal form (DNF) may not necessarily lead to the desired models. In this paper, we investigate a novel search strategy that uses conjunctions and disjunctions of individuals as its elementary operators, which are successively combined to deep rule structures with intermediate concepts. We make use of efficient data structures known from association rule mining, which can efficiently summarize counts of conjunctive expressions, and expand them to handle disjunctive expressions as well. The resulting rule concepts develop over multiple generations and consist of arbitrary, deep combinations of conjunctions and disjunctions. The behavior of this algorithm is evaluated on a benchmark task from the domain of poker. A comparison to other rule learning algorithms shows that, while it is not generally competitive, it has some interesting properties, such as finding more compact and better generalizing models, that cannot be found in conventional rule learning algorithms. Keywords rule learning, concept learning, learning in logic, local search, genetic algorithms 1. Introduction pair and no pair, Ripper gets stuck at 70% accuracy since it is not able to generalize to pairs that it has not seen Traditional rule learning algorithms usually learn models during training. that directly relate input features to the output class. In this paper, we make first steps towards tackling This approach works well for most benchmark datasets, this problem by trying to remove the restriction to DNF however, there are also datasets where it comes to its formulas that is commonly made by conventional rule limits. One such case is the poker-dataset, where the task learning algorithms, and directly learn arbitrary logical is to learn to identify the quantitative value (pair, full concept descriptions. More precisely, we investigate the house, straight, etc.) of a hand of five cards. Every card is behavior of a simple algorithm that successively com- defined by a unique combination of a suit (clubs, spades, bines input signals with logical operators, thus building hearts, diamonds) and a rank (ace, 2, 3, ..., queen, king). up complex logic structures. For example, the class one pair includes hands where two The remainder of the paper is organized as follows: of the five cards have the same rank and the remaining Section 2 describes the problem of learning the pair con- three cards a different one. Already this task of detecting cept in different logical representations and refers to re- one pair is particularly difficult for a rule learner, because, lated work. Based on this, we propose a combination of in order to make a correct classification, it essentially has a local search and genetic algorithm in Section 3 and test to enumerate all possible card combinations (card 1 and it in various settings in Section 4. Finally, the results are 2, card 1 and 3, ..., card 4 and 5), for every rank (ace, 2, 3, concluded in Section 5. ..., queen, king). Advanced concepts based on this pair concept, like two pairs or full house, are even harder to learn, so that on the full poker-dataset the state-of-the- 2. Pair Concept art rule learner Ripper only achieves a little more than 50% accuracy, which can already be reached by simply Disjunctive Normal Form. Ripper [1] and most other predicting the most frequent class nothing in hand that rule learners learn their models in disjunctive normal form covers 50% of the data as well. Even in an adapted binary (DNF) [2]. This normal form consists of a disjunction version of the poker-dataset, where the only classes are of conjunctions of literals whereby each conjunction is called a rule and the whole DNF expression a ruleset. To ITATβ22: Information technologies β Applications and Theory, Septem- describe the concept of a pair in DNF for a set of π cards ber 23β27, 2022, Zuberec, Slovakia and π ranks, we have to find one rule for each possible $ fbeck@faw.jku.at (F. Beck); juffi@faw.jku.at (J. FΓΌrnkranz); combination of two cards for every rank as seen in the vqphuynh@faw.jku.at (V. Q. P. Huynh) following equation: 0000-0003-3183-2953 (F. Beck); 0000-0002-1207-0159 (J. FΓΌrnkranz) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) can reduce this number to 8, as, e.g., in the following two π βοΈ π π βοΈ βοΈ cases: ππ = π β§ ππ = π. (1) π=1 π=1 π=π+1 [(π1 = π0 β¨ π2 = π0 ) β§ (π3 = π0 β¨ π4 = π0 )] β¨ The total (οΈ )οΈnumber of rules in this expression is π Β· π Β· (4) πβ1 = π Β· 2π with each of them consisting of two literals, (π1 = π0 β§ π2 = π0 ) β¨ (π3 = π0 β§ π4 = π0 ), 2 resulting in a total number of π Β· π Β· (π β 1) literals for [(π1 = π0 β¨ π2 = π0 ) β§ (π3 = π0 β¨ π4 = π0 )] β¨ the whole ruleset. (5) [(π1 = π0 β¨ π3 = π0 ) β§ (π2 = π0 β¨ π4 = π0 )]. Conjunctive Normal Form. An alternative represen- The first line of Equation (4) states that one of the first tation is the conjunctive normal form (CNF), using a con- two cards as well as one of the last two cards has rank π0 , junction of disjunctions of literals instead. In general, and the remaining two possibilities to form a pair within every logical formula can be represented in both DNF the first two cards or within the last two cards are covered and CNF, the CNF form is less popular in rule learning by the conjunctions in the second line. Equation (5) uses though (notable exceptions include [3, 4]). However, it the same conjunction in the first line and creates a similar turned out to be similarly powerful in practical applica- one in the second line, this time looking at the first and tions as well [4], with slight advantages for CNF learners third card and the second and fourth card respectively to on some datasets and for DNF learners on others. Look- find the remaining pair combinations. ing at the pair concept in Equation (1), the corresponding Note that these constructions can be generalized to an CNF would be much more complex. By only looking at arbitrary number of cards π: We first use two disjunctive pairs of a single rank π0 , we can find a CNF of a similar clauses with π/2 terms, which describe that a sought complexity though: rank occurs in the first half and in the second half of the cards, and then need to recursively encode the cases that π βοΈ βοΈ π a pair of that rank occurs in the first half, as well as that π π = π0 . (2) a pair occurs in the second rank. Thus, the total number π=1 π=1 πΜΈ=π of terms ππ that we need for encoding all possible pairs for π cards is Each disjunction in this CNF has a length of π β 1 π π literals, so that the whole CNF contains π disjunctions, ππ = + + π 2π + π 2π = π + 2 Β· π 2π . (6) 2 2 resulting in the same number of π Β· (π β 1) literals as the DNF when only considering a single rank π0 . However, in Flattening out the recursion yields comparison to the DNF, this representation might require ππ = log2 π Β· π, (7) a closer look to be comprehensible: Whenever any card ππ has the rank π0 , all disjunctions in the CNF besides which is considerably smaller than, for example, the = π(π2 ) terms that are needed for learning the DNF (οΈπ)οΈ the π0 th disjunction become true, and the latter is true if 2 any other card has the rank π0 as well to form a pair. representation of the concept. Note, however, that the expressions of type 2 need to Because of the more compact representation, deeper be combined disjunctively, so that the resulting formula models like shown in Equations (4) and (5) could be easier actually contains three logical layers: to learn than those restricted to CNF or DNF. Further- more, we noticed that state-of-the-art rule learners like π βοΈ βοΈ π βοΈ π Ripper cannot generalize their pair concept to all possible ππ = π. (3) combinations and therefore only reached 70% accuracy π=1 π=1 π=1 πΜΈ=π on the binary poker-dataset. It may be assumed that deeper models generalize better even if not all pair com- Converting this formula to a pure CNF would be con- binations occur in the training data. If in the previous siderably more complex than a DNF. In general, the example the pair combination of the first and the last card choice whether CNF or DNF is a more compact represen- is missing in the training data, an algorithm looking for tation depends on the problem. deep structures might still find and use the conjunction (π1 = π0 β¨ π2 = π0 ) β§ (π3 = π0 β¨ π4 = π0 ) instead Other Representations. In fact, if we do not restrict of splitting it into two conjunctions of size three (π = 1 the logic formula to consist of only one conjunctive and π β¨π = π )β§π = π and π = π β§(π = π β¨π = π ) 0 2 0 3 0 2 0 3 0 4 0 one disjunctive layer, even more compact representations or three conjunctions of size two π = π β§ π = π , 1 0 3 0 are possible. As an example, consider the pair concept π = π β§ π = π and π = π β§ π = π . In the 2 0 3 0 2 0 4 0 for π = 4 and a single rank π0 . The corresponding CNF following, we present a genetic algorithm that is capable and DNF contain 4 Β· (4 β 1) = 12 literals but other repre- to build a model with arbitrary combinations of conjunc- sentations allowing nested conjunctions and disjunctions tions and disjunctions to verify this hypothesis. 3. Algorithm Algorithm 1: learning()-method Input: population_size, metric_type, metric_arg, The motivation behind the work reported in this pa- n_generations, n_offspring, max_sim, selectors per was to investigate, whether general logical formulas Output: concept could be built up using a local search algorithm, which 1 population β Population(selectors, metric_type, incrementally builds up logical structures. metric_arg); The key idea is to simply start with the logical input 2 population.print_summary(); variables, and use the logical operators AND (β§) and OR 3 for π β 1 to n_generations do (β¨) to combine pairs of such values into a more complex 4 new_population β population; expression. Note that such an approach should, in prin- 5 for π β 1 to n_offspring do ciple, be sufficient for learning, for example, the DNF 6 for π1 β population do 7 repeat representation in the poker pairs domain. If, e.g., the 8 π2 β population size is large enough to form all possible pairs tournament_selection(population, in a first iteration, all subsequent iterations could form tournament_size); pairwise disjunctions of such pairs. Thus, once the 2π (οΈ )οΈ 9 until π1 ΜΈ= π2 ; conjunctive pairs are formed, we would need log2 2 (οΈπ)οΈ 10 πππππ β crossover(π1 , π2 , true); iterations in order to combine them into (οΈ )οΈa single large 11 ππππ π β crossover(π1 , π2 , false); disjunct, which makes in total 1 + log2 2π iterations. 12 sim β πππππ .support / ππππ π .support; The approach is quite similar to pattern trees, a data 13 if sim β€ max_sim then structure designed for and used in fuzzy reasoning [5]. 14 if πππππ ΜΈ= null then However, while this approach deals with numerical data 15 new_population.add(πππππ ); and fuzzification operators, we remain in strictly binary 16 end 17 if ππππ π ΜΈ= null then worlds, where solving an optimization problem is consid- 18 new_population.add(ππππ π ); erably harder than in its linear counterpart [6]. Also, the 19 end structures are often built up in a top-down fashion [7], 20 end whereas we proceed from the bottom upwards. 21 end The goal of the experiments reported is to see whether 22 end a logical formula correctly describing the pair concept 23 population β can be found at all, which parameter settings would be new_population.get_n_fittest(population_size, necessary to find it, whether more compact structures metric_type, metric_arg); can be found, or maybe even better generalizations can 24 population.print_summary(); be achieved. 25 end This attempt is realized in the form of a variant of a ge- 26 concept β population.get_n_fittest(1, metric_type, netic algorithm, which uses two types of crossovers, one metric_arg); for conjunctively combining and one for disjunctively 27 return concept combining two individuals in the current population, as shown in Algorithm 1. The algorithm builds upon data structures created by rithm is shown in Algorithm 1 and starts by creating a Lord, a novel rule learner developed in our group [8]. default population consisting of all possible features, also Lord reuses ideas from association rule learning and only known as selectors in Lord. All single features can al- needs a single pass through the training data during the ready be interpreted and evaluated as a rule for predicting learning phase, which makes it very efficient for large the class pair for some arbitrary metric and parameter datasets. All further operations, e.g., determining how (e.g., m-Estimate with π = 10). A first summary of many examples are correctly and incorrectly classified the population is output to potentially analyze the best for any given rule expression, can be directly extracted selectors. from n-Lists. Note that, apart from the usage of the same The evolution begins in line 3 of Algorithm 1 and con- data structures, the genetic algorithm presented in this sists of an outer loop for each generation, which copies paper and Lord are completely different and separate the population of the previous generation. It is followed approaches: While the genetic algorithm aims at finding by a loop for possibly generating multiple offspring per arbitrary logical formulas bottom upwards, Lord uses the individual and two inner loops for selecting individuals data structures to find the best DNF rule for each training for the crossover. The first individual π1 is already preset example. Detailed information about n-Lists are given in whereas the second one π2 is selected by a tournament [9], and about their application in the Lord algorithm in selection, returning the best individual of a fixed-sized [8]. subset of the population. Since conjunctions and disjunc- The learning method of the variant of genetic algo- Algorithm 2: crossover()-method β’ pairs4 consisting of 11,880 card combinations Input: π1 , π2 , conjunctive with 6,120 pairs (π = 4 cards, π = 6 ranks, π = 2 Output: π3 suits) 1 π1 , π2 β order(π1 , π2 ); β’ pairs4a consisting of 6,660 card combinations 2 if conjunctive then with 900 pairs (π = 4 cards, π = 6 ranks, π = 2 3 nlist β conj(π1 .nlist, π2 .nlist); suits). In comparison to pairs4, all pairs besides 4 π3 β Individual(nlist, π1 .body + "&&" + π2 .body) those with rank π0 = 6 are removed, and ad- 5 else ditionally those with π1 = 6 and π4 = 6 are 6 nlist β disj(π1 .nlist, π2 .nlist); retained for evaluation. 7 π3 β Individual(nlist, "(" + π1 .body + "||" + π2 .body + ")") All datasets are split into ten folds, whereby just use 8 end 9 if π3 .support = π1 .support β¨ π3 .support = π2 .support β¨ nine of them are used for training to break symmetries. π3 .support = 0 then In all experiments, 10 generations with a population size 10 return null of 100 and a tournament size of 5 are used, and the rules 11 end are evaluated by the m-estimate metric with π = 10. 12 return π3 The m-estimate value βπ of a rule π predicting class π has been proposed by Cestnik [10] and is calculated as π π.π + π π +π tions with itself are not changing the coverage of the βπ (π) = , (8) π.π + π.π + π expression, we force π1 and π2 to be different individuals before starting with the crossovers. Note that both the where conjunction and disjunction are computed successively π = a settable parameter in the range [0, +β) in lines 10 and 11. π.π = the number of true positives of rule π Algorithm 2 describes the crossover procedure. After π.π = the number of false positives of rule π ordering the two individuals alphabetically, either the π = the number of examples with class = π conjunctive or disjunctive n-List and condition string are π = the number of examples with class ΜΈ= π. built. If one of the individuals covers a subset of instances of the other one, or both individuals are disjoint from It provides an excellent, tunable trade-off between each other, the resulting individual is meaningless and weighted relative accuracy (π β β β), which is fre- null is returned instead. Otherwise, the created crossover quently used in descriptive rule learning, and precision is returned to the learning method. (π ββ 0), the main target for predictive learning [11]. Optionally, in lines 12 and 13 of Algorithm 1, the Jac- The results of the experiments are summarized in Ta- card similarity between the two generated crossovers is bles 1 and 2, the detailed evaluation is split into one computed to avoid offspring that is too similar to its con- paragraph per poker pairs-dataset. junction respectively disjunction "sibling" and parents. If this is not the case, the crossovers are added to the new population. Since the population increases this way, at Table 1 the end of each generation the population is filtered and Average percentage of pairs covered by the final concept (num- only the π best individuals are kept (line 23). By printing ber of folds with 100% coverage in parenthesis). The column the summary of the population in line 24, the maximum "full c." denotes the number of generation where all crossovers heuristic value and the ten best individuals are output. have been computed. Finally, the best individual of the last generation is re- pairs 2 3 4 4a turned as the concept, which can then be used to evaluate full c. 0 93.5% (5) 74.0% (0) 39.3% (0) 98.9% (6) whether test examples are covered by the concept. 1 100.0% (10) 100.0% (10) 87.0% (0) 100.0% (10) 2 100.0% (10) 98.8% (8) 63.0% (0) 100.0% (10) 4. Experiments Table 2 For the experiments, we generated multiple versions of Average number of generations needed to find best concept. the poker pairs-dataset with varying difficulty: pairs 2 3 4 4a full c. β’ pairs2 consisting of 30 card combinations with 12 0 4.4 5.8 6.5 5.3 pairs (π = 2 cards, π = 2 ranks, π = 3 suits) 1 3.2 5.4 7.8 5.0 β’ pairs3 consisting of 336 card combinations with 2 2.0 5.1 7.2 3.5 144 pairs (π = 3 cards, π = 4 ranks, π = 2 suits) Pairs within two cards. In this minimalistic dataset, Pairs within four cards. In the dataset with four cards there are only two different types of pairs since only we used six ranks, summing up to 36 pair combinations. two cards and ranks are available. Both the minimal DNF While a minimal DNF needs at least 72 literals, with π1 = 1β§π2 = 1β¨π1 = 2β§π2 = 2 and the corresponding deeper logical expressions sketched in Equations 4 and 5 CNF (π1 = 1 β¨ π2 = 2) β§ (π1 = 2 β¨ π2 = 1) consist of only 48 literals are needed. However, this requires to find four literals and could be found in two steps. suitable disjunctions in the first generation. Using a strict genetic algorithm approach, i.e., without The results for the dataset with four cards are similar building all possible crossovers in the first generation(s), to the previous one with three cards but with even more the learned concept describes all pairs for 5 out of 10 significant differences between the three settings. The folds, and this concept is found between the third and performance of the pure genetic algorithm setting drops fifth generation. The 5 remaining folds also diverge to a to a pair coverage between 24% and 49% and it converges fixed concept (in different logical expressions) the latest in an even later generation than in the smaller datasets. in the sixth generation, however, in 4 of them one or two This can again be fixed by a single exhaustive search for pairs are not covered respectively, and in the last one a crossovers and a subsequent genetic algorithm covering non-pair was covered mistakenly. between 79% and 93% of the pairs. However, similar to If in the first generation all possible crossovers are gen- the dataset with three cards, if also in the subsequent gen- erated, thus in particular the crossovers π1 = 1 β§ π2 = 1 eration all crossovers are computed, the convergence of and π1 = 2 β§ π2 = 2, the perfect theory can be found in the algorithm is sped up but the pair coverage decreases all 10 folds. 6 of theses folds need four generations to do to percentages between 52% and 77%. so, the remaining 4 find it already in the second gener- While the best performance is achieved by the al- ation. Interestingly, one of them also finds the minimal gorithm using a single generation with an exhaustive CNF additionally to the minimal DNF. crossover search, all resulting models have in common Finally, if also in the second generation all crossovers that they learn DNFs instead of creating intermediate are generated, both the minimal DNF and CNF are found disjunctions before conjuncting them. To investigate in all folds. further into this, we refine the dataset in the following paragraph. Pairs within three cards. The next dataset is already more complex; three cards and four ranks lead to 12 pair Subset of pairs within four cards. In the last ex- combinations, which need a length of at least 20 literals if periment, we want to focus on a different aspect: the using an arbitrary nested logical expression and a length capability of the learner to augment the concept to pairs of at least 24 literals if using a DNF. that are not covered in the training data like described Even with problems of this size, the presented simple in the end of Section 2. We consider pairs with cards of genetic algorithm can not find an expression covering rank 6 and ensure that only five out of six are part of the all pairs in any fold. The percentage of covered pairs training data. As a consequence, state-of-the-art DNF ranges from 56% to 92% and is approximately equally rule learners like Ripper are only capable to detect these distributed, same holds for the number of generations five pair combinations and conjunct them: that lies between 4 and 7. As discussed in Section 3, already a single generation with all possible crossovers can fix this problem. Even if (π1 = 6 β§ π2 = 6) β¨ (π1 = 6 β§ π3 = 6)β¨ the population size is too small to keep all crossovers, it (π2 = 6 β§ π3 = 6) β¨ (π2 = 6 β§ π4 = 6)β¨ (9) is still large enough so that only irrelevant crossovers are (π3 = 6 β§ π4 = 6) removed immediately (those taking the suit attributes into account). In all folds, a concept covering all pairs is In this representation, the remaining pair combination learned in the fifth generation or sixth generation. π1 = 6 β§ π4 = 6 remains uncovered. However, we have The average number of generations needed as well seen in Equations 4 and 5 that even smaller concepts can as the overall complexity of the found concepts can be be learned, and those possibly also cover the missing slightly decreased by computing all crossovers for a sec- pair combination. We verify whether similar models can ond generation. Surprisingly, this approach leads to in- indeed be learned with the suggested approach. Starting complete concepts in two folds though, that are missing with the pure genetic algorithm, the learner does not even 1 of the 12 pair combinations. This indicates that an ex- reliably find all pair combinations. In four out of ten folds, haustive search over multiple generations leads to too parts of the concepts additionally use suit attributes in many similar individuals that prevent other diverse in- the conjunctions, which led up to 37 of 810 pair instances dividuals from being included into the population and uncovered. used for further concepts. By generating all crossovers in the first generation, the five pair combinations are found and combined to a Generation: 1 Max fitness: 0,9503 Generation: 2 Max fitness: 0,9869 0,9503 (164|0) c1=6 && c2=6 -> class=1 0,9869 (648|0) (c1=6 || c4=6) && (c2=6 || c3=6) -> class=1 0,9503 (164|0) c1=6 && c3=6 -> class=1 0,9826 (486|0) (c1=6 || c2=6) && (c4=6 || c3=6) -> class=1 0,9497 (162|0) c2=6 && c3=6 -> class=1 0,9826 (486|0) (c1=6 || c3=6) && (c4=6 || c2=6) -> class=1 0,9491 (160|0) c4=6 && c2=6 -> class=1 0,9744 (328|0) (c1=6 && c2=6 || c1=6 && c3=6) -> class=1 0,9491 (160|0) c4=6 && c3=6 -> class=1 0,9744 (328|0) c1=6 && (c2=6 || c3=6) -> class=1 0,3822 ( 4|0) c3=5 && c4=5 -> class=1 0,9743 (326|0) (c1=6 && c2=6 || c2=6 && c3=6) -> class=1 0,3822 ( 4|0) c2=5 && c4=5 -> class=1 0,9743 (326|0) (c1=6 && c3=6 || c2=6 && c3=6) -> class=1 0,3822 ( 4|0) c4=1 && c1=1 -> class=1 0,9743 (326|0) (c1=6 || c3=6) && c2=6 -> class=1 0,3822 ( 4|0) c2=1 && c4=1 -> class=1 0,9743 (326|0) (c1=6 || c2=6) && c3=6 -> class=1 0,3822 ( 4|0) c3=1 && c4=1 -> class=1 0,9741 (324|0) (c1=6 && c2=6 || c4=6 && c2=6) -> class=1 Figure 1: Best ten individuals in the first and second generation respectively. DNF similar to Equation 9 within five generations, i.e., approach. one iteration more than in the optimal logarithmic case. While the presented combination of a local search and Therefore, all training instances consisting one of these genetic algorithm is not yet matured to be successfully five pair combinations are covered, but the test instances applied on arbitrary real-world datasets, through the ap- with the remaining sixth pair combination remain disre- plication on different versions of the poker pairs-dataset garded. we could take away some interesting properties of the This behavior changes if the second generation com- deep rule theories learned this way. An exhaustive search putes all crossovers as well. Figure 1 lists the best ten indi- for crossovers in the first generation was needed to re- viduals found in the first and second generation. While in liably find concepts covering all or at least most pairs. the first generation this list only consists of conjunctions However, an additional exhaustive search in the subse- that are already able to only cover pairs, in the second quent generation led to too many similar individuals and generation the best three individuals are conjunctions resulted in a worse performance for the chosen parame- of disjunctions, i.e., crossovers of individuals that have ters. Finally, the last experiment on a subset of the poker not been among the best ten in the first generation. In pairs-dataset with four cards showed that the presented particular, the second and third individual evaluate better learner indeed is capable to find the more compact and than all possible individuals in DNF and at the same time better generalizing model described at the beginning of still cover the missing sixth pair combination. this paper, which can not be learned by state-of-the-art Based on these individuals, a concept covering all pairs rule learners, and highlights the importance of subse- of the training set can be found already in the third gener- quent combinations of disjunctions and conjunctions to ation for five of the folds, and in the fourth generation for do so. the remaining five folds. The percentage of individuals covering the additional the missing sixth pair combina- tion ranges from 13% to 31%. While this number is still 6. Future Work low, it is worthwhile to mention that in comparison to In future work, further refinements of the genetic al- DNF rule learner it is capable to generate such individu- gorithm are necessary and partly already implemented. als, and in comparison to the other two versions it also Most emphasized should be the handling of diversity, actually finds them. which can, e.g., either be handled by a separate reserve population that focuses on covering a wider variety of 5. Conclusion examples instead of covering the most examples as possi- ble or by penalizing the evaluation of individuals if their In this work, we dealt with the question when conven- Jaccard similarity is too high. Additionally, a penalty tional DNF rule learners come to their limits. We con- for the rule length might force the algorithm to prefer ducted this on the basis of a poker pairs-dataset and the most compact representation between those cover- showed that deep rule theories can both describe the ing the same instances. Last but not least, the idea of pair concept in a more concise form and are capable to mutations might be refined for the rule learning setting. generalize the concept better. To investigate into this Features on all levels of the logical expression could be behavior in practical experiments, we created a simple removed, replaced or augmented, and conjunctions could variant of a genetic algorithm, which uses conjunctive be interchanged with disjunctions and vice versa. For all and disjunctive crossovers to built deep rule concepts out these suggestions, a further in-depth analysis is needed of single features. We also implemented the possibility though. to generate all possible crossovers at the beginning of the algorithm, which can then be seen as a local search References [1] W. W. Cohen, Fast effective rule induction, in: A. Prieditis, S. Russell (Eds.), Proceedings of the 12th International Conference on Machine Learning (ML-95), Morgan Kaufmann, Lake Tahoe, CA, 1995, pp. 115β123. [2] J. FΓΌrnkranz, D. Gamberger, N. LavraΔ, Foundations of Rule Learning, Springer-Verlag, 2012. [3] A. Dries, L. D. Raedt, S. Nijssen, Mining predic- tive k-CNF expressions, IEEE Transactions on Knowledge and Data Engineering 22 (2010) 743β 748. URL: https://doi.org/10.1109/TKDE.2009.152. doi:10.1109/TKDE.2009.152. [4] R. J. Mooney, Encouraging experimental results on learning CNF, Machine Learning 19 (1995) 79β92. [5] E. HΓΌllermeier, From knowledge-based to data-driven fuzzy modeling β development, criticism, and alternative directions, Infor- matik Spektrum 38 (2015) 500β509. URL: https://doi.org/10.1007/s00287-015-0931-8. doi:10.1007/s00287-015-0931-8. [6] A. Schrijver, Theory of linear and integer program- ming, John Wiley & Sons, 1998. [7] R. Senge, E. HΓΌllermeier, Top-down induction of fuzzy pattern trees, IEEE Transanctions on Fuzzy Systems 19 (2011) 241β252. URL: https://doi.org/10. 1109/TFUZZ.2010.2093532. doi:10.1109/TFUZZ. 2010.2093532. [8] V. Q. P. Huynh, J. FΓΌrnkranz, F. Beck, Efficient learn- ing of large sets of locally optimal classification rules, Under review for journal publication, 2022. [9] Z.-H. Deng, S.-L. Lv, Prepost+: An efficient n-lists- based algorithm for mining frequent itemsets via childrenβparent equivalence pruning, Expert Sys- tems with Applications 42 (2015) 5424β5432. [10] B. Cestnik, Estimating probabilities: A crucial task in Machine Learning, in: L. Aiello (Ed.), Proceed- ings of the 9th European Conference on Artificial Intelligence (ECAI-90), Pitman, Stockholm, Sweden, 1990, pp. 147β150. [11] J. FΓΌrnkranz, P. A. Flach, ROC βnβ rule learning β Towards a better understanding of covering algo- rithms, Machine Learning 58 (2005) 39β77.