=Paper=
{{Paper
|id=Vol-3226/paper2
|storemode=property
|title=On the Incremental Construction of Deep Rule Theories
|pdfUrl=https://ceur-ws.org/Vol-3226/paper2.pdf
|volume=Vol-3226
|authors=Florian Beck,Johannes Fürnkranz,Van Quoc Phuong Huynh
|dblpUrl=https://dblp.org/rec/conf/itat/BeckFQ22
}}
==On the Incremental Construction of Deep Rule Theories==
<pdf width="1500px">https://ceur-ws.org/Vol-3226/paper2.pdf</pdf>
<pre>
On the Incremental Construction of Deep Rule Theories
Florian Beck1 , Johannes Fürnkranz1 and Van Quoc Phuong Huynh1
1
 Johannes Kepler University Linz, Department of Computer Science, Institute for Application-oriented Knowledge Processing (FAW), Linz,
Austria


                                             Abstract
                                             While we have seen considerable progress in learning rule-based theories in the past years, all state-of-the-art rule learners
                                             still learn descriptions that directly relate the input features to the target concept. However, the limitation of learning
                                             concepts in this shallow disjunctive normal form (DNF) may not necessarily lead to the desired models. In this paper, we
                                             investigate a novel search strategy that uses conjunctions and disjunctions of individuals as its elementary operators, which
                                             are successively combined to deep rule structures with intermediate concepts. We make use of efficient data structures known
                                             from association rule mining, which can efficiently summarize counts of conjunctive expressions, and expand them to handle
                                             disjunctive expressions as well. The resulting rule concepts develop over multiple generations and consist of arbitrary, deep
                                             combinations of conjunctions and disjunctions. The behavior of this algorithm is evaluated on a benchmark task from the
                                             domain of poker. A comparison to other rule learning algorithms shows that, while it is not generally competitive, it has some
                                             interesting properties, such as finding more compact and better generalizing models, that cannot be found in conventional
                                             rule learning algorithms.

                                             Keywords
                                             rule learning, concept learning, learning in logic, local search, genetic algorithms


1. Introduction                                                      pair and no pair, Ripper gets stuck at 70% accuracy since
                                                                     it is not able to generalize to pairs that it has not seen
Traditional rule learning algorithms usually learn models during training.
that directly relate input features to the output class.                In this paper, we make first steps towards tackling
This approach works well for most benchmark datasets, this problem by trying to remove the restriction to DNF
however, there are also datasets where it comes to its formulas that is commonly made by conventional rule
limits. One such case is the poker-dataset, where the task learning algorithms, and directly learn arbitrary logical
is to learn to identify the quantitative value (pair, full concept descriptions. More precisely, we investigate the
house, straight, etc.) of a hand of five cards. Every card is behavior of a simple algorithm that successively com-
defined by a unique combination of a suit (clubs, spades, bines input signals with logical operators, thus building
hearts, diamonds) and a rank (ace, 2, 3, ..., queen, king). up complex logic structures.
For example, the class one pair includes hands where two                The remainder of the paper is organized as follows:
of the five cards have the same rank and the remaining Section 2 describes the problem of learning the pair con-
three cards a different one. Already this task of detecting cept in different logical representations and refers to re-
one pair is particularly difficult for a rule learner, because, lated work. Based on this, we propose a combination of
in order to make a correct classification, it essentially has a local search and genetic algorithm in Section 3 and test
to enumerate all possible card combinations (card 1 and it in various settings in Section 4. Finally, the results are
2, card 1 and 3, ..., card 4 and 5), for every rank (ace, 2, 3, concluded in Section 5.
..., queen, king). Advanced concepts based on this pair
concept, like two pairs or full house, are even harder to
learn, so that on the full poker-dataset the state-of-the- 2. Pair Concept
art rule learner Ripper only achieves a little more than
50% accuracy, which can already be reached by simply Disjunctive Normal Form. Ripper [1] and most other
predicting the most frequent class nothing in hand that rule learners learn their models in disjunctive normal form
covers 50% of the data as well. Even in an adapted binary (DNF) [2]. This normal form consists of a disjunction
version of the poker-dataset, where the only classes are of conjunctions of literals whereby each conjunction is
                                                                     called a rule and the whole DNF expression a ruleset. To
ITAT’22: Information technologies – Applications and Theory, Septem- describe the concept of a pair in DNF for a set of 𝑐 cards
ber 23–27, 2022, Zuberec, Slovakia                                   and 𝑟 ranks, we have to find one rule for each possible
$ fbeck@faw.jku.at (F. Beck); juffi@faw.jku.at (J. Fürnkranz);       combination of two cards for every rank as seen in the
vqphuynh@faw.jku.at (V. Q. P. Huynh)                                 following equation:
 0000-0003-3183-2953 (F. Beck); 0000-0002-1207-0159
(J. Fürnkranz)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                can reduce this number to 8, as, e.g., in the following two
                𝑟 ⋁︁
                   𝑐  𝑐
               ⋁︁    ⋁︁                                         cases:
                               𝑐𝑗 = 𝑖 ∧ 𝑐𝑘 = 𝑖.           (1)
               𝑖=1 𝑗=1 𝑘=𝑗+1
                                                                   [(𝑐1 = 𝑟0 ∨ 𝑐2 = 𝑟0 ) ∧ (𝑐3 = 𝑟0 ∨ 𝑐4 = 𝑟0 )] ∨
  The total
          (︀ )︀number of rules in this expression is 𝑟 · 𝑐 ·                                                              (4)
𝑐−1
    = 𝑟 · 2𝑐    with each of them consisting of two literals,      (𝑐1 = 𝑟0 ∧ 𝑐2 = 𝑟0 ) ∨ (𝑐3 = 𝑟0 ∧ 𝑐4 = 𝑟0 ),
 2
resulting in a total number of 𝑟 · 𝑐 · (𝑐 − 1) literals for        [(𝑐1 = 𝑟0 ∨ 𝑐2 = 𝑟0 ) ∧ (𝑐3 = 𝑟0 ∨ 𝑐4 = 𝑟0 )] ∨
the whole ruleset.                                                                                                        (5)
                                                                   [(𝑐1 = 𝑟0 ∨ 𝑐3 = 𝑟0 ) ∧ (𝑐2 = 𝑟0 ∨ 𝑐4 = 𝑟0 )].
Conjunctive Normal Form. An alternative represen-                  The first line of Equation (4) states that one of the first
tation is the conjunctive normal form (CNF), using a con-       two cards as well as one of the last two cards has rank 𝑟0 ,
junction of disjunctions of literals instead. In general,       and the remaining two possibilities to form a pair within
every logical formula can be represented in both DNF            the first two cards or within the last two cards are covered
and CNF, the CNF form is less popular in rule learning          by the conjunctions in the second line. Equation (5) uses
though (notable exceptions include [3, 4]). However, it         the same conjunction in the first line and creates a similar
turned out to be similarly powerful in practical applica-       one in the second line, this time looking at the first and
tions as well [4], with slight advantages for CNF learners      third card and the second and fourth card respectively to
on some datasets and for DNF learners on others. Look-          find the remaining pair combinations.
ing at the pair concept in Equation (1), the corresponding         Note that these constructions can be generalized to an
CNF would be much more complex. By only looking at              arbitrary number of cards 𝑐: We first use two disjunctive
pairs of a single rank 𝑟0 , we can find a CNF of a similar      clauses with 𝑐/2 terms, which describe that a sought
complexity though:                                              rank occurs in the first half and in the second half of the
                                                                cards, and then need to recursively encode the cases that
                      𝑐 ⋁︁
                     ⋀︁  𝑐
                                                                a pair of that rank occurs in the first half, as well as that
                                𝑐 𝑘 = 𝑟0 .                (2)   a pair occurs in the second rank. Thus, the total number
                     𝑗=1 𝑘=1
                         𝑘̸=𝑗                                   of terms 𝑁𝑐 that we need for encoding all possible pairs
                                                                for 𝑐 cards is
   Each disjunction in this CNF has a length of 𝑐 − 1
                                                                                𝑐    𝑐
literals, so that the whole CNF contains 𝑐 disjunctions,                𝑁𝑐 = + + 𝑁 2𝑐 + 𝑁 2𝑐 = 𝑐 + 2 · 𝑁 2𝑐 .              (6)
                                                                                2    2
resulting in the same number of 𝑐 · (𝑐 − 1) literals as the
DNF when only considering a single rank 𝑟0 . However, in        Flattening out the recursion yields
comparison to the DNF, this representation might require                          𝑁𝑐 = log2 𝑐 · 𝑐,                    (7)
a closer look to be comprehensible: Whenever any card
𝑐𝑘 has the rank 𝑟0 , all disjunctions in the CNF besides which        is considerably smaller than, for example, the
                                                                   = 𝑂(𝑐2 ) terms that are needed for learning the DNF
                                                             (︀𝑐)︀
the 𝑟0 th disjunction become true, and the latter is true if   2
any other card has the rank 𝑟0 as well to form a pair.       representation of the concept.
   Note, however, that the expressions of type 2 need to         Because of the more compact representation, deeper
be combined disjunctively, so that the resulting formula models like shown in Equations (4) and (5) could be easier
actually contains three logical layers:                      to learn than those restricted to CNF or DNF. Further-
                                                             more, we noticed that state-of-the-art rule learners like
                      𝑟 ⋀︁
                     ⋁︁  𝑐 ⋁︁ 𝑐
                                                             Ripper cannot generalize their pair concept to all possible
                                 𝑐𝑘 = 𝑖.                 (3) combinations and therefore only reached 70% accuracy
                    𝑖=1 𝑗=1 𝑘=1
                            𝑘̸=𝑗                             on the binary poker-dataset. It may be assumed that
                                                             deeper models generalize better even if not all pair com-
   Converting this formula to a pure CNF would be con-
                                                             binations occur in the training data. If in the previous
siderably more complex than a DNF. In general, the
                                                             example the pair combination of the first and the last card
choice whether CNF or DNF is a more compact represen-
                                                             is missing in the training data, an algorithm looking for
tation depends on the problem.
                                                             deep structures might still find and use the conjunction
                                                             (𝑐1 = 𝑟0 ∨ 𝑐2 = 𝑟0 ) ∧ (𝑐3 = 𝑟0 ∨ 𝑐4 = 𝑟0 ) instead
Other Representations. In fact, if we do not restrict of splitting it into two conjunctions of size three (𝑐 =
                                                                                                                    1
the logic formula to consist of only one conjunctive and 𝑟 ∨𝑐 = 𝑟 )∧𝑐 = 𝑟 and 𝑐 = 𝑟 ∧(𝑐 = 𝑟 ∨𝑐 = 𝑟 )
                                                               0    2     0   3     0      2    0     3     0   4      0
one disjunctive layer, even more compact representations or three conjunctions of size two 𝑐 = 𝑟 ∧ 𝑐 = 𝑟 ,
                                                                                                   1      0     3      0
are possible. As an example, consider the pair concept 𝑐 = 𝑟 ∧ 𝑐 = 𝑟 and 𝑐 = 𝑟 ∧ 𝑐 = 𝑟 . In the
                                                               2       0    3     0        2     0      4     0
for 𝑐 = 4 and a single rank 𝑟0 . The corresponding CNF following, we present a genetic algorithm that is capable
and DNF contain 4 · (4 − 1) = 12 literals but other repre- to build a model with arbitrary combinations of conjunc-
sentations allowing nested conjunctions and disjunctions tions and disjunctions to verify this hypothesis.
3. Algorithm                                                        Algorithm 1: learning()-method
                                                                       Input: population_size, metric_type, metric_arg,
The motivation behind the work reported in this pa-                           n_generations, n_offspring, max_sim, selectors
per was to investigate, whether general logical formulas               Output: concept
could be built up using a local search algorithm, which             1 population ← Population(selectors, metric_type,
incrementally builds up logical structures.                             metric_arg);
   The key idea is to simply start with the logical input           2 population.print_summary();
variables, and use the logical operators AND (∧) and OR             3 for 𝑔 ← 1 to n_generations do
(∨) to combine pairs of such values into a more complex              4    new_population ← population;
expression. Note that such an approach should, in prin-              5    for 𝑜 ← 1 to n_offspring do
ciple, be sufficient for learning, for example, the DNF              6         for 𝑖1 ∈ population do
                                                                     7             repeat
representation in the poker pairs domain. If, e.g., the
                                                                     8                  𝑖2 ←
population size is large enough to form all possible pairs                                tournament_selection(population,
in a first iteration, all subsequent iterations could form                                tournament_size);
pairwise disjunctions of such pairs. Thus, once the 2𝑐
                                                           (︀ )︀
                                                                     9             until 𝑖1 ̸= 𝑖2 ;
conjunctive pairs are formed, we would need log2 2
                                                           (︀𝑐)︀
                                                                   10              𝑖𝑐𝑜𝑛𝑗 ← crossover(𝑖1 , 𝑖2 , true);
iterations in order to combine them into   (︀ )︀a single large     11              𝑖𝑑𝑖𝑠𝑗 ← crossover(𝑖1 , 𝑖2 , false);
disjunct, which makes in total 1 + log2 2𝑐 iterations.             12              sim ← 𝑖𝑐𝑜𝑛𝑗 .support / 𝑖𝑑𝑖𝑠𝑗 .support;
   The approach is quite similar to pattern trees, a data          13              if sim ≤ max_sim then
structure designed for and used in fuzzy reasoning [5].             14                  if 𝑖𝑐𝑜𝑛𝑗 ̸= null then
However, while this approach deals with numerical data              15                       new_population.add(𝑖𝑐𝑜𝑛𝑗 );
and fuzzification operators, we remain in strictly binary           16                  end
                                                                    17                  if 𝑖𝑑𝑖𝑠𝑗 ̸= null then
worlds, where solving an optimization problem is consid-
                                                                    18                       new_population.add(𝑖𝑑𝑖𝑠𝑗 );
erably harder than in its linear counterpart [6]. Also, the         19                  end
structures are often built up in a top-down fashion [7],           20              end
whereas we proceed from the bottom upwards.                        21          end
   The goal of the experiments reported is to see whether          22     end
a logical formula correctly describing the pair concept            23     population ←
can be found at all, which parameter settings would be                      new_population.get_n_fittest(population_size,
necessary to find it, whether more compact structures                       metric_type, metric_arg);
can be found, or maybe even better generalizations can             24     population.print_summary();
be achieved.                                                       25 end
   This attempt is realized in the form of a variant of a ge-      26 concept ← population.get_n_fittest(1, metric_type,
netic algorithm, which uses two types of crossovers, one                metric_arg);
for conjunctively combining and one for disjunctively              27 return concept

combining two individuals in the current population, as
shown in Algorithm 1.
   The algorithm builds upon data structures created by            rithm is shown in Algorithm 1 and starts by creating a
Lord, a novel rule learner developed in our group [8].             default population consisting of all possible features, also
Lord reuses ideas from association rule learning and only          known as selectors in Lord. All single features can al-
needs a single pass through the training data during the           ready be interpreted and evaluated as a rule for predicting
learning phase, which makes it very efficient for large            the class pair for some arbitrary metric and parameter
datasets. All further operations, e.g., determining how            (e.g., m-Estimate with 𝑚 = 10). A first summary of
many examples are correctly and incorrectly classified             the population is output to potentially analyze the best
for any given rule expression, can be directly extracted           selectors.
from n-Lists. Note that, apart from the usage of the same             The evolution begins in line 3 of Algorithm 1 and con-
data structures, the genetic algorithm presented in this           sists of an outer loop for each generation, which copies
paper and Lord are completely different and separate               the population of the previous generation. It is followed
approaches: While the genetic algorithm aims at finding            by a loop for possibly generating multiple offspring per
arbitrary logical formulas bottom upwards, Lord uses the           individual and two inner loops for selecting individuals
data structures to find the best DNF rule for each training        for the crossover. The first individual 𝑖1 is already preset
example. Detailed information about n-Lists are given in           whereas the second one 𝑖2 is selected by a tournament
[9], and about their application in the Lord algorithm in          selection, returning the best individual of a fixed-sized
[8].                                                               subset of the population. Since conjunctions and disjunc-
   The learning method of the variant of genetic algo-
 Algorithm 2: crossover()-method                                          • pairs4 consisting of 11,880 card combinations
   Input: 𝑖1 , 𝑖2 , conjunctive                                             with 6,120 pairs (𝑐 = 4 cards, 𝑟 = 6 ranks, 𝑠 = 2
   Output: 𝑖3                                                               suits)
 1 𝑖1 , 𝑖2 ← order(𝑖1 , 𝑖2 );                                             • pairs4a consisting of 6,660 card combinations
 2 if conjunctive then                                                      with 900 pairs (𝑐 = 4 cards, 𝑟 = 6 ranks, 𝑠 = 2
 3       nlist ← conj(𝑖1 .nlist, 𝑖2 .nlist);                                suits). In comparison to pairs4, all pairs besides
 4       𝑖3 ← Individual(nlist, 𝑖1 .body + "&&" + 𝑖2 .body)                 those with rank 𝑟0 = 6 are removed, and ad-
 5 else                                                                     ditionally those with 𝑐1 = 6 and 𝑐4 = 6 are
 6       nlist ← disj(𝑖1 .nlist, 𝑖2 .nlist);                                retained for evaluation.
 7       𝑖3 ← Individual(nlist, "(" + 𝑖1 .body + "||" + 𝑖2 .body
          + ")")
                                                                      All datasets are split into ten folds, whereby just use
 8 end
 9 if 𝑖3 .support = 𝑖1 .support ∨ 𝑖3 .support = 𝑖2 .support ∨
                                                                   nine of them are used for training to break symmetries.
     𝑖3 .support = 0 then                                          In all experiments, 10 generations with a population size
10       return null                                               of 100 and a tournament size of 5 are used, and the rules
11 end                                                             are evaluated by the m-estimate metric with 𝑚 = 10.
12 return 𝑖3                                                       The m-estimate value ℎ𝑚 of a rule 𝑟 predicting class 𝑐
                                                                   has been proposed by Cestnik [10] and is calculated as
                                                                                                                 𝑃
                                                                                                       𝑟.𝑝 + 𝑚 𝑃 +𝑁
tions with itself are not changing the coverage of the                                   ℎ𝑚 (𝑟) =                            ,                 (8)
                                                                                                       𝑟.𝑝 + 𝑟.𝑛 + 𝑚
expression, we force 𝑖1 and 𝑖2 to be different individuals
before starting with the crossovers. Note that both the            where
conjunction and disjunction are computed successively
                                                                    𝑚 = a settable parameter in the range [0, +∞)
in lines 10 and 11.
                                                                    𝑟.𝑝 = the number of true positives of rule 𝑟
   Algorithm 2 describes the crossover procedure. After
                                                                    𝑟.𝑛 = the number of false positives of rule 𝑟
ordering the two individuals alphabetically, either the
                                                                    𝑃 = the number of examples with class = 𝑐
conjunctive or disjunctive n-List and condition string are
                                                                    𝑁 = the number of examples with class ̸= 𝑐.
built. If one of the individuals covers a subset of instances
of the other one, or both individuals are disjoint from              It provides an excellent, tunable trade-off between
each other, the resulting individual is meaningless and            weighted relative accuracy (𝑚 −   → ∞), which is fre-
null is returned instead. Otherwise, the created crossover         quently used in descriptive rule learning, and precision
is returned to the learning method.                                (𝑚 −→ 0), the main target for predictive learning [11].
   Optionally, in lines 12 and 13 of Algorithm 1, the Jac-           The results of the experiments are summarized in Ta-
card similarity between the two generated crossovers is            bles 1 and 2, the detailed evaluation is split into one
computed to avoid offspring that is too similar to its con-        paragraph per poker pairs-dataset.
junction respectively disjunction "sibling" and parents. If
this is not the case, the crossovers are added to the new
population. Since the population increases this way, at            Table 1
the end of each generation the population is filtered and          Average percentage of pairs covered by the final concept (num-
only the 𝑛 best individuals are kept (line 23). By printing        ber of folds with 100% coverage in parenthesis). The column
the summary of the population in line 24, the maximum              "full c." denotes the number of generation where all crossovers
heuristic value and the ten best individuals are output.           have been computed.
   Finally, the best individual of the last generation is re-                 pairs
                                                                                             2               3            4              4a
turned as the concept, which can then be used to evaluate           full c.
                                                                    0                     93.5% (5)     74.0% (0)      39.3% (0)      98.9% (6)
whether test examples are covered by the concept.                   1                    100.0% (10)   100.0% (10)     87.0% (0)     100.0% (10)
                                                                    2                    100.0% (10)    98.8% (8)      63.0% (0)     100.0% (10)

4. Experiments
                                                                   Table 2
For the experiments, we generated multiple versions of             Average number of generations needed to find best concept.
the poker pairs-dataset with varying difficulty:                                             pairs
                                                                                                        2        3       4         4a
                                                                               full c.
      • pairs2 consisting of 30 card combinations with 12                      0                       4.4       5.8    6.5        5.3
        pairs (𝑐 = 2 cards, 𝑟 = 2 ranks, 𝑠 = 3 suits)                          1                       3.2       5.4    7.8        5.0
      • pairs3 consisting of 336 card combinations with                        2                       2.0       5.1    7.2        3.5
        144 pairs (𝑐 = 3 cards, 𝑟 = 4 ranks, 𝑠 = 2 suits)
Pairs within two cards. In this minimalistic dataset,            Pairs within four cards. In the dataset with four cards
there are only two different types of pairs since only           we used six ranks, summing up to 36 pair combinations.
two cards and ranks are available. Both the minimal DNF          While a minimal DNF needs at least 72 literals, with
𝑐1 = 1∧𝑐2 = 1∨𝑐1 = 2∧𝑐2 = 2 and the corresponding                deeper logical expressions sketched in Equations 4 and 5
CNF (𝑐1 = 1 ∨ 𝑐2 = 2) ∧ (𝑐1 = 2 ∨ 𝑐2 = 1) consist of             only 48 literals are needed. However, this requires to find
four literals and could be found in two steps.                   suitable disjunctions in the first generation.
   Using a strict genetic algorithm approach, i.e., without         The results for the dataset with four cards are similar
building all possible crossovers in the first generation(s),     to the previous one with three cards but with even more
the learned concept describes all pairs for 5 out of 10          significant differences between the three settings. The
folds, and this concept is found between the third and           performance of the pure genetic algorithm setting drops
fifth generation. The 5 remaining folds also diverge to a        to a pair coverage between 24% and 49% and it converges
fixed concept (in different logical expressions) the latest      in an even later generation than in the smaller datasets.
in the sixth generation, however, in 4 of them one or two        This can again be fixed by a single exhaustive search for
pairs are not covered respectively, and in the last one a        crossovers and a subsequent genetic algorithm covering
non-pair was covered mistakenly.                                 between 79% and 93% of the pairs. However, similar to
   If in the first generation all possible crossovers are gen-   the dataset with three cards, if also in the subsequent gen-
erated, thus in particular the crossovers 𝑐1 = 1 ∧ 𝑐2 = 1        eration all crossovers are computed, the convergence of
and 𝑐1 = 2 ∧ 𝑐2 = 2, the perfect theory can be found in          the algorithm is sped up but the pair coverage decreases
all 10 folds. 6 of theses folds need four generations to do      to percentages between 52% and 77%.
so, the remaining 4 find it already in the second gener-            While the best performance is achieved by the al-
ation. Interestingly, one of them also finds the minimal         gorithm using a single generation with an exhaustive
CNF additionally to the minimal DNF.                             crossover search, all resulting models have in common
   Finally, if also in the second generation all crossovers      that they learn DNFs instead of creating intermediate
are generated, both the minimal DNF and CNF are found            disjunctions before conjuncting them. To investigate
in all folds.                                                    further into this, we refine the dataset in the following
                                                                 paragraph.
Pairs within three cards. The next dataset is already
more complex; three cards and four ranks lead to 12 pair         Subset of pairs within four cards. In the last ex-
combinations, which need a length of at least 20 literals if     periment, we want to focus on a different aspect: the
using an arbitrary nested logical expression and a length        capability of the learner to augment the concept to pairs
of at least 24 literals if using a DNF.                          that are not covered in the training data like described
   Even with problems of this size, the presented simple         in the end of Section 2. We consider pairs with cards of
genetic algorithm can not find an expression covering            rank 6 and ensure that only five out of six are part of the
all pairs in any fold. The percentage of covered pairs           training data. As a consequence, state-of-the-art DNF
ranges from 56% to 92% and is approximately equally              rule learners like Ripper are only capable to detect these
distributed, same holds for the number of generations            five pair combinations and conjunct them:
that lies between 4 and 7.
   As discussed in Section 3, already a single generation
with all possible crossovers can fix this problem. Even if              (𝑐1 = 6 ∧ 𝑐2 = 6) ∨ (𝑐1 = 6 ∧ 𝑐3 = 6)∨
the population size is too small to keep all crossovers, it             (𝑐2 = 6 ∧ 𝑐3 = 6) ∨ (𝑐2 = 6 ∧ 𝑐4 = 6)∨             (9)
is still large enough so that only irrelevant crossovers are            (𝑐3 = 6 ∧ 𝑐4 = 6)
removed immediately (those taking the suit attributes
into account). In all folds, a concept covering all pairs is        In this representation, the remaining pair combination
learned in the fifth generation or sixth generation.             𝑐1 = 6 ∧ 𝑐4 = 6 remains uncovered. However, we have
   The average number of generations needed as well              seen in Equations 4 and 5 that even smaller concepts can
as the overall complexity of the found concepts can be           be learned, and those possibly also cover the missing
slightly decreased by computing all crossovers for a sec-        pair combination. We verify whether similar models can
ond generation. Surprisingly, this approach leads to in-         indeed be learned with the suggested approach. Starting
complete concepts in two folds though, that are missing          with the pure genetic algorithm, the learner does not even
1 of the 12 pair combinations. This indicates that an ex-        reliably find all pair combinations. In four out of ten folds,
haustive search over multiple generations leads to too           parts of the concepts additionally use suit attributes in
many similar individuals that prevent other diverse in-          the conjunctions, which led up to 37 of 810 pair instances
dividuals from being included into the population and            uncovered.
used for further concepts.                                          By generating all crossovers in the first generation,
                                                                 the five pair combinations are found and combined to a
Generation:   1 Max fitness: 0,9503                  Generation:   2 Max fitness: 0,9869
0,9503 (164|0) c1=6 && c2=6 -> class=1               0,9869 (648|0) (c1=6 || c4=6) && (c2=6 || c3=6) -> class=1
0,9503 (164|0) c1=6 && c3=6 -> class=1               0,9826 (486|0) (c1=6 || c2=6) && (c4=6 || c3=6) -> class=1
0,9497 (162|0) c2=6 && c3=6 -> class=1               0,9826 (486|0) (c1=6 || c3=6) && (c4=6 || c2=6) -> class=1
0,9491 (160|0) c4=6 && c2=6 -> class=1               0,9744 (328|0) (c1=6 && c2=6 || c1=6 && c3=6) -> class=1
0,9491 (160|0) c4=6 && c3=6 -> class=1               0,9744 (328|0) c1=6 && (c2=6 || c3=6) -> class=1
0,3822 ( 4|0) c3=5 && c4=5 -> class=1                0,9743 (326|0) (c1=6 && c2=6 || c2=6 && c3=6) -> class=1
0,3822 ( 4|0) c2=5 && c4=5 -> class=1                0,9743 (326|0) (c1=6 && c3=6 || c2=6 && c3=6) -> class=1
0,3822 ( 4|0) c4=1 && c1=1 -> class=1                0,9743 (326|0) (c1=6 || c3=6) && c2=6 -> class=1
0,3822 ( 4|0) c2=1 && c4=1 -> class=1                0,9743 (326|0) (c1=6 || c2=6) && c3=6 -> class=1
0,3822 ( 4|0) c3=1 && c4=1 -> class=1                0,9741 (324|0) (c1=6 && c2=6 || c4=6 && c2=6) -> class=1

Figure 1: Best ten individuals in the first and second generation respectively.


DNF similar to Equation 9 within five generations, i.e.,          approach.
one iteration more than in the optimal logarithmic case.             While the presented combination of a local search and
Therefore, all training instances consisting one of these         genetic algorithm is not yet matured to be successfully
five pair combinations are covered, but the test instances        applied on arbitrary real-world datasets, through the ap-
with the remaining sixth pair combination remain disre-           plication on different versions of the poker pairs-dataset
garded.                                                           we could take away some interesting properties of the
   This behavior changes if the second generation com-            deep rule theories learned this way. An exhaustive search
putes all crossovers as well. Figure 1 lists the best ten indi-   for crossovers in the first generation was needed to re-
viduals found in the first and second generation. While in        liably find concepts covering all or at least most pairs.
the first generation this list only consists of conjunctions      However, an additional exhaustive search in the subse-
that are already able to only cover pairs, in the second          quent generation led to too many similar individuals and
generation the best three individuals are conjunctions            resulted in a worse performance for the chosen parame-
of disjunctions, i.e., crossovers of individuals that have        ters. Finally, the last experiment on a subset of the poker
not been among the best ten in the first generation. In           pairs-dataset with four cards showed that the presented
particular, the second and third individual evaluate better       learner indeed is capable to find the more compact and
than all possible individuals in DNF and at the same time         better generalizing model described at the beginning of
still cover the missing sixth pair combination.                   this paper, which can not be learned by state-of-the-art
   Based on these individuals, a concept covering all pairs       rule learners, and highlights the importance of subse-
of the training set can be found already in the third gener-      quent combinations of disjunctions and conjunctions to
ation for five of the folds, and in the fourth generation for     do so.
the remaining five folds. The percentage of individuals
covering the additional the missing sixth pair combina-
tion ranges from 13% to 31%. While this number is still           6. Future Work
low, it is worthwhile to mention that in comparison to
                                                           In future work, further refinements of the genetic al-
DNF rule learner it is capable to generate such individu-
                                                           gorithm are necessary and partly already implemented.
als, and in comparison to the other two versions it also
                                                           Most emphasized should be the handling of diversity,
actually finds them.
                                                           which can, e.g., either be handled by a separate reserve
                                                           population that focuses on covering a wider variety of
5. Conclusion                                              examples instead of covering the most examples as possi-
                                                           ble or by penalizing the evaluation of individuals if their
In this work, we dealt with the question when conven- Jaccard similarity is too high. Additionally, a penalty
tional DNF rule learners come to their limits. We con- for the rule length might force the algorithm to prefer
ducted this on the basis of a poker pairs-dataset and the most compact representation between those cover-
showed that deep rule theories can both describe the ing the same instances. Last but not least, the idea of
pair concept in a more concise form and are capable to mutations might be refined for the rule learning setting.
generalize the concept better. To investigate into this Features on all levels of the logical expression could be
behavior in practical experiments, we created a simple removed, replaced or augmented, and conjunctions could
variant of a genetic algorithm, which uses conjunctive be interchanged with disjunctions and vice versa. For all
and disjunctive crossovers to built deep rule concepts out these suggestions, a further in-depth analysis is needed
of single features. We also implemented the possibility though.
to generate all possible crossovers at the beginning of
the algorithm, which can then be seen as a local search
References
 [1] W. W. Cohen, Fast effective rule induction, in:
     A. Prieditis, S. Russell (Eds.), Proceedings of the
     12th International Conference on Machine Learning
     (ML-95), Morgan Kaufmann, Lake Tahoe, CA, 1995,
     pp. 115–123.
 [2] J. Fürnkranz, D. Gamberger, N. Lavrač, Foundations
     of Rule Learning, Springer-Verlag, 2012.
 [3] A. Dries, L. D. Raedt, S. Nijssen, Mining predic-
     tive k-CNF expressions, IEEE Transactions on
     Knowledge and Data Engineering 22 (2010) 743–
     748. URL: https://doi.org/10.1109/TKDE.2009.152.
     doi:10.1109/TKDE.2009.152.
 [4] R. J. Mooney, Encouraging experimental results on
     learning CNF, Machine Learning 19 (1995) 79–92.
 [5] E. Hüllermeier,        From knowledge-based to
     data-driven fuzzy modeling – development,
     criticism, and alternative directions,           Infor-
     matik Spektrum 38 (2015) 500–509. URL:
     https://doi.org/10.1007/s00287-015-0931-8.
     doi:10.1007/s00287-015-0931-8.
 [6] A. Schrijver, Theory of linear and integer program-
     ming, John Wiley & Sons, 1998.
 [7] R. Senge, E. Hüllermeier, Top-down induction of
     fuzzy pattern trees, IEEE Transanctions on Fuzzy
     Systems 19 (2011) 241–252. URL: https://doi.org/10.
     1109/TFUZZ.2010.2093532. doi:10.1109/TFUZZ.
     2010.2093532.
 [8] V. Q. P. Huynh, J. Fürnkranz, F. Beck, Efficient learn-
     ing of large sets of locally optimal classification
     rules, Under review for journal publication, 2022.
 [9] Z.-H. Deng, S.-L. Lv, Prepost+: An efficient n-lists-
     based algorithm for mining frequent itemsets via
     children–parent equivalence pruning, Expert Sys-
     tems with Applications 42 (2015) 5424–5432.
[10] B. Cestnik, Estimating probabilities: A crucial task
     in Machine Learning, in: L. Aiello (Ed.), Proceed-
     ings of the 9th European Conference on Artificial
     Intelligence (ECAI-90), Pitman, Stockholm, Sweden,
     1990, pp. 147–150.
[11] J. Fürnkranz, P. A. Flach, ROC ’n’ rule learning –
     Towards a better understanding of covering algo-
     rithms, Machine Learning 58 (2005) 39–77.

</pre>