=Paper= {{Paper |id=Vol-2962/paper31 |storemode=property |title=Beyond DNF: First Steps towards Deep Rule Learning |pdfUrl=https://ceur-ws.org/Vol-2962/paper31.pdf |volume=Vol-2962 |authors=Florian Beck,Johannes Fürnkranz |dblpUrl=https://dblp.org/rec/conf/itat/BeckF21 }} ==Beyond DNF: First Steps towards Deep Rule Learning == https://ceur-ws.org/Vol-2962/paper31.pdf
                      Beyond DNF: First Steps towards Deep Rule Learning

                                                   Florian Beck1 , Johannes Fürnkranz1

                                      Institute for Application-oriented Knowledge Processing (FAW)
                                                      Department of Computer Science
                                                  Johannes Kepler University Linz, Austria
                                                      {fbeck,juffi}@faw.jku.at

Abstract: Inductive rule learning is arguably among the                   ability in inductive rule learning. We therefore set out to
most traditional paradigms in machine learning. Although                  verify the hypothesis that deep rule structures might be eas-
we have seen considerable progress over the years in learn-               ier to learn than flat rule sets, in very much the same way as
ing rule-based theories, all state-of-the-art learners still              deep neural networks have a better performance than single-
learn descriptions that directly relate the input features to             layer networks [12]. Note that this is not obvious, because,
the target concept. It could nevertheless be the case that                in principle, every logical formula can be represented with a
more structured representations, which form deep theories                 DNF expression, which corresponds to a flat rule set, in the
by forming intermediate concepts, could be easier to learn,               same way as, in principle, one (sufficiently large) hidden
in very much the same way as deep neural networks are                     layer is sufficient to approximate any function with a neural
able to outperform shallow networks, even though the latter               network [9]. As no direct comparison is possible because
are also universal function approximators. In this paper,                 of the lack of a powerful algorithm for learning deep rule
we investigate into networks with weights and activations                 sets, our tool of choice is a simple stochastic optimization
limited to the values 0 and 1. For the lack of a powerful                 algorithm to optimize a rule network of a given size. While
algorithm that optimizes deep rule sets, we empirically                   this does not quite reach state-of-the-art performance (in
compare deep and shallow rule networks with a uniform                     either setting, shallow or deep), it nevertheless allows us to
general algorithm, which relies on greedy mini-batch based                gain some insights into these settings. We also test on both,
optimization. Our experiments on both artificial and real-                real-world UCI benchmark datasets, as well as artificial
world benchmark data indicate that deep rule networks may                 datasets for which we know the underlying target concept
outperform shallow networks.                                              representations.
                                                                             The remainder of the paper is organized as follows:
                                                                          Sect. 2 elaborates why deep rule learning is of particu-
1    Introduction                                                         lar interest and refers to related work. We propose a new
                                                                          network approach in Sect. 3 and test it in Sect. 4. The
Dating back to the AQ algorithm [13], inductive rule learn-               results are concluded in Sect. 5, followed by possible fu-
ing is one of the most traditional fields in machine learning.            ture extensions and improvements in Sect. 6. An extended
However, when reflecting upon its long history [7], it can                version of this paper containing an elaborate discussion of
be argued that while modern methods are somewhat more                     related work, more details on the methods, and additional
scalable than traditional rule learning algorithms [see, e.g.,            experiments is available as [2].
16, 10], no major break-through has been made. In fact, the
R IPPER rule learning algorithm [5] is still very hard to beat
in terms of both accuracy and simplicity of the learned rule              2     Deep Rule Learning
sets. All these algorithms, traditional or modern, typically
                                                                          Rule learning algorithms typically provide flat lists that
provide flat lists or sets of rules, which directly relate the
                                                                          directly relate the input to the output. Consider, e.g., the
input variables to the desired output. In concept learning,
                                                                          following example: the parity concept, which is known
where the goal is to learn a set of rules that collectively
                                                                          to be hard to learn for heuristic, greedy learning algo-
describe the target concept, the learned set of rules can be
                                                                          rithms, checks whether an odd or an even number of R
considered as a logical expression in disjunctive normal
                                                                          relevant attributes (out of a possibly higher total number
form (DNF), in which each conjunction forms a rule that
                                                                          of attributes) are set to true. Figure 1a shows a flat rule-
predicts the positive class.
                                                                          based representation1 of the target concept for R = 5, which
   In this paper, we argue that one of the key factors for
                                                                          requires 2R−1 = 16 rules. On the other hand, a structured
the strength of deep learning algorithms is that latent vari-
                                                                          representation, which introduces three auxiliary predicates
ables are formed during the learning process. However,
                                                                          (parity2345, parity345 and parity45 as shown in Fig-
while neural networks excel in implementing this ability
                                                                          ure 1b), is much more concise using only 2 · (R − 1) = 8
in their hidden layers, which can be effectively trained via
                                                                               1 We use a Prolog-like notation for rules, where the consequent (the
backpropagation, there is essentially no counter-part to this
                                                                          head of the rule) is written on the left and the antecedent (the body) is
     Copyright ©2021 for this paper by its authors. Use permitted under   written on the right. For example, the first rule reads as: If x1, x2, x3 and
Creative Commons License Attribution 4.0 International (CC BY 4.0).       x4 are all true and x5 is false then parity holds.
parity :-     x1,     x2,     x3,     x4, not x5.
parity :-     x1,     x2, not x3, not x4, not x5.                             parity45       :-     x4,     x5.
parity :-     x1, not x2,     x3, not x4, not x5.                             parity45       :- not x4, not x5.
parity :-     x1, not x2, not x3,     x4, not x5.
parity :- not x1,     x2, not x3,     x4, not x5.                             parity345      :-     x3, not parity45.
parity :- not x1,     x2,     x3, not x4, not x5.                             parity345      :- not x3,     parity45.
parity :- not x1, not x2,     x3,     x4, not x5.
parity :- not x1, not x2, not x3, not x4, not x5.                             parity2345 :-     x2, not parity345.
parity :-     x1,     x2,     x3, not x4,     x5.                             parity2345 :- not x2,     parity345.
parity :-     x1,     x2, not x3,     x4,     x5.
parity :-     x1, not x2,     x3,     x4,     x5.                             parity         :-     x1, not parity2345.
parity :- not x1,     x2,     x3,     x4,     x5.                             parity         :- not x1,     parity2345.
parity :- not x1, not x2, not x3,     x4,     x5.
parity :- not x1, not x2,     x3, not x4,     x5.
parity :- not x1,     x2, not x3, not x4,     x5.
parity :-     x1, not x2, not x3, not x4,     x5.
                                                                              (b) A deep structured rule base for parity using
          (a) A flat unstructured rule set for the parity concept                       three auxiliary predicates

                           Figure 1: Unstructured and structured rule sets for the parity concept.


rules. We argue that the parsimonious structure of the latter       conjuncts in a DNF encoding of the inputs may grow expo-
could be easier to learn because it uses only a linear num-         nentially with the number of input features. This is in many
ber of rules, and slowly builds up the complex target con-          ways analogous to the universal approximation theorem [9],
cept parity from the smaller subconcepts parity2345,                which essentially states that any continuous function can
parity345 and parity45.                                             be approximated arbitrarily closely with a shallow neural
   To motivate this, we draw an analogy to neural network           network with a single hidden layer, provided that the size
learning, and view rule sets as networks. Conventional              of this layer is not bounded. So, in principle, deep neural
rule learning algorithms learn a flat rule set of the type          networks are not necessary, and indeed, much of the neural
shown in Figure 1a, which may be viewed as a concept                network research in the 90s has concentrated on learning
description in disjunctive normal form (DNF): Each rule             such two-layer networks. Nevertheless, we have now seen
body corresponds to a single conjunct, and these conjuncts          that deep neural networks are easier to train and often yield
are connected via a disjunction (each positive example              better performance, presumably because they require expo-
must be covered by one or more of these rule bodies). This          nentially less parameters than shallow networks [12]. In the
situation is illustrated in Figure 2a, where the 5 input nodes      same way, we expect that deep logical structures will yield
are connected to 16 hidden nodes - one for each of the 16           more efficient representations of the captured knowledge
rules that define the concept - and these are then connected        and might be easier to learn than flat DNF rule sets.
to a single output node. Analogously, the deep parity rule
set of Figure 1b may be encoded into a deeper network
                                                                    3     Deep Rule Networks
structure as shown in Figure 2b. Clearly, the deep network
is more compact and considerably sparser in the number of           For our studies of deep and shallow rule learning, we define
edges. Of course, we need to take into consideration that the       rule-based theories in a networked structure, which we
optimal structure is not known beforehand and presumably            describe in the following. We build upon the shallow two-
needs to emerge from a fixed network structure that offers          level networks we have previously used for experimenting
the possibility for some redundancy, but nevertheless we            with mini-batch rule learning [1], but generalize them from
expect that such structured representations offer similar           a shallow DNF-structure to deeper networks.
advantages as deep neural networks offer over single-layer
networks.
                                                                    3.1   Network Structure
   It is important to note that deep structures do not in-
crease the expressiveness of the learned concepts. Any              A conventional rule set consisting of multiple conjunctive
formula in propositional logic (and we limit ourselves to           rules that define a single target concept, corresponds to
propositional logic in this project) can be converted to a          a logical expression in disjunctive normal form (DNF).
DNF formula. In the worst case (a so-called full DNF),              An equivalent network consists of three layers, the input
each of the input variables appears exactly once in all of          layer, one hidden layer (= AND layer) and the output layer
the inputs, which essentially corresponds to enumerating            (= OR layer), as, e.g., illustrated in Figure 2a. The in-
all the positive examples. Thus, the size of the number of          put layer receives one-hot-encoded nominal attribute-value
           (a) shallow representation                                      (b) deep representation

Figure 2: Network representations of the parity rule sets of Figure 1. Red connections are logical ANDs, green edges
correspond to logical ORs.


pairs as binary features (= literals), the hidden layer con-     it receives the output, and the node k in the successive layer
juncts these literals to rules and the output layer disjuncts    i + 1 to which it passes the activation. Thus, the weights of
the rules to a rule set. The network is designed for binary      each layer can be represented by an si × si−1 -dimensional
                                                                                    (i)
classification problems and produces a single prediction         matrix W (i) = [w jk ]. In total, there are ∑ni=0 si si+1 Boolean
output that is true if and only if an input sample is covered    weights which have to be learned, i.e., have to be set to
by any of the rules in the rule set.                                                                                    (i)
                                                                 true (resp. 1) or false (resp. 0). If weight w jk is set to
   For generalizing this structure to deeper networks, we        true, this means that the output of node j is used in the
need to define multiple layers. While the input layer and the    conjunction (if i mod 2 = 0) or disjunction (if i mod 2 = 1)
output layer remain the same, the number and the size of the     that defines node k. If it is set to false, this output is
hidden layers can be chosen arbitrarily. Note that we can        ignored by node k.
still emulate a shallow DNF-structure by choosing a single
hidden layer. In the more general case, the hidden layers
are treated alternately as conjunctive and disjunctive layers.      In the beginning, these weights need to be initialized.
We focus on layer structures starting with a conjunctive         This initialization process is influenced by two hyperparam-
hidden layer and ending with a disjunctive output layer,         eters: average rule length (l) ¯ and initialization probability
i.e. networks with an odd number of hidden layers. In this                     ¯
                                                                 (p), where l only affects the number of weights that are
way, the output will be easier to compare with rule sets in      set to 1 in the first layer. Here we use the additional in-
DNF. Furthermore, the closer we are to the output layer, the     formation which literals belong to the same attribute to
more extensive are the rules and rule sets, and the smaller      avoid immediate contradictions within the first conjunction.
is the chance to form new combinations from them that are        Let |A | be the number of attributes, then each attribute is
neither tautological nor contradictory. As a consequence,                                         ¯
                                                                 selected with the probability l/|A     | so that on average for l¯
the number of nodes per hidden layer should be lower the         literals of different attributes the corresponding weight will
closer it is to the output layer. This makes the network         be set to true. In the remaining layers, the weights are
shaped like a funnel.                                            set to true with the probability p. Additionally, at least
                                                                 one outgoing weight from each node will be set to true
3.2   Network Weights and Initialization                         to ensure connectivity. This implies that, regardless of the
                                                                 choice of p, all the weights in the last layer will always
In the following, we assume the network to have n + 2            be initialized with true because there is only one output
layers, with each layer i containing si nodes. Layer 0 corre-    node. Note that, as a consequence, shallow DNF-structured
sponds to the input layer with s0 = |x| and layer n + 1 to the   networks will not be influenced by the choice of p, since
                                                          (i)
output layer with sn+1 = 1. Furthermore, a weight w jk is        they only consist of the first layer influenced by l¯ and the
identified by the layer i it belongs to, the node j from which   last layer initialized with true.
3.3   Prediction                                                  tried and evaluated, and the flip with the biggest improve-
                                                                  ment of the accuracy on the current mini-batch is selected.
The prediction of the network can be efficiently computed         These greedy adjustments are repeated until either no flip
using binary matrix multiplications ( ). In each layer i,         improves the accuracy on the mini-batch or a maximum
the input features A(i) are multiplied with the correspond-       number of flips is reached, which ensures that the network
ing weights W (i) and aggregated at the receiving node in         does not overfit the mini-batch data.
layer i + 1. If the aggregation is disjunctive, this directly        When all mini-batches are processed, the procedure is re-
corresponds to a binary matrix multiplication. According          peated for a fixed number of epochs. Only the composition
to De Morgan’s law, a ∧ b = ¬(¬a ∨ ¬b) holds. This means          of the mini-batches is changed in each epoch by shuffling
that binary matrix multiplication can be used also in the         the training data before proceeding. After all epochs, the
conjunctive case, provided that the inputs and outputs are        weights of the networks are reset to the optimum found so
negated before and after the multiplication. Because of the       far, and a final optimization on the complete training set
alternating sequence of conjunctive and disjunctive layers,       is conducted to eliminate any overfitting on mini-batches.
binary matrix multiplications and negations are also always       The returned network can then be used to predict outcomes
alternated when passing data through the network, so that a       of any further test instances.
binary matrix multiplication followed by a negation can be
considered as a NOR-node. Thus, the activations A(i+1) can
be computed from the activations in the previous layers as        4     Experiments

                   A(i+1) ←− Ã(i)    W (i)                (1)    In this section we present the results of differently struc-
                                                                  tured rule networks on both artificial and real-world UCI
where X̃ = J − X denotes the element-wise negation of a           datasets, with the goal of investigating the effect of differ-
matrix X (J denotes a matrix of all ones). Hence, internally,     ences in the depth of the networks.
we do not distinguish between conjunctive and disjunctive
layers within the network, but have a uniform network             4.1   Artificial Datasets
structure consisting only of NOR-nodes. However, for the
                                                                  As many standard UCI databases can be solved with very
sake of the ease of interpretation, we chose to represent the
                                                                  simple rules [8], we generated artificial datasets with a deep
networks as alternating AND and OR layers.
                                                                  structure that we know can be represented by our network.
   In the first layer, we have the choice whether to start with
                                                                  An artificial dataset suitable for our greedy optimization
a disjunctive layer or a conjunctive one, which can be con-
                                                                  algorithm should not only include intermediate concepts
trolled by simply using the original input vector (A(0) = x)
                                                                  which are meaningful but also a strictly monotonically de-
or its negation (A(0) = x̃) as the first layer. Also, if the
                                                                  creasing entropy between these concepts, so that they can
last layer is conjunctive, an additional negation must be
                                                                  be learned in a stepwise fashion in successive layers. One
performed at the end of the network so that the output has
                                                                  way to generate artificial datasets that satisfy these require-
the same ”polarity” as the target values. In our experiments,
                                                                  ments is to take the output of a randomly generated deep
we always start with a conjunctive and end with a disjunc-
                                                                  rule network. Subsequently, this training information can
tive layer. In this way, the rule networks can be directly
                                                                  be used to see whether the function encoded in the original
converted into conjunctive rule sets.
                                                                  network can be recovered. Note that such a recovery is
                                                                  also possible for networks with different layer structures.
3.4   Training                                                    In particular, each of the logical functions encoded in such
                                                                  a deep network can, of course, also be encoded as a DNF
Following [1], we implement a straight-forward mini-batch         expression, so that shallow networks are not in an a pri-
based greedy optimization scheme. While the number, the           ori disadvantage (provided that their hidden layer is large
arrangement and the aggregation types of the nodes remain         enough, which we ensured in preliminary experiments).
unchanged, the training process will flip the weights of the         We use a dataset of ten Boolean inputs named a to j
network to optimize its outcome. Flipping a weight from 0         and generate all possible 210 combinations as training or
to 1 (or vice versa) can be understood to be a single addition    test samples. These samples are extended by the ten nega-
(or removal) of a literal to the conjunction or disjunction       tions ¬a to ¬ j via one-hot-encoding and finally passed
encoded by the following node. After the initialization, the      to a funnel-shaped deep rule network with n = 5 and
base accuracy on the complete training set and the initial        s = [32, 16, 8, 4, 2]. The weights of the network are set
weights are stored and subsequently updated every time            by randomly initializing the network and then training it
when the predicted accuracy on the training set exceeds           on two randomly selected examples, one assigned to the
the previous maximum after processing a mini-batch of             positive and one to the negative class, to ensure both a posi-
training examples. However, the predictive performance            tive and negative output is possible. If the resulting ratio of
does not necessarily increase monotonically, since the ac-        positively predicted samples is still less than 20% or more
curacy is optimized not on the whole training set, but on         than 80%, the network is reinitialized with a new random
a mini-batch. For all layers and nodes, possible flips are        seed to avoid extremely imbalanced datasets.
4.2   Results on Artificial Datasets                                   Figure 3 shows the development of the accuracies on
                                                                    the training set averaged on all 20 datasets over the num-
We first conducted a few preliminary experiments on three           ber of processed mini-batches, whereby after every ten
of the artificial datasets to set suitable default values for the   mini-batches a new epoch starts. The base accuracy be-
hyperparameters of the deep and shallow networks. The               fore processing the first mini-batch and after the full batch
detailed grid search including figures comparing different          optimization are omitted. We can see that the deep net-
hyperparameter settings are presented in the extended ver-          works not only deliver higher accuracies but they also con-
sion of this paper [2]. Based on these results, we selected         verge slightly faster than the shallow one. The orange
three network versions for the main experiments. As a               curve of DRNC(3) runs a little higher than the blue one
candidate for shallow networks, we take the best combina-           of DRNC(5), whereas the green curve of RNC has some
tion of s1 = 20 and l¯ = 5. For the deep networks, however,         distance to them, especially during the first two epochs.
we will choose the second-best network s = [32, 16, 8, 4, 2]           Table 1 shows the accuracies of the three networks. For
combined with l¯ = 2 and an averaged p = 0.05, since it is          each dataset, the best accuracy of the three network classi-
almost ten times faster than the best deep network while            fiers is highlighted in bold. We can see a clear advantage
still reaching an accuracy over 0.895. The third network            for the two deep networks both when considering the av-
is chosen as an intermediate stage between the first two:           erage accuracy and the amount of highest accuracies. The
s = [32, 8, 2] combined with l¯ = 3 and p = 0.05. While still       results clearly show that the best performing deep networks
being a deep network, the learned rules can be passed to the        outperform the best performing shallow network in all but
output layer a little faster. In the following, we will refer to    4 of the 20 generated datasets. Both the average rank and
these (deep) rule network classifiers based on their number         the average accuracy of the deep networks is considerably
of layers, i.e. DRNC(5) for s = [32, 16, 8, 4, 2], DRNC(3)          better than the corresponding values for RNC. This also
for s = [32, 8, 2] and RNC for s1 = 20. For computational           holds for pairwise comparisons of the columns (DRNC(5)
reasons, all of the reported results were estimated with a          vs. RNC 15:5, DRNC(3) vs. RNC 15:5).
2-fold cross validation. While this may not yield the most
reliable estimate on each individual dataset, we neverthe-
less get a coherent picture over all 20 datasets, as we will
see in the following.
   In the main experiments, we use a combination of 15
artificial datasets with seeds we already used in the prior
hyperparameter grid search and 5 artificial datasets with
new seeds to detect potential overfitting on the first datasets.    Figure 4: Critical distance diagram for the rule networks
All datasets are tested using five epochs, a batch size of 50       on the artificial datasets with a significance level of 95%.
and an unlimited number of flips per batch. We also ensured         Learners that are not statistically different from one another
for all of the generated datasets that the DNF concept does         are connected.
not contain more than 20 rules, so that it can be theoretically
also be learned by the tested shallow network with s1 = 20             The Friedman-test for the ranks yields a significance
(and therefore also for the two deep networks, since their          of more than 95%. A subsequent Nemenyi-Test delivers
first layer is already bigger).                                     a critical distance of 0.741 (95%) or 0.649 (90%), which
                                                                    shows that DRNC(3) and RNC are significantly different
                                                                    on a level of more than 95% and DRNC(5) and RNC
                                                                    on a level of more than 90%. The corresponding critical
                                                                    distance diagram (CD=0.741) is shown in Figure 4. We
                                                                    thus find it safe to conclude that deep networks outperform
                                                                    shallow networks on these datasets.
                                                                       In the two right-most columns of Table 1 we also show
                                                                    a comparison to the state-of-the-art rule learner R IPPER
                                                                    [5] and the decision tree learner CART [4] in Python im-
                                                                    plementations using default parameters.2 We see that all
                                                                    network approaches are outperformed by the R IPPER and
                                                                    CART classifiers with default setting. The difference be-
                                                                    tween R IPPER and DRNC(3) is approximately the same
                                                                    as the difference between DRNC(3) and RNC. However,
                                                                    considering that we only use a naïve greedy algorithm, it
                                                                        2 We used the implementations available from https:
Figure 3: Average accuracy of rule network with 1/3/5               //pypi.org/project/wittgenstein/       and      https://
                                                                    scikit-learn.org/stable/modules/generated/sklearn.
layers on training dataset.                                         tree.DecisionTreeClassifier.html.
Table 1: Accuracies on artificial datasets. Rule network with 1/3/5 layers vs R IPPERvs CART. The best accuracy of the
rule networks is marked in bold, the overall best accuracy per dataset is marked in italic.

                 seed    %(+)           DRNC(5)       DRNC(3)       RNC            R IPPER        CART
                 5        0.4453          0.958         0.9863        0.9531           0.9805       0.9844
                 16       0.7959          0.9639        0.9707        0.9629           0.9766       0.9551
                 19       0.6562          1             0.9902        0.9746           1            1
                 24       0.584           0.9053        0.9043        0.916            0.9463       0.9404
                 36       0.6943          0.8828        0.9209        0.9043           0.8867       0.9111
                 44       0.7939          0.9629        0.9551        0.9326           0.9482       0.9697
                 53       0.6055          0.9805        0.9805        0.9775           0.9746       0.9824
                 57       0.7705          0.9824        0.9736        0.9639           0.9951       0.9902
                 60       0.7715          0.9443        0.9453        0.9209           0.958        0.9883
                 65       0.5312          0.9854        0.9688        0.9414           0.9961       0.9922
                 68       0.5654          0.9248        0.9443        0.9619           0.9688       0.9355
                 69       0.6924          0.9551        0.9658        0.9199           0.9795       0.9717
                 70       0.6338          0.9014        0.9062        0.9229           0.9111       0.8984
                 81       0.5684          0.9004        0.9131        0.8857           0.9248       0.9756
                 82       0.7188          0.9941        0.998         0.9717           1            1
                 85       0.5312          1             0.998         0.9736           1            1
                 89       0.6084          0.8926        0.9434        0.9629           0.9502       0.9395
                 107      0.6172          0.8965        0.873         0.8643           0.9043       0.9277
                 112      0.7549          0.9346        0.9248        0.9189           0.9082       0.9561
                 118      0.5957          0.9688        0.9414        0.9434           0.9736       0.9688
                 Ø Accuracy               0.9467        0.9502        0.9386           0.9591       0.9644
                 Ø Rank                   1.775         1.725         2.5


could not be expected (and was also not our objective) to        yield any result (i.e., the resulting network classified all ex-
be able to beat state-of-the-art rule learner. In particular,    amples into a single class), we re-initialized with a different
the runtime is far from state-of-the-art, since already for      seed (this happened once for both deep network versions).
the shallow network 30 seconds are needed per dataset and
up to three minutes for the deep networks (in comparison            The results are shown in Table 2. We can again observe
to less than a second for R IPPER and CART). Further-            that the deep 5-layer network DRNC(5) outperforms the
more, the results also confirm that shallow rule learners (of    shallow network RNC. Of all rule networks, DRNC(5)
which both R IPPER and CART are representatives) had no          provides the highest accuracy on the connect-4, monk-1,
disadvantage by the way we generated the datasets.               monk-3, mushroom and vote datasets, whereas DRNC(3)
                                                                 performs best on car-evaluation and monk-2 and RNC on
                                                                 kr-vs-kp and tic-tac-toe. The latter two datasets are also
4.3   Results on UCI Datasets
                                                                 interesting: tic-tac-toe clearly does not require a deep struc-
For an estimation how the rule networks perform on real-         ture, because for solving it, the learner essentially needs
world datasets, we select nine classification datasets (car-     to enumerate all three-in-a-row positions on a 3 × 3 board.
evaluation, connect-4, kr-vs-kp, monk 1-3, mushroom, tic-        This is similar to connect-4, where four-in-a-row positions
tac-toe and vote) from the UCI Repository [6]. They dif-         have to be recognized. However, in the former case, there
fer in the number of attributes and instances, but have in       is only one matching tile for an intermediate concept con-
common that they consist only of nominal attributes. Car-        sisting of two tiles, while in connect-4 there are several,
evaluation and connect-4 are actually multi-class datasets       which can potentially be exploited by a deeper network. In
and are therefore converted into the binary classification       the kr-vs-kp dataset, deep structures are also not helpful
problem whether a sample belongs to the most frequent            because it consists of carefully engineered features for the
class or not. Of all binary classification problems, the net-    KRKP chess endgame, which were designed in an iterative
works to be tested treat again the more common class as the      process so that the game can be learned with a decision
positive class and the less common as the negative class,        tree learner [15]. It would be an ambitious goal of deep
except for the monk datasets whereby the positive class          rule learning methods to be able to learn such a dataset
is set to 1. As with the artificial datasets, we additionally    from, e.g., only the positions of the chess pieces. This is
compare the performance of the networks to R IPPER and           clearly beyond the state-of-the-art of current rule learning
CART, and again all accuracies are obtained via 2-fold           algorithms. The comparison to R IPPER and CART is again
cross validation. In case a random initialization did not        clearly in favor of these state-of-the-art algorithms.
Table 2: Accuracies on real-world datasets. Rule network with 1/3/5 layers vs R IPPER vs CART. The best accuracy of the
rule networks is marked in bold, the overall best accuracy per dataset is marked in italic.

            dataset              %(+)        DRNC(5)       DRNC(3)      RNC            R IPPER         CART
            car-evaluation       0.7002        0.8999        0.9022       0.8565           0.9838        0.9821
            connect-4            0.6565        0.7728        0.7712       0.7597           0.7475        0.8195
            kr-vs-kp             0.5222        0.9671        0.9643       0.9725           0.9837        0.989
            monk-1               0.5000        1             0.9982       0.9910           0.9478        0.8939
            monk-2               0.3428        0.7321        0.7421       0.7139           0.6872        0.7869
            monk-3               0.5199        0.9693        0.9603       0.9567           0.9386        0.9729
            mushroom             0.784         1             0.978        0.993            0.9992        1
            tic-tac-toe          0.6534        0.8956        0.9196       0.9541           1             0.9217
            vote                 0.6138        0.9655        0.9288       0.9264           0.9011        0.9287
            Ø Rank                             1.556         2            2.444


5   Conclusion                                                  the potential of deep rule networks. Nevertheless, several
                                                                avenues for improving our networks have surfaced, which
The main objective of this work was to study the question       we intend to explore in the near future.
whether deep rule networks have the potential of outper-           One of the main drawbacks of the presented deep rule
forming shallow DNF rule sets, even though, in principle,       networks is the extremely high runtime due to the primitive
every concept can be represented as DNF formula. As             flipping algorithm. A single flip needs a recalculation of
there is no sufficiently competitive deep rule learning al-     all activations in the network, even if only a few them will
gorithm, we proposed a technique how deep and shallow           be affected by this flip whereby the matrix multiplication
rule networks can be learned and thus effectively compared      could be minimized considerably. Conversely, this knowl-
in a uniform framework, using a network approach with           edge can be used to find a small subset of flips that affects
a greedy optimization algorithm. For both types of net-         a certain activation. On the other hand, the majority of
works, we find good hyperparameter settings that allow the      possible flips does not have any effect on this activation
networks to reach reasonable accuracies on both artificial      or the accuracy at all. This effect will typically remain
and real-world datasets, even though the approach is still      unchanged after a few more flips are done. Therefore an
outperformed by state-of-the-art learning algorithms such       exhaustive search of all flips is only needed in the first itera-
as R IPPER and CART.                                            tion, while afterwards just a subset of possible flips should
   Our experiments on both artificial and real-world bench-     be considered which can be built either in a deterministic
mark data indicate that deep rule networks outperform shal-     or probabilistic way.
low networks. The deep networks obtain not only a higher
                                                                   Due to this lack of backpropagation, the flips are eval-
accuracy, but also need less mini-batch iterations to achieve
                                                                uated by their influence on the prediction when executed.
it. Moreover, in preliminary experiments in the hyperpa-
                                                                However, when looking at a false positive, we can only
rameter grid search, we have seen indications that the deep
                                                                correct this error by making the overall hypothesis of the
networks are generally more robust to the choice of the
                                                                network more specific. In order to achieve a generalization
hyperparameters than shallow networks. On the other hand,
                                                                of the hypothesis, only flips from false to true in con-
we also had some cases on real-world data sets where deep
                                                                junctive layers or flips from true to false in disjunctive
networks failed because a poor initialization resulted in
                                                                layers have to be taken into account. In this way, all flips are
indiscriminate predictions.
                                                                split into ”generalization-flips” and ”specialization-flips”
   Overall, we interpret these results as evidence that an
                                                                of which only one group has to be considered at the same
investigation of deep rule structures is a promising research
                                                                time. This improvement as well as the above mentioned
goal, which we hope could yield a similar boost in perfor-
                                                                selection of a subset of flips might also allow us to perform
mance in inductive rule learning as could be observed by
                                                                two or more flips at the same time so that a better result
moving from shallow to deep neural networks. However,
                                                                than with the greedy approach can be achieved.
this goal is still far ahead.
                                                                   An even more promising approach starts one step ear-
                                                                lier in the initialization phase of the network. Instead of
6   Future Work                                                 specifying the structure of the network and finding optimal
                                                                initialization parameters l¯ and p for it, a small part of the
In this work, it was not our goal to reach a state-of-the-art   data could be used to create a rough draft version of the
predictive performance, but instead we wanted to evaluate       network. The Quine-McCluskey algorithm [11] or R IPPER
a very simple greedy optimization algorithm on both shal-       are suitable methods to generate shallow networks, whereas
low and deep networks, in order to get an indication on         the ESPRESSO-algorithm [3] would generate deep networks.
Decision trees can also be used to generate deep networks                [7] J. Fürnkranz, D. Gamberger, and N. Lavrač. Founda-
since the contained rules already share some conditions                      tions of Rule Learning. Springer-Verlag, 2012.
and, moreover, similar subtrees can be merged.
   All these approaches share some significant advantages                [8] R. C. Holte. Very simple classification rules per-
over the network approach we developed so far. First of all,                 form well on most commonly used datasets. Machine
the decision which class value will be treated as positive                   Learning, 11:63–91, 1993.
or negative does not have to be made manually any longer.                [9] K. Hornik. Approximation capabilities of multilayer
Second, they automatically deliver a suitable initialization                 feedforward networks. Neural Networks, 4(2):251 –
of the network, which otherwise would have to be improved                    257, 1991.
by similar approaches like used in neural networks [e.g.,
14] to achieve a robust performance. Third, the general                 [10] H. Lakkaraju, S. H. Bach, and J. Leskovec. Inter-
structure of the network is not limited to a fixed size and                  pretable decision sets: A joint framework for descrip-
depth where each node is strictly assigned to a specific                     tion and prediction. In B. Krishnapuram, M. Shah,
layer. Instead of generating nodes that become useless after                 A. J. Smola, C. C. Aggarwal, D. Shen, and R. Ras-
a few flips have been processed and that should be removed,                  togi, editors, Proceedings of the 22nd ACM SIGKDD
we can thereby start with a small structure which can be                     International Conference on Knowledge Discovery
adapted purposefully by copying and mutating good nodes                      and Data Mining (KDD-16), pages 1675–1684, San
and pruning bad ones. However, it remains unclear whether                    Francisco, CA, 2016. ACM.
these changes still lead to improvements in performance or
if the network in the given structure is already optimal.               [11] E. J. McCluskey. Minimization of Boolean functions.
                                                                             The Bell System Technical Journal, 35(6):1417–1444,
Acknowledgments. We are grateful to Eneldo Loza Mencía,                      1956.
Michael Rapp, Eyke Hüllermeier, and Van Quoc Phuong Huynh               [12] H. Mhaskar, Q. Liao, and T. A. Poggio. When and
for inspiring discussions, fruitful pointers to related work, and for        why are deep networks better than shallow ones? In
suggesting the evaluation with artificial datasets.                          S. P. Singh and S. Markovitch, editors, Proceedings
                                                                             of the 31st AAAI Conference on Artificial Intelligence,
References                                                                   pages 2343–2349, San Francisco, California, USA,
                                                                             2017. AAAI Press.
 [1] F. Beck and J. Fürnkranz. An investigation into mini-              [13] R. S. Michalski. On the quasi-minimal solution of
     batch rule learning. In K. Kersting, S. Kramer, and                     the covering problem. In Proceedings of the 5th In-
     Z. Ahmadi, editors, Proceedings of the 2nd Work-                        ternational Symposium on Information Processing
     shop on Deep Continuous-Discrete Machine Learn-                         (FCIP-69), volume A3 (Switching Circuits), pages
     ing (DeCoDeML), 2020. Available as arXiv Preprint                       125–128, Bled, Yugoslavia, 1969.
     abs/2106.10202.
                                                                        [14] E. Z. Ramos, M. Nakakuni, and E. Yfantis. Quan-
 [2] F. Beck and J. Fürnkranz. An empirical investigation                    titative measures to evaluate neural network weight
     into deep and shallow rule learning. arxiv Preprint,                    initialization strategies. In Proceedings of the IEEE
     abs/2106.10254, 2021. Submitted for journal publica-                    7th Annual Computing and Communication Workshop
     tion.                                                                   and Conference (CCWC), pages 1–7, Las Vegas, NV,
 [3] R. K. Brayton, G. D. Hachtel, C. T. McMullen, and                       USA, 2017. IEEE.
     A. L. Sangiovanni-Vincentelli. Logic Minimization Al-              [15] A. D. Shapiro. Structured Induction in Expert Systems.
     gorithms for VLSI Synthesis, volume 2 of The Kluwer                     Turing Institute Press. Addison-Wesley, 1987.
     International Series in Engineering and Computer
     Science. Springer, 1984.                                           [16] T. Wang, C. Rudin, F. Doshi-Velez, Y. Liu, E. Klampfl,
                                                                             and P. MacNeille. A Bayesian framework for learning
 [4] L. Breiman, J. H. Friedman, R. Olshen, and C. Stone.                    rule sets for interpretable classification. Journal of
     Classification and Regression Trees. Wadsworth &                        Machine Learning Research, 18:70:1–70:37, 2017.
     Brooks, Pacific Grove, CA, 1984.

 [5] W. W. Cohen. Fast effective rule induction. In
     A. Prieditis and S. Russell, editors, Proceedings of
     the 12th International Conference on Machine Learn-
     ing (ML-95), pages 115–123, Lake Tahoe, CA, 1995.
     Morgan Kaufmann.

 [6] D. Dua and C. Graff. UCI machine learning reposi-
     tory, 2017.