Beyond DNF: First Steps towards Deep Rule Learning Florian Beck1 , Johannes Fürnkranz1 Institute for Application-oriented Knowledge Processing (FAW) Department of Computer Science Johannes Kepler University Linz, Austria {fbeck,juffi}@faw.jku.at Abstract: Inductive rule learning is arguably among the ability in inductive rule learning. We therefore set out to most traditional paradigms in machine learning. Although verify the hypothesis that deep rule structures might be eas- we have seen considerable progress over the years in learn- ier to learn than flat rule sets, in very much the same way as ing rule-based theories, all state-of-the-art learners still deep neural networks have a better performance than single- learn descriptions that directly relate the input features to layer networks [12]. Note that this is not obvious, because, the target concept. It could nevertheless be the case that in principle, every logical formula can be represented with a more structured representations, which form deep theories DNF expression, which corresponds to a flat rule set, in the by forming intermediate concepts, could be easier to learn, same way as, in principle, one (sufficiently large) hidden in very much the same way as deep neural networks are layer is sufficient to approximate any function with a neural able to outperform shallow networks, even though the latter network [9]. As no direct comparison is possible because are also universal function approximators. In this paper, of the lack of a powerful algorithm for learning deep rule we investigate into networks with weights and activations sets, our tool of choice is a simple stochastic optimization limited to the values 0 and 1. For the lack of a powerful algorithm to optimize a rule network of a given size. While algorithm that optimizes deep rule sets, we empirically this does not quite reach state-of-the-art performance (in compare deep and shallow rule networks with a uniform either setting, shallow or deep), it nevertheless allows us to general algorithm, which relies on greedy mini-batch based gain some insights into these settings. We also test on both, optimization. Our experiments on both artificial and real- real-world UCI benchmark datasets, as well as artificial world benchmark data indicate that deep rule networks may datasets for which we know the underlying target concept outperform shallow networks. representations. The remainder of the paper is organized as follows: Sect. 2 elaborates why deep rule learning is of particu- 1 Introduction lar interest and refers to related work. We propose a new network approach in Sect. 3 and test it in Sect. 4. The Dating back to the AQ algorithm [13], inductive rule learn- results are concluded in Sect. 5, followed by possible fu- ing is one of the most traditional fields in machine learning. ture extensions and improvements in Sect. 6. An extended However, when reflecting upon its long history [7], it can version of this paper containing an elaborate discussion of be argued that while modern methods are somewhat more related work, more details on the methods, and additional scalable than traditional rule learning algorithms [see, e.g., experiments is available as [2]. 16, 10], no major break-through has been made. In fact, the R IPPER rule learning algorithm [5] is still very hard to beat in terms of both accuracy and simplicity of the learned rule 2 Deep Rule Learning sets. All these algorithms, traditional or modern, typically Rule learning algorithms typically provide flat lists that provide flat lists or sets of rules, which directly relate the directly relate the input to the output. Consider, e.g., the input variables to the desired output. In concept learning, following example: the parity concept, which is known where the goal is to learn a set of rules that collectively to be hard to learn for heuristic, greedy learning algo- describe the target concept, the learned set of rules can be rithms, checks whether an odd or an even number of R considered as a logical expression in disjunctive normal relevant attributes (out of a possibly higher total number form (DNF), in which each conjunction forms a rule that of attributes) are set to true. Figure 1a shows a flat rule- predicts the positive class. based representation1 of the target concept for R = 5, which In this paper, we argue that one of the key factors for requires 2R−1 = 16 rules. On the other hand, a structured the strength of deep learning algorithms is that latent vari- representation, which introduces three auxiliary predicates ables are formed during the learning process. However, (parity2345, parity345 and parity45 as shown in Fig- while neural networks excel in implementing this ability ure 1b), is much more concise using only 2 · (R − 1) = 8 in their hidden layers, which can be effectively trained via 1 We use a Prolog-like notation for rules, where the consequent (the backpropagation, there is essentially no counter-part to this head of the rule) is written on the left and the antecedent (the body) is Copyright ©2021 for this paper by its authors. Use permitted under written on the right. For example, the first rule reads as: If x1, x2, x3 and Creative Commons License Attribution 4.0 International (CC BY 4.0). x4 are all true and x5 is false then parity holds. parity :- x1, x2, x3, x4, not x5. parity :- x1, x2, not x3, not x4, not x5. parity45 :- x4, x5. parity :- x1, not x2, x3, not x4, not x5. parity45 :- not x4, not x5. parity :- x1, not x2, not x3, x4, not x5. parity :- not x1, x2, not x3, x4, not x5. parity345 :- x3, not parity45. parity :- not x1, x2, x3, not x4, not x5. parity345 :- not x3, parity45. parity :- not x1, not x2, x3, x4, not x5. parity :- not x1, not x2, not x3, not x4, not x5. parity2345 :- x2, not parity345. parity :- x1, x2, x3, not x4, x5. parity2345 :- not x2, parity345. parity :- x1, x2, not x3, x4, x5. parity :- x1, not x2, x3, x4, x5. parity :- x1, not parity2345. parity :- not x1, x2, x3, x4, x5. parity :- not x1, parity2345. parity :- not x1, not x2, not x3, x4, x5. parity :- not x1, not x2, x3, not x4, x5. parity :- not x1, x2, not x3, not x4, x5. parity :- x1, not x2, not x3, not x4, x5. (b) A deep structured rule base for parity using (a) A flat unstructured rule set for the parity concept three auxiliary predicates Figure 1: Unstructured and structured rule sets for the parity concept. rules. We argue that the parsimonious structure of the latter conjuncts in a DNF encoding of the inputs may grow expo- could be easier to learn because it uses only a linear num- nentially with the number of input features. This is in many ber of rules, and slowly builds up the complex target con- ways analogous to the universal approximation theorem [9], cept parity from the smaller subconcepts parity2345, which essentially states that any continuous function can parity345 and parity45. be approximated arbitrarily closely with a shallow neural To motivate this, we draw an analogy to neural network network with a single hidden layer, provided that the size learning, and view rule sets as networks. Conventional of this layer is not bounded. So, in principle, deep neural rule learning algorithms learn a flat rule set of the type networks are not necessary, and indeed, much of the neural shown in Figure 1a, which may be viewed as a concept network research in the 90s has concentrated on learning description in disjunctive normal form (DNF): Each rule such two-layer networks. Nevertheless, we have now seen body corresponds to a single conjunct, and these conjuncts that deep neural networks are easier to train and often yield are connected via a disjunction (each positive example better performance, presumably because they require expo- must be covered by one or more of these rule bodies). This nentially less parameters than shallow networks [12]. In the situation is illustrated in Figure 2a, where the 5 input nodes same way, we expect that deep logical structures will yield are connected to 16 hidden nodes - one for each of the 16 more efficient representations of the captured knowledge rules that define the concept - and these are then connected and might be easier to learn than flat DNF rule sets. to a single output node. Analogously, the deep parity rule set of Figure 1b may be encoded into a deeper network 3 Deep Rule Networks structure as shown in Figure 2b. Clearly, the deep network is more compact and considerably sparser in the number of For our studies of deep and shallow rule learning, we define edges. Of course, we need to take into consideration that the rule-based theories in a networked structure, which we optimal structure is not known beforehand and presumably describe in the following. We build upon the shallow two- needs to emerge from a fixed network structure that offers level networks we have previously used for experimenting the possibility for some redundancy, but nevertheless we with mini-batch rule learning [1], but generalize them from expect that such structured representations offer similar a shallow DNF-structure to deeper networks. advantages as deep neural networks offer over single-layer networks. 3.1 Network Structure It is important to note that deep structures do not in- crease the expressiveness of the learned concepts. Any A conventional rule set consisting of multiple conjunctive formula in propositional logic (and we limit ourselves to rules that define a single target concept, corresponds to propositional logic in this project) can be converted to a a logical expression in disjunctive normal form (DNF). DNF formula. In the worst case (a so-called full DNF), An equivalent network consists of three layers, the input each of the input variables appears exactly once in all of layer, one hidden layer (= AND layer) and the output layer the inputs, which essentially corresponds to enumerating (= OR layer), as, e.g., illustrated in Figure 2a. The in- all the positive examples. Thus, the size of the number of put layer receives one-hot-encoded nominal attribute-value (a) shallow representation (b) deep representation Figure 2: Network representations of the parity rule sets of Figure 1. Red connections are logical ANDs, green edges correspond to logical ORs. pairs as binary features (= literals), the hidden layer con- it receives the output, and the node k in the successive layer juncts these literals to rules and the output layer disjuncts i + 1 to which it passes the activation. Thus, the weights of the rules to a rule set. The network is designed for binary each layer can be represented by an si × si−1 -dimensional (i) classification problems and produces a single prediction matrix W (i) = [w jk ]. In total, there are ∑ni=0 si si+1 Boolean output that is true if and only if an input sample is covered weights which have to be learned, i.e., have to be set to by any of the rules in the rule set. (i) true (resp. 1) or false (resp. 0). If weight w jk is set to For generalizing this structure to deeper networks, we true, this means that the output of node j is used in the need to define multiple layers. While the input layer and the conjunction (if i mod 2 = 0) or disjunction (if i mod 2 = 1) output layer remain the same, the number and the size of the that defines node k. If it is set to false, this output is hidden layers can be chosen arbitrarily. Note that we can ignored by node k. still emulate a shallow DNF-structure by choosing a single hidden layer. In the more general case, the hidden layers are treated alternately as conjunctive and disjunctive layers. In the beginning, these weights need to be initialized. We focus on layer structures starting with a conjunctive This initialization process is influenced by two hyperparam- hidden layer and ending with a disjunctive output layer, eters: average rule length (l) ¯ and initialization probability i.e. networks with an odd number of hidden layers. In this ¯ (p), where l only affects the number of weights that are way, the output will be easier to compare with rule sets in set to 1 in the first layer. Here we use the additional in- DNF. Furthermore, the closer we are to the output layer, the formation which literals belong to the same attribute to more extensive are the rules and rule sets, and the smaller avoid immediate contradictions within the first conjunction. is the chance to form new combinations from them that are Let |A | be the number of attributes, then each attribute is neither tautological nor contradictory. As a consequence, ¯ selected with the probability l/|A | so that on average for l¯ the number of nodes per hidden layer should be lower the literals of different attributes the corresponding weight will closer it is to the output layer. This makes the network be set to true. In the remaining layers, the weights are shaped like a funnel. set to true with the probability p. Additionally, at least one outgoing weight from each node will be set to true 3.2 Network Weights and Initialization to ensure connectivity. This implies that, regardless of the choice of p, all the weights in the last layer will always In the following, we assume the network to have n + 2 be initialized with true because there is only one output layers, with each layer i containing si nodes. Layer 0 corre- node. Note that, as a consequence, shallow DNF-structured sponds to the input layer with s0 = |x| and layer n + 1 to the networks will not be influenced by the choice of p, since (i) output layer with sn+1 = 1. Furthermore, a weight w jk is they only consist of the first layer influenced by l¯ and the identified by the layer i it belongs to, the node j from which last layer initialized with true. 3.3 Prediction tried and evaluated, and the flip with the biggest improve- ment of the accuracy on the current mini-batch is selected. The prediction of the network can be efficiently computed These greedy adjustments are repeated until either no flip using binary matrix multiplications ( ). In each layer i, improves the accuracy on the mini-batch or a maximum the input features A(i) are multiplied with the correspond- number of flips is reached, which ensures that the network ing weights W (i) and aggregated at the receiving node in does not overfit the mini-batch data. layer i + 1. If the aggregation is disjunctive, this directly When all mini-batches are processed, the procedure is re- corresponds to a binary matrix multiplication. According peated for a fixed number of epochs. Only the composition to De Morgan’s law, a ∧ b = ¬(¬a ∨ ¬b) holds. This means of the mini-batches is changed in each epoch by shuffling that binary matrix multiplication can be used also in the the training data before proceeding. After all epochs, the conjunctive case, provided that the inputs and outputs are weights of the networks are reset to the optimum found so negated before and after the multiplication. Because of the far, and a final optimization on the complete training set alternating sequence of conjunctive and disjunctive layers, is conducted to eliminate any overfitting on mini-batches. binary matrix multiplications and negations are also always The returned network can then be used to predict outcomes alternated when passing data through the network, so that a of any further test instances. binary matrix multiplication followed by a negation can be considered as a NOR-node. Thus, the activations A(i+1) can be computed from the activations in the previous layers as 4 Experiments A(i+1) ←− Ã(i) W (i) (1) In this section we present the results of differently struc- tured rule networks on both artificial and real-world UCI where X̃ = J − X denotes the element-wise negation of a datasets, with the goal of investigating the effect of differ- matrix X (J denotes a matrix of all ones). Hence, internally, ences in the depth of the networks. we do not distinguish between conjunctive and disjunctive layers within the network, but have a uniform network 4.1 Artificial Datasets structure consisting only of NOR-nodes. However, for the As many standard UCI databases can be solved with very sake of the ease of interpretation, we chose to represent the simple rules [8], we generated artificial datasets with a deep networks as alternating AND and OR layers. structure that we know can be represented by our network. In the first layer, we have the choice whether to start with An artificial dataset suitable for our greedy optimization a disjunctive layer or a conjunctive one, which can be con- algorithm should not only include intermediate concepts trolled by simply using the original input vector (A(0) = x) which are meaningful but also a strictly monotonically de- or its negation (A(0) = x̃) as the first layer. Also, if the creasing entropy between these concepts, so that they can last layer is conjunctive, an additional negation must be be learned in a stepwise fashion in successive layers. One performed at the end of the network so that the output has way to generate artificial datasets that satisfy these require- the same ”polarity” as the target values. In our experiments, ments is to take the output of a randomly generated deep we always start with a conjunctive and end with a disjunc- rule network. Subsequently, this training information can tive layer. In this way, the rule networks can be directly be used to see whether the function encoded in the original converted into conjunctive rule sets. network can be recovered. Note that such a recovery is also possible for networks with different layer structures. 3.4 Training In particular, each of the logical functions encoded in such a deep network can, of course, also be encoded as a DNF Following [1], we implement a straight-forward mini-batch expression, so that shallow networks are not in an a pri- based greedy optimization scheme. While the number, the ori disadvantage (provided that their hidden layer is large arrangement and the aggregation types of the nodes remain enough, which we ensured in preliminary experiments). unchanged, the training process will flip the weights of the We use a dataset of ten Boolean inputs named a to j network to optimize its outcome. Flipping a weight from 0 and generate all possible 210 combinations as training or to 1 (or vice versa) can be understood to be a single addition test samples. These samples are extended by the ten nega- (or removal) of a literal to the conjunction or disjunction tions ¬a to ¬ j via one-hot-encoding and finally passed encoded by the following node. After the initialization, the to a funnel-shaped deep rule network with n = 5 and base accuracy on the complete training set and the initial s = [32, 16, 8, 4, 2]. The weights of the network are set weights are stored and subsequently updated every time by randomly initializing the network and then training it when the predicted accuracy on the training set exceeds on two randomly selected examples, one assigned to the the previous maximum after processing a mini-batch of positive and one to the negative class, to ensure both a posi- training examples. However, the predictive performance tive and negative output is possible. If the resulting ratio of does not necessarily increase monotonically, since the ac- positively predicted samples is still less than 20% or more curacy is optimized not on the whole training set, but on than 80%, the network is reinitialized with a new random a mini-batch. For all layers and nodes, possible flips are seed to avoid extremely imbalanced datasets. 4.2 Results on Artificial Datasets Figure 3 shows the development of the accuracies on the training set averaged on all 20 datasets over the num- We first conducted a few preliminary experiments on three ber of processed mini-batches, whereby after every ten of the artificial datasets to set suitable default values for the mini-batches a new epoch starts. The base accuracy be- hyperparameters of the deep and shallow networks. The fore processing the first mini-batch and after the full batch detailed grid search including figures comparing different optimization are omitted. We can see that the deep net- hyperparameter settings are presented in the extended ver- works not only deliver higher accuracies but they also con- sion of this paper [2]. Based on these results, we selected verge slightly faster than the shallow one. The orange three network versions for the main experiments. As a curve of DRNC(3) runs a little higher than the blue one candidate for shallow networks, we take the best combina- of DRNC(5), whereas the green curve of RNC has some tion of s1 = 20 and l¯ = 5. For the deep networks, however, distance to them, especially during the first two epochs. we will choose the second-best network s = [32, 16, 8, 4, 2] Table 1 shows the accuracies of the three networks. For combined with l¯ = 2 and an averaged p = 0.05, since it is each dataset, the best accuracy of the three network classi- almost ten times faster than the best deep network while fiers is highlighted in bold. We can see a clear advantage still reaching an accuracy over 0.895. The third network for the two deep networks both when considering the av- is chosen as an intermediate stage between the first two: erage accuracy and the amount of highest accuracies. The s = [32, 8, 2] combined with l¯ = 3 and p = 0.05. While still results clearly show that the best performing deep networks being a deep network, the learned rules can be passed to the outperform the best performing shallow network in all but output layer a little faster. In the following, we will refer to 4 of the 20 generated datasets. Both the average rank and these (deep) rule network classifiers based on their number the average accuracy of the deep networks is considerably of layers, i.e. DRNC(5) for s = [32, 16, 8, 4, 2], DRNC(3) better than the corresponding values for RNC. This also for s = [32, 8, 2] and RNC for s1 = 20. For computational holds for pairwise comparisons of the columns (DRNC(5) reasons, all of the reported results were estimated with a vs. RNC 15:5, DRNC(3) vs. RNC 15:5). 2-fold cross validation. While this may not yield the most reliable estimate on each individual dataset, we neverthe- less get a coherent picture over all 20 datasets, as we will see in the following. In the main experiments, we use a combination of 15 artificial datasets with seeds we already used in the prior hyperparameter grid search and 5 artificial datasets with new seeds to detect potential overfitting on the first datasets. Figure 4: Critical distance diagram for the rule networks All datasets are tested using five epochs, a batch size of 50 on the artificial datasets with a significance level of 95%. and an unlimited number of flips per batch. We also ensured Learners that are not statistically different from one another for all of the generated datasets that the DNF concept does are connected. not contain more than 20 rules, so that it can be theoretically also be learned by the tested shallow network with s1 = 20 The Friedman-test for the ranks yields a significance (and therefore also for the two deep networks, since their of more than 95%. A subsequent Nemenyi-Test delivers first layer is already bigger). a critical distance of 0.741 (95%) or 0.649 (90%), which shows that DRNC(3) and RNC are significantly different on a level of more than 95% and DRNC(5) and RNC on a level of more than 90%. The corresponding critical distance diagram (CD=0.741) is shown in Figure 4. We thus find it safe to conclude that deep networks outperform shallow networks on these datasets. In the two right-most columns of Table 1 we also show a comparison to the state-of-the-art rule learner R IPPER [5] and the decision tree learner CART [4] in Python im- plementations using default parameters.2 We see that all network approaches are outperformed by the R IPPER and CART classifiers with default setting. The difference be- tween R IPPER and DRNC(3) is approximately the same as the difference between DRNC(3) and RNC. However, considering that we only use a naïve greedy algorithm, it 2 We used the implementations available from https: Figure 3: Average accuracy of rule network with 1/3/5 //pypi.org/project/wittgenstein/ and https:// scikit-learn.org/stable/modules/generated/sklearn. layers on training dataset. tree.DecisionTreeClassifier.html. Table 1: Accuracies on artificial datasets. Rule network with 1/3/5 layers vs R IPPERvs CART. The best accuracy of the rule networks is marked in bold, the overall best accuracy per dataset is marked in italic. seed %(+) DRNC(5) DRNC(3) RNC R IPPER CART 5 0.4453 0.958 0.9863 0.9531 0.9805 0.9844 16 0.7959 0.9639 0.9707 0.9629 0.9766 0.9551 19 0.6562 1 0.9902 0.9746 1 1 24 0.584 0.9053 0.9043 0.916 0.9463 0.9404 36 0.6943 0.8828 0.9209 0.9043 0.8867 0.9111 44 0.7939 0.9629 0.9551 0.9326 0.9482 0.9697 53 0.6055 0.9805 0.9805 0.9775 0.9746 0.9824 57 0.7705 0.9824 0.9736 0.9639 0.9951 0.9902 60 0.7715 0.9443 0.9453 0.9209 0.958 0.9883 65 0.5312 0.9854 0.9688 0.9414 0.9961 0.9922 68 0.5654 0.9248 0.9443 0.9619 0.9688 0.9355 69 0.6924 0.9551 0.9658 0.9199 0.9795 0.9717 70 0.6338 0.9014 0.9062 0.9229 0.9111 0.8984 81 0.5684 0.9004 0.9131 0.8857 0.9248 0.9756 82 0.7188 0.9941 0.998 0.9717 1 1 85 0.5312 1 0.998 0.9736 1 1 89 0.6084 0.8926 0.9434 0.9629 0.9502 0.9395 107 0.6172 0.8965 0.873 0.8643 0.9043 0.9277 112 0.7549 0.9346 0.9248 0.9189 0.9082 0.9561 118 0.5957 0.9688 0.9414 0.9434 0.9736 0.9688 Ø Accuracy 0.9467 0.9502 0.9386 0.9591 0.9644 Ø Rank 1.775 1.725 2.5 could not be expected (and was also not our objective) to yield any result (i.e., the resulting network classified all ex- be able to beat state-of-the-art rule learner. In particular, amples into a single class), we re-initialized with a different the runtime is far from state-of-the-art, since already for seed (this happened once for both deep network versions). the shallow network 30 seconds are needed per dataset and up to three minutes for the deep networks (in comparison The results are shown in Table 2. We can again observe to less than a second for R IPPER and CART). Further- that the deep 5-layer network DRNC(5) outperforms the more, the results also confirm that shallow rule learners (of shallow network RNC. Of all rule networks, DRNC(5) which both R IPPER and CART are representatives) had no provides the highest accuracy on the connect-4, monk-1, disadvantage by the way we generated the datasets. monk-3, mushroom and vote datasets, whereas DRNC(3) performs best on car-evaluation and monk-2 and RNC on kr-vs-kp and tic-tac-toe. The latter two datasets are also 4.3 Results on UCI Datasets interesting: tic-tac-toe clearly does not require a deep struc- For an estimation how the rule networks perform on real- ture, because for solving it, the learner essentially needs world datasets, we select nine classification datasets (car- to enumerate all three-in-a-row positions on a 3 × 3 board. evaluation, connect-4, kr-vs-kp, monk 1-3, mushroom, tic- This is similar to connect-4, where four-in-a-row positions tac-toe and vote) from the UCI Repository [6]. They dif- have to be recognized. However, in the former case, there fer in the number of attributes and instances, but have in is only one matching tile for an intermediate concept con- common that they consist only of nominal attributes. Car- sisting of two tiles, while in connect-4 there are several, evaluation and connect-4 are actually multi-class datasets which can potentially be exploited by a deeper network. In and are therefore converted into the binary classification the kr-vs-kp dataset, deep structures are also not helpful problem whether a sample belongs to the most frequent because it consists of carefully engineered features for the class or not. Of all binary classification problems, the net- KRKP chess endgame, which were designed in an iterative works to be tested treat again the more common class as the process so that the game can be learned with a decision positive class and the less common as the negative class, tree learner [15]. It would be an ambitious goal of deep except for the monk datasets whereby the positive class rule learning methods to be able to learn such a dataset is set to 1. As with the artificial datasets, we additionally from, e.g., only the positions of the chess pieces. This is compare the performance of the networks to R IPPER and clearly beyond the state-of-the-art of current rule learning CART, and again all accuracies are obtained via 2-fold algorithms. The comparison to R IPPER and CART is again cross validation. In case a random initialization did not clearly in favor of these state-of-the-art algorithms. Table 2: Accuracies on real-world datasets. Rule network with 1/3/5 layers vs R IPPER vs CART. The best accuracy of the rule networks is marked in bold, the overall best accuracy per dataset is marked in italic. dataset %(+) DRNC(5) DRNC(3) RNC R IPPER CART car-evaluation 0.7002 0.8999 0.9022 0.8565 0.9838 0.9821 connect-4 0.6565 0.7728 0.7712 0.7597 0.7475 0.8195 kr-vs-kp 0.5222 0.9671 0.9643 0.9725 0.9837 0.989 monk-1 0.5000 1 0.9982 0.9910 0.9478 0.8939 monk-2 0.3428 0.7321 0.7421 0.7139 0.6872 0.7869 monk-3 0.5199 0.9693 0.9603 0.9567 0.9386 0.9729 mushroom 0.784 1 0.978 0.993 0.9992 1 tic-tac-toe 0.6534 0.8956 0.9196 0.9541 1 0.9217 vote 0.6138 0.9655 0.9288 0.9264 0.9011 0.9287 Ø Rank 1.556 2 2.444 5 Conclusion the potential of deep rule networks. Nevertheless, several avenues for improving our networks have surfaced, which The main objective of this work was to study the question we intend to explore in the near future. whether deep rule networks have the potential of outper- One of the main drawbacks of the presented deep rule forming shallow DNF rule sets, even though, in principle, networks is the extremely high runtime due to the primitive every concept can be represented as DNF formula. As flipping algorithm. A single flip needs a recalculation of there is no sufficiently competitive deep rule learning al- all activations in the network, even if only a few them will gorithm, we proposed a technique how deep and shallow be affected by this flip whereby the matrix multiplication rule networks can be learned and thus effectively compared could be minimized considerably. Conversely, this knowl- in a uniform framework, using a network approach with edge can be used to find a small subset of flips that affects a greedy optimization algorithm. For both types of net- a certain activation. On the other hand, the majority of works, we find good hyperparameter settings that allow the possible flips does not have any effect on this activation networks to reach reasonable accuracies on both artificial or the accuracy at all. This effect will typically remain and real-world datasets, even though the approach is still unchanged after a few more flips are done. Therefore an outperformed by state-of-the-art learning algorithms such exhaustive search of all flips is only needed in the first itera- as R IPPER and CART. tion, while afterwards just a subset of possible flips should Our experiments on both artificial and real-world bench- be considered which can be built either in a deterministic mark data indicate that deep rule networks outperform shal- or probabilistic way. low networks. The deep networks obtain not only a higher Due to this lack of backpropagation, the flips are eval- accuracy, but also need less mini-batch iterations to achieve uated by their influence on the prediction when executed. it. Moreover, in preliminary experiments in the hyperpa- However, when looking at a false positive, we can only rameter grid search, we have seen indications that the deep correct this error by making the overall hypothesis of the networks are generally more robust to the choice of the network more specific. In order to achieve a generalization hyperparameters than shallow networks. On the other hand, of the hypothesis, only flips from false to true in con- we also had some cases on real-world data sets where deep junctive layers or flips from true to false in disjunctive networks failed because a poor initialization resulted in layers have to be taken into account. In this way, all flips are indiscriminate predictions. split into ”generalization-flips” and ”specialization-flips” Overall, we interpret these results as evidence that an of which only one group has to be considered at the same investigation of deep rule structures is a promising research time. This improvement as well as the above mentioned goal, which we hope could yield a similar boost in perfor- selection of a subset of flips might also allow us to perform mance in inductive rule learning as could be observed by two or more flips at the same time so that a better result moving from shallow to deep neural networks. However, than with the greedy approach can be achieved. this goal is still far ahead. An even more promising approach starts one step ear- lier in the initialization phase of the network. Instead of 6 Future Work specifying the structure of the network and finding optimal initialization parameters l¯ and p for it, a small part of the In this work, it was not our goal to reach a state-of-the-art data could be used to create a rough draft version of the predictive performance, but instead we wanted to evaluate network. The Quine-McCluskey algorithm [11] or R IPPER a very simple greedy optimization algorithm on both shal- are suitable methods to generate shallow networks, whereas low and deep networks, in order to get an indication on the ESPRESSO-algorithm [3] would generate deep networks. Decision trees can also be used to generate deep networks [7] J. Fürnkranz, D. Gamberger, and N. Lavrač. Founda- since the contained rules already share some conditions tions of Rule Learning. Springer-Verlag, 2012. and, moreover, similar subtrees can be merged. All these approaches share some significant advantages [8] R. C. Holte. Very simple classification rules per- over the network approach we developed so far. First of all, form well on most commonly used datasets. Machine the decision which class value will be treated as positive Learning, 11:63–91, 1993. or negative does not have to be made manually any longer. [9] K. Hornik. Approximation capabilities of multilayer Second, they automatically deliver a suitable initialization feedforward networks. Neural Networks, 4(2):251 – of the network, which otherwise would have to be improved 257, 1991. by similar approaches like used in neural networks [e.g., 14] to achieve a robust performance. Third, the general [10] H. Lakkaraju, S. H. Bach, and J. Leskovec. Inter- structure of the network is not limited to a fixed size and pretable decision sets: A joint framework for descrip- depth where each node is strictly assigned to a specific tion and prediction. In B. Krishnapuram, M. Shah, layer. Instead of generating nodes that become useless after A. J. Smola, C. C. Aggarwal, D. Shen, and R. Ras- a few flips have been processed and that should be removed, togi, editors, Proceedings of the 22nd ACM SIGKDD we can thereby start with a small structure which can be International Conference on Knowledge Discovery adapted purposefully by copying and mutating good nodes and Data Mining (KDD-16), pages 1675–1684, San and pruning bad ones. However, it remains unclear whether Francisco, CA, 2016. ACM. these changes still lead to improvements in performance or if the network in the given structure is already optimal. [11] E. J. McCluskey. Minimization of Boolean functions. The Bell System Technical Journal, 35(6):1417–1444, Acknowledgments. We are grateful to Eneldo Loza Mencía, 1956. Michael Rapp, Eyke Hüllermeier, and Van Quoc Phuong Huynh [12] H. Mhaskar, Q. Liao, and T. A. Poggio. When and for inspiring discussions, fruitful pointers to related work, and for why are deep networks better than shallow ones? In suggesting the evaluation with artificial datasets. S. P. Singh and S. Markovitch, editors, Proceedings of the 31st AAAI Conference on Artificial Intelligence, References pages 2343–2349, San Francisco, California, USA, 2017. AAAI Press. [1] F. Beck and J. Fürnkranz. An investigation into mini- [13] R. S. Michalski. On the quasi-minimal solution of batch rule learning. In K. Kersting, S. Kramer, and the covering problem. In Proceedings of the 5th In- Z. Ahmadi, editors, Proceedings of the 2nd Work- ternational Symposium on Information Processing shop on Deep Continuous-Discrete Machine Learn- (FCIP-69), volume A3 (Switching Circuits), pages ing (DeCoDeML), 2020. Available as arXiv Preprint 125–128, Bled, Yugoslavia, 1969. abs/2106.10202. [14] E. Z. Ramos, M. Nakakuni, and E. Yfantis. Quan- [2] F. Beck and J. Fürnkranz. An empirical investigation titative measures to evaluate neural network weight into deep and shallow rule learning. arxiv Preprint, initialization strategies. In Proceedings of the IEEE abs/2106.10254, 2021. Submitted for journal publica- 7th Annual Computing and Communication Workshop tion. and Conference (CCWC), pages 1–7, Las Vegas, NV, [3] R. K. Brayton, G. D. Hachtel, C. T. McMullen, and USA, 2017. IEEE. A. L. Sangiovanni-Vincentelli. Logic Minimization Al- [15] A. D. Shapiro. Structured Induction in Expert Systems. gorithms for VLSI Synthesis, volume 2 of The Kluwer Turing Institute Press. Addison-Wesley, 1987. International Series in Engineering and Computer Science. Springer, 1984. [16] T. Wang, C. Rudin, F. Doshi-Velez, Y. Liu, E. Klampfl, and P. MacNeille. A Bayesian framework for learning [4] L. Breiman, J. H. Friedman, R. Olshen, and C. Stone. rule sets for interpretable classification. Journal of Classification and Regression Trees. Wadsworth & Machine Learning Research, 18:70:1–70:37, 2017. Brooks, Pacific Grove, CA, 1984. [5] W. W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russell, editors, Proceedings of the 12th International Conference on Machine Learn- ing (ML-95), pages 115–123, Lake Tahoe, CA, 1995. Morgan Kaufmann. [6] D. Dua and C. Graff. UCI machine learning reposi- tory, 2017.