Differentiable Rule Induction with Learned Relational Features Remy Kusters1,2,βˆ— , Yusik Kim1,2 , Marine Collery2,3 , Christian de Sainte Marie2 and Shubham Gupta1,2 1 IBM Research, Orsay, France 2 IBM France Lab, Orsay, France 3 Inria Saclay Ile-de-France, Palaiseau, France Abstract Rule-based decision models are attractive due to their interpretability. However, existing rule induction methods often result in long and consequently less interpretable rule models. This problem can often be attributed to the lack of appropriately expressive vocabulary, i.e., relevant predicates used as literals in the decision model. Most existing rule induction algorithms presume pre-defined literals, naturally decoupling the definition of the literals from the rule learning phase. In contrast, we propose the Relational Rule Network (R2N), a neural architecture that learns literals that represent a linear relationship among numerical input features along with the rules that use them. This approach opens the door to increasing the expressiveness of induced decision models by coupling literal learning directly with rule learning in an end-to-end differentiable fashion. On benchmark tasks, we show that these learned literals are simple enough to retain interpretability, yet improve prediction accuracy and provide sets of rules that are more concise compared to state-of-the-art rule induction algorithms. Keywords Decision rules, Rule learning, Feature learning 1. Introduction Over the last decade, black box decision models (e.g. neural networks) are increasingly used in high stakes decision making, yet there has been equally growing concern over their use [1]. Regulations in various industries are demanding accountability and transparency from the models involved in the decision making process. Besides the regulatory concerns, the end-users of many complex decision models are also increasingly demanding interpretability of the final decision. Rule based models are a potential solution but, to date, most rule-based decision support systems do not support learning practically useful decision models directly from the data, often due to the resulting rules being too long to be considered interpretable [2, 3]. Focusing on learning rule models which model higher-level concepts expressed in terms of the lower-level input features, rather than rule models that only use these lower- level input features has been brought forward as one of the primary avenues to improve the expressiveness and interpretability of rule models [2, 4]. The scope of this paper is learning NeSy 22: 16th International Workshop on Neural-Symbolic Learning and Reasoning βˆ— Corresponding author. Envelope-Open remy.kusters@ibm.com (R. Kusters) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) rule models, expressed in Disjunctive Normal Form (DNF), for classification tasks of tabular data. Of particular interest is to enrich the representation language used by rule models with higher-level latent representations which are functions of lower-level input features. This is in contrast to most state-of-the-art rule learning algorithms which use pre-defined literals (either hand-crafted or obtained from a-priori binarized features) as literals in the rule model (e.g., BRCG [5], RIPPER [6], CORELS [7]). The main difficulty of learning the appropriate literals is in incorporating feedback from the rule model that uses them. One way to achieve this is through learning the literals and the rule model that uses them jointly via a single differentiable neural architecture. We propose the Relational Rule Network (R2N): a three-layer neural network which learns literals in the first layer, each representing a partition of the input feature space delimited by a hyperplane, which we call a halfspace. The second and third layers map the binary vector of literals to a binary predictions, encoding a crisp logical formula in the form of a DNF. We consider halfspaces as literals as they are simple enough to retain interpretability, yet significantly improve expressivity and model accuracy. Encoding this structure in an end-to-end differentiable neural network has the benefit that the learning of the literals is directly informed by their usage in the rule. To showcase the value of using appropriate predicates as literals in a rule model, we present a simple toy example: Given a dataset with two numerical input features (π‘₯0 , π‘₯1 ) and a corresponding label 𝑦 determined by the ground truth if (π‘₯0 /π‘₯1 > 1.0 ∧ π‘₯0 > 0.5) then class = True; else class = False; (1) the task is to find a rule model that accurately predicts the label given the input. This model involves two distinct decision boundaries: π‘₯0 = 0.5 and π‘₯0 /π‘₯1 = 1.0. Training our model with default parameters (see Appendix C) recovers the ground truth (see Fig. 1) whereas when literals are predefined as univariate value comparisons (e.g. π‘₯0 < 0.2), we learn a longer DNF. This is because each conjunction represents a rectangle in the feature space (π‘₯0 , π‘₯1 ), and many of them are needed to adequately approximate the sloped decision boundary, π‘₯0 /π‘₯1 = 1 (Fig. 1b). Therefore, by improving the expressivity of the rule model with the ability to represent ratio as a single literal rather than a conjunction of many, we simplified the model and improved accuracy. We summarize our contribution as follows: i) We propose a neural network architecture (R2N) that learns a DNF together with the literals it uses. The literals represent halfspaces, whose boundary hyperplanes are learned by the network. ii) We demonstrate that compared to SOTA algorithms (RIPPER, BRCG, DR-Net, etc.) we improve sparsity and interpretability of the learned rule model both for a set of benchmark problems where the underlying decision models are known as well as where they are not. 2. Related works Our work contributes to the large body of work on neuro-symbolic rule learning systems that represent atoms and clauses as individual nodes in a neural network (NN) and that learn rules as configurations of weights on the edges that link these nodes. This approach is different from existing distillation-based approaches that construct a simpler interpretable network to mimic a) R2N b) RIPPER x0 x1 x1 Figure 1: Decision boundaries for Eq. (1) (a) R2N, and (b) the RIPPER algorithm [6]. The decision boundaries learned by R2N (a) correspond to π‘₯0 > 0.5 and π‘₯1 /π‘₯0 > 1.0, while the RIPPER algorithm (b) learns nine rules that correspond to the nine rectangles inside the green area. the behavior of a complex black-box model [8, 9]. While they generate post-hoc explanations that only approximate a model, R2N and the like natively learn explainable rules: the rules do not approximate the NN, they are equivalent to it by construction. The representation of rules in R2N is essentially the same as in C-IL2P [10] and followers (e.g. [11, 12, 13]). R2N differs from C-IL2P and similar systems in two main ways: whereas they focus on approximating logic programs and their semantics in discrete domains (although infinite and continuous domains can also be dealt with, it remains in a probabilistic setting [14]), R2N learns crisp classification rules in continuous numerical domains, like [15, 16]. However, our approach differs from [15] in two regards: i) the encoding and ii) the binarization of the rules as we discuss in Section 4. In addition, R2N learns a set of literals that map the 𝑛-dimensional continuous numerical input space to an π‘š-dimensional Boolean space in which the rule models are learned. Similar neuro-symbolic systems require binarized input data using predefined predicates [10, 15, 16, 17]; are limited to learning univariate threshold values [18]; or learn latent relations in a finite Herbrand base [4, 13, 19]. The latter systems rely on prior knowledge about the relations to be learned [4, 19], whereas R2N does not require prior knowledge. Rather than predicate invention, learning literals in R2N can be considered as a form of propositionalization [20]. Contrary to other systems relying on it [12, 20, 21], propositionalization is intrinsically part of the rule learning in R2N: indeed, R2N is able to learn a more useful grounding of the learned literals in the input space without prior knowledge because the literals are learned as part of the supervised rule learning process. This is similar to the way DeepProbLog uses the semantics of the program to supervise the grounding of predicates [22]. Extending the branching conditions from simple value comparisons to linear halfspaces has been proposed for decision trees [23]. While it is possible to translate the resulting decision tree to a DNF, such mapping usually results in excessively long rules, hindering interpretability. Furthermore, single literal learning layer that grounds the learned literals using hyperplanes in R2N can be replaced with an arbitrarily complex NN (learning hypersurfaces) at the cost of interpretability. Literal Learning Module Rule Learning Module Figure 2: Relational Rule Network (R2N): The literal learning module encodes halfspaces and evaluates to Boolean literals πœ™. Alongside these learned literals, zero or more predefined literals πœ™π‘š+1 , ..., πœ™π‘€ are supplied directly to the rule learning module. The rule learning module uses these literals in the conjunctions 𝑧 in the first layer. In the final layer, the relevant conjunctions are selected into a DNF. 3. Rule language The rule model that we consider for binary classification is a DNF, whose formal grammar is defined by the following production rules, r u l e m o d e l β†’ if D N F then class = True; else class = False DNF β†’ conjunction|conjunction ∨ DNF conjunction β†’ literal|literal ∧ conjunction l i t e r a l β†’ πœ™1 | ... | πœ™π‘š | πœ™π‘š+1 | ... | πœ™π‘€ The underlined terms are terminal symbols, and the t y p e w r i t e r terms are the non-terminal symbols. Among the predicates (of arbitrary arity) in our dictionary of literals, some have learnable parameters, {πœ™1 ,...,πœ™π‘š }, e.g., coefficients of a hyperplane defining a halfspace, and some do not, {πœ™π‘š+1 ,...,πœ™π‘€ }, meaning they are pre-determined (see Fig. 2). We call the subset of literals with learnable parameters learned literals, and those without learnable parameters predefined literals. Note that negations are not part of the rule language. Therefore, negated predicates, if desired, must be explicitly included in the dictionary of literals. 4. The Relational Rule Network (R2N) The Relational Rule Network (R2N) is composed of two modules connected sequentially: 1. The literal learning module that learns halfspaces to be used as literals of the rule language. 2. The rule learning module that maps the binary vector of evaluated literals to a single binary prediction of the class label, in such a way that it encodes an equivalent logical formula in disjunctive normal form (DNF). Any predefined literals we want to use, including categorical value comparisons, are directly supplied to the input layer of the rule learning module. See Fig. 2 for an illustration. We introduce the following notation for the rest of this paper: β€’ π‘₯𝑖 : Numerical input variable 𝑖. β€’ πœ™π‘– (x): The 𝑖-th literal in our rule language. Learned literals (𝑖 = 1, ..., π‘š) represent half- spaces in the N-dimensional space of x. We omit the argument when the context is clear. β€’ 𝑧𝑗 : The 𝑗-th conjunction formed by a combination of literals. Unless stated otherwise, normal-faced lower-case variables with subscripts (e.g., π‘₯𝑖 ) repre- sent scalar components of its bold-faced counterparts, representing vectors (lower-case; x) or matrices (upper-case; X). All vectors are column vectors. 4.1. The literal learning module To enrich the vocabulary of our rule language, we consider a neural network layer that learns predicates representing halfspaces of the numerical feature space. These learned predicates can then be used as literals in the rule language. The literal learning module is represented by a (𝑝) single layer perceptron with a learnable bias 𝑏𝑖 and weight vector w𝑗 for each halfspace πœ™π‘– that we wish to learn (See Fig. 2). Extending this to multiple output nodes using matrix notation, the transformation is represented by x𝑇 W(𝑝) + b πœ™(x) = 𝜎 ( ), (2) 𝜏 where W(𝑝) is the weight matrix, b the corresponding biases, 𝜎 is the sigmoid activation function applied component-wise, and 𝜏 is a temperature parameter that changes according to a cooling schedule. Observing that 𝜎 (π‘₯/𝜏 ) converges1 to the Heaviside step function as 𝜏 β†’ 0, we effectively approach strict binarization at the end of the cooling schedule (step function is used during testing), while avoiding zero gradients during training. 4.2. The rule learning module The rule learning module maps the binary vector of literals to a single binary prediction of the class label, in such a way that it encodes an equivalent logical formula in DNF. We specify this rule learning module as a two-layer neural network architecture respectively implementing an AND and a subsequent OR operation, similar to [10, 15]. AND-layer The AND-layer maps a binary vector of literals πœ™ = (πœ™1 , ..., πœ™π‘€ ) to another binary vector (𝑧1 , ..., 𝑧𝐽 ) whose components represent conjunctions made from some combination of the input literals. This combination is specified by a binary vector w𝑗 indicating which of the 𝑀 literals are present in conjunction 𝑧𝑗 . The 𝑖-th component 𝑀𝑖𝑗 of w𝑗 is 1 if literal πœ™π‘– is included in conjunction 𝑧𝑗 and 0 otherwise [16]. 1 It pointwise converges on ℝ β§΅ {0}. Proposition 1. Let 𝑀𝑖𝑗 be binary parameter and πœ™π‘– be Boolean variables whose values can be represented by binary values. Then the mappings πœ™ ↦ β‹€ πœ™π‘– and πœ™ ↦ 1 βˆ’ min {βˆ‘ 𝑀𝑖𝑗 (1 βˆ’ πœ™π‘– ), 1} (3) π‘–βˆΆπ‘€π‘–π‘— =1 𝑖 are equivalent. Proof: See Appendix A The AND-layer implements transformation (3) as a computational graph. Although the parameter we want to eventually learn is the binary weight matrix W, we learn it indirectly through reparameterizing it with a scaled sigmoid function that converges to a Heaviside step function in the limit, similarly as in the literal learning module. The learnable parameter 𝑒𝑖𝑗 for each binary weight 𝑀𝑖𝑗 is continuous and unconstrained: 𝑀𝑖𝑗 = 𝜎 (𝑒𝑖𝑗 /𝜏 ). (4) The resulting computational graph defined by equations (3) and (4) is differentiable with non- zero derivatives. Note that ensuring non-zero gradients during the backwards pass, rather than using a straight through estimator, as in [15], is essential to ensure robust training, as described in [24]. OR-layer The OR-layer maps a binary vector z to a single binary value by implementing a logical disjunction operation among select components of z. The binary weights w(𝑑) of this layer encode which of the conjunctions learned in the AND-layer are to be included in the final disjunction. We seek a differentiable approximation that approaches the non-differentiable OR operation in the limit, similar to what we have done for the AND-layer. (𝑑) Proposition 2. Let 𝑀𝑗 be binary parameters and 𝑧𝑗 be Boolean variables. Then the mappings (𝑑) z↦ ⋁ 𝑧𝑗 and z ↦ max {𝑀𝑗 𝑧𝑗 ; 𝑗 = 1, ..., 𝐽} (5) (𝑑) π‘—βˆΆπ‘€π‘— =1 are equivalent. Proof: See Appendix B The binary weight vector w(𝑑) is learned exactly the same way as we learn the weight matrix W of the AND-layer, through reparameterizing the binary weights using a scaled sigmoid: (𝑑) 𝑀𝑗 = 𝜎 (𝑣𝑗 /𝜏 ). (6) The OR-layer implements the computational graph defined by equations (5) and (6), where 𝑣𝑗 are the continuous and free learnable parameters. Decoding the DNF from the trained network The 2-layer network composing the AND and OR layers can be represented as a composition of the two differentiable functions (3) and (5). As a direct result of Propositions 1 and 2, this network implements the evaluation of the DNF πœ™β†¦ ⋁ β‹€ πœ™π‘– . (7) π‘—βˆΆπ‘€π‘— =1 π‘–βˆΆπ‘€π‘–π‘— =1 (𝑑) (𝑑) The binary weights 𝑀𝑖𝑗 and 𝑀𝑗 can be retrieved by inspecting the sign of the learned parameters 𝑒𝑖𝑗 and 𝑣𝑗 . Note, however, that during training the network weights are not binary when the temperature is different from 0. Hence proposition 1 is not strictly applicable at training time, so there is no guarantee that at this stage the network output and the output of the decoded DNF are equivalent. However, we can make the outputs arbitrarily close to each other by driving the temperature down to zero. We observed that the difference between the network output and the output of the decoded DNF is negligible for reasonably cold temperatures (𝜏 = 10βˆ’4 ). Therefore at the end of training we learn a crisp DNF as classifcation rule. 5. Results Results with known ground truth Examples R2N DR-Net RIPPER BRCG MLP TRUE IF Acc. π‘›π‘Ÿ , 𝑙 π‘Ÿ Acc. π‘›π‘Ÿ , 𝑙 π‘Ÿ Acc. π‘›π‘Ÿ , π‘™π‘Ÿ Acc. π‘›π‘Ÿ , 𝑙 π‘Ÿ Acc 1) (π‘₯0 > 0.25) ∨ (π‘₯1 < 0.5) 1.0 2, 1.0 0.99 28, 1.6 0.99 5, 2.2 1.0 2, 1.0 1.0 2) (π‘₯0 /π‘₯1 > 0.5) ∨ (π‘₯4 < 0.25) 1.0 2, 1.5 0.99 39, 2.1 0.99 22, 2.8 0.98 6, 1.5 1.0 3) (π‘₯0 + 0.5π‘₯1 > 0.5) ∨ (π‘₯4 < 0.25) 0.99 3, 1.7 0.98 40, 2.3 0.99 15, 2.7 0.97 6, 1.7 1.0 4) (π‘₯0 < 0.2 ∧ π‘₯1 + π‘₯2 > 0.5)∨ 0.99 4, 2.3 0.98 40, 3.7 0.98 23, 4.8 0.94 5, 2.2 1.0 (π‘₯4 /π‘₯3 > 1 ∧ π‘₯1 < 0.5) 5) (π‘₯4 < 0.2 ∧ π‘₯0 /π‘₯1 > 0.5)∨ 0.99 3, 1.3 0.98 39, 2.0 0.98 15, 3.5 0.97 6, 1.8 0.99 (0.5π‘₯3 + 0.2π‘₯1 > 0.5) ∨ (π‘₯0 < 0.2) Results with unknown ground truth Examples R2N DR-Netβˆ— RIPPERβˆ— BRCGβˆ— MLPβˆ— Magic 0.86 6, 3.3 0.84 18, 5.2 0.82 27, 6.0 0.84 22, 3.7 0.87 Heloc 0.72 3, 2.0 0.70 2, 6.3 0.70 12, 5.2 0.69 1, 1.9 0.71 Adult 0.83 2, 3.5 0.83 6, 13.5 0.82 20, 4.7 0.83 24, 3.8 0.84 House 0.86 5, 2.2 0.86 12, 6.3 0.82 41, 7.0 0.84 5, 5.2 0.89 Table 1 Accuracy, number of conjunctions in the DNF (π‘›π‘Ÿ ) and average number of literals per conjunction (π‘™π‘Ÿ ) for (top) the examples with known ground truth and (bottom) four datasets taken from the UCI ML database. Note that for the columns with an βˆ— , the benchmark accuracy and (π‘›π‘Ÿ , π‘™π‘Ÿ ) are taken from [15] to ensure a fair comparison. In this section we study the accuracy and rule complexity for a set of benchmark problems (i) where the underlying DNF is known and (ii) where this is not the case. We compare with three SOTA methods: RIPPER [6], DR-Net [15] and BRCG [5]. The hyperparameters, training routine and benchmarks are discussed in the Appendix F. We first consider five simple DNFs with increasing complexity (see Table 1) and compare the prediction accuracy, number of conjunctions in the DNF (π‘›π‘Ÿ ) and average number of literals per conjunction (π‘™π‘Ÿ ) of the R2N with SOTA methods. We generate a dataset of 104 samples from the underlying rules where the five input features (π‘₯0 , ..., π‘₯4 ) are randomly sampled between [0, 1]. For all examples, R2N obtains near perfect accuracy (> 99%) and the DNF we obtain approximate the ground truth very well. E.g., for the most complex example, Ex. 5, we obtain the DNF (0.2π‘₯1 + 0.5π‘₯3 > 0.5) ∨ (π‘₯0 < 0.2) ∨ (π‘₯0 /π‘₯1 > 0.59 ∧ π‘₯4 < 0.2). This example shows that R2N has not only obtained a compact rule model, but has also learned the ground truth literals involving the linear relationships and ratios of input features. The DNF we obtain for the other examples are detailed in Appendix E. The three benchmark methods (RIPPER, DR-Net and BRCG) all require binarizing the numer- ical input features, hence trading off accuracy and length of the optimal DNF (fewer pre-defined literals will result in more compact rule models, but cruder approximations of the decision boundary). We selected the default parameters to make a fair comparisons in this case (See Appendix F for a detailed description), but these methods inherently require this accuracy/DNF complexity trade-off. RIPPER and DR-Net provide a comparable accuracy, yet provide DNFs that are considerably longer than R2N while BRCG provides more compact DNFs, but com- promises significantly on the prediction accuracy. Note that this discrepancy is particularly important when the decision boundaries cannot be captured by univariate value comparisons as pre-defined literals (Ex. 2-5 in Table 1). In all cases, the rule language is not expressive enough to describe the linear decision boundary accurately with a sparse set of univariate literals. Next we consider four datasets for which the underlying rule models are unknown (Taken from the UCI ML-repository, [25]). Note that besides the numerical attributes, these datasets contain categorical data which are added as additional input to the rule learning module using one-hot encoding as described in Section 4. Similarly as for the rule models with known underlying DNF, we obtain significantly shorter DNFs compared to all the SOTA models (see Table 1). This shows that even though we do not have a priori information of the presence of multi-variate literals, R2N provides more concise rulesets while outperforming SOTA methods on prediction accuracy. The accuracy we obtain with R2N approaches that of a Multi Layer Perceptron2 . In Appendix G and D we show the sensitivity of our approach with respect to the size of the dataset/noise level and sparsity parameters. Overall we show that the obtained rule model and its near-optimal accuracy can still be obtained for noise-levels (randomly flipping a fraction of the labels in the dataset) up to 10% and for training set sizes of the order of 103 training samples. 6. Outlook The method presented in this paper extends the vocabulary of rule-learning algorithms by learning literals that represent halfspaces. R2N can also serve in other applications as part of the domain vocabulary (or ontology); for instance as input predicates for other rule models like RIPPER or BRCG. We show that using these learned literals from R2N as input for RIPPER reduces the number of rules that it learns as compared to RIPPER with univariate value comparisons for all examples listed in Table 1 (see Appendix H). Note that while in general rules with fewer 2 Comparisons are taken directly from [15] conjunctions and literals per conjunction can be considered more easily interpretable, true interpretability might have to be examined by subject matter experts. Although learning literals defined through hyperplanes has been shown to be a good first step towards improving the expressiveness of the rules while retaining interpretability, they are by no means sufficient for all practical applications. The framework we present can be extended to non-linear decision boundaries. A simple yet powerful example of this involves black-box function approximators (e.g. MLPs and LSTMs) to learn non-linear literals using our general machinery from Section 4 for retaining differentiability. While this results in literals that are not explicitly interpretable, the final DNF that use these literals can still be understood and audited by humans who can then try to understand the learned literals in light of the rules that use them. As a concrete example, consider an application where the input data is sequential and the ground-truth class assignments depend on complex aggregates such as variance (e.g., π‘£π‘Žπ‘Ÿ(𝑐) > πœƒ1 β‡’ 𝑇 π‘Ÿπ‘’π‘’) or counts (e.g. π‘π‘œπ‘’π‘›π‘‘(π‘₯𝑑 > πœƒ2 ) ≀ πœƒ3 β‡’ 𝐹 π‘Žπ‘™π‘ π‘’). Halfspace predicates are inadequate for representing these relationships. However LSTMs can learn binary β€œnon-linear” predicates from the data, that can then be used as literals in the rule model. Post-hoc analysis of these literals can then shed light on their meaning, reducing the complex problem of interpreting the entire model to a simpler problem of interpreting individual literals in the context of an a priori understandable rule model. Acknowledgments This work has been partially funded by the French government as part of project PSPC AIDA 2019-PSPC-09, in the framework of the β€œProgramme d’Investissement d’Avenir”. References [1] J. Angwin, J. Larson, S. Mattu, L. Kirchner, Machine bias, https://www.propublica.org/ article/machine-bias-risk-assessments-in-criminal-sentencing, 2016. URL: https://www. propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, accessed: 2021-12-21. [2] C. de Sainte Marie, Learning decision rules or learning decision models?, in: International Joint Conference on Rules and Reasoning, Springer, 2021, pp. 276–283. [3] S. Kramer, A brief history of learning symbolic higher-level representations from data (and a curious look forward), in: C. Bessiere (Ed.), Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, International Joint Conferences on Artificial Intelligence Organization, 2020, pp. 4868–4876. URL: https: //doi.org/10.24963/ijcai.2020/678. doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 2 0 / 6 7 8 , survey track. [4] S. DumančiΔ‡, H. Blockeel, Demystifying relational latent representations, in: International Conference on Inductive Logic Programming, Springer, 2017, pp. 63–77. [5] S. Dash, O. GΓΌnlΓΌk, D. Wei, Boolean decision rules via column generation, in: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Curran Associates Inc., Red Hook, NY, USA, 2018, p. 4660–4670. [6] W. W. Cohen, Fast effective rule induction, in: Machine learning proceedings 1995, Elsevier, 1995, pp. 115–123. [7] E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, C. Rudin, Learning certifiably optimal rule lists for categorical data, Journal of Machine Learning Research 18 (2018) 1–78. URL: http://jmlr.org/papers/v18/17-716.html. [8] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, NIPS Deep Learning and Representation Learning Workshop (2015). [9] J. R. Zilke, E. L. MencΓ­a, F. Janssen, Deepred–rule extraction from deep neural networks, in: International Conference on Discovery Science, Springer, 2016, pp. 457–473. [10] A. S. Avila Garcez, G. Zaverucha, The connectionist inductive learning and logic program- ming system, Applied Intelligence 11 (1999) 59–77. [11] M. V. FranΓ§a, G. Zaverucha, A. S. d’Avila Garcez, Fast relational learning using bottom clause propositionalization with artificial neural networks, Machine learning 94 (2014) 81–104. [12] K. Gao, K. Inoue, Y. Cao, H. Wang, Learning first-order rules with differentiable logic program semantics, arXiv preprint arXiv:2204.13570 (2022). [13] G. Marra, M. Diligenti, F. Giannini, M. Gori, M. Maggini, Relational neural machines, arXiv preprint arXiv:2002.02193 (2020). [14] V. Belle, Symbolic logic meets machine learning: A brief survey in infinite domains, in: International Conference on Scalable Uncertainty Management, Springer, 2020, pp. 3–16. [15] L. Qiao, W. Wang, B. Lin, Learning accurate and interpretable decision rule sets from neural networks, Proceedings of the AAAI Conference on Artificial Intelligence 35 (2021) 4303–4311. URL: https://ojs.aaai.org/index.php/AAAI/article/view/16555. [16] F. Beck, J. FΓΌrnkranz, An investigation into mini-batch rule learning, arXiv preprint arXiv:2106.10202 (2021). [17] L. Fu, Rule generation from neural networks, IEEE Transactions on Systems, Man, and Cybernetics 24 (1994) 1114–1124. doi:1 0 . 1 1 0 9 / 2 1 . 2 9 9 6 9 6 . [18] Z. Wang, W. Zhang, N. Liu, J. Wang, Scalable rule-based representation learning for interpretable classification, Advances in Neural Information Processing Systems 34 (2021). [19] G. Sourek, V. Aschenbrenner, F. Zelezny, O. Kuzelka, Lifted relational neural networks, arXiv preprint arXiv:1508.05128 (2015). [20] S. Kramer, E. Frank, Bottom-up propositionalization., in: ILP Work-in-progress reports, 2000. [21] M.-A. Krogel, S. Rawles, F. Ε½eleznyΜ€ , P. A. Flach, N. Lavrač, S. Wrobel, Comparative evaluation of approaches to propositionalization, in: International Conference on Inductive Logic Programming, Springer, 2003, pp. 197–214. [22] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, L. De Raedt, Deepproblog: Neural probabilistic logic programming, Advances in Neural Information Processing Systems 31 (2018). [23] H. Zhu, P. Murali, D. Phan, L. Nguyen, J. Kalagnanam, A scalable mip-based method for learning optimal multivariate decision trees, in: Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1771–1781. URL: https: //proceedings.neurips.cc/paper/2020/file/1373b284bc381890049e92d324f56de0-Paper.pdf. [24] Y. Bengio, N. LΓ©onard, A. Courville, Estimating or propagating gradients through stochastic neurons for conditional computation, arXiv preprint arXiv:1308.3432 (2013). [25] D. Dua, C. Graff, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml. A. Proof of Proposition 1 As the functions are binary-valued, it suffices to show that one function evaluating to 1 is equivalent to the other function evaluating to 1. β‹€ πœ™π‘– = 1 ⟺ βˆ€π‘– (𝑀𝑖𝑗 = 1) β†’ (πœ™π‘– = 1) π‘–βˆΆπ‘€π‘–π‘— =1 ⟺ βˆ€π‘– (𝑀𝑖𝑗 = 0) ∨ (πœ™π‘– = 1) ⟺ βˆ€π‘– 𝑀𝑖𝑗 (1 βˆ’ πœ™π‘– ) = 0 ⟺ βˆ‘ 𝑀𝑖𝑗 (1 βˆ’ πœ™π‘– ) = 0 (8) 𝑖 ⟺ 1 βˆ’ min {βˆ‘ 𝑀𝑖𝑗 (1 βˆ’ πœ™π‘– ), 1} = 1 (9) 𝑖 B. Proof of Proposition 2 (𝑑) ⋁ 𝑧𝑗 = 1 ⟺ βˆƒπ‘— 𝑀𝑗 = 1 ∧ 𝑧𝑗 = 1 (10) (𝑑) π‘—βˆΆπ‘€π‘— =1 (𝑑) ⟺ max {𝑀𝑗 𝑧𝑗 ; 𝑗 = 1, ..., 𝐽} = 1 (11) C. Loss function and training Our loss function has a mean-square error component together with components that promote sparsity of the obtained rule model. The overall loss function is: β„’π‘šπ‘ π‘’ + πœ†π‘Žπ‘›π‘‘ β€–Wβ€–1 + πœ†π‘œπ‘Ÿ β€–w(𝑑) β€–1 + πœ†π‘ β€–W(𝑝) β€–1 (12) where β€– β‹… β€–1 is the 1-norm. Increasing πœ†π‘ reduces the sparity of the learned literals, increasing πœ†π‘Žπ‘›π‘‘ shortens conjunctions, and increasing πœ†π‘œπ‘Ÿ shortens the disjunction. For simplicity, we choose a single regularization penalty for all the different layers πœ† = πœ†π‘Žπ‘›π‘‘ = πœ†π‘œπ‘Ÿ = πœ†π‘ . Unless stated otherwise, we use πœ† = 10βˆ’2 for training on synthetic datasets (with known ground truth) and a more conservative value of πœ† = 10βˆ’3 for the UCI ML datasets [25]. We use the Adam optimizer from the PyTorch ecosystem with default learning rate and parameters to optimize Eq. 12. In order to avoid local minima throughout the training routine, we randomly re-initialize the weights of the neural network every 104 epochs and run the network for 2.5 Γ— 105 epochs in total. The training is done with a fixed batch size of 100. We select the epoch with the lowest loss function on the training set and report the prediction accuracy at that stage for the test-set (train/test split of 80/20). As mentioned in the section on the literal learning module, to ensure smooth training as well as reaching binary output of the literal learning layer, we propose a cooling schedule of the sigmoid activation function with high temperature in the beginning allowing larger gradients to exist across a wider range of the input so that training can occur. As the temperature approaches zero, the sigmoid is progressively scaled and approaches the Heaviside function in the limit, so that we achieve strict binarization in the end. The temperature is cooled down with a factor 𝛾 = 0.995 at every subsequent epoch for all the examples presented in the paper. The number of output nodes of the literal learning module π‘š = 10 while the number of nodes in the rule learning module is 𝐽 = 25. Increasing or decreasing these numbers would impact the expressivity of the network but we empirically observed that varying this number in the range 10-50 did not show a significant sensitivity to these values. D. Sparsity of the DNF In the examples presented in Table 1, the number of conjunctions in the DNF is always smaller than the number of possible conditions in the hidden layer of the rule learning module. As described in the Appendix C, the loss function of R2N contains sparsity regulation on the weights of the literal learning, AND and OR-layer, controlling the sparsity of the resultant DNF. To assess the role of the sparsity promoting parameter, πœ†, we trained the R2N for values of πœ† in the range [10βˆ’5 , 1], both for Ex. 4 and 5 from Table 1. As shown in Fig. 3 (red), the number of conjunctions in the DNF decreases upon increasing πœ†. The accuracy however (black), remains near optimal for values inferior to πœ† < 10βˆ’1 , suggesting that the most compact DNF that has the smallest compromise on accuracy can be obtained for values of πœ† β‰ˆ 10βˆ’2 or 10βˆ’3 . At larger values of πœ†, the strength of the sparsity constraint will result in DNFs that are shorter than the ground-truth, resulting in approximations of the hyperplanes that are sparser, and compromise on the predicition accuracy, e.g. for Ex. 4, 77% accuracy is obtained with a DNF containing a single conjunction (βˆ’π‘₯0 βˆ’ 0.7π‘₯1 βˆ’ 0.4π‘₯3 + 0.3π‘₯4 > βˆ’0.7), while the ground-truth DNF contains three conjunctions. Controlling the value of πœ† provides the user an opportunity to balance the prediction accuracy with the sparsity of the DNF. E. DNFs obtained with R2N In this section we show the DNFs obtained with R2N and discuss how they are obtained. The DNF associated to the learned rule model can be extracted directly from the binary weight matrix W and w(𝑑) , while the values of the learned literals are extracted from the weight matrix W(𝑝) . Since the values of W(𝑝) are constrained with a sparsity penalty (see Appendix C), most values are small, yet not strictly zero. We therefore threshold all the values that have a normalized coefficient that is smaller than 2.5% of the magnitude of the largest coefficient in a predicate. For instance, for example 5 in Table 1, the unprocessed output of R2N is given by, [(βˆ’0.7π‘₯0 βˆ’ 0.7π‘₯1 βˆ’ π‘₯2 βˆ’ 0.2π‘₯3 βˆ’ 0.8π‘₯4 > βˆ’1173) ∧ (π‘₯4 < 0.2) ∧ (π‘₯0 βˆ’ 0.6π‘₯1 > βˆ’0.1)] ∨[(π‘₯4 < 0.2) ∧ (βˆ’π‘₯0 βˆ’ 0.4π‘₯1 βˆ’ 0.1π‘₯2 βˆ’ 0.2π‘₯4 > 704)] ∨[π‘₯0 < 0.2] (13) ∨[(0.5π‘₯0 βˆ’ 0.2π‘₯1 βˆ’ π‘₯2 βˆ’ 0.4π‘₯3 + 0.1π‘₯4 > βˆ’540) ∧ (βˆ’π‘₯0 βˆ’ 0.4π‘₯1 βˆ’ 0.1π‘₯2 βˆ’ 0.2π‘₯4 > 704)] ∨[(π‘₯4 < 0.2) ∧ (0.3π‘₯0 + 0.1π‘₯2 βˆ’ π‘₯3 βˆ’ 0.3π‘₯4 > 3050)] ∨[0.4π‘₯1 + π‘₯3 > 1.0] This can be simplified by identifying and removing all the constant false (βŠ₯) or true (⊀) predicates (always true or false), e.g., 0.5π‘₯0 βˆ’ 0.2π‘₯1 βˆ’ π‘₯2 βˆ’ 0.4π‘₯3 + 0.1π‘₯4 > βˆ’540 β†’ ⊀, into, [(π‘₯4 < 0.2) ∧ (π‘₯0 βˆ’ 0.6π‘₯1 > βˆ’0.1)] ∨[π‘₯0 < 0.2] (14) ∨[0.4π‘₯1 + π‘₯3 > 1.0]. The simplified DNFs we obtain for the other examples in Table 1 are: For ex. 1: [(π‘₯0 > 0.26)] (15) ∨[(π‘₯1 < 0.49)]. For ex. 2: [(π‘₯0 βˆ’ 0.41π‘₯1 > 0) ∧ (π‘₯0 βˆ’ 0.53π‘₯1 > 0)] (16) ∨[(π‘₯4 < 0.25)]. For ex. 3: [(π‘₯0 + 0.49π‘₯1 > 0.41) ∧ (π‘₯4 < 0.23)] ∨[(π‘₯0 + 0.49π‘₯1 > 0.41) ∧ (π‘₯0 + 0.50π‘₯1 > 0.51)] (17) ∨[(π‘₯4 < 0.25)]. Finally for ex. 4: (π‘₯1 < 0.49) ∧ (βˆ’π‘₯3 + π‘₯4 >= 0) ∨(π‘₯0 < βˆ’0.20) ∨[(π‘₯0 βˆ’ 0.1π‘₯1 βˆ’ 0.2π‘₯2 βˆ’ 0.4π‘₯3 βˆ’ 0.2π‘₯4 > βˆ’2.2) ∧ (π‘₯0 < 0.2) ∧ (βˆ’π‘₯3 + π‘₯4 > 0)] ∨[(π‘₯0 βˆ’ 0.1π‘₯1 βˆ’ 0.2π‘₯2 βˆ’ 0.4π‘₯3 βˆ’ 0.2π‘₯4 > βˆ’2.1) ∧ (π‘₯4 βˆ’ π‘₯3 > 0) ∧ (0.1π‘₯0 + π‘₯1 + π‘₯2 βˆ’ 0.1π‘₯3 + > 0.46)] (18) F. Baselines This section describes the configuration settings for the three baselines: DR-Net [15], RIPPER [6], and BRCG [5] used in our experiments. For all cases, we used a common training set of Figure 3: (top) Accuracy on the test-set and number of conjunctions in the DNF, π‘›π‘Ÿ , as function of πœ† = πœ†π‘Žπ‘›π‘‘ = πœ†π‘œπ‘Ÿ = πœ†π‘ for Ex. 4 (dashed) and Ex. 5 (solid) from Table 1. Accuracy on the test set as function of (left) the size of the training set and (right) the noise level for (red) Ex. 4 and (blue) Ex. 5 from Table 1. size 8000 and evaluated on a common test set of size 2000, and used the equi-quantile feature binarizer included in the BRCG repository3 with 40 bins. For BRCG, we use the publicly available implementation from the AI Explainability 360 repository4 . The algorithm uses two parameters πœ†1 and πœ†2 to penalize the learning of more and longer clauses respectively. We set πœ†1 = πœ†2 = 0.001, which is the default value recommended in the implementation. We also use the default values for the maximum number of iteration (100), maximum number of columns generated per iteration (10), and maximum degree (10) and width (5) parameters for the beam search. For DR-Net we use the implementation by the authors5 with the default parameters except for the number of training epochs, which was increased from 2000 to 10000 according to the authors’ recommendation to learn sparse rules. For RIPPER, we used the wittgenstein implementation6 . But instead of using its internal binarizer, we binarized it upfront using the aforementioned feature binarizer. For evaluating the predictive accuracy for the synthetic examples in Table 1 in the main text we also report the values obtained from a MLP (Multi Layer Perceptron). We used a simple two layer architecture, with 10 neurons in the hidden layer, trained for 104 epochs with default optimizer parameters (Adam optimizer). G. Robustness w.r.t. noise and sample size Data from which rule models are learned are oftentimes noisy and sparse. To assess the robustness of our approach w.r.t. number of training examples, we plot the prediction accuracy 3 https://github.com/Trusted-AI/AIX360/blob/master/aix360/algorithms/rbm/features.py 4 https://github.com/Trusted-AI/AIX360 5 https://github.com/Joeyonng/decision-rules-network 6 https://github.com/imoscovitz/wittgenstein/tree/master/wittgenstein Example. 1 2 3 4 5 π‘›π‘Ÿ with quantile binning 5 22 15 23 15 π‘›π‘Ÿ with learned literals 3 3 2 5 3 Table 2 Number of conjunctions obtained with RIPPER by using quantile binning and the learned literals obtained from R2N for the 5 synthetic examples presented in Table 1 on the test set for Ex. 4 and 5 of Table 1 as function of the size of the training set (Fig. 3) and find that, even for a training set size of 𝑂(103 ) we obtain a good accuracy (> 0.95%). This shows that our method is not overly sensitive to the size of the training dataset, a common problem with neural-network based models. Alternatively, when we add white noise to the data provided to the algorithm (randomly flipping a fraction of the labels in the dataset) we find that for both examples the accuracy of the test-set approaches near-optimal accuracy up to noise levels of 10% (Fig. 3: The solid line indicates optimal performance, note that since the data is noisy, this line is the upper limit for the performance). H. Interoperability with other rule learning algorithms The relational literals that are learned by R2N can also serve in other applications as part of the domain vocabulary (or ontology); for instance as input literals for other rule learning algorithms, ensuring interoperability with existing rule learning workflows. To showcase how this can augment algorithms like RIPPER, we used the learned literals from R2N as input literals for RIPPER and found that for all synthetic examples in Table 1, the number of rules learned is considerably less as compared to RIPPER with univariate value comparisons. As we display in Table 2, the learned number of rules decreases and becomes comparable with that of R2N, along with an improvement in accuracy for 4 of the 5 cases (not shown in the table), demonstrating that the learned literals are not only specific to R2N but can also benefit other rule learning algorithms.