Extracting Propositional Rules from Feed-forward Neural Networks by Means of Binary Decision Diagrams Sebastian Bader Department of Computer Science, University of Rostock, Germany sebastian.bader@uni-rostock.de Abstract After a presentation of all necessary concepts in Sec- tion 2, we discuss the proposed extension of the CoOp- We discuss how to extract symbolic rules from approach, i.e., the extraction of BDDs from simple per- a given binary threshold feed-forward net- ceptrons. Afterwards, we discuss how to compose the work. The proposed decompositional approach intermediate results and how to incorporate integrity is based on an internal representation using bi- constraints. In Section 6 first experimental results are nary decision diagrams. They allow for an effi- presented and further work is discussed in Section 7. cient composition of the intermediate results as well as for an easy integration of integrity con- straints into the extraction. We also discuss 2 Preliminaries some experimental results indicating a good In this section, we introduce some necessary concepts. performance of the approach. After defining feed-forward neural network and the rule- extraction problem, we discuss binary decision diagrams. Feed-forward artificial neural networks (ANNs), also 1 Introduction called connectionist systems, consist of simple computa- During the training process, neural networks acquire tional units (neurons) which are connected. The set of knowledge by generalising from raw data. Unfortunately, units U together with the connections C ⊆ U × U form an this learnt knowledge is hidden in the weights associated acyclic directed graph. In this paper we concentrate on to the connections and humans have no direct access networks with units applying the ±1-threshold function. to it. One goal of rule extraction is the generation of Such a neural network can be represented as a 6-tuple a human-readable description of the output units be- hU, Uinp , Uout , C, ω, θi. Uinp , Uout ⊆ U denote input and haviour with respect to the input units. Usually, the output units of the network, i.e., are sources and sinks result is described in form of if-then rules, giving condi- of the underlying graph. The functions ω : C → R as- tions that activate (or inactivate) a given output unit. sign a weight to every connection and θ : U → R a Rule extraction from connectionist systems is still an threshold to every unit. Every unit u has an activation open research problem, even though a number of algo- value act u ∈ {−1, +1} which is set from outside for in- rithms exists. For an overview of different approaches we put units, or computed based on the activation value of refer to [Andrews et al., 1995] and [Jacobsson, 2005]. Ex- its predecessor units and the threshold θ(u) as follows: traction techniques can be divided into pedagogical and ( P decompositional approaches. While the first conceives +1 if c=(v,u)∈C act v · ω(c) ≥ θ(u) the network as a black box, the latter decomposes the act u = −1 otherwise network, constructs rules describing the behaviour of the simpler parts, and then re-composes those results. Figure 1 shows a simple network serving as running ex- In [Bader et al., 2007], we proposed the CoOp- ample throughout the paper. algorithm, a decompositional approach for the extrac- Because every unit can be active (act u = +1) or inac- tion of propositional rules from feed-forward neural net- tive (act u = −1) only, we can associate a propositional works. Here, we discuss an extension of this approach. In variable u to it, which is assumed to be true if and only if this new extension, binary decision diagrams (BDD) are the unit u is active, and we use ū to denote the negation used to store intermediate results, i.e., rules extracted of u. Furthermore, we can characterise network inputs from single units (perceptrons). This representation has as interpretations I ⊆ Uinp of the propositional variables three advantages: Uinp . We use act u (I) to denote the state of unit u if 1. results are stored in a very compact form, all input units contained in I are active and all other input units are inactive. Using this notation, we can 2. intermediate results can easily be combined, and define the rule extraction problem as follows: The rule 3. integrity constraints can easily be incorporated. extraction problem for a given node u of a feed-forward 22 c 0 c Pg = hθ, I, ωi with ω c d e f g h d a 1 -2 5 2 θ=4 a g d I = {c, d, e, f } 1 2 4 b 1 1 -3 1 c 1 0 g 2 e  4 1 if x = c b h d 2 -3  e  3 4 −2  2 if x = d e 3 0 ω(x) = f 3 if x = e 5 f 5 -2   −4 f  5 if x = f Figure 1: A simple network serving as running exam- Figure 2: The perceptron corresponding to unit g. ple. The threshold are shown within the nodes and the connection weights are shown on the right. BDD = h≺, 0, 1, R, N i with threshold network hU, Uinp , Uout , C, ω, θi is the construc- a 4 R=4 tion of a propositional formulae F over Uinp such that N = {h2, b, 0, 1i, h3, b, 1, 0i, for all interpretations I we find  b 2 b 3 h4, a, 2, 3i} +1 if I |= F pf(2) = (b ∧ ⊥) ∨ (¬b ∧ >) = ¬b act u (I) = −1 otherwise pf(3) = (b ∧ >) ∨ (¬b ∧ ⊥) = b 1 0 I.e., we are looking for propositional formula stating nec- pf(4) = (a ∧ ¬b) ∨ (¬a ∧ b) essary and sufficient conditions (in terms of the activa- tion of input units) such that the unit u is active. Figure 3: A simple BDD, nodes are annotated with their Usually not all input combinations make sense in a variables and their ID on the right. High branches are given application domain, because some of them would depicted as solid and low branches as dashed lines. On correspond to invalid states of the world. We use the the right you find the underlying data structure and the term valid inputs to denote the set of allowed input com- logic formulae corresponding to the internal nodes. binations. Even more important for the extraction is the fact that all training samples are taken from this subset. Therefore, the network learns to solve a task under the can be found for example in [Andersen, 1999]. Intu- implicit conditions hidden in the selection of inputs. In- itively, a BDD is a directed acyclic graph with a variable tegrity constraints are a way to make those conditions associated to every node and such that all nodes n 6= 0, 1 explicit during the extraction. An integrity constraint have exactly two successors, called high and low branch is a formula IC over Uinp describing the set of valid in- of n. The nodes 0 and 1 are the sinks of the graph. We puts V ⊆ P(Uinp ) as follows: For all I ⊆ Uinp we find use h≺, 0, 1, R, N i to refer to a BDD with sinks 0 and I |= IC if and only if I ∈ V . Using integrity constraints, 1, a set of nodes N and a root-node with identifier R. we can reformulate the extraction problem as follows: And we use hi, v, h, li to denote the node with identi- The rule extraction problem for a given network N and fier i, with variable v = var(i), and with high and low a given integrity constraint IC is the construction of a branch pointing to the nodes with identifiers h and l, propositional formulae F over Uinp such that for all in- respectively. terpretations I with I |= IC we find Usually a BDD is assumed to be ordered and reduced.  It is called ordered iff there exists a linear order ≺ on +1 if I |= F the variables and the successors of a node are marked act u = −1 otherwise with variables that are bigger with respect to ≺. It is Networks can be decomposed into their basic build- called reduced if no two nodes for the same variable have ing blocks, namely single units together with their in- identical high and low branch, and for no node high and coming connections. Those single units can be seen as low-branch coincide. simple sub-networks (perceptrons) consisting of a num- BDDs represent propositional formulae in if-then-else ber of inputs and a single output unit, together with the normal form. The corresponding formula for a given corresponding weighted connections. To simplify the no- node is defined recursively as follows: tations we use Pp = hθ, I, ωi to denote the perceptron corresponding to the unit p together with its threshold pf(0) := ⊥ pf(1) := > θ, the set of predecessor units I and the weight function pf(i) := (var(i) ∧ pf(h)) ∨ (¬ var(i) ∧ pf(l)) ω. Figure 2 shows the perceptron for the output unit g. Binary decision diagrams (BDD) are a data structure Figure 3 shows a simple BDD using a graphical repre- to represent propositional formulae in a very compact sentation as well as the underlying data structure and way and to manipulate them easily. A nice introduction the corresponding logic formulae for every node. 23 3 From Perceptrons to Search Trees · -11 {} In this section, we discuss an algorithm to extract a BDD from a single unit such that the BDD represents neces- f -1 e -5 d -7 c -9 sary and sufficient condition on the inputs to turn the {f } {e} {d} {c} unit active. Following [Bader et al., 2007], we define input patterns I as subsets of the inputs I of a given e 5 d 3 c 1 d -1 c -3 c -5 {e,f } {d,f } {c,f } {d,e} {c,e} {c,d} perceptron Pp = hθ, I, ωi which are assumed to be ac- tive. The inputs not contained in I can be either active 9 d c 7 c 5 c 1 or inactive. And we define the corresponding minimal {d,e,f } {c,e,f } {c,d,f } {c,d,e} input imin (I) as follows: X X c 11 imin (I) = ω(a) − |ω(a)| {c,d,e,f } a∈I a∈I\I Figure 4: The pruned search tree for the perceptron Pg The minimal input is computed by adding the contri- P from above. The underlying full tree is depicted in grey bution of the fixed inputs a∈I ω(a) and the minimal using dashed lines. The nodes contain the newly added input caused by all other inputs. A perceptron is called symbol and are annotated with the corresponding input positive, if all weights are positive. For the following pattern and the resulting minimal input. All nodes for constructions, we assume the perceptrons to be positive. which the minimal input exceed the threshold of θ(g) = 4 In Section 5, we discuss how to apply the extraction to are shown with grey background. arbitrary perceptrons. The construction of BDDs below is based on the search trees described in [Bader et al., 2007]. These search trees · -11 Cg {} contain a node for every possible input pattern. Children of a given node correspond to input patterns which con- -1 f e -5 d -7 c -9 tain exactly one symbol more and all nodes are sorted {f } {e} {d} {c} with respect to their minimal inputs. If the minimal input of some node exceeds the threshold, that node is e 5 d 3 c 1 d -1 c -3 c -5 marked (I.e., the corresponding input pattern represents {e,f } {d,f } {c,f } {d,e} {c,e} {c,d} a sufficient condition to turn the perceptron active). The 9 complete tree is pruned by removing all those nodes for d c 7 c 5 c 1 {d,e,f } {c,e,f } {c,d,f } {c,d,e} which no descendant is marked and all those nodes which are descendants of marked nodes. The construction of a c 11 1 0 pruned tree is shown in Algorithm 1. Figure 4 shows the {c,d,e,f } full and the resulting pruned search tree on top of it. Figure 5: The BDD corresponding to the pruned search tree from Figure 4. Input: A positive perceptron Pp+ . Output: A pruned search tree. 1 Fix an order ≺ such that b ≺ c if ω(b) ≥ ω(c). 4 From Search Trees to BDDs 2 Create a root node for the empty input pattern. 3 Add a child labelled x for each input symbol x Before presenting an algorithm to construct BDDs from (sorted wrt. ≺). a given pruned search tree, we introduce some further 4 foreach newly added node labelled y do notations. Every node in the search tree is represented 5 Add a new child c for every symbol z with as a pair hI, Ci with I being the corresponding input y ≺ z (sorted wrt. ≺). pattern and C being the set of children. id(n) denotes 6 Label c with the corresponding pattern I. a unique identifier for the node n (e.g., the correspond- 7 Mark c if imin (I) > θ(p). ing input pattern, or some index), this identifier is also used as internal index for the BDD nodes. We assume 8 Remove all descendants of marked nodes. id(n) := 0 if there is no node n. var(n) denotes the 9 Remove all nodes for which no descendant is symbol which is added to the input pattern at node n. marked. The construction of a BDD for a given search tree Algorithm 1: Constructing a pruned search tree. is shown as Algorithm 2. This algorithm transforms a search tree into a BDD, by traversing the tree in a left- Exploiting the structure of these search trees, we can depth-first manner. A node’s high branch points to 1, if easily construct BDDs representing conditions to turn its minimal input exceeds the threshold. Otherwise, it the perceptron active. I.e., we find the perceptron to be points to its left-most child, or to 0 if there is no child. active for all those input patterns which, understood as The low-branch points to the right sibling, or to 0 if interpretation, turn the logic formula corresponding to there is none. The result for the perceptron Pg is shown the BDD true. in Figure 5. 24 g h Input: A search tree T for Pp = hθ, I, ωi wrt. ≺. Output: A corresponding OBDD h≺, 0, 1, R, N i. a a 1 if T is empty then 2 R = 0 and N = {} 3 else if T contains only the root node then b b 4 R = 1 and N = {} 5 else 0 1 6 R = id(rl ) for the leftmost child of the root. 7 N = {}. 8 foreach leaf node n in T do Figure 7: The global BDD for the network from Figure 1. 9 Add hid(n), var(n), 1, li to N with l = id(r) for the right sibling r. 5 Composition of Intermediate Results 10 foreach node hI, Ci in T with left sibling l do In the previous section, we have been concerned with 11 Let hid(l), var(l), hl , ll i be the node for l positive perceptrons only. But we can easily turn every 12 Let hid(ll ), var(n), hl1 , ll1 i be the node for perceptron into a positive one, by multiplying negative the leftmost child ll of l weights by −1 and inverting the corresponding input 13 if mci(l) − 2ω(l) + 2ω(n) > θ then symbols. By doing so, we can apply the algorithm to 14 Add hid(n), var(n), ll1 , ln i to N with all output units of a given network, and obtain a BDD ln = id(rn ) for the right sibling rn of n describing necessary conditions with respect to the pre- 15 foreach other non-root node hI, Ci do decessor units that turn the output unit active. But 16 Add a node hid(n), var(n), id(c), li to N for some of the input symbols may have been inverted. I.e., the leftmost child c and l = id(rn ) for the we need another algorithm to construct BDDs stating right sibling rn of n and l = 0 if there is conditions which turn a perceptron inactive. Due to the none symmetry of the threshold function, we find this algo- rithm to be dual to Algorithm 1 and 2. I.e., by inverting Algorithm 2: Constructing a BDD. the order and the inequalities we obtain an algorithm that constructs such a BDD. Once we have extracted the BDD for a given output Please note, that the BDD can be constructed without unit, we can continue by substituting the nodes test- constructing the search tree first. The tree is used only ing non-input nodes (i.e., nodes not corresponding to in- to describe the underlying ideas. All conditions tested put units of the network) by their corresponding BDDs. in Algorithm 2 can be tested by expanding the tree step- A non-negated node is replaced by the BDD as con- by-step. Looking a little closer at the constructed search structed above, and negated nodes are replaced by the tree we find that some sub-trees have an identical in- dual BDDs. As mentioned above, BDDs have been de- ternal structure, which is exemplified in Figure 6. If the signed to allow for an efficient manipulation of logic for- condition tested in Line 13 of Algorithm 2 is fulfilled, two mulae. And in fact it is straightforward to compose the neighbouring sub-trees are structured identically. mci(n) intermediate results into an overall diagram by simply denotes the minimum of all minimal inputs associated replacing the nodes. But this is not the best approach, to nodes below n. Please note, that mci(n) can be com- because the resulting ‘global’ BDD would not be ordered puted without expanding the sub-tree by looking at the any more. But while expanding the BDD, we can keep it associated input pattern. ordered (and reduced) as described in [Andersen, 1999]. The mentioned structural equivalence can be exploited After expanding all non-input nodes, we obtain a final by using a shortcut into the already constructed BDD BDD representing necessary and sufficient conditions on and thus preventing the expansion of an identical sub- the network’s input to turn a given output unit active. tree. Figure 6 contains a number of those shortcuts, Using BDDs as internal data structure has some fur- e.g., one from node {b} to the node {c, a}, because the ther advantages. We can actually extract all output children of {b} are annotated the same way as the node units into the same global BDD. Doing so leads auto- below and right of {c, a}. Please note that this identity matically to a sharing of intermediate results, because can be recognised without expanding the second sub- common substructures are contained only once within tree, i.e., the construction of whole tree below {b} can this BDD. Figure 7 shows the final ‘global’ BDD for the be avoided. network from Figure 1. Please note that the right node The condition on Line 13 is fulfilled whenever the per- labelled b is used for both output units g and h. ceptron shows a so called n-of-m behaviour, i.e., if there Furthermore, we can integrate integrity constraints in are m inputs from which n suffice to turn the percep- a straightforward fashion. Instead of starting with an tron active. In this case, there will be n equivalent sub- empty BDD, we extract the output nodes into a BDD trees, which can be shortcut. As discussed in [Towell and representing the integrity constraints. To exemplify this, Shavlik, 1993], this occurs quite frequently while training a network has been trained to the Encode-Decoder task. neural networks. It contains 8 input, 8 output units and 3 hidden units 25 · -5 Ct {} a -3 b -3 c -3 d -3 e -3 {a} {b} {c} {d} {e} -1 c -1 -1 e -1 c -1 -1 e -1 -1 e -1 e -1 b d d d {b,a} {c,a} {d,a} {e,a} {c,b} {d,b} {e,b} {d,c} {e,c} {e,d} c 1 d 1 e 1 d 1 e 1 e 1 d 1 e 1 e 1 e 1 {c,b,a} {d,b,a} {e,b,a} {d,c,a} {e,c,a} {e,d,a} {d,c,b} {e,c,b} {e,d,b} {e,d,c} 3 e 3 e 3 e 3 e 3 d {d,c,b,a}{e,c,b,a}{e,d,b,a} {e,d,c,a} {e,d,c,b} e 5 1 0 {e,d,c,b,a} Figure 6: A larger BDD and its underlying search tree. The tree has been constructed for a perceptron with 5 inputs whose weights are all 1 and with threshold 0. I.e., this perceptron is active if 3 of the five inputs are +1. Please note that there are certain symmetries in the tree: All label and minimal inputs of the sub-trees of node {b} coincide with those of the right sub-trees of {a} if a is substituted by b. This has been exploited by linking from {b} to {c, a}. o1 o1 and is trained to learn the identity mapping for all inputs i1 i1 in which exactly one unit is active. I.e., the network has to learn a compressed representation within the hidden i2 i2 i2 layer. But applying the algorithm presented above yields an unwanted result shown in Figure 8 on the left. Using i3 i3 i3 i3 i3 the integrity constraint that at most one input is active i4 i4 i4 i4 i4 i4 i4 i4 at a time yields the BDD shown on the right. 900 nodes have to be constructed (including all intermediate re- i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 sults while constructing the BDD) for the ‘normal’ BDD, i6 i6 i6 i6 i6 i6 i6 i6 i6 i6 i6 but only 124 while using the integrity constraint. This shows the advantage of using integrity constraints right i7 i7 i7 i7 i7 i7 i7 i7 i7 from the beginning of the extraction process. Usually i8 i8 i8 they are used to refine the extraction result afterwards. This would be possible here as well by simple computing 1 0 0 1 the conjunction of the ‘normal’ BDD with one repre- senting the integrity constraint. But starting with the Figure 8: The result of extracting output unit 1 from constraint avoids the construction of many intermediate an 8-3-8 encoder-decoder network. The BDD on the left nodes which would be removed afterwards. is the result of the ‘normal’ extraction. On the right the constraint that at most one unit is active has been 6 Experimental Evaluation incorporated. To evaluate the approach a Prolog implementation has been used to gather some statistics. The results are |I| |T| #IPs |BDD| node/IP shown in Table 1. The table shows average numbers 1 2 0.58 2.58 4.448 for different numbers of inputs, the size of the full search 5 32 4.31 8.49 1.969 tree, the number of minimal input patterns, the size of 10 1024 63.51 49.97 0.786 the corresponding BDD and the number of BDD nodes 15 32768 1270.45 313.25 0.246 per input pattern. All numbers have been collected from 20 1048576 25681.70 1863.90 0.072 100 random perceptrons per size. The extraction using the full search tree is not feasible due to the exponential Table 1: The size of the full search tree (|T|), the number growth. The number of input patterns is a conservative of minimal input patterns as a lower bound for the size lower bound for the size of the pruned search tree, be- of the pruned search tree (#IPs) and the corresponding cause those trees have at least one node per minimal in- BDDs (|BDD|) for different number of inputs (|I|). put pattern. The result shows that the use of BDD pro- posed here yields a very compact representation. Even though the number of nodes in the BDD grows, the ratio A second experiment has been performed to show the (node/IP) of size of the BDD and the number of minimal effect of the usage of integrity constraints while extract- coalitions decreases. ing the BDDs. A network with 6 inputs, 4 hidden and 2 26 |BDD| 863 approach presented in [Krishnan et al., 1999]. But due 813 791 758 to a different order, we do not need to expand them completely, which would otherwise be necessary. The extraction as presented here is applicable to 497 all feed-forward networks composed of binary threshold 502.80 514.14 units computing ±1-threshold function. This limitation 442.62 396.98 can be softened by allowing arbitrary symmetric thresh- 341.90 old functions. The symmetry is necessary to construct negative and positive forms of the perceptron without 258 151 194 199 changing the global network function. 118.14 111 Finally, we discussed first experimental results indi- 94 92 n cating a good performance of the approach. On the 1 2 3 4 5 6 one hand, we obtain a very compact representation and on the other hand, we circumvent the construction of Figure 9: Resulting BDD sizes of the extraction for non-necessary intermediate results while incorporating different maxn -integrity constraints. The bars indicate integrity constraints right from the start. minimal, maximal and average sizes of the BDDs. The Nonetheless, much remains to be done. In particular, size of the sub-BDD for the constraint is shown in grey. the extraction for non-threshold units has to be studied. For the encoder-decoder experiments mentioned above the algorithm has simply been applied to networks com- output units has been used for the experiment. The pos- puting the symmetric hyperbolic tangent as activation sible inputs have been constrained by a maxn integrity function. Interestingly, the result coincide with our ex- constraint for 0 ≤ n ≤ 6. The results are presented pectations. This is due to the fact, that networks when in Figure 9. For every n the experiment has been con- trained to compute crisp decisions tend to behave like ducted for the same 100 randomised networks and the threshold networks. But the details of this need to be following numbers have been collected: the size of the investigated in the future. Furthermore, a detailed anal- sub-BDD encoding the constraint, the minimal, maxi- ysis of the performance is necessary, in particular using mal and average size of the final BDD. Please note that networks trained for real-world problems. The approach the numbers show the total number of internal nodes as presented here detects equivalent sub-BDDs for n- constructed for the BDD, i.e., including all necessary in- of-m patterns. But there are more cases for equivalent termediate nodes. For n = 1, i.e., the biggest restriction, sub-BDDs [Mayer-Eichberger, 2008]. Those have to be we obtain very small BDDs. The size of the BDD grows integrated into the extraction procedure. It would also up to n = 4 and decreases again for n > 4. From those be interesting to study the evolution of a network during observations we can conclude that the incorporation of the training process by repeatedly applying the extrac- integrity constraints into the extraction process can lead tion method and compare the results. to big savings in terms of nodes constructed for the final BDD. Without their use during the extraction, we would Acknowledgements The author is thankful for the have to construct the BDD corresponding to n = 6. This comments of Valentin Mayer-Eichberger and two anony- big BDD of ≈ 400 nodes would have to be refined with mous reviewers. respect to the constraints afterwards. There seem to be cases (e.g., for n = 4) where the use of integrity constrain References yields larger BDDs, but nonetheless, the final BDD does [Andersen, 1999] H. R. Andersen. An introduction to binary decision not have to be revised afterwards, and the difference is diagrams. Lecture Notes, 1999. [Andrews et al., 1995] R. Andrews, J. Diederich, and A. Tickle. A not too big. survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge–Based Systems, 8(6), 1995. 7 Conclusions and Future Work [Bader et al., 2007] S. Bader, S. Hölldobler, and V. Mayer-Eichberger. Extracting propositional rules from feed-forward neural networks — A novel approach for the extraction of propositional rules a new decompositional approach. In Proceedings of the 3rd Inter- from feed-forward networks of threshold units has been national Workshop on Neural-Symbolic Learning and Reasoning, NeSy’07, January 2007. presented. After decomposing the network into percep- [Jacobsson, 2005] H. Jacobsson. Rule extraction from recurrent neu- trons, binary decision diagrams representing precondi- ral networks: A taxonomy and review. Neural Computation, tions that activate or inactivate the perceptron have 17(6):1223–1263, 2005. been extracted. Those intermediate representations can [Krishnan et al., 1999] R. Krishnan, G. Sivakumar, and P. Bhat- be composed using the usual algorithms for BDDs, or tacharya. A search technique for rule extraction from trained neural networks. Non-Linear Anal., 20(3):273–280, 1999. they can be combined during their construction by ex- [Mayer-Eichberger, 2008] V. Mayer-Eichberger. Towards solving a sys- tracting one into the other. The latter approach does tem of pseudo boolean constraints with binary decision diagrams. also allow for an incorporation of integrity constraints Master’s thesis, Universidade Nova de Lisboa, SEP 2008. – already during the extraction of the intermediate re- [Towell and Shavlik, 1993] G. Towell and J. W. Shavlik. Extract- sults. As already mentioned in [Bader et al., 2007], the ing refined rules from knowledge-based neural networks. Machine Learning, 13:71–101, 1993. pruned search trees constructed above are related to the 27