Extracting Propositional Rules from Feed-forward Neural Networks by
                Means of Binary Decision Diagrams

                                         Sebastian Bader
                   Department of Computer Science, University of Rostock, Germany
                                   sebastian.bader@uni-rostock.de


                       Abstract                                       After a presentation of all necessary concepts in Sec-
                                                                   tion 2, we discuss the proposed extension of the CoOp-
    We discuss how to extract symbolic rules from                  approach, i.e., the extraction of BDDs from simple per-
    a given binary threshold feed-forward net-                     ceptrons. Afterwards, we discuss how to compose the
    work. The proposed decompositional approach                    intermediate results and how to incorporate integrity
    is based on an internal representation using bi-               constraints. In Section 6 first experimental results are
    nary decision diagrams. They allow for an effi-                presented and further work is discussed in Section 7.
    cient composition of the intermediate results as
    well as for an easy integration of integrity con-
    straints into the extraction. We also discuss                  2   Preliminaries
    some experimental results indicating a good                    In this section, we introduce some necessary concepts.
    performance of the approach.                                   After defining feed-forward neural network and the rule-
                                                                   extraction problem, we discuss binary decision diagrams.
                                                                      Feed-forward artificial neural networks (ANNs), also
1   Introduction                                                   called connectionist systems, consist of simple computa-
During the training process, neural networks acquire               tional units (neurons) which are connected. The set of
knowledge by generalising from raw data. Unfortunately,            units U together with the connections C ⊆ U × U form an
this learnt knowledge is hidden in the weights associated          acyclic directed graph. In this paper we concentrate on
to the connections and humans have no direct access                networks with units applying the ±1-threshold function.
to it. One goal of rule extraction is the generation of            Such a neural network can be represented as a 6-tuple
a human-readable description of the output units be-               hU, Uinp , Uout , C, ω, θi. Uinp , Uout ⊆ U denote input and
haviour with respect to the input units. Usually, the              output units of the network, i.e., are sources and sinks
result is described in form of if-then rules, giving condi-        of the underlying graph. The functions ω : C → R as-
tions that activate (or inactivate) a given output unit.           sign a weight to every connection and θ : U → R a
Rule extraction from connectionist systems is still an             threshold to every unit. Every unit u has an activation
open research problem, even though a number of algo-               value act u ∈ {−1, +1} which is set from outside for in-
rithms exists. For an overview of different approaches we          put units, or computed based on the activation value of
refer to [Andrews et al., 1995] and [Jacobsson, 2005]. Ex-         its predecessor units and the threshold θ(u) as follows:
traction techniques can be divided into pedagogical and                            (          P
decompositional approaches. While the first conceives                                 +1 if     c=(v,u)∈C act v · ω(c) ≥ θ(u)
the network as a black box, the latter decomposes the                    act u =
                                                                                      −1 otherwise
network, constructs rules describing the behaviour of the
simpler parts, and then re-composes those results.                 Figure 1 shows a simple network serving as running ex-
   In [Bader et al., 2007], we proposed the CoOp-                  ample throughout the paper.
algorithm, a decompositional approach for the extrac-                 Because every unit can be active (act u = +1) or inac-
tion of propositional rules from feed-forward neural net-          tive (act u = −1) only, we can associate a propositional
works. Here, we discuss an extension of this approach. In          variable u to it, which is assumed to be true if and only if
this new extension, binary decision diagrams (BDD) are             the unit u is active, and we use ū to denote the negation
used to store intermediate results, i.e., rules extracted          of u. Furthermore, we can characterise network inputs
from single units (perceptrons). This representation has           as interpretations I ⊆ Uinp of the propositional variables
three advantages:                                                  Uinp . We use act u (I) to denote the state of unit u if
  1. results are stored in a very compact form,                    all input units contained in I are active and all other
                                                                   input units are inactive. Using this notation, we can
  2. intermediate results can easily be combined, and              define the rule extraction problem as follows: The rule
  3. integrity constraints can easily be incorporated.             extraction problem for a given node u of a feed-forward


                                                              22
              c
              0                                                            c                         Pg = hθ, I, ωi with
                                 ω    c d e f       g h
              d                  a    1 -2 5 2
                                                                                                       θ=4
    a                    g
                                                                           d                           I = {c, d, e, f }


                                                                                       1
              2      4           b    1 1 -3 1
                                 c                  1 0                                        g


                                                                                       2
              e
                                                                                                           
                                                                                           4                 1 if x = c
    b                    h       d                  2 -3
                                                                                                           
                                                                           e
                                                                                                           


                                                                                   3
              4     −2                                                                                     
                                                                                                             2 if x = d
                                 e                  3 0                                             ω(x) =
              f                                                                                              3 if x = e


                                                                                   5
                                 f                  5 -2                                                   
                                                                                                           
             −4                                                            f                               
                                                                                                             5 if x = f

Figure 1: A simple network serving as running exam-                    Figure 2: The perceptron corresponding to unit g.
ple. The threshold are shown within the nodes and the
connection weights are shown on the right.
                                                                                               BDD = h≺, 0, 1, R, N i with
threshold network hU, Uinp , Uout , C, ω, θi is the construc-                  a   4              R=4
tion of a propositional formulae F over Uinp such that                                            N = {h2, b, 0, 1i, h3, b, 1, 0i,
for all interpretations I we find
                                                                          b   2   b       3             h4, a, 2, 3i}
                             +1 if I |= F                                                      pf(2) = (b ∧ ⊥) ∨ (¬b ∧ >) = ¬b
                act u (I) =
                             −1 otherwise                                                      pf(3) = (b ∧ >) ∨ (¬b ∧ ⊥) = b
                                                                       1                   0
I.e., we are looking for propositional formula stating nec-                                    pf(4) = (a ∧ ¬b) ∨ (¬a ∧ b)
essary and sufficient conditions (in terms of the activa-
tion of input units) such that the unit u is active.                 Figure 3: A simple BDD, nodes are annotated with their
   Usually not all input combinations make sense in a                variables and their ID on the right. High branches are
given application domain, because some of them would                 depicted as solid and low branches as dashed lines. On
correspond to invalid states of the world. We use the                the right you find the underlying data structure and the
term valid inputs to denote the set of allowed input com-            logic formulae corresponding to the internal nodes.
binations. Even more important for the extraction is the
fact that all training samples are taken from this subset.
Therefore, the network learns to solve a task under the              can be found for example in [Andersen, 1999]. Intu-
implicit conditions hidden in the selection of inputs. In-           itively, a BDD is a directed acyclic graph with a variable
tegrity constraints are a way to make those conditions               associated to every node and such that all nodes n 6= 0, 1
explicit during the extraction. An integrity constraint              have exactly two successors, called high and low branch
is a formula IC over Uinp describing the set of valid in-            of n. The nodes 0 and 1 are the sinks of the graph. We
puts V ⊆ P(Uinp ) as follows: For all I ⊆ Uinp we find               use h≺, 0, 1, R, N i to refer to a BDD with sinks 0 and
I |= IC if and only if I ∈ V . Using integrity constraints,          1, a set of nodes N and a root-node with identifier R.
we can reformulate the extraction problem as follows:                And we use hi, v, h, li to denote the node with identi-
The rule extraction problem for a given network N and                fier i, with variable v = var(i), and with high and low
a given integrity constraint IC is the construction of a             branch pointing to the nodes with identifiers h and l,
propositional formulae F over Uinp such that for all in-             respectively.
terpretations I with I |= IC we find                                    Usually a BDD is assumed to be ordered and reduced.
                                                                    It is called ordered iff there exists a linear order ≺ on
                           +1 if I |= F                              the variables and the successors of a node are marked
                  act u =
                           −1 otherwise                              with variables that are bigger with respect to ≺. It is
   Networks can be decomposed into their basic build-                called reduced if no two nodes for the same variable have
ing blocks, namely single units together with their in-              identical high and low branch, and for no node high and
coming connections. Those single units can be seen as                low-branch coincide.
simple sub-networks (perceptrons) consisting of a num-                  BDDs represent propositional formulae in if-then-else
ber of inputs and a single output unit, together with the            normal form. The corresponding formula for a given
corresponding weighted connections. To simplify the no-              node is defined recursively as follows:
tations we use Pp = hθ, I, ωi to denote the perceptron
corresponding to the unit p together with its threshold                    pf(0) := ⊥           pf(1) := >
θ, the set of predecessor units I and the weight function                  pf(i) := (var(i) ∧ pf(h)) ∨ (¬ var(i) ∧ pf(l))
ω. Figure 2 shows the perceptron for the output unit g.
   Binary decision diagrams (BDD) are a data structure               Figure 3 shows a simple BDD using a graphical repre-
to represent propositional formulae in a very compact                sentation as well as the underlying data structure and
way and to manipulate them easily. A nice introduction               the corresponding logic formulae for every node.


                                                                23
3     From Perceptrons to Search Trees                                                                              · -11
                                                                                                                     {}
In this section, we discuss an algorithm to extract a BDD
from a single unit such that the BDD represents neces-                                   f
                                                                                              -1                                e -5              d
                                                                                                                                                       -7     c -9
sary and sufficient condition on the inputs to turn the                                      {f }                                   {e}               {d}      {c}
unit active. Following [Bader et al., 2007], we define
input patterns I as subsets of the inputs I of a given                       e 5               d
                                                                                                     3       c 1          d
                                                                                                                               -1     c -3        c -5
                                                                                {e,f }              {d,f }       {c,f }       {d,e}       {c,e}       {c,d}
perceptron Pp = hθ, I, ωi which are assumed to be ac-
tive. The inputs not contained in I can be either active                    9
                                                                       d          c 7          c 5                        c 1
or inactive. And we define the corresponding minimal                       {d,e,f } {c,e,f } {c,d,f }                         {c,d,e}
input imin (I) as follows:
                        X           X                                  c 11
             imin (I) =     ω(a) −      |ω(a)|                             {c,d,e,f }

                       a∈I        a∈I\I
                                                                   Figure 4: The pruned search tree for the perceptron Pg
The minimal input is computed       by adding the contri-
                              P                                    from above. The underlying full tree is depicted in grey
bution of the fixed inputs a∈I ω(a) and the minimal
                                                                   using dashed lines. The nodes contain the newly added
input caused by all other inputs. A perceptron is called
                                                                   symbol and are annotated with the corresponding input
positive, if all weights are positive. For the following
                                                                   pattern and the resulting minimal input. All nodes for
constructions, we assume the perceptrons to be positive.
                                                                   which the minimal input exceed the threshold of θ(g) = 4
In Section 5, we discuss how to apply the extraction to
                                                                   are shown with grey background.
arbitrary perceptrons.
   The construction of BDDs below is based on the search
trees described in [Bader et al., 2007]. These search trees                                                         · -11
                                                                                         Cg                          {}
contain a node for every possible input pattern. Children
of a given node correspond to input patterns which con-                                       -1
                                                                                         f                                      e -5              d
                                                                                                                                                       -7     c -9
tain exactly one symbol more and all nodes are sorted                                        {f }                                   {e}               {d}      {c}
with respect to their minimal inputs. If the minimal
input of some node exceeds the threshold, that node is                       e 5               d
                                                                                                     3       c 1          d
                                                                                                                               -1     c -3        c -5
marked (I.e., the corresponding input pattern represents                        {e,f }              {d,f }       {c,f }       {d,e}       {c,e}       {c,d}

a sufficient condition to turn the perceptron active). The                  9
complete tree is pruned by removing all those nodes for                d          c 7          c 5                        c 1
                                                                           {d,e,f } {c,e,f } {c,d,f }                         {c,d,e}
which no descendant is marked and all those nodes which
are descendants of marked nodes. The construction of a                 c 11                                  1            0
pruned tree is shown in Algorithm 1. Figure 4 shows the                    {c,d,e,f }
full and the resulting pruned search tree on top of it.
                                                                   Figure 5: The BDD corresponding to the pruned search
                                                                   tree from Figure 4.
     Input: A positive perceptron Pp+ .
     Output: A pruned search tree.
 1 Fix an order ≺ such that b ≺ c if ω(b) ≥ ω(c).                  4   From Search Trees to BDDs
 2 Create a root node for the empty input pattern.
 3 Add a child labelled x for each input symbol x
                                                                   Before presenting an algorithm to construct BDDs from
   (sorted wrt. ≺).                                                a given pruned search tree, we introduce some further
 4 foreach newly added node labelled y do
                                                                   notations. Every node in the search tree is represented
 5     Add a new child c for every symbol z with                   as a pair hI, Ci with I being the corresponding input
       y ≺ z (sorted wrt. ≺).                                      pattern and C being the set of children. id(n) denotes
 6     Label c with the corresponding pattern I.                   a unique identifier for the node n (e.g., the correspond-
 7     Mark c if imin (I) > θ(p).                                  ing input pattern, or some index), this identifier is also
                                                                   used as internal index for the BDD nodes. We assume
 8 Remove all descendants of marked nodes.                         id(n) := 0 if there is no node n. var(n) denotes the
 9 Remove all nodes for which no descendant is                     symbol which is added to the input pattern at node n.
   marked.                                                            The construction of a BDD for a given search tree
Algorithm 1: Constructing a pruned search tree.                    is shown as Algorithm 2. This algorithm transforms a
                                                                   search tree into a BDD, by traversing the tree in a left-
   Exploiting the structure of these search trees, we can          depth-first manner. A node’s high branch points to 1, if
easily construct BDDs representing conditions to turn              its minimal input exceeds the threshold. Otherwise, it
the perceptron active. I.e., we find the perceptron to be          points to its left-most child, or to 0 if there is no child.
active for all those input patterns which, understood as           The low-branch points to the right sibling, or to 0 if
interpretation, turn the logic formula corresponding to            there is none. The result for the perceptron Pg is shown
the BDD true.                                                      in Figure 5.


                                                              24
                                                                                              g     h
     Input: A search tree T for Pp = hθ, I, ωi wrt. ≺.
     Output: A corresponding OBDD h≺, 0, 1, R, N i.                                           a     a
 1 if T is empty then
 2     R = 0 and N = {}
 3 else if T contains only the root node then
                                                                                              b     b
 4     R = 1 and N = {}
 5 else                                                                                   0             1
 6     R = id(rl ) for the leftmost child of the root.
 7     N = {}.
 8     foreach leaf node n in T do                                  Figure 7: The global BDD for the network from Figure 1.
 9        Add hid(n), var(n), 1, li to N with l = id(r)
          for the right sibling r.                                  5   Composition of Intermediate Results
10     foreach node hI, Ci in T with left sibling l do              In the previous section, we have been concerned with
11        Let hid(l), var(l), hl , ll i be the node for l           positive perceptrons only. But we can easily turn every
12        Let hid(ll ), var(n), hl1 , ll1 i be the node for         perceptron into a positive one, by multiplying negative
          the leftmost child ll of l                                weights by −1 and inverting the corresponding input
13        if mci(l) − 2ω(l) + 2ω(n) > θ then                        symbols. By doing so, we can apply the algorithm to
14            Add hid(n), var(n), ll1 , ln i to N with              all output units of a given network, and obtain a BDD
              ln = id(rn ) for the right sibling rn of n            describing necessary conditions with respect to the pre-
15      foreach other non-root node hI, Ci do                       decessor units that turn the output unit active. But
16         Add a node hid(n), var(n), id(c), li to N for            some of the input symbols may have been inverted. I.e.,
           the leftmost child c and l = id(rn ) for the             we need another algorithm to construct BDDs stating
           right sibling rn of n and l = 0 if there is              conditions which turn a perceptron inactive. Due to the
           none                                                     symmetry of the threshold function, we find this algo-
                                                                    rithm to be dual to Algorithm 1 and 2. I.e., by inverting
        Algorithm 2: Constructing a BDD.                            the order and the inequalities we obtain an algorithm
                                                                    that constructs such a BDD.
                                                                       Once we have extracted the BDD for a given output
   Please note, that the BDD can be constructed without             unit, we can continue by substituting the nodes test-
constructing the search tree first. The tree is used only           ing non-input nodes (i.e., nodes not corresponding to in-
to describe the underlying ideas. All conditions tested             put units of the network) by their corresponding BDDs.
in Algorithm 2 can be tested by expanding the tree step-            A non-negated node is replaced by the BDD as con-
by-step. Looking a little closer at the constructed search          structed above, and negated nodes are replaced by the
tree we find that some sub-trees have an identical in-              dual BDDs. As mentioned above, BDDs have been de-
ternal structure, which is exemplified in Figure 6. If the          signed to allow for an efficient manipulation of logic for-
condition tested in Line 13 of Algorithm 2 is fulfilled, two        mulae. And in fact it is straightforward to compose the
neighbouring sub-trees are structured identically. mci(n)           intermediate results into an overall diagram by simply
denotes the minimum of all minimal inputs associated                replacing the nodes. But this is not the best approach,
to nodes below n. Please note, that mci(n) can be com-              because the resulting ‘global’ BDD would not be ordered
puted without expanding the sub-tree by looking at the              any more. But while expanding the BDD, we can keep it
associated input pattern.                                           ordered (and reduced) as described in [Andersen, 1999].
   The mentioned structural equivalence can be exploited            After expanding all non-input nodes, we obtain a final
by using a shortcut into the already constructed BDD                BDD representing necessary and sufficient conditions on
and thus preventing the expansion of an identical sub-              the network’s input to turn a given output unit active.
tree. Figure 6 contains a number of those shortcuts,                   Using BDDs as internal data structure has some fur-
e.g., one from node {b} to the node {c, a}, because the             ther advantages. We can actually extract all output
children of {b} are annotated the same way as the node              units into the same global BDD. Doing so leads auto-
below and right of {c, a}. Please note that this identity           matically to a sharing of intermediate results, because
can be recognised without expanding the second sub-                 common substructures are contained only once within
tree, i.e., the construction of whole tree below {b} can            this BDD. Figure 7 shows the final ‘global’ BDD for the
be avoided.                                                         network from Figure 1. Please note that the right node
   The condition on Line 13 is fulfilled whenever the per-          labelled b is used for both output units g and h.
ceptron shows a so called n-of-m behaviour, i.e., if there             Furthermore, we can integrate integrity constraints in
are m inputs from which n suffice to turn the percep-               a straightforward fashion. Instead of starting with an
tron active. In this case, there will be n equivalent sub-          empty BDD, we extract the output nodes into a BDD
trees, which can be shortcut. As discussed in [Towell and           representing the integrity constraints. To exemplify this,
Shavlik, 1993], this occurs quite frequently while training         a network has been trained to the Encode-Decoder task.
neural networks.                                                    It contains 8 input, 8 output units and 3 hidden units


                                                               25
                                                                                  · -5
                                             Ct                                       {}


                                              a -3                                                                 b
                                                                                                                             -3                                         c -3                   d
                                                                                                                                                                                                    -3     e -3
                                                {a}                                                                     {b}                                                  {c}                   {d}         {e}

                               -1                       c -1            -1     e -1                 c -1                            -1        e -1                     -1         e -1         e -1
                          b                                        d                                                         d                                    d
                              {b,a}                        {c,a}       {d,a}    {e,a}                   {c,b}                      {d,b}           {e,b}              {d,c}            {e,c}       {e,d}


              c 1               d
                                      1   e 1     d
                                                       1     e 1   e 1                     d
                                                                                                    1         e 1            e 1                                  e 1
                 {c,b,a}            {d,b,a} {e,b,a} {d,c,a} {e,c,a} {e,d,a}                    {d,c,b} {e,c,b}                     {e,d,b}                            {e,d,c}

             3     e 3          e 3               e 3                                      e 3
        d
            {d,c,b,a}{e,c,b,a}{e,d,b,a}               {e,d,c,a}                                {e,d,c,b}


        e 5                                       1          0
            {e,d,c,b,a}


Figure 6: A larger BDD and its underlying search tree. The tree has been constructed for a perceptron with 5 inputs
whose weights are all 1 and with threshold 0. I.e., this perceptron is active if 3 of the five inputs are +1. Please note
that there are certain symmetries in the tree: All label and minimal inputs of the sub-trees of node {b} coincide with
those of the right sub-trees of {a} if a is substituted by b. This has been exploited by linking from {b} to {c, a}.

                                                                                                                                              o1                                                                     o1
and is trained to learn the identity mapping for all inputs
                                                                                                                                              i1                                                                     i1
in which exactly one unit is active. I.e., the network has
to learn a compressed representation within the hidden                                                                                   i2        i2                                                                i2
layer. But applying the algorithm presented above yields
an unwanted result shown in Figure 8 on the left. Using                                                                       i3         i3        i3        i3                                                      i3

the integrity constraint that at most one input is active                                                     i4        i4          i4        i4        i4        i4         i4                                      i4
at a time yields the BDD shown on the right. 900 nodes
have to be constructed (including all intermediate re-                                         i5        i5        i5         i5         i5        i5        i5         i5        i5      i5                         i5

sults while constructing the BDD) for the ‘normal’ BDD,                                        i6        i6        i6         i6         i6        i6        i6         i6        i6      i6                         i6
but only 124 while using the integrity constraint. This
shows the advantage of using integrity constraints right                                                 i7        i7         i7         i7        i7        i7         i7        i7                                 i7

from the beginning of the extraction process. Usually
                                                                                                                                         i8        i8                                                                i8
they are used to refine the extraction result afterwards.
This would be possible here as well by simple computing                                                                             1                   0                                                  0              1
the conjunction of the ‘normal’ BDD with one repre-
senting the integrity constraint. But starting with the                                    Figure 8: The result of extracting output unit 1 from
constraint avoids the construction of many intermediate                                    an 8-3-8 encoder-decoder network. The BDD on the left
nodes which would be removed afterwards.                                                   is the result of the ‘normal’ extraction. On the right
                                                                                           the constraint that at most one unit is active has been
6   Experimental Evaluation                                                                incorporated.
To evaluate the approach a Prolog implementation has
been used to gather some statistics. The results are                                                    |I|                  |T|                     #IPs                         |BDD|                  node/IP
shown in Table 1. The table shows average numbers                                                         1                   2                        0.58                          2.58                  4.448
for different numbers of inputs, the size of the full search                                              5                  32                        4.31                          8.49                  1.969
tree, the number of minimal input patterns, the size of                                                  10                1024                       63.51                         49.97                  0.786
the corresponding BDD and the number of BDD nodes                                                        15               32768                     1270.45                        313.25                  0.246
per input pattern. All numbers have been collected from                                                  20             1048576                    25681.70                       1863.90                  0.072
100 random perceptrons per size. The extraction using
the full search tree is not feasible due to the exponential                                Table 1: The size of the full search tree (|T|), the number
growth. The number of input patterns is a conservative                                     of minimal input patterns as a lower bound for the size
lower bound for the size of the pruned search tree, be-                                    of the pruned search tree (#IPs) and the corresponding
cause those trees have at least one node per minimal in-                                   BDDs (|BDD|) for different number of inputs (|I|).
put pattern. The result shows that the use of BDD pro-
posed here yields a very compact representation. Even
though the number of nodes in the BDD grows, the ratio                                        A second experiment has been performed to show the
(node/IP) of size of the BDD and the number of minimal                                     effect of the usage of integrity constraints while extract-
coalitions decreases.                                                                      ing the BDDs. A network with 6 inputs, 4 hidden and 2


                                                                                  26
    |BDD|                                 863
                                                                                  approach presented in [Krishnan et al., 1999]. But due
                                                    813
                                791
                                                              758                 to a different order, we do not need to expand them
                                                                                  completely, which would otherwise be necessary.
                                                                                     The extraction as presented here is applicable to
                     497
                                                                                  all feed-forward networks composed of binary threshold
                                      502.80    514.14
                                                                                  units computing ±1-threshold function. This limitation
                                                          442.62
                                                                    396.98        can be softened by allowing arbitrary symmetric thresh-
                            341.90                                                old functions. The symmetry is necessary to construct
                                                                                  negative and positive forms of the perceptron without
                                258
            151
                      194                 199                                     changing the global network function.
                  118.14
                                                    111
                                                                                     Finally, we discussed first experimental results indi-
             94                                               92
                                                                      n           cating a good performance of the approach. On the
             1        2         3         4         5         6                   one hand, we obtain a very compact representation and
                                                                                  on the other hand, we circumvent the construction of
Figure 9: Resulting BDD sizes of the extraction for                               non-necessary intermediate results while incorporating
different maxn -integrity constraints. The bars indicate                          integrity constraints right from the start.
minimal, maximal and average sizes of the BDDs. The                                  Nonetheless, much remains to be done. In particular,
size of the sub-BDD for the constraint is shown in grey.                          the extraction for non-threshold units has to be studied.
                                                                                  For the encoder-decoder experiments mentioned above
                                                                                  the algorithm has simply been applied to networks com-
output units has been used for the experiment. The pos-                           puting the symmetric hyperbolic tangent as activation
sible inputs have been constrained by a maxn integrity                            function. Interestingly, the result coincide with our ex-
constraint for 0 ≤ n ≤ 6. The results are presented                               pectations. This is due to the fact, that networks when
in Figure 9. For every n the experiment has been con-                             trained to compute crisp decisions tend to behave like
ducted for the same 100 randomised networks and the                               threshold networks. But the details of this need to be
following numbers have been collected: the size of the                            investigated in the future. Furthermore, a detailed anal-
sub-BDD encoding the constraint, the minimal, maxi-                               ysis of the performance is necessary, in particular using
mal and average size of the final BDD. Please note that                           networks trained for real-world problems. The approach
the numbers show the total number of internal nodes                               as presented here detects equivalent sub-BDDs for n-
constructed for the BDD, i.e., including all necessary in-                        of-m patterns. But there are more cases for equivalent
termediate nodes. For n = 1, i.e., the biggest restriction,                       sub-BDDs [Mayer-Eichberger, 2008]. Those have to be
we obtain very small BDDs. The size of the BDD grows                              integrated into the extraction procedure. It would also
up to n = 4 and decreases again for n > 4. From those                             be interesting to study the evolution of a network during
observations we can conclude that the incorporation of                            the training process by repeatedly applying the extrac-
integrity constraints into the extraction process can lead                        tion method and compare the results.
to big savings in terms of nodes constructed for the final
BDD. Without their use during the extraction, we would                            Acknowledgements The author is thankful for the
have to construct the BDD corresponding to n = 6. This                            comments of Valentin Mayer-Eichberger and two anony-
big BDD of ≈ 400 nodes would have to be refined with                              mous reviewers.
respect to the constraints afterwards. There seem to be
cases (e.g., for n = 4) where the use of integrity constrain                      References
yields larger BDDs, but nonetheless, the final BDD does                           [Andersen, 1999] H. R. Andersen. An introduction to binary decision
not have to be revised afterwards, and the difference is                            diagrams. Lecture Notes, 1999.
                                                                                  [Andrews et al., 1995] R. Andrews, J. Diederich, and A. Tickle. A
not too big.                                                                        survey and critique of techniques for extracting rules from trained
                                                                                    artificial neural networks. Knowledge–Based Systems, 8(6), 1995.
7    Conclusions and Future Work                                                  [Bader et al., 2007] S. Bader, S. Hölldobler, and V. Mayer-Eichberger.
                                                                                    Extracting propositional rules from feed-forward neural networks —
A novel approach for the extraction of propositional rules                          a new decompositional approach. In Proceedings of the 3rd Inter-
from feed-forward networks of threshold units has been                              national Workshop on Neural-Symbolic Learning and Reasoning,
                                                                                    NeSy’07, January 2007.
presented. After decomposing the network into percep-
                                                                                  [Jacobsson, 2005] H. Jacobsson. Rule extraction from recurrent neu-
trons, binary decision diagrams representing precondi-                               ral networks: A taxonomy and review. Neural Computation,
tions that activate or inactivate the perceptron have                                17(6):1223–1263, 2005.
been extracted. Those intermediate representations can                            [Krishnan et al., 1999] R. Krishnan, G. Sivakumar, and P. Bhat-
be composed using the usual algorithms for BDDs, or                                 tacharya. A search technique for rule extraction from trained neural
                                                                                    networks. Non-Linear Anal., 20(3):273–280, 1999.
they can be combined during their construction by ex-
                                                                                  [Mayer-Eichberger, 2008] V. Mayer-Eichberger. Towards solving a sys-
tracting one into the other. The latter approach does                               tem of pseudo boolean constraints with binary decision diagrams.
also allow for an incorporation of integrity constraints                            Master’s thesis, Universidade Nova de Lisboa, SEP 2008.
– already during the extraction of the intermediate re-                           [Towell and Shavlik, 1993] G. Towell and J. W. Shavlik. Extract-
sults. As already mentioned in [Bader et al., 2007], the                             ing refined rules from knowledge-based neural networks. Machine
                                                                                     Learning, 13:71–101, 1993.
pruned search trees constructed above are related to the


                                                                             27