Confidence Values and Compact Rule Extraction
                      From Probabilistic Neural Networks

                                    Simon Odense and Artur d’Avila Garcez

                                          City, University of London, UK
                                            simon.odense@city.ac.uk
                                              a.garcez@city.ac.uk,

                One of the key challenges of extracting rules from neural networks is accommo-
                dation of the inherent flexibility of knowledge representation in neural networks
                to more rigid rule based systems. Neural networks are often seen as having ‘soft
                constraints’ as opposed to the ‘hard constraints’ of rule based systems. This dis-
                tinction has been identified as one of the key differences between the connection-
                ist approach and more traditional symbolic AI [1]. For deterministic networks,
                the distinction becomes somewhat fuzzy as every input/output relationship of a
                network can be encoded in a set of propositional rules with arbitrary precision,
                however, representing a neural network this way will be at the very least incom-
                prehensible and in most cases intractable. Thus the issue of flexibility is tied to
                the issue of compactness of representation. For probabilistic networks, the issue
                of flexibility becomes even more of a challenge. Here we go over results showing
                that a previous rule extraction method applied to Restricted Boltzmann ma-
                chines(RBMs) [5] can be improved by considering more compact rules called M
                of N rules. We also consider an example highlighting the advantage that these
                rules have in terms of minimizing the incidents of ‘false negatives’ over tradi-
                tional conjunctive rules. Finally we look at the notion of ‘confidence values’,
                numeric values we associate with a rule meant to represent our degree of belief
                in the rule, and show that a more refined notion of confidence may be helpful
                when considering extracted rules from RBMs and other probabilistic networks.

                Tran and Garcez developed a rule extraction algorithm for RBMs (and DBNs
                built from RBMs) which associates confidence values with extracted rules by
                looking at the average weight of the literals in the rule [2]. The extraction al-
                gorithm works by starting with the conjunction of every literal in the rule and
                iteratively updating the confidence value and pruning literals with small enough
                weights until equilibrium is achieved. The extracted rules are then composed
                in a deep belief network along with an inference rule in order to calculate con-
                fidence values for the output given a (partial) set of confidence values for the
                input. When looking at RBMs in isolation, the extracted rules can be thought
                of as biconditionals, however, the following example shows that when looking at
                RBMs a high confidence value does not necessarily correspond to a high prob-
                ability of the rule being true. First, when we say that an extracted rule has a
                certain probability in the network we mean that, given that the visible units
                are uniformly distributed (if the visible distribution is defined by the network
                it can be shown that local rule extraction preserving the probabilities is impos-


Copyright © 2017 for this paper by its authors. Copying permitted for private and academic purposes.
sible assuming some basic conditions on the network), the rule has a certain
probability of being true in the distribution of the network. For example, if a
network has a probability distribution P , for two hidden units in a network,
h1 and h2 , h1 ∨ h2 is given a probability P (h1 ) + P (h2 ). Given a biconditional
h ↔ x1 , ..., xk , ¬xk+1 ...¬xn , where each xi represents a visible unit, we will con-
sider the probability of the biconditional being true in an RBM. For brevity we
will denote the antecedent of the biconditional as AN T , the probability of this
biconditional in an RBM is then P (h = 1, AN T ) + P (h = 0, ¬AN T ) Where
the distribution on the set of literals in the antecedent is uniform (since they
represent visible units). We will consider an example of a rule extracted using
the algorithm mentioned above to show that the associated confidence doesn’t
reflect the probability of the biconditional in the network. Define a network with
a single hidden neuron with k identical weights W and bias 0, the antecedent
of the extracted rule is the conjunction of all the literals and the confidence is
W . This means that the antecedent is satisfied only when all k literals are sat-
isfied. Using some algebra, the probability of the biconditional being true in the
network can be written as
                                    k−1
                                    X       
                                           k
      P (h = 1|AN T )P (AN T ) +               (1 − P (h = 1|AN T i ))P (AN T i )
                                     i=0
                                           i

Where P (h = 1|AN T ) is the probability of the hidden neuron being on when
the antecedent is satisfied and P (h = 1|AN T i ) is the probability of the hidden
neuron being on when exactly i literals of the rule are satisfied, since all the
weights are the same this does not depend on which specific literals are not
satisfied. Furthermore we are assuming that this visible units are taken from a
uniform distribution so we have P (AN T ) = P (AN T i ) = 21k . This gives us

                                     k−1
                                                            !
                     1               X    k−1
                                              
                          σ(W k) +              (1 − σ(iW ))
                     2k              i=1
                                            i

Since W and k are arbitrary we can take them to be as large as possible, in
which case the limit of the right term goes to 1 − σ(0) = 0.5 and the left hand
term goes to 1 so as k → ∞ the whole thing goes to 0. This shows we can extract
rules with arbitrarily high confidence but arbitrarily low probability.

The issue with this example is that the extracted rule give many false nega-
tives. There are many cases where the rule should be giving an output of 1 but
is failing to since not every literal is satisfied. Rather than requiring every literal
in the antecedent be satisfied in order to predict 1 we really only need one of
them. It’s difficult to extract a single conjunctive rule which can accurately cap-
ture the behaviour of a probabilistic network and by extracting many different
rules you lose compactness. In order to find a compact way to more faithfully
capture the behaviour of an RBM we relax the condition that every literal in the
antecedent needs to be satisfied. This give us the so called M of N rules. In an
M of N rule the antecedent is satisfied if only M of the N literals are. In the pre-
vious example the correct rule would be 1 of the set of literals. By first applying
the rule extraction algorithm and selecting M by looking for the minimum value
of M for which M · c (where c is the confidence given to the rule) is greater than
a predetermined threshold (in our case the minimum input to the hidden node)
we can convert the purely conjunctive rules into M of N ones. If we cannot find
an appropriate M we add new literals until there either is an appropriate M or
we run out of literals. The rules produced by this algorithm perform much better
than the purely conjunctive rules when tested with a variety of small datasets [6].

Assigning values to logical sentences to measure degrees of belief has been done
before. The most relevant examples for us are penalty logic [4] and Markov
logic networks [3], in both cases ‘weights’ were given to logical sentences which
were then translated into weights of a network (Hopfield networks and Markov
random fields respectively). A similar philosophy was used to define confidence
values for deep belief networks by using the weights of the RBM in the previous
algorithm. The above example shows that the extracted confidence really does
not accurately reflect the underlying probability of structure of the constituent
RBMs and that the extracted rules are perhaps better considered in the feed
forward context rather rather than biconditionals. Extending this algorithm to
M of N relieves some of the problems by loosening the requirements for the
rule to be satisfied but it remains to be seen whether the confidence values ex-
tracted with M of N rules more accurately reflect the probability structure of
the RBM. One possible avenue of research is, rather than look simply at the
weights attached to the literals to derive confidence, look at both the minimum
input to a node when the rule is satisfied and the maximum input to the node
when the rule is not satisfied. Ultimately the M of N rule is a promising way of
representing knowledge in a neural network with more possibilities to imbue it
with more flexibility by exploring various notions of confidence values

References
[1] Smolensky, P..: On the Proper Treatment of Connectionism. Behavioral and Brain
  Sciences. 11(1), 1–23 (1988)
[2] Son, T., d’Avila Garcez, A.: Deep Logic Network: Inserting and Extracting Knowl-
  edge from Deep Belief Networks. IEEE Transactions on Neural Networks and Learn-
  ing Systems, pp(99), 1–13(2016)
[3] Richardson, M., Domingos, P.: Markov Logic Networks. Machine Learning. 62(1),
  107–136(2006)
[4] Pinkas, G.: Reasoning, Connectionist Nonmonoticity and Learning in Networks
  that Capture Propositional Knowledge. Artificial Intelligence. 77, 203–247
[5] Smolensky, P.: Information Processing in Dynamical Systems: Foundations of Har-
  mony Theory. Parallel Distributed Processing: Volume 1: Foundations. MIT Press,
  Cambridge, 194–281 (1986) Comm. Pure Appl. Math. 33, 609–633 (1980)
[6] Odense, S., d’Avila Garcez, A.: Extracting M of N Rules From Restricted Boltz-
  mann Machines. The 26th International Conference on Artificial Neural Networks.
  to appear, (2017)