-

dence Values and Compact Rule Extraction From Probabilistic Neural Networks

Simon Odense

simon.odense@city.ac.uk 0

Artur d'Avila Garcez

a.garcez@city.ac.uk 0 0 City, University of London , UK

One of the key challenges of extracting rules from neural networks is accommodation of the inherent exibility of knowledge representation in neural networks to more rigid rule based systems. Neural networks are often seen as having `soft constraints' as opposed to the `hard constraints' of rule based systems. This distinction has been identi ed as one of the key di erences between the connectionist approach and more traditional symbolic AI [1]. For deterministic networks, the distinction becomes somewhat fuzzy as every input/output relationship of a network can be encoded in a set of propositional rules with arbitrary precision, however, representing a neural network this way will be at the very least incomprehensible and in most cases intractable. Thus the issue of exibility is tied to the issue of compactness of representation. For probabilistic networks, the issue of exibility becomes even more of a challenge. Here we go over results showing that a previous rule extraction method applied to Restricted Boltzmann machines(RBMs) [5] can be improved by considering more compact rules called M of N rules. We also consider an example highlighting the advantage that these rules have in terms of minimizing the incidents of `false negatives' over traditional conjunctive rules. Finally we look at the notion of `con dence values', numeric values we associate with a rule meant to represent our degree of belief in the rule, and show that a more re ned notion of con dence may be helpful when considering extracted rules from RBMs and other probabilistic networks.

Tran and Garcez developed a rule extraction algorithm for RBMs (and DBNs built from RBMs) which associates con dence values with extracted rules by looking at the average weight of the literals in the rule [ 2 ]. The extraction algorithm works by starting with the conjunction of every literal in the rule and iteratively updating the con dence value and pruning literals with small enough weights until equilibrium is achieved. The extracted rules are then composed in a deep belief network along with an inference rule in order to calculate con

dence values for the output given a (partial) set of con dence values for the input. When looking at RBMs in isolation, the extracted rules can be thought of as biconditionals, however, the following example shows that when looking at RBMs a high con dence value does not necessarily correspond to a high probability of the rule being true. First, when we say that an extracted rule has a certain probability in the network we mean that, given that the visible units are uniformly distributed (if the visible distribution is de ned by the network it can be shown that local rule extraction preserving the probabilities is imposCopyright © 2017 for this paper by its authors. Copying permitted for private and academic purposes. sible assuming some basic conditions on the network), the rule has a certain probability of being true in the distribution of the network. For example, if a network has a probability distribution P , for two hidden units in a network, h1 and h2, h1 _ h2 is given a probability P (h1) + P (h2). Given a biconditional h $ x1; :::; xk; :xk+1::::xn, where each xi represents a visible unit, we will consider the probability of the biconditional being true in an RBM. For brevity we will denote the antecedent of the biconditional as AN T , the probability of this biconditional in an RBM is then P (h = 1; AN T ) + P (h = 0; :AN T ) Where the distribution on the set of literals in the antecedent is uniform (since they represent visible units). We will consider an example of a rule extracted using the algorithm mentioned above to show that the associated con dence doesn't re ect the probability of the biconditional in the network. De ne a network with a single hidden neuron with k identical weights W and bias 0, the antecedent of the extracted rule is the conjunction of all the literals and the con dence is W . This means that the antecedent is satis ed only when all k literals are satis ed. Using some algebra, the probability of the biconditional being true in the network can be written as

P (h = 1jAN T )P (AN T ) + kX1 k

i i=0 (1

P (h = 1jAN T i))P (AN T i) Where P (h = 1jAN T ) is the probability of the hidden neuron being on when the antecedent is satis ed and P (h = 1jAN T i) is the probability of the hidden neuron being on when exactly i literals of the rule are satis ed, since all the weights are the same this does not depend on which speci c literals are not satis ed. Furthermore we are assuming that this visible units are taken from a uniform distribution so we have P (AN T ) = P (AN T i) = 21k . This gives us 1 2k (W k) + kX1 k i=1 i 1 (1 (iW )) ! Since W and k are arbitrary we can take them to be as large as possible, in which case the limit of the right term goes to 1 (0) = 0:5 and the left hand term goes to 1 so as k ! 1 the whole thing goes to 0. This shows we can extract rules with arbitrarily high con dence but arbitrarily low probability.

The issue with this example is that the extracted rule give many false negatives. There are many cases where the rule should be giving an output of 1 but is failing to since not every literal is satis ed. Rather than requiring every literal in the antecedent be satis ed in order to predict 1 we really only need one of them. It's di cult to extract a single conjunctive rule which can accurately capture the behaviour of a probabilistic network and by extracting many di erent rules you lose compactness. In order to nd a compact way to more faithfully capture the behaviour of an RBM we relax the condition that every literal in the antecedent needs to be satis ed. This give us the so called M of N rules. In an M of N rule the antecedent is satis ed if only M of the N literals are. In the previous example the correct rule would be 1 of the set of literals. By rst applying the rule extraction algorithm and selecting M by looking for the minimum value of M for which M c (where c is the con dence given to the rule) is greater than a predetermined threshold (in our case the minimum input to the hidden node) we can convert the purely conjunctive rules into M of N ones. If we cannot nd an appropriate M we add new literals until there either is an appropriate M or we run out of literals. The rules produced by this algorithm perform much better than the purely conjunctive rules when tested with a variety of small datasets [ 6 ].

Assigning values to logical sentences to measure degrees of belief has been done before. The most relevant examples for us are penalty logic [ 4 ] and Markov logic networks [ 3 ], in both cases `weights' were given to logical sentences which were then translated into weights of a network (Hop eld networks and Markov random elds respectively). A similar philosophy was used to de ne con dence values for deep belief networks by using the weights of the RBM in the previous algorithm. The above example shows that the extracted con dence really does not accurately re ect the underlying probability of structure of the constituent RBMs and that the extracted rules are perhaps better considered in the feed forward context rather rather than biconditionals. Extending this algorithm to M of N relieves some of the problems by loosening the requirements for the rule to be satis ed but it remains to be seen whether the con dence values extracted with M of N rules more accurately re ect the probability structure of the RBM. One possible avenue of research is, rather than look simply at the weights attached to the literals to derive con dence, look at both the minimum input to a node when the rule is satis ed and the maximum input to the node when the rule is not satis ed. Ultimately the M of N rule is a promising way of representing knowledge in a neural network with more possibilities to imbue it with more exibility by exploring various notions of con dence values

[1] Smolensky , P..: On the Proper Treatment of Connectionism . Behavioral and Brain Sciences . 11 ( 1 ), 1 { 23 ( 1988 )

[2] Son , T. , d'Avila Garcez , A. : Deep Logic Network: Inserting and Extracting Knowledge from Deep Belief Networks . IEEE Transactions on Neural Networks and Learning Systems , pp( 99 ), 1 { 13 ( 2016 )

[3] Richardson , M. , Domingos , P. : Markov Logic Networks . Machine Learning . 62 ( 1 ), 107 { 136 ( 2006 )

[4] Pinkas , G. : Reasoning, Connectionist Nonmonoticity and Learning in Networks that Capture Propositional Knowledge . Arti cial Intelligence . 77 , 203 { 247

[5] Smolensky , P. : Information Processing in Dynamical Systems: Foundations of Harmony Theory. Parallel Distributed Processing: Volume 1 : Foundations. MIT Press, Cambridge, 194 { 281 ( 1986 ) Comm . Pure Appl. Math. 33 , 609 { 633 ( 1980 )

[6] Odense , S. , d'Avila Garcez , A. : Extracting M of N Rules From Restricted Boltzmann Machines . The 26th International Conference on Arti cial Neural Networks . to appear, ( 2017 )