<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>dence Values and Compact Rule Extraction From Probabilistic Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simon Odense</string-name>
          <email>simon.odense@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Artur d'Avila Garcez</string-name>
          <email>a.garcez@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>City, University of London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>One of the key challenges of extracting rules from neural networks is accommodation of the inherent exibility of knowledge representation in neural networks to more rigid rule based systems. Neural networks are often seen as having `soft constraints' as opposed to the `hard constraints' of rule based systems. This distinction has been identi ed as one of the key di erences between the connectionist approach and more traditional symbolic AI [1]. For deterministic networks, the distinction becomes somewhat fuzzy as every input/output relationship of a network can be encoded in a set of propositional rules with arbitrary precision, however, representing a neural network this way will be at the very least incomprehensible and in most cases intractable. Thus the issue of exibility is tied to the issue of compactness of representation. For probabilistic networks, the issue of exibility becomes even more of a challenge. Here we go over results showing that a previous rule extraction method applied to Restricted Boltzmann machines(RBMs) [5] can be improved by considering more compact rules called M of N rules. We also consider an example highlighting the advantage that these rules have in terms of minimizing the incidents of `false negatives' over traditional conjunctive rules. Finally we look at the notion of `con dence values', numeric values we associate with a rule meant to represent our degree of belief in the rule, and show that a more re ned notion of con dence may be helpful when considering extracted rules from RBMs and other probabilistic networks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Tran and Garcez developed a rule extraction algorithm for RBMs (and DBNs
built from RBMs) which associates con dence values with extracted rules by
looking at the average weight of the literals in the rule [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The extraction
algorithm works by starting with the conjunction of every literal in the rule and
iteratively updating the con dence value and pruning literals with small enough
weights until equilibrium is achieved. The extracted rules are then composed
in a deep belief network along with an inference rule in order to calculate
con
      </p>
      <p>dence values for the output given a (partial) set of con dence values for the
input. When looking at RBMs in isolation, the extracted rules can be thought
of as biconditionals, however, the following example shows that when looking at
RBMs a high con dence value does not necessarily correspond to a high
probability of the rule being true. First, when we say that an extracted rule has a
certain probability in the network we mean that, given that the visible units
are uniformly distributed (if the visible distribution is de ned by the network
it can be shown that local rule extraction preserving the probabilities is
imposCopyright © 2017 for this paper by its authors. Copying permitted for private and academic purposes.
sible assuming some basic conditions on the network), the rule has a certain
probability of being true in the distribution of the network. For example, if a
network has a probability distribution P , for two hidden units in a network,
h1 and h2, h1 _ h2 is given a probability P (h1) + P (h2). Given a biconditional
h $ x1; :::; xk; :xk+1::::xn, where each xi represents a visible unit, we will
consider the probability of the biconditional being true in an RBM. For brevity we
will denote the antecedent of the biconditional as AN T , the probability of this
biconditional in an RBM is then P (h = 1; AN T ) + P (h = 0; :AN T ) Where
the distribution on the set of literals in the antecedent is uniform (since they
represent visible units). We will consider an example of a rule extracted using
the algorithm mentioned above to show that the associated con dence doesn't
re ect the probability of the biconditional in the network. De ne a network with
a single hidden neuron with k identical weights W and bias 0, the antecedent
of the extracted rule is the conjunction of all the literals and the con dence is
W . This means that the antecedent is satis ed only when all k literals are
satis ed. Using some algebra, the probability of the biconditional being true in the
network can be written as</p>
      <p>P (h = 1jAN T )P (AN T ) + kX1 k</p>
      <p>i
i=0
(1</p>
      <p>P (h = 1jAN T i))P (AN T i)
Where P (h = 1jAN T ) is the probability of the hidden neuron being on when
the antecedent is satis ed and P (h = 1jAN T i) is the probability of the hidden
neuron being on when exactly i literals of the rule are satis ed, since all the
weights are the same this does not depend on which speci c literals are not
satis ed. Furthermore we are assuming that this visible units are taken from a
uniform distribution so we have P (AN T ) = P (AN T i) = 21k . This gives us
1
2k
(W k) + kX1 k
i=1
i
1
(1
(iW ))
!
Since W and k are arbitrary we can take them to be as large as possible, in
which case the limit of the right term goes to 1 (0) = 0:5 and the left hand
term goes to 1 so as k ! 1 the whole thing goes to 0. This shows we can extract
rules with arbitrarily high con dence but arbitrarily low probability.</p>
      <p>
        The issue with this example is that the extracted rule give many false
negatives. There are many cases where the rule should be giving an output of 1 but
is failing to since not every literal is satis ed. Rather than requiring every literal
in the antecedent be satis ed in order to predict 1 we really only need one of
them. It's di cult to extract a single conjunctive rule which can accurately
capture the behaviour of a probabilistic network and by extracting many di erent
rules you lose compactness. In order to nd a compact way to more faithfully
capture the behaviour of an RBM we relax the condition that every literal in the
antecedent needs to be satis ed. This give us the so called M of N rules. In an
M of N rule the antecedent is satis ed if only M of the N literals are. In the
previous example the correct rule would be 1 of the set of literals. By rst applying
the rule extraction algorithm and selecting M by looking for the minimum value
of M for which M c (where c is the con dence given to the rule) is greater than
a predetermined threshold (in our case the minimum input to the hidden node)
we can convert the purely conjunctive rules into M of N ones. If we cannot nd
an appropriate M we add new literals until there either is an appropriate M or
we run out of literals. The rules produced by this algorithm perform much better
than the purely conjunctive rules when tested with a variety of small datasets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Assigning values to logical sentences to measure degrees of belief has been done
before. The most relevant examples for us are penalty logic [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Markov
logic networks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], in both cases `weights' were given to logical sentences which
were then translated into weights of a network (Hop eld networks and Markov
random elds respectively). A similar philosophy was used to de ne con dence
values for deep belief networks by using the weights of the RBM in the previous
algorithm. The above example shows that the extracted con dence really does
not accurately re ect the underlying probability of structure of the constituent
RBMs and that the extracted rules are perhaps better considered in the feed
forward context rather rather than biconditionals. Extending this algorithm to
M of N relieves some of the problems by loosening the requirements for the
rule to be satis ed but it remains to be seen whether the con dence values
extracted with M of N rules more accurately re ect the probability structure of
the RBM. One possible avenue of research is, rather than look simply at the
weights attached to the literals to derive con dence, look at both the minimum
input to a node when the rule is satis ed and the maximum input to the node
when the rule is not satis ed. Ultimately the M of N rule is a promising way of
representing knowledge in a neural network with more possibilities to imbue it
with more exibility by exploring various notions of con dence values
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Smolensky</surname>
            ,
            <given-names>P..:</given-names>
          </string-name>
          <article-title>On the Proper Treatment of Connectionism</article-title>
          .
          <source>Behavioral and Brain Sciences</source>
          .
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>23</fpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Son</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>d'Avila Garcez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep Logic Network: Inserting and Extracting Knowledge from Deep Belief Networks</article-title>
          .
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          , pp(
          <volume>99</volume>
          ),
          <volume>1</volume>
          {
          <fpage>13</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Markov Logic Networks</article-title>
          .
          <source>Machine Learning</source>
          .
          <volume>62</volume>
          (
          <issue>1</issue>
          ),
          <volume>107</volume>
          {
          <fpage>136</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Pinkas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Reasoning, Connectionist Nonmonoticity and Learning in Networks that Capture Propositional Knowledge</article-title>
          .
          <source>Arti cial Intelligence</source>
          .
          <volume>77</volume>
          ,
          <issue>203</issue>
          {
          <fpage>247</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Smolensky</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <source>Information Processing in Dynamical Systems: Foundations of Harmony Theory. Parallel Distributed Processing:</source>
          Volume
          <volume>1</volume>
          : Foundations. MIT Press, Cambridge,
          <volume>194</volume>
          {
          <fpage>281</fpage>
          (
          <year>1986</year>
          )
          <article-title>Comm</article-title>
          . Pure Appl. Math.
          <volume>33</volume>
          ,
          <issue>609</issue>
          {
          <fpage>633</fpage>
          (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Odense</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>d'Avila Garcez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Extracting M of N Rules From Restricted Boltzmann Machines</surname>
          </string-name>
          .
          <source>The 26th International Conference on Arti cial Neural Networks</source>
          . to appear, (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>