<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PU-learning disjunctive concepts in ILP</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hendrik Blockeel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KU Leuven, Department of Computer Science</institution>
        </aff>
      </contrib-group>
      <fpage>6</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>PU learning is a variant of semi-supervised learning where labels of only one class are given. In this late-breaking paper, we present early results on an ILP method that PU-learns disjunctive concepts. The problem is motivated by some work on natural language learning, more speci cally, learning the meaning of n-grams from (sentence, context)-pairs, where the context is described using rst-order logic. An earlier solution for this task could learn only conjunctive concepts. We show experimentally that the novel method e ectively learns disjunctive concepts from PU data. We also discuss challenges and limitations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        PU-learners learn a binary classi er from two sets: a set of instances known to be positive, and a set of unlabeled
instances, each of which may be positive or negative. Elkan and Noto [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] made the following observation, which
has in uenced recent work on PU-learning. Let + denote the event that an instance is positive, and pos the event
that an instance is labeled positive. Since only positives can be labeled positive, P (posjx) = P (posjx; +)P (+jx).
      </p>
      <p>Under the assumption that the labeled positives are a completely random subset of all positives (that is,
each positive has the same probability k of being labeled), this reduces to P (posjx) = kP (+jx), and hence,
P (+jx) = P (posjx)=k. Thus, if k is known, it su ces to learn a probabilistic classi er that predicts whether x is
labeled, which is a supervised learning task, and then dividing the predicted probabilities by k.</p>
      <p>Most work on PU-learning implicitly or explicitly makes the \constant k" assumption. Under this assumption,
PU-learning reduces to estimating the constant k and running a standard supervised learner. In the next section,
we show a PU-learning problem where this assumption is clearly violated.</p>
    </sec>
    <sec id="sec-2">
      <title>Relational grounded language learning</title>
      <p>
        The motivating context for this work is the following task, \relational grounded language learning" (RGLL) [
        <xref ref-type="bibr" rid="ref1 ref2">1,
2</xref>
        ]. Given a set of sentence/context pairs, where context is a Datalog description of the context in which sentence
was used, learn the circumstances under which an n-gram can be used (we call this the \meaning" of the n-gram).
For instance, in one dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], derived from Zitnick et al.'s \clip art" dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the sentence \a cat sits under
the picnic table" accompanies the Datalog model
[object(o1),large(o1,c_table),color(o1,c_yellow),
object(o2),human(o2,c_boy),pose(o2,c_pose2),expression(o2,c_surprised),
object(o3),human(o3,c_girl),pose(o3,c_pose0),expression(o3,c_surprised),
object(o4),animal(o4,c_cat),size(o4,c_small),
object(o5),clothing(o5,c_hat),style(o5,c_pirate),act(o2, c_wear, o5),
object(o6),food(o6,c_pizza)]
which mentions many things, including a cat and a table. Creating for each model where \cat" occurs in the
corresponding sentence, a clause of the form \cat model", and computing the least general generalization
under theta-subsumption (brie y, \lgg") [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of these clauses, might yield, e.g.,
cat
      </p>
      <p>object(A),animal(A,c cat),size(A,c small)
which summarizes the conditions under which the word \cat" can apparently be used (or: the \meaning" of
\cat").</p>
      <p>
        Earlier work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] computed the meaning of an n-gram as the lgg of all the contexts where that n-gram was used,
possibly allowing some exceptions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in order to handle noise. The disadvantage of that approach is that the lgg
is a single clause, hence, a conjunctive concept. Such an approach cannot handle words with multiple meanings,
which can be used in quite di erent contexts. For instance, in the dataset used, we expected the word \dog"
to have only one meaning within the used dataset, just like \cat", but its occurrence in the bigram \hot dog"
made it occur frequently in cases where the food, but not the animal, was present. As a result, the mentioned
approaches could not learn the meaning of \dog"; they overgeneralized over dogs and hotdogs.
      </p>
      <p>While ILP is perfectly capable of learning disjunctive concepts, RGLL faces the problem that a word does not
have to be used when it is relevant. When the word \cat" occurs, a cat should be present (this is an assumption
of RGLL), but when a cat is present, the sentence may not mention it. If we consider the occurrence of a word
as a label for the context, it is clear that this is a PU-learning problem.</p>
      <p>RGLL does not, however, ful ll the \constant k" condition. This condition would imply that the probability
that a dog is mentioned, given that it appears, equals the probability that a hotdog is mentioned (using the
spelling \hot dog", rather than \hotdog"), given that it appears. There is no reason to believe these are equal:
some objects are more likely to be mentioned than others, there is the e ect of alternative spellings or synonyms,
etc. This is corroborated by the Zitnick dataset: e.g., the probability of having \dog" in the sentence, given that
the context contains a dog or a hotdog, is 0.315 and 0.249, respectively.</p>
      <p>For this reason, we developed a PU rule learner that does not assume a constant k. Instead, it assumes that
the target concept is a disjunctive concept, and that the probability of being labeled is constant within each
disjunct, but may di er from one disjunct to another.
4
4.1</p>
    </sec>
    <sec id="sec-3">
      <title>PU-learning disjunctive concepts</title>
      <p>PU-learning one rule
We rst consider the case of PU-learning a conjunctive concept, which can be expressed using a single Horn
clause. If there is no noise in the data, the clause must cover all positive examples. While it may cover unlabeled
examples, the clause c that covers the fewest unlabeled examples is optimal: P (posjc) is maximal in this case,
and therefore, so is P (+jc) = P (posjc)=k.</p>
      <p>
        The earlier approaches did not explicitly consider RGLL as a PU-learning task because, when learning
conjunctive concepts from noiseless data, this is not necessary: the problem corresponds to computing an lgg. To
allow for noisy data (sentences mentioning things not present in the context), [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] computed the lgg of \almost
all" cases, rather than strictly all cases, by randomly generalizing the current clause with new positive instances
until almost all positives are covered.
      </p>
      <p>The PU-learning viewpoint gives a more principled method. Consider a sequence of clauses c1; c2; : : : ; cn,
where c1 is a random positive instance and ci = lgg(ci 1; ei), with ei a random positive instance not covered by
ci 1; we call this a generalization path.</p>
      <p>Assume that the clauses c1; c2; : : : ; cj cover subsets of the target clause t, but cj+1 does not. Then P (posjci) = k
for i j, and P (posjci) &lt; k for i &gt; j, since labels occur entirely randomly within the disjunct, and there are
no labeled examples outside the disjunct. Thus, even though we do not know k, we could in principle nd k by
constructing a curve that shows P (posjci) in terms of i, and looking what part of the curve is at. The optimal
ci is the one just before the curve starts decreasing.</p>
      <p>In practice, we cannot construct P (posjci); we can at best estimate it from a dataset as the proportion of
instances covered by ci that are labeled. This proportion randomly deviates from P (posjci). Furthermore, it tends
to be biased positively because ci is the result of repeated lggs of labeled instances. One way to compensate for
the deviation is to construct a con dence interval for P (posjci) based on the proportion, and take the lower
bound of this con dence interval; call this value q(ci). The con dence interval becomes narrower with increasing
i, since the coverage of the clause increases with i. Assuming P (posjci) is constant for i = 1; : : : ; j, the expected
value of q(ci) reaches a maximum for i = j. From that point of view, choosing the ci that maximizes q(ci) is a
good heuristic.</p>
      <p>This leads to the following algorithm for PU-learning one rule (PULOR):
procedure PULOR:
c1 = random (not-yet-covered) labeled instance
i=1
stop=false
while not stop:
repeat
e = random labeled instance not covered by earlier rules
ci+1 = lgg(ci,e)
until ci+1 6= ci or repeat-loop was executed 5 times
if ci+1 = ci then stop=true else i++
return arg maxcj;1 j i q(cj )</p>
      <p>Our actual implementation of PULOR uses as heuristic not the q-function as de ned above, but a monotonic
function of it, namely the (lower bound of the) odds ratio of labeled versus unlabeled examples in the rule's
coverage, as estimated using a sample of the unlabeled set. Figure 1 shows some typical curves. It con rms
that using the lower bound of a con dence interval is advantageous; if the ratio itself is used, there is less
generalization.</p>
      <p>6
4
2
0
Rule sets are often learned using a covering approach: learn one rule, mark the examples covered by that rule
as covered, then repeat the process on the still uncovered examples. A single rule is typically learned by re ning
the rule until its accuracy is close to 1. In the PU-learning setting, the observed accuracy (the proportion of
instances covered by the rule that have a positive label) di ers from the real accuracy (the proportion of covered
instances that are truly positive): if the true accuracy is a, the observed accuracy is ka.</p>
      <p>If k were constant over all rules, one could try to determine it and then run a standard rule learner where the
accuracy aimed for is k rather than 1. Since we assume di erent rules may have a di erent k value, k must be
estimated per rule, but that can only be done once we know the rule. Here, therefore, the learning of the rule
and the estimation of k must go hand in hand.</p>
      <p>Assume the target is a disjunctive concept consisting of d disjuncts, where each disjunct can be accurately
expressed as a single rule. Each such rule can be learned using PULOR.</p>
      <p>Consider a generalization path that starts with a positively labeled example, which is part of disjunct i. Let
us denote the \local" k value for the i'th disjunct as ki. As long as the clauses cover a subset of that disjunct
i, the expected proportion of labeled examples in the clause's coverage is ki. As the clauses become increasingly
general, they start covering negative instances and/or instances from other disjuncts. Clearly, covering negative
instances decreases the observed accuracy, while covering instances from another disjunct j may increase or
decrease it; this depends on whether kj &gt; ki or not.</p>
      <p>Under these circumstances, it is not obvious that the q-maximization criterion that PULOR uses is optimal.
When a generalization path starts in a disjunct with low ki and at some point starts covering instances from
a disjunct with high kj without also covering too many negatives, the maximal q-value may be obtained for a
clause that covers instances from multiple disjuncts. If rules are learned in the right order, from high to low k
value, this will not occur.</p>
      <p>Our algorithm for PU-learning a rule set, PULSE, uses the standard covering approach: it runs PULOR
several times, marking covered examples as such before proceeding to learn the next rule. It does not attempt
to optimize the order in which rules are learned (high to low k value).
5</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        We have run PULSE on a dataset of 10000 sentence/context pairs, trying to learn rules for speci c words or
n-grams. Table 1 presents a sample from the outcomes. Whether a result is correct is somewhat subjective, there
is often no agreed-upon ground truth; the best we can do is interpret the result. We cannot compare to other
systems, because no other systems solve the considered task; for conjunctive concepts, however, PULSE often
nds similar results as the earlier proposed ReGLL [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Note that PULSE sometimes nds sets of clauses where one clause is a specialization of another. This is
a consequence of the fact that one can never be sure what the maximal reachable accuracy is (because k is
unknown).</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We have proposed a general algorithm for PU-learning sets of Horn clauses. The algorithm can trivially be
extended to general rule learning. Contrary to existing work, it does not rely on the assumption that all positives
are equally likely to be labeled. Applied to a language learning dataset, the method achieves interesting results: it
is the rst to learn disjunctive concepts in this setting, and makes an earlier method for handling noise obsolete.
The proposed algorithm is highly preliminary: possible improvements include learning the rules in the optimal
order, and estimating a rule's label proportion in a more principled manner. A more thorough experimental
study is also warranted.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>Research conducted while H.B. visited Universite Jean Monnet, Saint-Etienne. H.B. thanks Elisa Fromont and
Leonor Becerra-Bonache for making this stay possible, motivating this work, and for many interesting discussions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. L.
          <string-name>
            <surname>Becerra-Bonache</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Blockeel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Galvan</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Jacquenet</surname>
          </string-name>
          .
          <article-title>A rst-order-logic based model for grounded language learning</article-title>
          .
          <source>In Proc. of IDA-2015</source>
          , pages
          <fpage>49</fpage>
          {
          <fpage>60</fpage>
          . LNCS 9385, Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. L.
          <string-name>
            <surname>Becerra-Bonache</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Blockeel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Galvan</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Jacquenet</surname>
          </string-name>
          .
          <article-title>Relational grounded language learning</article-title>
          .
          <source>In Proc. of ECAI</source>
          <year>2016</year>
          , pages
          <fpage>1764</fpage>
          {
          <fpage>1765</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Elkan</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Noto</surname>
          </string-name>
          .
          <article-title>Learning classi ers from only positive and unlabeled data</article-title>
          .
          <source>In Proc. of KDD-2008</source>
          , pages
          <fpage>213</fpage>
          {
          <fpage>220</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Muggleton</surname>
          </string-name>
          .
          <article-title>Learning from positive data</article-title>
          .
          <source>In Selected Papers from the 6th International Workshop on Inductive Logic Programming</source>
          ,
          <source>ILP '96</source>
          , pages
          <fpage>358</fpage>
          {
          <fpage>376</fpage>
          . Springer-Verlag,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Plotkin</surname>
          </string-name>
          .
          <source>Machine Intelligence</source>
          <volume>5</volume>
          ,
          <string-name>
            <surname>chapter</surname>
            <given-names>A</given-names>
          </string-name>
          <source>note on inductive generalization</source>
          , pages
          <volume>153</volume>
          {
          <fpage>163</fpage>
          . Edinburgh University Press,
          <year>1970</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          .
          <article-title>Bringing semantics into focus using visual abstraction</article-title>
          .
          <source>In Proceedings of the International Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>3009</volume>
          {
          <fpage>3016</fpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>