=Paper=
{{Paper
|id=Vol-2085/blockeelLBP-ILP2017
|storemode=property
|title=PU-learning Disjunctive Concepts in ILP

|pdfUrl=https://ceur-ws.org/Vol-2085/blockeelLBP-ILP2017.pdf
|volume=Vol-2085
|authors=Hendrik Blockeel
|dblpUrl=https://dblp.org/rec/conf/ilp/Blockeel17
}}
==PU-learning Disjunctive Concepts in ILP
==
<pdf width="1500px">https://ceur-ws.org/Vol-2085/blockeelLBP-ILP2017.pdf</pdf>
<pre>
                 PU-learning disjunctive concepts in ILP

                                                Hendrik Blockeel
                                  KU Leuven, Department of Computer Science


                                                    Abstract.
                     PU learning is a variant of semi-supervised learning where labels of only
                     one class are given. In this late-breaking paper, we present early results
                     on an ILP method that PU-learns disjunctive concepts. The problem is
                     motivated by some work on natural language learning, more specifically,
                     learning the meaning of n-grams from (sentence, context)-pairs, where
                     the context is described using first-order logic. An earlier solution for
                     this task could learn only conjunctive concepts. We show experimen-
                     tally that the novel method effectively learns disjunctive concepts from
                     PU data. We also discuss challenges and limitations.

1   Introduction
Many systems for inductive logic programming learn classifiers in the form of rule sets, and more specifically, sets
of definite Horn clauses. Such sets are typically learned in a supervised setting. An alternative learning setting
is PU-learning, which stands for “learning from positive and unlabeled examples”. PU-learners are given a set
of instances E + that are known to be positive, and a set of instances E ? for which it is not known whether they
are positive or negative.
   To our knowledge, PU-learning has not been considered in ILP. Learning from positives only has been consid-
ered [4], but is slightly different; it compares to PU-learning like supervised learning compares to semi-supervised
learning.
   In this short paper, we first briefly discuss PU-learning. We next show an ILP problem where PU-learning is
relevant, and argue that it is beyond reach for standard PU-learners in two orthogonal ways: first, it is relational;
second, a crucial assumption typically made by PU-learners is violated. We present an ILP algorithm that
addresses both challenges and present some experimental results with it.

2   PU learning
PU-learners learn a binary classifier from two sets: a set of instances known to be positive, and a set of unlabeled
instances, each of which may be positive or negative. Elkan and Noto [3] made the following observation, which
has influenced recent work on PU-learning. Let + denote the event that an instance is positive, and pos the event
that an instance is labeled positive. Since only positives can be labeled positive, P (pos|x) = P (pos|x, +)P (+|x).
   Under the assumption that the labeled positives are a completely random subset of all positives (that is,
each positive has the same probability k of being labeled), this reduces to P (pos|x) = kP (+|x), and hence,
P (+|x) = P (pos|x)/k. Thus, if k is known, it suffices to learn a probabilistic classifier that predicts whether x is
labeled, which is a supervised learning task, and then dividing the predicted probabilities by k.
   Most work on PU-learning implicitly or explicitly makes the “constant k” assumption. Under this assumption,
PU-learning reduces to estimating the constant k and running a standard supervised learner. In the next section,
we show a PU-learning problem where this assumption is clearly violated.

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: Nicolas Lachiche, Christel Vrain (eds.): Late Breaking Papers of ILP 2017, Orléans, France, September 4-6, 2017,
published at http://ceur-ws.org


                                                         6
3     Relational grounded language learning
The motivating context for this work is the following task, “relational grounded language learning” (RGLL) [1,
2]. Given a set of sentence/context pairs, where context is a Datalog description of the context in which sentence
was used, learn the circumstances under which an n-gram can be used (we call this the “meaning” of the n-gram).
For instance, in one dataset [2], derived from Zitnick et al.’s “clip art” dataset [6], the sentence “a cat sits under
the picnic table” accompanies the Datalog model

[object(o1),large(o1,c_table),color(o1,c_yellow),
object(o2),human(o2,c_boy),pose(o2,c_pose2),expression(o2,c_surprised),
object(o3),human(o3,c_girl),pose(o3,c_pose0),expression(o3,c_surprised),
object(o4),animal(o4,c_cat),size(o4,c_small),
object(o5),clothing(o5,c_hat),style(o5,c_pirate),act(o2, c_wear, o5),
object(o6),food(o6,c_pizza)]

which mentions many things, including a cat and a table. Creating for each model where “cat” occurs in the
corresponding sentence, a clause of the form “cat ← model”, and computing the least general generalization
under theta-subsumption (briefly, “lgg”) [5] of these clauses, might yield, e.g.,

                                  cat ← object(A),animal(A,c cat),size(A,c small)

which summarizes the conditions under which the word “cat” can apparently be used (or: the “meaning” of
“cat”).
   Earlier work [1] computed the meaning of an n-gram as the lgg of all the contexts where that n-gram was used,
possibly allowing some exceptions [2] in order to handle noise. The disadvantage of that approach is that the lgg
is a single clause, hence, a conjunctive concept. Such an approach cannot handle words with multiple meanings,
which can be used in quite different contexts. For instance, in the dataset used, we expected the word “dog”
to have only one meaning within the used dataset, just like “cat”, but its occurrence in the bigram “hot dog”
made it occur frequently in cases where the food, but not the animal, was present. As a result, the mentioned
approaches could not learn the meaning of “dog”; they overgeneralized over dogs and hotdogs.
   While ILP is perfectly capable of learning disjunctive concepts, RGLL faces the problem that a word does not
have to be used when it is relevant. When the word “cat” occurs, a cat should be present (this is an assumption
of RGLL), but when a cat is present, the sentence may not mention it. If we consider the occurrence of a word
as a label for the context, it is clear that this is a PU-learning problem.
   RGLL does not, however, fulfill the “constant k” condition. This condition would imply that the probability
that a dog is mentioned, given that it appears, equals the probability that a hotdog is mentioned (using the
spelling “hot dog”, rather than “hotdog”), given that it appears. There is no reason to believe these are equal:
some objects are more likely to be mentioned than others, there is the effect of alternative spellings or synonyms,
etc. This is corroborated by the Zitnick dataset: e.g., the probability of having “dog” in the sentence, given that
the context contains a dog or a hotdog, is 0.315 and 0.249, respectively.
   For this reason, we developed a PU rule learner that does not assume a constant k. Instead, it assumes that
the target concept is a disjunctive concept, and that the probability of being labeled is constant within each
disjunct, but may differ from one disjunct to another.

4     PU-learning disjunctive concepts
4.1   PU-learning one rule
We first consider the case of PU-learning a conjunctive concept, which can be expressed using a single Horn
clause. If there is no noise in the data, the clause must cover all positive examples. While it may cover unlabeled
examples, the clause c that covers the fewest unlabeled examples is optimal: P (pos|c) is maximal in this case,
and therefore, so is P (+|c) = P (pos|c)/k.
   The earlier approaches did not explicitly consider RGLL as a PU-learning task because, when learning con-
junctive concepts from noiseless data, this is not necessary: the problem corresponds to computing an lgg. To
allow for noisy data (sentences mentioning things not present in the context), [2] computed the lgg of “almost
all” cases, rather than strictly all cases, by randomly generalizing the current clause with new positive instances
until almost all positives are covered.


                                                         7
    The PU-learning viewpoint gives a more principled method. Consider a sequence of clauses c1 , c2 , . . . , cn ,
where c1 is a random positive instance and ci = lgg(ci−1 , ei ), with ei a random positive instance not covered by
ci−1 ; we call this a generalization path.
    Assume that the clauses c1 , c2 , . . . , cj cover subsets of the target clause t, but cj+1 does not. Then P (pos|ci ) = k
for i ≤ j, and P (pos|ci ) < k for i > j, since labels occur entirely randomly within the disjunct, and there are
no labeled examples outside the disjunct. Thus, even though we do not know k, we could in principle find k by
constructing a curve that shows P (pos|ci ) in terms of i, and looking what part of the curve is flat. The optimal
ci is the one just before the curve starts decreasing.
    In practice, we cannot construct P (pos|ci ); we can at best estimate it from a dataset as the proportion of
instances covered by ci that are labeled. This proportion randomly deviates from P (pos|ci ). Furthermore, it tends
to be biased positively because ci is the result of repeated lggs of labeled instances. One way to compensate for
the deviation is to construct a confidence interval for P (pos|ci ) based on the proportion, and take the lower
bound of this confidence interval; call this value q(ci ). The confidence interval becomes narrower with increasing
i, since the coverage of the clause increases with i. Assuming P (pos|ci ) is constant for i = 1, . . . , j, the expected
value of q(ci ) reaches a maximum for i = j. From that point of view, choosing the ci that maximizes q(ci ) is a
good heuristic.
    This leads to the following algorithm for PU-learning one rule (PULOR):

procedure PULOR:
   c1 = random (not-yet-covered) labeled instance
   i=1
   stop=false
   while not stop:
       repeat
            e = random labeled instance not covered by earlier rules
            ci+1 = lgg(ci ,e)
       until ci+1 6= ci or repeat-loop was executed 5 times
       if ci+1 = ci then stop=true else i++
   return arg maxcj ,1≤j≤i q(cj )

   Our actual implementation of PULOR uses as heuristic not the q-function as defined above, but a monotonic
function of it, namely the (lower bound of the) odds ratio of labeled versus unlabeled examples in the rule’s
coverage, as estimated using a sample of the unlabeled set. Figure 1 shows some typical curves. It confirms
that using the lower bound of a confidence interval is advantageous; if the ratio itself is used, there is less
generalization.

 6                               4                                                               2
                                                                  10
                                 3                                                              1,5
 4
                                 2                                5                              1
 2                                                                                              0,5
                                 1
                                                                  0
 0                               0                                     0          5              0
     0     2    4     6     8        0     2      4       6       -5                           -0,5 0   1     2    3    4


Fig. 1. Observed odds ratio (dotted) and the corresponding lower bound (solid), versus number of generalization steps,
for some generalization paths


4.2      PU-learning a set of rules
Rule sets are often learned using a covering approach: learn one rule, mark the examples covered by that rule
as covered, then repeat the process on the still uncovered examples. A single rule is typically learned by refining
the rule until its accuracy is close to 1. In the PU-learning setting, the observed accuracy (the proportion of
instances covered by the rule that have a positive label) differs from the real accuracy (the proportion of covered
instances that are truly positive): if the true accuracy is a, the observed accuracy is ka.
   If k were constant over all rules, one could try to determine it and then run a standard rule learner where the
accuracy aimed for is k rather than 1. Since we assume different rules may have a different k value, k must be


                                                              8
estimated per rule, but that can only be done once we know the rule. Here, therefore, the learning of the rule
and the estimation of k must go hand in hand.
    Assume the target is a disjunctive concept consisting of d disjuncts, where each disjunct can be accurately
expressed as a single rule. Each such rule can be learned using PULOR.
    Consider a generalization path that starts with a positively labeled example, which is part of disjunct i. Let
us denote the “local” k value for the i’th disjunct as ki . As long as the clauses cover a subset of that disjunct
i, the expected proportion of labeled examples in the clause’s coverage is ki . As the clauses become increasingly
general, they start covering negative instances and/or instances from other disjuncts. Clearly, covering negative
instances decreases the observed accuracy, while covering instances from another disjunct j may increase or
decrease it; this depends on whether kj > ki or not.
    Under these circumstances, it is not obvious that the q-maximization criterion that PULOR uses is optimal.
When a generalization path starts in a disjunct with low ki and at some point starts covering instances from
a disjunct with high kj without also covering too many negatives, the maximal q-value may be obtained for a
clause that covers instances from multiple disjuncts. If rules are learned in the right order, from high to low k
value, this will not occur.
    Our algorithm for PU-learning a rule set, PULSE, uses the standard covering approach: it runs PULOR
several times, marking covered examples as such before proceeding to learn the next rule. It does not attempt
to optimize the order in which rules are learned (high to low k value).

5   Experiments
We have run PULSE on a dataset of 10000 sentence/context pairs, trying to learn rules for specific words or
n-grams. Table 1 presents a sample from the outcomes. Whether a result is correct is somewhat subjective, there
is often no agreed-upon ground truth; the best we can do is interpret the result. We cannot compare to other
systems, because no other systems solve the considered task; for conjunctive concepts, however, PULSE often
finds similar results as the earlier proposed ReGLL [2].
   Note that PULSE sometimes finds sets of clauses where one clause is a specialization of another. This is
a consequence of the fact that one can never be sure what the maximal reachable accuracy is (because k is
unknown).
                       n-gram meaning                                                         associated with
                       tree       2/3: ob(A),lg(A,tree),spec(A,B),col(A,green),sz(A,big)      tree
                       apple tree 3/3: ob(A),lg(A,tree),spec(A,apples),col(A,green),sz(A,big) apple tree
                       table      2/3: ob(A),lg(A,table),col(A,yellow)                        table
                                  1/3: ob(A),lg(A,table),col(A,yellow) ;
                                       ob(A),food(A,B),ob(C),lg(C,table),col(C,yellow)        table and food
                       dog        1/3: ob(A),animal(A,dog),col(A,brown),sz(A,small);          dog
                                       ob(A),lg(A,table),col(A,yellow),ob(B),food(B,hot dog); hotdog
                                       (3rd clause is a specialization of second)
                       hot dog    1/3: ob(A),col(A,B),ob(C),food(C,hot dog)                   hotdog
Table 1. Some “meanings” learned using the proposed algorithm (predicate names and constants abbreviated for con-
ciseness: ob=object, col=color, lg=large object, spec=species, sz=size, hu=human, ex=expression)


6   Conclusions
We have proposed a general algorithm for PU-learning sets of Horn clauses. The algorithm can trivially be
extended to general rule learning. Contrary to existing work, it does not rely on the assumption that all positives
are equally likely to be labeled. Applied to a language learning dataset, the method achieves interesting results: it
is the first to learn disjunctive concepts in this setting, and makes an earlier method for handling noise obsolete.
The proposed algorithm is highly preliminary: possible improvements include learning the rules in the optimal
order, and estimating a rule’s label proportion in a more principled manner. A more thorough experimental
study is also warranted.

Acknowledgements
Research conducted while H.B. visited Université Jean Monnet, Saint-Etienne. H.B. thanks Elisa Fromont and
Leonor Becerra-Bonache for making this stay possible, motivating this work, and for many interesting discussions.


                                                                9
References
1. L. Becerra-Bonache, H. Blockeel, M. Galván, and F. Jacquenet. A first-order-logic based model for grounded language
   learning. In Proc. of IDA-2015, pages 49–60. LNCS 9385, Springer, 2015.
2. L. Becerra-Bonache, H. Blockeel, M. Galván, and F. Jacquenet. Relational grounded language learning. In Proc. of
   ECAI 2016, pages 1764–1765, 2016.
3. C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proc. of KDD-2008, pages
   213–220, 2008.
4. Stephen Muggleton. Learning from positive data. In Selected Papers from the 6th International Workshop on Inductive
   Logic Programming, ILP ’96, pages 358–376. Springer-Verlag, 1997.
5. G. D. Plotkin. Machine Intelligence 5, chapter A note on inductive generalization, pages 153–163. Edinburgh University
   Press, 1970.
6. C. L. Zitnick and D. Parikh. Bringing semantics into focus using visual abstraction. In Proceedings of the International
   Conference on Computer Vision and Pattern Recognition, pages 3009–3016. IEEE, 2013.


                                                           10

</pre>