<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pointwise Con dence scores with provable guarantees</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nivasini Ananthakrishnan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shai Ben-David</string-name>
          <email>bendavid.shai@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tosca Lechner</string-name>
          <email>tlechner@uwaterloo.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Waterloo</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Quantifying the probability of a machine learning prediction being correct on a given test point enables users to better decide how to use those predictions. Con dence scores are pointwise estimates of such probabilities. Ideally, these would be the label probabilities (a.k.a. the labeling rule) of the data generating distribution. However, in learning scenarios the learner does not have access to the labeling rule. The learner aims to gure out a best approximation of that rule based on training data and some prior knowledge about the task at hand, both for predicting a label and for possibly providing a con dence score. We formulate two goals for con dence scores, motivated by the use of these scores for safety-critical applications. First, they must not under-estimate the probability of error for any test point. Second, they must not be trivially low for most points. We consider a few common types of learner's prior knowledge and provide tools for obtaining pointwise con dence scores based on such prior knowledge and the relation between the test point and the training data. We prove that under the corresponding prior knowledge assumptions, our proposed tools meet these desired goals.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The reliability of machine learnt programs is of course a
major concern and has been the focus of much research.
Theory o ers quite a selection of tools for evaluating
reliability, from generalization bounds to experimental
result of test sets. However, most of those guarantees are
statistical, in the sense that they only hold with high
probability (over the generation of the training data
and of the test points) and they provide no
information about the correctness of prediction on any speci c
instance. In cases where an error on a speci c
instance may incur a very high cost, like in safety-critical
applications, the common statistical guarantees do not
su ce. We would also wish to be able to identify
predictions with low con dence so that one could apply
some safety procedures (such as a review by a human
expert). Ideally, no low con dence prediction should go
undetected, At the same time, since expert intervention
could be expensive, one also wishes to minimize the
occurrence of false positives in the predictions agged as
low con dence.</p>
      <p>Can one do better than the overall statistical
estimates when it comes to evaluating reliability on a given
test case?</p>
      <p>Arguably, the most common reason for an
statistically reliable machine learning program to fail on a test
point is that that point is an `outlier', in the sense of not
being well represented by the sample the program was
trained on. This research aims to quantify this
`outlierness'. We propose theoretically founded con dence
bounds that take into account the relation between the
training sample and the speci c test point in question
(on top of the commonly used parameters of the
learning algorithm, the size of training sample and
assumptions about the processes generating both the training
data and the test instance).</p>
      <p>Clearly, the con dence of any prediction of an
unknown label (or any piece of information) hinges
upon some prior knowledge or assumptions. In this
work we consider a few forms of prior knowledge that
are commonly employed in machine learning theory,
and develop and analyse con dence score for prediction
of individual test points under such assumptions.</p>
      <p>We consider the following types of prior knowledge:</p>
    </sec>
    <sec id="sec-2">
      <title>Known hypothesis class with low approxima</title>
      <p>tion error: We discuss two cases - the realizable
setting (i.e., when that approximation error is zero) and
the agnostic setup (both in Section 2).</p>
      <p>In the realizable case, we show that there are indeed
hypothesis classes for which it is possible to de ne
a con dence score that does not overestimate
condences for any points, while providing high con
dences to many points. However, there are also
hypotheses classes, that do not allow non-trivial con
dence scores ful lling such a guarantee.</p>
      <p>For the agnostic setup, assuming the learner has
knowledge of a hypothesis class with low (but not
necessarily 0) approximation error, we show that in
this case it is not possible to give any non-trivial
condence score that does not overestimate con dence
for some instances.</p>
    </sec>
    <sec id="sec-3">
      <title>The data generating distribution is Lipschitz:</title>
      <p>We provide a an algorithm that calculates con dence
scores under such an a Lipschitzness assumption. We
show that with high probability over samples, the
resulting con dence score of every point is an
underestimate of its true con dence while the con dence score we
obtain is non-trivial. We provide bounds on the
probability (over points and samples) of assigning a low
condence score to a point with high true con dence that
converge to zero as a function of the training sizes. For
more details, see Section 5.</p>
      <p>2</p>
      <sec id="sec-3-1">
        <title>Related work</title>
        <p>
          The closest previous work to ours is Jiang et al [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. They
consider a very similar problem to the one we address
here - the problem of determining when can a classi er's
prediction on a given point be trusted. For the sake of
saving space, we refer the reader to that paper for a
more extensive review of previous work on this topic.
Their theoretical results di er from our work in two
essential aspects. First, they consider only one setup - a
data generating distribution that satis es several
technical assumptions. In particular they rely on the
following strong condition: "for any point x 2 X , if the ratio
of the distance to one class's high-density-region to that
of another is smaller by some margin , then it is more
likely that x's label corresponds to the former class."
We analyse our notion of con dence under several di
erent incomparable assumptions, arguably, none of which
is as strong as that. The second signi cant di erence is
that the main result on trust of labels there (theorem 4
of [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]) states that if a certain inequality holds then the
predicted label agrees with that of the Bayes optimal
predictor, and if another inequality holds, there is
disagreement between them. However, those inequalities
are not complementary. It may very well be that in
many cases every domain point (or high probability of
instances) fails both inequalities. For such points, that
main result tells nothing at all. That paper does not
o er any discussion of the conditions under which their
main result is not vacuous in that respect. Additionally
their result holds with high probability over the domain
and is not a point-wise guarantee.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Selective Classi cation/ Classi cation with</title>
      <p>
        Abstension: One line of work that is related to our
paper is learning with abstention. Similar to our
setting, the classi cation problem does not only consist
of the goal of classifying correctly, but to also allows
the classi er to abstain from making a prediction, if
the con dence of a prediction is too low. Many works
in this line provide accuracy guarantees that hold with
high probability over the domain ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
This is di erent from our goal of point-wise guarantees.
      </p>
      <p>
        Point-wise guarantees are provided in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
El-Yaniv et al [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] gave a theoretical analysis of the
selective classi cation setup in which a classi cation
function and a selective function are learned simultaneously.
The risk of a classi cation is then only accessed on the
set of instances that was selected for classi cation. The
selective function is evaluated by their coverage - how
many instances in expectation are selected for
classication. They analyse the trade-o between risk and
coverage, and introduce the notion of "perfect
classication" which requires risk 0 with certainty. This is
similar to our requirements on a con dence score in
the deterministic setting, where we require 0 risk with
high probability. Their notion of coverage is similar to
our notion of non-redundancy - in fact non-redundancy
corresponds to worst-case coverage over a family of
distributions. They provide an optimal perfect learning
strategy in the realizable setting and show that there
are hypothesis classes with a coverage that converges
to 1 and hypothesis classes for which coverage is always
0 for some distributions. We use their results in our
Section 4. In contrast to their paper our setup also
considers probabilistic labeling functions and our analysis
also considers other assumptions on the family of
probability distributions, besides generalization guarantees
for some xed hypothesis space.
      </p>
      <p>3</p>
      <sec id="sec-4-1">
        <title>Problem de nition</title>
        <p>Let the domain of instances be X and the label set
be f0; 1g. A learning task is determined by a
probability distribution P over X f0; 1g. We denote the
marginal distribution over the domain by PX and the
conditional labeling rule by `P (namely, for x 2 X,
`P (x) = Pr(x0;y0) P [y0 = 1jx0 = x]).</p>
        <p>The Bayes classi er,
hPB(x) = 1 i</p>
        <p>Pr
(x0;y0) P
[y0 = 1jx0 = x]
0:5;
is the pointwise minimizer of the zero-one prediction
loss. We sometime refer to its prediction hPB(x) as the
majority label of a point or the Bayes label of a point.</p>
        <p>We are interested in point-wise con dence. For a
point x 2 X, the con dence of a label y 2 f0; 1g is
CP (x; y) =</p>
        <p>Pr
(x0;y0) P
[y0 = yjx0 = x]:
Note that the label assigned by he Bayes predictor
maximizes this con dence for every domain point x.</p>
        <p>A Con dence score of the label con dence is an
empirical estimate (based on some training sample S)
of the true label con dence. Inevitably, the reliability
of a con dence score is dependent on some
assumptions on the data generating distribution (or, in other
words, prior knowledge about the task at hand). Given
a family of data generating distributions P (ful lling
some niceness assumption that re ect the learners prior
knowledge or beliefs about the task), a training sample
S, and a parameter , the empirical con dence estimate
for a point x and label y is a function C(x; y; S; ). We
want the following property to hold: For every
probability distribution P 2 P, with probability of at least
over an i.i.d. generation of S by P , we have</p>
        <p>Pr
y0 Bernoulli(`P (x))
That is, with high probability, we do not overestimate
the probability of y being the correct label for x.
Ideally, this should hold for every point x in the domain.
Of course, there is a trivial solution for this - just let
C( ; ; ; ) be the constant 0 function. The goal therefore
is to get a con dence score that ful ls the condition
above, while still being as high as possible for `many'
x0s. That is, we aim for a con dence score, such that
Ex P;S P m [maxfC(x; 1; S; ); C(x; 0; S; )g] is high. As
mentioned above, given a data generating distribution
P and a data representation available to the learner, the
highest con dence on every instance x 2 X is achieved
by the Bayes predictor hPB(x) and it is easy to see that
it is maxf`P (x); (1 `P (x))g.</p>
        <p>In contrast with the common notion of a PAC style
error bound is that con dence scores may vary over
individual instances, capturing the heterogeneity of the
domain and the speci c training sample the label
prediction relies on. To demonstrate this point, consider
the following example:
Example 1. Let X1 be the 0:1 grid over [0; 1]d, let
X0 the 0:01 grid over [0; 0:1]d and let our domain be
X = X0 [ X1 (for some large d). Consider the
family P of all probability distributions over X that have
a deterministic labeling rule satisfying the 10- Lipschitz
condition (so points of distance 0.1 or more have no
e ect on each other). Assume further that all the
distributions in P have half of their mass uniformly
distributed over X1 grid points and the other half of the
mass uniformly distributed over X0.</p>
        <p>Since outside the [0; 0:1]d cube, every labeling is
possible, for every learner there is a distribution P 2 P
w.r.t. which it errs on every domain point in X1 n SX
(where SX = fx : 9y 2 f0; 1g s.t. (x; y) 2 Sg). On
the other hand, due to the Lipschitz condition, all the
points in the [0; 0:1]d grid, X0, must get the same
label. Therefore, given a sample S that includes a point
in X0, a learner that labels all the points in X0 by the
label of the sample points in it induces con dence 1 for
all these points.</p>
        <p>
          We conclude that, for sample sizes between 2 and,
say, 10d=2, for most of the samples a learner can
achieve con dence 1 for points in X0 and no learner can
achieve con dence above 0 for even a half of the domain
points in X1. Note also that the No Free Lunch theorem
(as formulated in, e.g., [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]) implies that for sample sizes
smaller than 10d=2, for every learner there exists some
probability distribution P 2 P for which its expected
error is at least 1=8 (1=4 over a subspace X1 that has
probability weight 1=2).
        </p>
        <p>4</p>
      </sec>
      <sec id="sec-4-2">
        <title>Con dence scores for hypothesis classes</title>
        <p>In the following we will analyze the point-wise con
dence when all the prior knowledge available about the
data generating distribution P is a bound on the
approximation error of a class of predictors. We will
distinguish two cases here,
1. The family PH;0 of distributions P which are
realizable w.r.t. H, i.e. infh2H LP (h) = 0 and
2. The family PH; of distributions P for which the
approximation error of class H is low but not
guaranteed to be zero, i.e. inf h2H LP (h) , for some &gt; 0.
Note that, given a class of predictors, H, the second
family of possible data generating distributions is a
superset of the rst. Consequently, the pointwise
error guarantees one can give in that
non-necessarilyrealizable case are weaker1.</p>
        <p>De nition 1 (Con dence Score, ful lling the
no-overestimation guarantee for all instances). We say a
function C, that takes as input a sample S, a point x, a
hypothesis h and a parameter and outputs a value
in [0; 1]. We say such a function C is a con dence
score ful lling the no-overestimation guarantee for all
instances for a family of probability functions P if for
every P 2 P the probability over S P m that there
exists x 2 X with</p>
        <p>We say a function C(x; y; S; ), is a con dence score
ful lling the ful lling the no-overestimation guarantee
for positive mass instances if the above guarantee holds
not for all x, but for all x with P (fxg) &gt; 0.</p>
        <p>It is obvious, that this guarantee to achieve, if we
give the con dence score 0 for all predictions. In order
to measure the informativeness of a con dence score
for a particular distribution we introduce the notion of
non-triviality of a con dence score.</p>
        <p>De nition 2 (Non-triviality). Given a con dence
score C for some family of distributions P, we de ne
the non-triviality ntP (C; x; y) w.r.t. to a distribution
P 2 P for a given sample size m and parameter , to
be the extected di erence between the estimated and the
true con dence,i.e.:
ntP (C; m; ) =
Ex P;S P m [1
y2mf0in;1g jCP (x; y)</p>
        <p>C(x; y; S; )j)]</p>
        <p>Next we consider a speci c con dence score that
takes into account whether a hypothesis class is
undecided on a point x given a sample S.</p>
        <p>CH(x; y; S; ) =
1To not have to deal with the ambiguity of the labeling
function for points with mass 0, we will restrict this
discussion to the family of distributions which have positive mass
on all points. This implies that in the realizable setting all
labeling function `P we consider are part of H.</p>
        <p>
          This statement was made in a di erent setup by
ElYaniv et al [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. We note that our con dence score CH is
equivalent to their notion of consistent selective
strategy. Using our terminology, they show that if the
realizability assumption holds, if an instance (x; y) is
classied as 1 by CH then x is guaranteed to have true label
y (with probability 1). Furthermore, their Theorems 11
and Theorem 14 as well as their Corollory 28 give rise
to the following observation about con dence scores.
Observation 1. It turns out that ntP (C; m; ) for
this con dence scoring rule CH under the realizability
assumption displays di erent behaviours for di erent
classes (even when they have similar VC dimension):
For some hypothesis classes, e.g. the class of
thresholds on the real line Hthres or the class of linear
separators in Rd, ntP (C; m; ) converges to 1 for every
&gt; 0 and every P 2 P as the sample size go to
in nity.
        </p>
        <p>On the other hand, for some hypothesis classes with
nite VC-dimension for every &gt; 0 there exist
P 2 PH;0 with ntP (C; m; ) = 0 for every sample
size m and every &lt; 1. This phenomenon occurs for
example for H being the class of singletons.</p>
        <p>
          For a more detailed analysis of which hypothesis
classes have high non-triviality, we refer the reader to
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], noting that our notion of non-triviality corresponds
to their notion of coverage.
        </p>
        <p>We now look at the second case we wanted to
address in this section: The family of probability
distributions such that the approximation error of a class H
is bounded by some . We x a hypothesis class H and
let PH; be the family of all probability distributions P
such that H has approximation error at most w.r.t. P .
We show that for for any (non-trivial) hypothesis class
H, it is not possible to nd any satisfying con dence
score for such a family.</p>
        <p>Proposition 2. Let H be any hypothesis over an in
nite domain. Then there is no con dence score C such
that the following two statements are true
simultaneously:</p>
        <p>C ful lls the no-overestimation guarantee for all
positive-mass instances w.r.t. PH;
there exists some &gt; 0 such that for every P 2 PH;
there are 2 (0; 1), m 2 N such that C has
nontriviality ntP (C; m; ) &gt; .</p>
        <p>This shows us that restricting ourselves to a family
of probability distributions that allow for good
generalization is not su cient for allowing satisfying
condence scores. In the following section we will make
stronger, more local assumptions and show that under
these assumptions more satisfying con dence scores can
be found.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Con dence scores under Lipschitz assumption</title>
        <p>Lipschitz Assumption : We say that the probability
distribution P over X f0; 1g satis es -Lipschitzness
w.r.t. a metric d(:; :) over X, if for every x; x0 2 X
, j`P (x) `P (x0)j d(x; x0). When the domain is a
subset of a Euclidean space, we will assume that d is
the Euclidean distance unless we specify otherwise.</p>
        <p>We provide an algorithm (1) to estimate the labelling
probability of points using labelled samples. The
algorithm partitions the space into cells. The algorithm
outputs the same answer for points in the same cell.
The input parameter r dictates the size of the cells.
The algorithm estimates the average labelling
probability for each cell. A con dence interval for this estimate
is calculated based on the number of sample points in
the cell. The interval is narrow when there are more
sample points in the cell.</p>
        <p>The following lemmas show how to estimate
probability weights and average labelling probabilities of subsets
of the domain:
Lemma 1. Let P be a distribution over domain X.
Let X0 be a subset of X. Let S be an i.i.d. sample of
size m drawn from the distribution P . Let p^(X0; S) be
the fraction of the m samples that are in X0. For any
&gt; 0, with probability 1 over the generation of the
samples S,
where
jP (X0)
p^(X0; S)j</p>
        <p>wp(m; )
wp(m; ) =
2m</p>
        <p>2
ln :
Lemma 2. Let D be distribution over X f0; 1g. Let
X0 be a subset of X. Let S be an i.i.d. sample of size
m drawn from D. Let `^(X0; S) be the fraction of the m
labelled samples with label 1 in S \ X0. For any &gt; 0,
with probability 1 over the generation of the samples
S, if p^(X0; S) wp(m; =2) &gt; 0, then
j`P (X0)</p>
        <p>`^(X0; S)j &lt; w`(m; ; p^(X0; S))
w`(m; ; p^(X0; S)) =
1
p^(X0; S)</p>
        <p>wp(m; =2)
wp(m; =2) +
2m
ln
4 !
;
where p^(X0; S) is the fraction of the samples from S
in X0 that have label 1, wp(m; =2)) is as de ned in
Lemma 1.</p>
        <p>The following theorem shows that the true labelling
probability of a point lies within the estimate interval
provided by Algorithm 1, with high probability over the
sample used to nd the estimate.</p>
        <p>Theorem 1. Let the domain be [0; 1]d. Suppose the
data generating distribution P satis es -Lipschitzness.
Algorithm 1 Lipschitz labelling probability estimate
Input: Test point x, Labelled samples S =
(xi; yi)im=1,</p>
        <p>Radius r, Estimation parameter ,</p>
        <p>Lipschitz constant
Output: Labelling probability estimate, con dence
width of estimate
Split the domain X = [0; 1]d into a grid of (1=r)d
hypercube cells each of side length r.</p>
        <p>Find the cell tx containing the test point x.
p^[tx] := fraction of samples in tx.
`^[tx] := fraction of samples in the cell tx with label 1.
w[tx] := 1
if p^[tx] wp(m; =2) then</p>
        <p>w[tx] = w`(m; ; p^[tx])
end if
Return `^[tx], min(1; w[tx] + r p2)</p>
        <p>C^Lr;ipschitz(x; y; S; ) =
For any r &gt; 0; &gt; 0, m 2 N, for any x 2 [0; 1]d, de ne
the con dence score based on Algorithm 1 as
(1 `^S (x)
wS (x)
wS (x) if y = 1</p>
        <p>if y = 0
`^S (x)
where (`^S (x); wS (x)) is the output of the Algorithm with
input r; ; . Then with probability 1 over samples
S of size m,</p>
        <p>C^Lr;ipschitz(x; y; S; )</p>
        <p>CP (x; y)</p>
        <p>We now show that as sample size increases, for an
appropriately chosen input parameter r, Algorithm 1
returns narrow estimate intervals for the labelling
probabilities for most points. This implies that for most
points, the con dence score is not much lower than the
true con dence (2j`P (x) 12 j)
Theorem 2. For every -Lipschitz distribution, for
every x; c; &gt; 0, there is a sample size m( x; c; )
such that with probability 1 over samples S of size
m( x; c; ),</p>
        <p>Pr [wS (x) &gt; c] &lt; x
x P
where wS is the width of labelling probability estimate
obtained from Algorithm 1 with input parameter of grid
size r = 1=m 81d</p>
        <p>Note, that this theorem implies that the expected
non-triviality is high.</p>
        <p>6</p>
      </sec>
      <sec id="sec-4-4">
        <title>Proofs</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Useful lemmas</title>
      <p>
        The following lemma appears as Theorem 2.2.6 of the
book of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where the reader can nd its proof.
Lemma 3 (Hoe ding's inequality for general bounded
random variables). Let X1; : : : ; XN be independent
random variables. Assume that Xi 2 [mi; Mi] for every i.
Then, for any t &gt; 0, we have
      </p>
      <p>" N # 2t2 !
Pr X(Xi E[Xi]) t exp
:
i=1
Proof of Lemma 1. Let Xi be a random variable
indicating if the ith sample belongs to set X0. Xi = 1 if
the ith sample belongs to X0 and zero otherwise. For
each i, E[Xi] = P (X0). p^(X0; S) = PiN=m1 Xi .
Applying Hoe ding's inequality, we get the inequality of the
theorem.</p>
      <p>Proof of Lemma 2. Let Xi be a random variable
such that</p>
      <p>Xi =
1 If ith sample is in X0 and has label one,
0</p>
      <p>otherwise.</p>
      <p>E[Xi] = P (X0)`P (X0), for each i. Pim=1 Xi
mp^`^P (X0; S). Note that by triangle inequality,
=
2m</p>
      <p>4
ln ;
we get that with probability 1
w`(m; ; p^).
, j`P (X0) `^(X0; S)j &lt;</p>
    </sec>
    <sec id="sec-6">
      <title>Con dence scores for hypothesis classes</title>
      <p>We start by recalling the de nition of con dence scores
ful lling the no-overestimation guarantee for all
instances:
De nition 1 (Con dence Score, ful lling the
no-overestimation guarantee for all instances). We say a
function C, that takes as input a sample S, a point x, a
hypothesis h and a parameter and outputs a value
in [0; 1]. We say such a function C is a con dence
score ful lling the no-overestimation guarantee for all
instances for a family of probability functions P if for
every P 2 P the probability over S P m that there
exists x 2 X
Proposition 1. For the family of distributions P that
are realizable with respect to H, CH is indeed a
condence score ful lling the no-overestimation guarantee
for all instances.</p>
      <p>We need to show that CH ful lls De nition 1, that is,
we need to show that for every hypothesis class H and
every distribution P that ful lls the realizable condition
w.r.t. H, the probability over S P m that there exists
x 2 X</p>
      <p>P ry Bernoulli(`P (x)) [h(x) = y] &lt; CH(x; y; S; )
is less than . Since CH only assigns values 0 and 1 and
the condition is trivially ful lled for instances with
condence score 0, we will now only discuss the case where
CH assigns con dence score 1. Recall, that CH only
assigns con dence 1 if and only if there is no h 2 H
with LS (h) = 0 and h(x) 6= y. Since we know, that
realizability holds, we know that `P 2 H and since S
is an i.i.d sample from P we know LS (`P ) = 0. Now
let CH(x; y; S; ) = 1, then by de nition we know that
LS (h) = 0 implies h(x) = y. Thus `P (x) = y. Thus,
this con dence score does not overestimate the con
dence of a point in any label.
6.2</p>
    </sec>
    <sec id="sec-7">
      <title>Proof of Observation 1</title>
      <p>
        We start by noting that our de nition of the con dence
score CH is equivalent to the consistent selective
strategy from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In order to state their de nition, we will
rst need to introduce some other concepts. First, we
will state the de nition of version space from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
De nition 3 (Version Space). Given a hypothesis class
H and a labeled sample S, the version space HS the set
of all hypotheses in H that classify S correctly.
      </p>
      <p>
        Now, we can de ne the agreement set and maximal
agreement set as in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>De nition 4 (agreement set). Let G H. A subset
X X is an agreement set with respect to G if all
hypotheses in G agree on every instance in X0, namely
for all g1; g2 2 G, x 2 X</p>
      <p>g1(x) = g2(x):
De nition 5 (maximal agreement set). Let G H.
The maximal agreement set with respect to G is the
union of all agreement sets with respect to G.</p>
      <p>We can now state the de nition of consistent selective
strategy. Note, that a selective strategy is de ned by a
pair (h; g) of a classi cation function h and a selective
function g. For our purposes, we will only need to look
at the selective function g.</p>
      <p>De nition 6 (consistent selective strategy (CSS)).
Given a labeled sample S, a consistent selective strategy
(CSS) is a selective classi cation strategy that takes h to
by any hypothesis in HS (i.e., a consistent learner), and
takes a (deterministic) selection function g that equals
one for all points in the maximal agreement set with
respect to HS and zero otherwise.</p>
      <p>
        We now see that for any H and any labeled sample
S the selected function g assigns one to x, if for
every two h1; h2 2 H with LS (h1) = LS (h2) = 0 implies
h1(x) = h2(x). Thus, g(x) = CH(x; h(x); S; ) for any
h(x) 2 HS . In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] the selection function is then
analysed with respect to its coverage, which is de ned by
(h; g) = Ex P [g(x)] for a given distribution P . Note,
that our notion of non-triviality corresponds to this
notion of coverage for deterministic distributions and
binary con dence scores. We can now use some of their
results to show our observation.
      </p>
      <p>
        Theorem 3 (non-achievable coverage, Theorem 14
from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Let m and d &gt; 2 be given. There exist a
distribution P , an in nite hypothesis class H with a
nite VC-dimension d, and a target hypothesis in H,
such that (h; g) = 0 for any selective classi er (h; g),
chosen from H by CSS using a training sample S of size
m drawn i.i.d. according to P .
      </p>
      <p>This directly implies the second part of our
observation. For a more concrete example consider
the class of singletons over the real line Hsgl =
fhz : R ! f0; 1g : hz(x) = 1 if and only if z = xg:
We note that CHsgl only gives positive con dence
scores to instances outside the sample if the sample
contains a positively labeled instance. Let P be the
uniform distribution over [0; 1] f0g. Obviously
P 2 PH;0. Furthermore any sample generated by P
will not contain any positively labeled sample. Thus,
ntP (C; m; ) = Px PX ;S P m [x 2 SX ] = 0, where PX
and SX denote the projections of P and S to the
domain X = R.</p>
      <p>Let us consider the class of thresholds Hthres = fha :
R ! f0; 1g; ha(x) = 1 if and only if x &gt; ag. We can
de ne the following two learning rules for thresholds:
A1(S) = ha1 ; where a1 =
A2(S) = ha2 ; where ha2 (x) = 1
if and only if x
a2 :=</p>
      <p>arg max
xi2R:(xi;0)2S</p>
      <p>xi
arg min
xi2R:(xi;1)2S
xi:
Furthermore, The way CHthres is de ned, for any S
and 2 (0; 1) we have A1(S)(x) = A2(S)(x) if and
only if there is y 2 f0; 1g with CHthres (x; y; S; ) = 1.
Furthermore, both of these learning rules are empirical
risk minimizers for Hthres in the realizable case. Thus
both of them are PAC-learners. Thus for any ; , there
is a m( ; ), such that for any m m(; ) and any
P 2 PHthres;0,
1</p>
      <p>P rS P m;x P [A1(S) = A2(S)] =
P rS P m;x P [y2mfa0;x1gfCHthres (x; y; S; )g]</p>
      <p>= ntPHthres;0 (C; m; 0)
Thus limm!1 ntPHthres;0 (CHthres; m; ) = 1 for any
2 (0; 1) and any P 2 PHthres;0. Furthermore note
that we have uniform convergence.
For every &gt; 0, every x 2 X0 and every n 2 N with
m &gt; 1 , we can construct a distribution Px;n, such
that lPx;n (x0) = h1(x0) for every x0 2 X n fxg and
h1(x0) 6= lPx;n (x0) and such that the marginal Px;n;X is
uniform over some Xn X0 with jXnj = n. For a
sample to distinguish between two such distributions Px1;n
and Px2;n either the point x1 or x2 needs to be visited
by the sample. Thus in order to give a point-wise
guarantee for all instances with positive mass, only points in
the sample can be assigned a positive con dence in this
scenario. Thus any con dence score ful lling this
guarantee would have ntPx;n (C; m; ) = Px PX ;S P m [x 2
SmnX ] , pmnr.ovFinogr oeuvrercylaim&gt;. 0 we can nd n such that</p>
    </sec>
    <sec id="sec-8">
      <title>Con dence scores using Lipschitz assumption</title>
      <p>Proof of Theorem 1. The algorithm partitions the
space into rd cells. Let pc be the probability weight
of a cell c and let p^c be the estimate of pc that is
calculated based on a sample to be the fraction of sample
points in the cell c. From Lemma 1 and a union bound,
we know that with probability 1 2 , for every cell c,
pc 2 [p^c</p>
      <p>wp(c); p^c + wp(c)]:
Here wp(c) = wp(m; =2rd) (as de ned in Lemma 1).</p>
      <p>The algorithm also estimates the average label of a
cell c - `c as `^c. This is the fraction of the sample point
in the cell that have the label one. This is the same as
the labelling probability estimate de ned in Lemma 2.
When the true probability weights of cells lie within the
calculated con dence interval, by Lemma 2, we know
that with probability 1 2 , for every cell c,
^ ^
`c 2 [`c
w`(c); `^c
w`(c))]:
Here w`(c) = w`(m; =2rd; p^c) (as de ned in Lemma 2).</p>
      <p>The maximum distance between any two points in
any cell is rp2. By the -Lipshitz, any point in the
cell has labelling probability within rp2 of the
average labelling probability of the cell. Therefore, with
probability 1 , for each cell c, for every point x in
the cell c, the labelling probability of x satis es:
^
`P (x) 2 [`c
w`(c)
p p
r 2; `^c + w`(c) + r 2]:</p>
      <p>This is the interval returned by the algorithm. Now
we lower bound true con dence based on the con dence
interval of the labelling probability. For a point x, let
c(x) denote the cell containing the point.</p>
      <p>CP (x; 0) = `P (x)
CP (x; 1) = 1
`^c(x)</p>
      <p>Proof of Theorem 2. We choose the input to the
algorithm to be r = m11=8d . With probability 1 2 , for all
cells with probability weight greater than = m11=4 , the
length of the con dence interval of the labelling
probability is less than
This quantity decreases with increase in m and
converges to zero. Therefore, for every c &gt; 0, there is
M1( c; ) such that this interval is less than c. When
sample size is larger than M1( c; ), with probability
1 2 , the size of con dence intervals for labelling
probabilities of cells with weights greater than = m11=4 , is
smaller than c.</p>
      <p>The points for which we can't say anything about
the interval lengths are points in cells with weight at
most . The total weight of such points is at most
r1d = m11=8 . For any x &gt; 0, let M2( x) be such that
1
M2( x)1=8 &lt; x.</p>
      <p>Choosing a sample size M greater than M1( c; ) and
M2( x), we get that</p>
      <p>Pr [w` &gt; c] &lt; x:
S P M</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bartlett</surname>
            ,
            <given-names>P. L.</given-names>
          </string-name>
          ; and Wegkamp,
          <string-name>
            <surname>M. H.</surname>
          </string-name>
          <year>2008</year>
          .
          <article-title>Classi cation with a Reject Option using a Hinge Loss</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          (
          <issue>59</issue>
          ):
          <year>1823</year>
          {
          <year>1840</year>
          . URL http://jmlr.org/papers/v9/ bartlett08a.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>El-Yaniv</surname>
            , R.; and Wiener,
            <given-names>Y.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>On the Foundations of Noise-free Selective Classi cation</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>11</volume>
          :
          <issue>1605</issue>
          {
          <fpage>1641</fpage>
          . URL http: //portal.acm.org/citation.cfm?id=
          <fpage>1859904</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Freund</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mansour</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schapire</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          ; et al.
          <year>2004</year>
          .
          <article-title>Generalization bounds for averaged classiers</article-title>
          .
          <source>The annals of statistics 32(4):</source>
          <volume>1698</volume>
          {
          <fpage>1722</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Herbei</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; and Wegkamp,
          <string-name>
            <surname>M. H.</surname>
          </string-name>
          <year>2006</year>
          .
          <article-title>Classi - cation with reject option</article-title>
          .
          <source>The Canadian Journal of Statistics/La Revue Canadienne de Statistique</source>
          <volume>709</volume>
          {
          <fpage>721</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Gupta,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>To trust or not to trust a classi er</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>5541</volume>
          {
          <fpage>5552</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Kalai</surname>
            ,
            <given-names>A. T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kanade</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ; and Mansour,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>Reliable agnostic learning</article-title>
          .
          <source>Journal of Computer and System Sciences</source>
          <volume>78</volume>
          (
          <issue>5</issue>
          ):
          <volume>1481</volume>
          {
          <fpage>1495</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] Mitchell,
          <string-name>
            <surname>T. M.</surname>
          </string-name>
          <year>1977</year>
          .
          <article-title>Version Spaces: A Candidate Elimination Approach to Rule Learning</article-title>
          . In Reddy, R., ed.,
          <source>Proceedings of the 5th International Joint Conference on Arti cial Intelligence</source>
          . Cambridge, MA, USA,
          <year>August</year>
          22-
          <issue>25</issue>
          ,
          <year>1977</year>
          ,
          <volume>305</volume>
          {
          <fpage>310</fpage>
          . William Kaufmann. URL http://ijcai. org/Proceedings/77-1/Papers/048.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Shalev-Shwartz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ben-David</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Understanding Machine Learning -</article-title>
          <source>From Theory to Algorithms</source>
          . Cambridge University Press.
          <source>ISBN 978-1-10-705713- 5</source>
          . URL http://www.cambridge.org/de/ academic/subjects/computer-science/
          <article-title>pattern-recognition-and-machine-learning/ understanding-machine-learning-theory-algorithms.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Vershynin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>High-dimensional probability</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Wiener</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>El-Yaniv</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Agnostic pointwise-competitive selective classi cation</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          <volume>52</volume>
          :
          <fpage>171</fpage>
          {
          <fpage>201</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Wegkamp,
          <string-name>
            <surname>M. H.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>Classi cation Methods with Reject Option Based on Convex Risk Minimization</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>11</volume>
          :
          <issue>111</issue>
          {
          <fpage>130</fpage>
          . URL https://dl.acm.org/citation.cfm? id=
          <fpage>1756011</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>