=Paper= {{Paper |id=Vol-2808/Paper_33 |storemode=property |title=Classification Confidence Scores with Point-wise Guarantees |pdfUrl=https://ceur-ws.org/Vol-2808/Paper_33.pdf |volume=Vol-2808 |authors=Nivasini Ananthakrishnan,Shai Ben-David,Tosca Lechner |dblpUrl=https://dblp.org/rec/conf/aaai/Ananthakrishnan21 }} ==Classification Confidence Scores with Point-wise Guarantees== https://ceur-ws.org/Vol-2808/Paper_33.pdf
               Pointwise Confidence scores with provable guarantees
                 Nivasini Ananthakrishnan,1 Shai Ben-David, 1 Tosca Lechner 1
                                                   1
                                               University of Waterloo
                      nanantha@uwaterloo.ca, bendavid.shai@gmail.com, tlechner@uwaterloo.ca




                        Abstract                              undetected, At the same time, since expert intervention
                                                              could be expensive, one also wishes to minimize the oc-
  Quantifying the probability of a machine learning pre-
  diction being correct on a given test point enables users   currence of false positives in the predictions flagged as
  to better decide how to use those predictions. Confi-       low confidence.
  dence scores are pointwise estimates of such probabil-         Can one do better than the overall statistical esti-
  ities. Ideally, these would be the label probabilities      mates when it comes to evaluating reliability on a given
  (a.k.a. the labeling rule) of the data generating distri-   test case?
  bution. However, in learning scenarios the learner does        Arguably, the most common reason for an statisti-
  not have access to the labeling rule. The learner aims      cally reliable machine learning program to fail on a test
  to figure out a best approximation of that rule based on    point is that that point is an ‘outlier’, in the sense of not
  training data and some prior knowledge about the task
  at hand, both for predicting a label and for possibly
                                                              being well represented by the sample the program was
  providing a confidence score. We formulate two goals        trained on. This research aims to quantify this ‘out-
  for confidence scores, motivated by the use of these        lierness’. We propose theoretically founded confidence
  scores for safety-critical applications. First, they must   bounds that take into account the relation between the
  not under-estimate the probability of error for any test    training sample and the specific test point in question
  point. Second, they must not be trivially low for most      (on top of the commonly used parameters of the learn-
  points. We consider a few common types of learner’s         ing algorithm, the size of training sample and assump-
  prior knowledge and provide tools for obtaining point-      tions about the processes generating both the training
  wise confidence scores based on such prior knowledge        data and the test instance).
  and the relation between the test point and the train-
  ing data. We prove that under the corresponding prior
                                                                 Clearly, the confidence of any prediction of an
  knowledge assumptions, our proposed tools meet these        unknown label (or any piece of information) hinges
  desired goals.                                              upon some prior knowledge or assumptions. In this
                                                              work we consider a few forms of prior knowledge that
                                                              are commonly employed in machine learning theory,
                  1    Introduction                           and develop and analyse confidence score for prediction
The reliability of machine learnt programs is of course a     of individual test points under such assumptions.
major concern and has been the focus of much research.
Theory offers quite a selection of tools for evaluating re-      We consider the following types of prior knowledge:
liability, from generalization bounds to experimental re-        Known hypothesis class with low approxima-
sult of test sets. However, most of those guarantees are      tion error: We discuss two cases - the realizable set-
statistical, in the sense that they only hold with high       ting (i.e., when that approximation error is zero) and
probability (over the generation of the training data         the agnostic setup (both in Section 2).
and of the test points) and they provide no informa-
tion about the correctness of prediction on any specific      • In the realizable case, we show that there are indeed
instance.      In cases where an error on a specific in-        hypothesis classes for which it is possible to define
stance may incur a very high cost, like in safety-critical      a confidence score that does not overestimate con-
applications, the common statistical guarantees do not          fidences for any points, while providing high confi-
suffice. We would also wish to be able to identify pre-         dences to many points. However, there are also hy-
dictions with low confidence so that one could apply            potheses classes, that do not allow non-trivial confi-
some safety procedures (such as a review by a human             dence scores fulfilling such a guarantee.
expert). Ideally, no low confidence prediction should go      • For the agnostic setup, assuming the learner has
Copyright c 2021 for this paper by its authors. Use permit-     knowledge of a hypothesis class with low (but not
ted under Creative Commons License Attribution 4.0 Inter-       necessarily 0) approximation error, we show that in
national (CC BY 4.0).                                           this case it is not possible to give any non-trivial con-
  fidence score that does not overestimate confidence          tion and a selective function are learned simultaneously.
  for some instances.                                          The risk of a classification is then only accessed on the
                                                               set of instances that was selected for classification. The
   The data generating distribution is Lipschitz:              selective function is evaluated by their coverage - how
We provide a an algorithm that calculates confidence           many instances in expectation are selected for classi-
scores under such an a Lipschitzness assumption. We            fication. They analyse the trade-off between risk and
show that with high probability over samples, the re-          coverage, and introduce the notion of ”perfect classi-
sulting confidence score of every point is an underesti-       fication” which requires risk 0 with certainty. This is
mate of its true confidence while the confidence score we      similar to our requirements on a confidence score in
obtain is non-trivial. We provide bounds on the proba-         the deterministic setting, where we require 0 risk with
bility (over points and samples) of assigning a low con-       high probability. Their notion of coverage is similar to
fidence score to a point with high true confidence that        our notion of non-redundancy - in fact non-redundancy
converge to zero as a function of the training sizes. For      corresponds to worst-case coverage over a family of dis-
more details, see Section 5.                                   tributions. They provide an optimal perfect learning
                                                               strategy in the realizable setting and show that there
                 2    Related work                             are hypothesis classes with a coverage that converges
The closest previous work to ours is Jiang et al [5]. They     to 1 and hypothesis classes for which coverage is always
consider a very similar problem to the one we address          0 for some distributions. We use their results in our
here - the problem of determining when can a classifier’s      Section 4. In contrast to their paper our setup also con-
prediction on a given point be trusted. For the sake of        siders probabilistic labeling functions and our analysis
saving space, we refer the reader to that paper for a          also considers other assumptions on the family of prob-
more extensive review of previous work on this topic.          ability distributions, besides generalization guarantees
Their theoretical results differ from our work in two es-      for some fixed hypothesis space.
sential aspects. First, they consider only one setup - a
data generating distribution that satisfies several tech-                   3    Problem definition
nical assumptions. In particular they rely on the follow-
                                                               Let the domain of instances be X and the label set
ing strong condition: ”for any point x ∈ X , if the ratio
                                                               be {0, 1}. A learning task is determined by a proba-
of the distance to one class’s high-density-region to that
                                                               bility distribution P over X × {0, 1}. We denote the
of another is smaller by some margin γ, then it is more
                                                               marginal distribution over the domain by PX and the
likely that x’s label corresponds to the former class.”
                                                               conditional labeling rule by `P (namely, for x ∈ X,
We analyse our notion of confidence under several differ-
                                                               `P (x) = Pr(x0 ,y0 )∼P [y 0 = 1|x0 = x]).
ent incomparable assumptions, arguably, none of which
is as strong as that. The second significant difference is        The Bayes classifier,
that the main result on trust of labels there (theorem 4            hB                              [y 0 = 1|x0 = x] ≥ 0.5,
                                                                     P (x) = 1 iff       Pr
of [5]) states that if a certain inequality holds then the                           (x0 ,y 0 )∼P
predicted label agrees with that of the Bayes optimal
predictor, and if another inequality holds, there is dis-      is the pointwise minimizer of the zero-one prediction
agreement between them. However, those inequalities            loss. We sometime refer to its prediction hB P (x) as the
are not complementary. It may very well be that in             majority label of a point or the Bayes label of a point.
many cases every domain point (or high probability of             We are interested in point-wise confidence. For a
instances) fails both inequalities. For such points, that      point x ∈ X, the confidence of a label y ∈ {0, 1} is
main result tells nothing at all. That paper does not
offer any discussion of the conditions under which their                 CP (x, y) =       Pr         [y 0 = y|x0 = x].
                                                                                       (x0 ,y 0 )∼P
main result is not vacuous in that respect. Additionally
their result holds with high probability over the domain       Note that the label assigned by he Bayes predictor max-
and is not a point-wise guarantee.                             imizes this confidence for every domain point x.
   Selective Classification/ Classification with                  A Confidence score of the label confidence is an
Abstension: One line of work that is related to our            empirical estimate (based on some training sample S)
paper is learning with abstention. Similar to our set-         of the true label confidence. Inevitably, the reliability
ting, the classification problem does not only consist         of a confidence score is dependent on some assump-
of the goal of classifying correctly, but to also allows       tions on the data generating distribution (or, in other
the classifier to abstain from making a prediction, if         words, prior knowledge about the task at hand). Given
the confidence of a prediction is too low. Many works          a family of data generating distributions P (fulfilling
in this line provide accuracy guarantees that hold with        some niceness assumption that reflect the learners prior
high probability over the domain ([1],[11], [3], [4], [6]).    knowledge or beliefs about the task), a training sample
This is different from our goal of point-wise guarantees.      S, and a parameter δ, the empirical confidence estimate
   Point-wise guarantees are provided in [2] and [10].         for a point x and label y is a function C(x, y, S, δ). We
El-Yaniv et al [2] gave a theoretical analysis of the se-      want the following property to hold: For every proba-
lective classification setup in which a classification func-   bility distribution P ∈ P, with probability of at least
1 − δ over an i.i.d. generation of S by P , we have            data generating distribution P is a bound on the ap-
                 Pr              0
                             [y = y] ≥ C(x, y, S, δ).          proximation error of a class of predictors. We will dis-
        y 0 ∼Bernoulli(`P (x))                                 tinguish two cases here,
That is, with high probability, we do not overestimate        1. The family PH,0 of distributions P which are realiz-
the probability of y being the correct label for x. Ide-          able w.r.t. H, i.e. inf h∈H LP (h) = 0 and
ally, this should hold for every point x in the domain.       2. The family PH, of distributions P for which the ap-
Of course, there is a trivial solution for this - just let        proximation error of class H is low but not guaran-
C( , , , ) be the constant 0 function. The goal therefore         teed to be zero, i.e. inf h∈H LP (h) ≤ , for some  > 0.
is to get a confidence score that fulfils the condition
                                                               Note that, given a class of predictors, H, the second
above, while still being as high as possible for ‘many’
                                                               family of possible data generating distributions is a
x0 s. That is, we aim for a confidence score, such that
                                                               superset of the first. Consequently, the pointwise er-
Ex∼P,S∼P m [max{C(x, 1, S, δ), C(x, 0, S, δ)}] is high. As
                                                               ror guarantees one can give in that non-necessarily-
mentioned above, given a data generating distribution
                                                               realizable case are weaker1 .
P and a data representation available to the learner, the
highest confidence on every instance x ∈ X is achieved         Definition 1 (Confidence Score, fulfilling the no-over-
by the Bayes predictor hB  P (x) and it is easy to see that
                                                               estimation guarantee for all instances). We say a func-
it is max{`P (x), (1 − `P (x))}.                               tion C, that takes as input a sample S, a point x, a
   In contrast with the common notion of a PAC style           hypothesis h and a parameter δ and outputs a value
error bound is that confidence scores may vary over in-        in [0, 1]. We say such a function C is a confidence
dividual instances, capturing the heterogeneity of the         score fulfilling the no-overestimation guarantee for all
domain and the specific training sample the label pre-         instances for a family of probability functions P if for
diction relies on. To demonstrate this point, consider         every P ∈ P the probability over S ∼ P m that there
the following example:                                         exists x ∈ X with
Example 1. Let X1 be the 0.1 grid over [0, 1]d , let                 P ry∼Bernoulli(`P (x)) [h(x) = y] < C(x, y, S, δ)
X0 the 0.01 grid over [0, 0.1]d and let our domain be
X = X0 ∪ X1 (for some large d). Consider the fam-             is less than δ.
ily P of all probability distributions over X that have          We say a function C(x, y, S, δ), is a confidence score
a deterministic labeling rule satisfying the 10- Lipschitz    fulfilling the fulfilling the no-overestimation guarantee
condition (so points of distance 0.1 or more have no          for positive mass instances if the above guarantee holds
effect on each other). Assume further that all the dis-       not for all x, but for all x with P ({x}) > 0.
tributions in P have half of their mass uniformly dis-           It is obvious, that this guarantee to achieve, if we
tributed over X1 grid points and the other half of the        give the confidence score 0 for all predictions. In order
mass uniformly distributed over X0 .                          to measure the informativeness of a confidence score
   Since outside the [0, 0.1]d cube, every labeling is pos-   for a particular distribution we introduce the notion of
sible, for every learner there is a distribution P ∈ P        non-triviality of a confidence score.
w.r.t. which it errs on every domain point in X1 \ SX         Definition 2 (Non-triviality). Given a confidence
(where SX = {x : ∃y ∈ {0, 1} s.t. (x, y) ∈ S}). On            score C for some family of distributions P, we define
the other hand, due to the Lipschitz condition, all the       the non-triviality ntP (C, x, y) w.r.t. to a distribution
points in the [0, 0.1]d grid, X0 , must get the same la-      P ∈ P for a given sample size m and parameter δ, to
bel. Therefore, given a sample S that includes a point        be the extected difference between the estimated and the
in X0 , a learner that labels all the points in X0 by the     true confidence,i.e.:
label of the sample points in it induces confidence 1 for
all these points.
   We conclude that, for sample sizes between 2 and,             ntP (C, m, δ) =
say, 10d /2, for most of the samples a learner can                Ex∼P,S∼P m [1 − min |CP (x, y) − C(x, y, S, δ)|)]
                                                                                    y∈{0,1}
achieve confidence 1 for points in X0 and no learner can
achieve confidence above 0 for even a half of the domain        Next we consider a specific confidence score that
points in X1 . Note also that the No Free Lunch theorem       takes into account whether a hypothesis class is un-
(as formulated in, e.g., [8]) implies that for sample sizes   decided on a point x given a sample S.
smaller than 10d /2, for every learner there exists some
probability distribution P ∈ P for which its expected                               CH (x, y, S, δ) =
error is at least 1/8 (1/4 over a subspace X1 that has          
                                                                    0 if there is h ∈ H s.t. LS (h) = 0 and h(x) 6= y
probability weight 1/2).
                                                                    1 otherwise
                                                                 1
   4    Confidence scores for hypothesis                          To not have to deal with the ambiguity of the labeling
                                                              function for points with mass 0, we will restrict this discus-
                    classes                                   sion to the family of distributions which have positive mass
In the following we will analyze the point-wise confi-        on all points. This implies that in the realizable setting all
dence when all the prior knowledge available about the        labeling function `P we consider are part of H.
Proposition 1. For the family of distributions P that            5     Confidence scores under Lipschitz
are realizable with respect to H, CH is indeed a con-                            assumption
fidence score fulfilling the no-overestimation guarantee
for all instances.                                             Lipschitz Assumption : We say that the probability
                                                               distribution P over X × {0, 1} satisfies λ-Lipschitzness
   This statement was made in a different setup by El-         w.r.t. a metric d(., .) over X, if for every x, x0 ∈ X
Yaniv et al [2]. We note that our confidence score CH is       , |`P (x) − `P (x0 )| ≤ λd(x, x0 ). When the domain is a
equivalent to their notion of consistent selective strat-      subset of a Euclidean space, we will assume that d is
egy. Using our terminology, they show that if the real-        the Euclidean distance unless we specify otherwise.
izability assumption holds, if an instance (x, y) is classi-      We provide an algorithm (1) to estimate the labelling
fied as 1 by CH then x is guaranteed to have true label        probability of points using labelled samples. The algo-
y (with probability 1). Furthermore, their Theorems 11         rithm partitions the space into cells. The algorithm
and Theorem 14 as well as their Corollory 28 give rise         outputs the same answer for points in the same cell.
to the following observation about confidence scores.          The input parameter r dictates the size of the cells.
Observation 1. It turns out that ntP (C, m, δ) for             The algorithm estimates the average labelling probabil-
this confidence scoring rule CH under the realizability        ity for each cell. A confidence interval for this estimate
assumption displays different behaviours for different         is calculated based on the number of sample points in
classes (even when they have similar VC dimension):            the cell. The interval is narrow when there are more
• For some hypothesis classes, e.g. the class of thresh-       sample points in the cell.
  olds on the real line Hthres or the class of linear sep-        The following lemmas show how to estimate probabil-
  arators in Rd , ntP (C, m, δ) converges to 1 for every       ity weights and average labelling probabilities of subsets
  δ > 0 and every P ∈ P as the sample size go to               of the domain:
  infinity.                                                    Lemma 1. Let P be a distribution over domain X.
• On the other hand, for some hypothesis classes with          Let X 0 be a subset of X. Let S be an i.i.d. sample of
  finite VC-dimension for every  > 0 there exist              size m drawn from the distribution P . Let p̂(X 0 , S) be
  P ∈ PH,0 with ntP (C, m, δ) = 0 for every sample             the fraction of the m samples that are in X 0 . For any
  size m and every δ < 1. This phenomenon occurs for           δ > 0, with probability 1 − δ over the generation of the
  example for H being the class of singletons.                 samples S,
   For a more detailed analysis of which hypothesis                         |P (X 0 ) − p̂(X 0 , S)| ≤ wp (m, δ)
classes have high non-triviality, we refer the reader to
[2], noting that our notion of non-triviality corresponds      where
to their notion of coverage.
                                                                                              r
                                                                                                   1   2
   We now look at the second case we wanted to ad-                              wp (m, δ) =          ln .
                                                                                                  2m δ
dress in this section: The family of probability distri-
butions such that the approximation error of a class H         Lemma 2. Let D be distribution over X × {0, 1}. Let
is bounded by some . We fix a hypothesis class H and          X 0 be a subset of X. Let S be an i.i.d. sample of size
let PH, be the family of all probability distributions P      m drawn from D. Let `(X    ˆ 0 , S) be the fraction of the m
such that H has approximation error at most  w.r.t. P .       labelled samples with label 1 in S ∩ X 0 . For any δ > 0,
We show that for for any (non-trivial) hypothesis class        with probability 1 − δ over the generation of the samples
H, it is not possible to find any satisfying confidence        S, if p̂(X 0 , S) − wp (m, δ/2) > 0, then
score for such a family.
Proposition 2. Let H be any hypothesis over an infi-             |`¯P (X 0 ) − `(X
                                                                               ˆ 0 , S)| < w` (m, δ, p̂(X 0 , S))
nite domain. Then there is no confidence score C such                                                     1
                                                                    w` (m, δ, p̂(X 0 , S)) =
that the following two statements are true simultane-                                        p̂(X 0 , S) − wp (m, δ/2)
ously:                                                                                                        r          !
                                                                                                                  1    4
• C fulfills the no-overestimation guarantee for all                                       · wp (m, δ/2) +          ln     ,
  positive-mass instances w.r.t. PH,                                                                           2m δ
• there exists some η > 0 such that for every P ∈ PH,         where p̂(X 0 , S) is the fraction of the samples from S
  there are δ ∈ (0, 1), m ∈ N such that C has non-             in X 0 that have label 1, wp (m, δ/2)) is as defined in
  triviality ntP (C, m, δ) > η.                                Lemma 1.
   This shows us that restricting ourselves to a family
                                                                 The following theorem shows that the true labelling
of probability distributions that allow for good gener-
                                                               probability of a point lies within the estimate interval
alization is not sufficient for allowing satisfying con-
                                                               provided by Algorithm 1, with high probability over the
fidence scores. In the following section we will make
                                                               sample used to find the estimate.
stronger, more local assumptions and show that under
these assumptions more satisfying confidence scores can        Theorem 1. Let the domain be [0, 1]d . Suppose the
be found.                                                      data generating distribution P satisfies λ-Lipschitzness.
Algorithm 1 Lipschitz labelling probability estimate              Proof of Lemma 1. Let Xi be a random variable in-
 Input:          Test point x, Labelled samples S =               dicating if the ith sample belongs to set X 0 . Xi = 1 if
 (xi , yi )m
           i=1 ,
                                                                  the ith sample belongs to X 0 and zeroPotherwise. For
                                                                                                           N
                                                                                                               Xi
     Radius r, Estimation parameter δ,                            each i, E[Xi ] = P (X 0 ). p̂(X 0 , S) = i=1
                                                                                                            m     . Apply-
     Lipschitz constant λ                                         ing Hoeffding’s inequality, we get the inequality of the
 Output: Labelling probability estimate, confidence               theorem.
 width of estimate
 Split the domain X = [0, 1]d into a grid of (1/r)d               Proof of Lemma 2. Let Xi be a random variable
 hypercube cells each of side length r.                           such that
 Find the cell tx containing the test point x.                            
                                                                             1 If ith sample is in X 0 and has label one,
 p̂[tx ] := fraction of samples in tx .                             Xi =
  ˆ x ] := fraction of samples in the cell tx with label 1.                  0 otherwise.
 `[t
 w[tx ] := 1                                                      E[Xi ] = P (X 0 )`¯P (X 0 ), for each i.
                                                                                                              Pm
                                                                                                                 i=1 Xi =
 if p̂[tx ] − wp (m, δ/2) then                                       ˆ     0
      w[tx ] = w` (m, δ, p̂[tx ])                                 mp̂`P (X , S). Note that by triangle inequality,
 end if                                 √                         |P (X 0 )`ˆP (X 0 , S) − P (X 0 )`¯P (X 0 )|
 Return `[t    ˆ x ], min(1, w[tx ] + rλ 2)
                                                                   ≤ |p̂`ˆP (X 0 , S) − P (X 0 )`¯P (X 0 )| + |p̂ − P (X 0 )|`ˆP (X 0 , S)
                                                                  ≤ |p̂`ˆP (X 0 , S) − P (X 0 )`¯P (X 0 )| + wp .
For any r > 0, δ > 0, m ∈ N, for any x ∈ [0, 1]d , define
the confidence score based on Algorithm 1 as                      For any  > 0,
                           (
   r,λ                       1 − `ˆS (x) − wS (x) if y = 1         Pr[|`¯P (X 0 ) − `(X
                                                                                    ˆ 0 , S)| > ]
 ĈLipschitz (x, y; S, δ) = ˆ
                             `S (x) − wS (x)      if y = 0
                                                                   = Pr[P (X 0 ) · |`¯P (X 0 ) − `(X
                                                                                                 ˆ 0 , S)| > P (X 0 )]
where (`ˆS (x), wS (x)) is the output of the Algorithm with
                                                                    ≤ Pr[|p̂`ˆP (X 0 , S) − P (X 0 )`¯P (X 0 )| + wp > (p̂ − wp )]
input r, δ, λ. Then with probability 1 − δ over samples
S of size m,                                                                   ˆ 0 , S) − mP (X 0 )`¯P (X 0 )|
                                                                    = Pr[|mp̂`(X
                  r,λ
                ĈLipschitz (x, y; S, δ) ≤ CP (x, y)                > m(p̂ − wP ) − wp ]
                                                                          m
  We now show that as sample size increases, for an                      X
appropriately chosen input parameter r, Algorithm 1                 = Pr[    |Xi − E[Xi ]| > m((p̂ − wP ) − wp )]]
returns narrow estimate intervals for the labelling prob-                  i=1
                                                                    ≤ 2 exp −2m((p̂ − wP ) − wp )2
                                                                                                              
abilities for most points. This implies that for most
points, the confidence score is not much lower than the
true confidence (2|`P (x) − 12 |)                                 When p̂ − wp > 0, choosing
Theorem 2. For every λ-Lipschitz distribution, for                                           wp         1
                                                                                                                  r
                                                                                                                       1   4
every x , c , δ > 0, there is a sample size m(x , c , δ)              wl (m, δ, p̂) >         +                      ln ,
such that with probability 1 − δ over samples S of size                                   p̂ − wp   p̂ − wp           2m δ
m(x , c , δ),
                                                                  we get that with probability 1−δ, |`¯P (X 0 )− `(X
                                                                                                                 ˆ 0 , S)| <
                      Pr [wS (x) > c ] < x
                  x∼P                                             w` (m, δ, p̂).
where wS is the width of labelling probability estimate
obtained from Algorithm 1 with input parameter of grid            Confidence scores for hypothesis classes
               1
size r = 1/m 8d                                                   We start by recalling the definition of confidence scores
   Note, that this theorem implies that the expected              fulfilling the no-overestimation guarantee for all in-
non-triviality is high.                                           stances:
                                                                  Definition 1 (Confidence Score, fulfilling the no-over-
                        6   Proofs                                estimation guarantee for all instances). We say a func-
Useful lemmas                                                     tion C, that takes as input a sample S, a point x, a
The following lemma appears as Theorem 2.2.6 of the               hypothesis h and a parameter δ and outputs a value
book of [9], where the reader can find its proof.                 in [0, 1]. We say such a function C is a confidence
Lemma 3 (Hoeffding’s inequality for general bounded               score fulfilling the no-overestimation guarantee for all
random variables). Let X1 , . . . , XN be independent ran-        instances for a family of probability functions P if for
dom variables. Assume that Xi ∈ [mi , Mi ] for every i.           every P ∈ P the probability over S ∼ P m that there
Then, for any t > 0, we have                                      exists x ∈ X
   "N                    #                                  !
    X                                           2t2                      P ry∼Bernoulli(`P (x)) [h(x) = y] < C(x, y, S, δ)
Pr      (Xi − E[Xi ]) ≥ t ≤ exp − PN                          .
    i=1                                    i=1 (Mi − mi )
                                                          2       is less than δ.
6.1    Proof of Proposition 1                                     We now see that for any H and any labeled sample
Proposition 1. For the family of distributions P that          S the selected function g assigns one to x, if for ev-
are realizable with respect to H, CH is indeed a con-          ery two h1 , h2 ∈ H with LS (h1 ) = LS (h2 ) = 0 implies
fidence score fulfilling the no-overestimation guarantee       h1 (x) = h2 (x). Thus, g(x) = CH (x, h(x), S, δ) for any
for all instances.                                             h(x) ∈ HS . In [2] the selection function is then anal-
                                                               ysed with respect to its coverage, which is defined by
   We need to show that CH fulfills Definition 1, that is,     φ(h, g) = Ex∼P [g(x)] for a given distribution P . Note,
we need to show that for every hypothesis class H and          that our notion of non-triviality corresponds to this no-
every distribution P that fulfills the realizable condition    tion of coverage for deterministic distributions and bi-
w.r.t. H, the probability over S ∼ P m that there exists       nary confidence scores. We can now use some of their
x∈X                                                            results to show our observation.
      P ry∼Bernoulli(`P (x)) [h(x) = y] < CH (x, y, S, δ)      Theorem 3 (non-achievable coverage, Theorem 14
is less than δ. Since CH only assigns values 0 and 1 and       from [2]). Let m and d > 2 be given. There exist a
the condition is trivially fulfilled for instances with con-   distribution P , an infinite hypothesis class H with a
fidence score 0, we will now only discuss the case where       finite VC-dimension d, and a target hypothesis in H,
CH assigns confidence score 1. Recall, that CH only            such that φ(h, g) = 0 for any selective classifier (h, g),
assigns confidence 1 if and only if there is no h ∈ H          chosen from H by CSS using a training sample S of size
with LS (h) = 0 and h(x) 6= y. Since we know, that             m drawn i.i.d. according to P .
realizability holds, we know that `P ∈ H and since S             This directly implies the second part of our ob-
is an i.i.d sample from P we know LS (`P ) = 0. Now            servation. For a more concrete example consider
let CH (x, y, S, δ) = 1, then by definition we know that       the class of singletons over the real line Hsgl =
LS (h) = 0 implies h(x) = y. Thus `P (x) = y. Thus,            {hz : R → {0, 1} : hz (x) = 1 if and only if z = x}.
this confidence score does not overestimate the confi-         We note that CHsgl only gives positive confidence
dence of a point in any label.                                 scores to instances outside the sample if the sample
                                                               contains a positively labeled instance. Let P be the
6.2    Proof of Observation 1                                  uniform distribution over [0, 1] × {0}.     Obviously
We start by noting that our definition of the confidence       P ∈ PH,0 . Furthermore any sample generated by P
score CH is equivalent to the consistent selective strat-      will not contain any positively labeled sample. Thus,
egy from [2]. In order to state their definition, we will      ntP (C, m, δ) = Px∼PX ,S∼P m [x ∈ SX ] = 0, where PX
first need to introduce some other concepts. First, we         and SX denote the projections of P and S to the
will state the definition of version space from [7].           domain X = R.
Definition 3 (Version Space). Given a hypothesis class
H and a labeled sample S, the version space HS the set           Let us consider the class of thresholds Hthres = {ha :
of all hypotheses in H that classify S correctly.              R → {0, 1}, ha (x) = 1 if and only if x > a}. We can
                                                               define the following two learning rules for thresholds:
   Now, we can define the agreement set and maximal
agreement set as in [2].                                           A1 (S) = ha1 , where a1 =     arg max         xi
                                                                                               xi ∈R:(xi ,0)∈S
Definition 4 (agreement set). Let G ⊂ H. A subset
X ⊂ X is an agreement set with respect to G if all                 A2 (S) = h̄a2 , where h̄a2 (x) = 1
hypotheses in G agree on every instance in X 0 , namely                   if and only if x ≥ a2 := arg min            xi .
for all g1 , g2 ∈ G, x ∈ X                                                                          xi ∈R:(xi ,1)∈S

                       g1 (x) = g2 (x).                        Furthermore, The way CHthres is defined, for any S
Definition 5 (maximal agreement set). Let G ⊂ H.               and δ ∈ (0, 1) we have A1 (S)(x) = A2 (S)(x) if and
The maximal agreement set with respect to G is the             only if there is y ∈ {0, 1} with CHthres (x, y, S, δ) = 1.
union of all agreement sets with respect to G.                 Furthermore, both of these learning rules are empirical
                                                               risk minimizers for Hthres in the realizable case. Thus
   We can now state the definition of consistent selective     both of them are PAC-learners. Thus for any , δ, there
strategy. Note, that a selective strategy is defined by a      is a m(, δ), such that for any m ≥ m(, δ) and any
pair (h, g) of a classification function h and a selective     P ∈ PHthres ,0 ,
function g. For our purposes, we will only need to look
at the selective function g.                                          1 − δ ≤ P rS∼P m ,x∼P [A1 (S) = A2 (S)] =
Definition 6 (consistent selective strategy (CSS)).                   P rS∼P m ,x∼P [ max {CHthres (x, y, S, δ)}]
Given a labeled sample S, a consistent selective strategy                            y∈{0,1}
(CSS) is a selective classification strategy that takes h to                    = ntPHthres ,0 (C, m, 0)
by any hypothesis in HS (i.e., a consistent learner), and
takes a (deterministic) selection function g that equals       Thus limm→∞ ntPHthres ,0 (CHthres , m, δ) = 1 for any
one for all points in the maximal agreement set with           δ ∈ (0, 1) and any P ∈ PHthres ,0 . Furthermore note
respect to HS and zero otherwise.                              that we have uniform convergence.
6.3    Proof of Proposition 2                                   Proof of Theorem 2. We choose the input to the al-
                                                                                       1
                                    0
For every  > 0, every x ∈ X and every n ∈ N with               gorithm to be r = m1/8d   . With probability 1 − 2δ , for all
m > 1 , we can construct a distribution Px,n , such            cells with probability weight greater than γ = m11/4 , the
that lPx,n (x0 ) = h1 (x0 ) for every x0 ∈ X \ {x} and          length of the confidence interval of the labelling proba-
h1 (x0 ) 6= lPx,n (x0 ) and such that the marginal Px,n,X is    bility is less than
uniform over some Xn ⊂ X 0 with |Xn | = n. For a sam-                    1
                                                                                                r                      √
ple to distinguish between two such distributions Px1 ,n               m1/2              1          1     4m1/8      λ 2
                                                                     1           − 1                   ln       + 1/8
and Px2 ,n either the point x1 or x2 needs to be visited          m1/4
                                                                        + m11/2     m1/4
                                                                                         − m11/2 2m         δ        m
by the sample. Thus in order to give a point-wise guar-                                     r                 √
antee for all instances with positive mass, only points in                 1           1        1    4m     λ 2
                                                                  ≤ 1/4          +                ln      + 1/8 .
the sample can be assigned a positive confidence in this               m − 1 m1/4 − 1 16              δ    m
scenario. Thus any confidence score fulfilling this guar-       This quantity decreases with increase in m and con-
antee would have ntPx,n (C, m, δ) = Px∼PX ,S∼P m [x ∈           verges to zero. Therefore, for every c > 0, there is
SX ] ≤ m   n . For every η > 0 we can find n such that          M1 (c , δ) such that this interval is less than c . When
m
 n  ≤  η,  proving   our claim.                                 sample size is larger than M1 (c , δ), with probability
Confidence scores using Lipschitz                               1 − 2δ , the size of confidence intervals for labelling prob-
assumption                                                      abilities of cells with weights greater than γ = m11/4 , is
                                                                smaller than c .
Proof of Theorem 1. The algorithm partitions the                    The points for which we can’t say anything about
space into rd cells. Let pc be the probability weight           the interval lengths are points in cells with weight at
of a cell c and let p̂c be the estimate of pc that is cal-      most γ. The total weight of such points is at most
culated based on a sample to be the fraction of sample          γ r1d = m11/8 . For any x > 0, let M2 (x ) be such that
points in the cell c. From Lemma 1 and a union bound,                 1
                                                                            < x .
we know that with probability 1 − 2δ , for every cell c,        M2 (x )1/8
                                                                    Choosing a sample size M greater than M1 (c , δ) and
              pc ∈ [p̂c − wp (c), p̂c + wp (c)].                M2 (x ), we get that
Here wp (c) = wp (m, δ/2rd ) (as defined in Lemma 1).                                Pr [w` > c ] < x .
                                                                                   S∼P M
  The algorithm also estimates the average label of a
cell c - `c as `ˆc . This is the fraction of the sample point
in the cell that have the label one. This is the same as
the labelling probability estimate defined in Lemma 2.
When the true probability weights of cells lie within the
calculated confidence interval, by Lemma 2, we know
that with probability 1 − 2δ , for every cell c,

              `ˆc ∈ [`ˆc − w` (c), `ˆc − w` (c))].
Here w` (c) = w` (m, δ/2rd , p̂c ) (as defined in Lemma 2).
  The maximum √      distance between any two points in
any cell is r 2. By the λ-Lipshitz, any      √ point in the
cell has labelling probability within λr 2 of the aver-
age labelling probability of the cell. Therefore, with
probability 1 − δ, for each cell c, for every point x in
the cell c, the labelling probability of x satisfies:
                               √                    √
    `P (x) ∈ [`ˆc − w` (c) − λr 2, `ˆc + w` (c) + λr 2].
   This is the interval returned by the algorithm. Now
we lower bound true confidence based on the confidence
interval of the labelling probability. For a point x, let
c(x) denote the cell containing the point.
        CP (x, 0) = `P (x)
                                           √
                  ≥ `ˆc(x) − w` (c(x)) − λr 2
        CP (x, 1) = 1 − `P (x)
                                               √
                  ≥ 1 − `ˆc(x) − w` (c(x)) − λr 2.
                     References
 [1] Bartlett, P. L.; and Wegkamp, M. H. 2008. Clas-
     sification with a Reject Option using a Hinge
     Loss. Journal of Machine Learning Research 9(59):
     1823–1840. URL http://jmlr.org/papers/v9/
     bartlett08a.html.
 [2] El-Yaniv, R.; and Wiener, Y. 2010. On the Foun-
     dations of Noise-free Selective Classification. J.
     Mach. Learn. Res. 11: 1605–1641. URL http:
     //portal.acm.org/citation.cfm?id=1859904.
 [3] Freund, Y.; Mansour, Y.; Schapire, R. E.; et al.
     2004. Generalization bounds for averaged classi-
     fiers. The annals of statistics 32(4): 1698–1722.
 [4] Herbei, R.; and Wegkamp, M. H. 2006. Classifi-
     cation with reject option. The Canadian Journal
     of Statistics/La Revue Canadienne de Statistique
     709–721.
 [5] Jiang, H.; Kim, B.; Guan, M.; and Gupta, M. 2018.
     To trust or not to trust a classifier. In Advances in
     neural information processing systems, 5541–5552.
 [6] Kalai, A. T.; Kanade, V.; and Mansour, Y. 2012.
     Reliable agnostic learning. Journal of Computer
     and System Sciences 78(5): 1481–1495.
 [7] Mitchell, T. M. 1977. Version Spaces: A Candi-
     date Elimination Approach to Rule Learning. In
     Reddy, R., ed., Proceedings of the 5th Interna-
     tional Joint Conference on Artificial Intelligence.
     Cambridge, MA, USA, August 22-25, 1977, 305–
     310. William Kaufmann. URL http://ijcai.
     org/Proceedings/77-1/Papers/048.pdf.
 [8] Shalev-Shwartz,   S.;   and Ben-David,    S.
     2014.     Understanding Machine Learning -
     From Theory to Algorithms.        Cambridge
     University Press.      ISBN 978-1-10-705713-
     5.       URL http://www.cambridge.org/de/
     academic/subjects/computer-science/
     pattern-recognition-and-machine-learning/
     understanding-machine-learning-theory-algorithms.
 [9] Vershynin, R. 2019. High-dimensional probability.
[10] Wiener, Y.; and El-Yaniv, R. 2015. Agnos-
     tic pointwise-competitive selective classification.
     Journal of Artificial Intelligence Research 52: 171–
     201.
[11] Yuan, M.; and Wegkamp, M. H. 2010. Classifica-
     tion Methods with Reject Option Based on Convex
     Risk Minimization. J. Mach. Learn. Res. 11: 111–
     130. URL https://dl.acm.org/citation.cfm?
     id=1756011.