=Paper=
{{Paper
|id=Vol-2808/Paper_33
|storemode=property
|title=Classification Confidence Scores with Point-wise Guarantees
|pdfUrl=https://ceur-ws.org/Vol-2808/Paper_33.pdf
|volume=Vol-2808
|authors=Nivasini Ananthakrishnan,Shai Ben-David,Tosca Lechner
|dblpUrl=https://dblp.org/rec/conf/aaai/Ananthakrishnan21
}}
==Classification Confidence Scores with Point-wise Guarantees==
Pointwise Confidence scores with provable guarantees Nivasini Ananthakrishnan,1 Shai Ben-David, 1 Tosca Lechner 1 1 University of Waterloo nanantha@uwaterloo.ca, bendavid.shai@gmail.com, tlechner@uwaterloo.ca Abstract undetected, At the same time, since expert intervention could be expensive, one also wishes to minimize the oc- Quantifying the probability of a machine learning pre- diction being correct on a given test point enables users currence of false positives in the predictions flagged as to better decide how to use those predictions. Confi- low confidence. dence scores are pointwise estimates of such probabil- Can one do better than the overall statistical esti- ities. Ideally, these would be the label probabilities mates when it comes to evaluating reliability on a given (a.k.a. the labeling rule) of the data generating distri- test case? bution. However, in learning scenarios the learner does Arguably, the most common reason for an statisti- not have access to the labeling rule. The learner aims cally reliable machine learning program to fail on a test to figure out a best approximation of that rule based on point is that that point is an ‘outlier’, in the sense of not training data and some prior knowledge about the task at hand, both for predicting a label and for possibly being well represented by the sample the program was providing a confidence score. We formulate two goals trained on. This research aims to quantify this ‘out- for confidence scores, motivated by the use of these lierness’. We propose theoretically founded confidence scores for safety-critical applications. First, they must bounds that take into account the relation between the not under-estimate the probability of error for any test training sample and the specific test point in question point. Second, they must not be trivially low for most (on top of the commonly used parameters of the learn- points. We consider a few common types of learner’s ing algorithm, the size of training sample and assump- prior knowledge and provide tools for obtaining point- tions about the processes generating both the training wise confidence scores based on such prior knowledge data and the test instance). and the relation between the test point and the train- ing data. We prove that under the corresponding prior Clearly, the confidence of any prediction of an knowledge assumptions, our proposed tools meet these unknown label (or any piece of information) hinges desired goals. upon some prior knowledge or assumptions. In this work we consider a few forms of prior knowledge that are commonly employed in machine learning theory, 1 Introduction and develop and analyse confidence score for prediction The reliability of machine learnt programs is of course a of individual test points under such assumptions. major concern and has been the focus of much research. Theory offers quite a selection of tools for evaluating re- We consider the following types of prior knowledge: liability, from generalization bounds to experimental re- Known hypothesis class with low approxima- sult of test sets. However, most of those guarantees are tion error: We discuss two cases - the realizable set- statistical, in the sense that they only hold with high ting (i.e., when that approximation error is zero) and probability (over the generation of the training data the agnostic setup (both in Section 2). and of the test points) and they provide no informa- tion about the correctness of prediction on any specific • In the realizable case, we show that there are indeed instance. In cases where an error on a specific in- hypothesis classes for which it is possible to define stance may incur a very high cost, like in safety-critical a confidence score that does not overestimate con- applications, the common statistical guarantees do not fidences for any points, while providing high confi- suffice. We would also wish to be able to identify pre- dences to many points. However, there are also hy- dictions with low confidence so that one could apply potheses classes, that do not allow non-trivial confi- some safety procedures (such as a review by a human dence scores fulfilling such a guarantee. expert). Ideally, no low confidence prediction should go • For the agnostic setup, assuming the learner has Copyright c 2021 for this paper by its authors. Use permit- knowledge of a hypothesis class with low (but not ted under Creative Commons License Attribution 4.0 Inter- necessarily 0) approximation error, we show that in national (CC BY 4.0). this case it is not possible to give any non-trivial con- fidence score that does not overestimate confidence tion and a selective function are learned simultaneously. for some instances. The risk of a classification is then only accessed on the set of instances that was selected for classification. The The data generating distribution is Lipschitz: selective function is evaluated by their coverage - how We provide a an algorithm that calculates confidence many instances in expectation are selected for classi- scores under such an a Lipschitzness assumption. We fication. They analyse the trade-off between risk and show that with high probability over samples, the re- coverage, and introduce the notion of ”perfect classi- sulting confidence score of every point is an underesti- fication” which requires risk 0 with certainty. This is mate of its true confidence while the confidence score we similar to our requirements on a confidence score in obtain is non-trivial. We provide bounds on the proba- the deterministic setting, where we require 0 risk with bility (over points and samples) of assigning a low con- high probability. Their notion of coverage is similar to fidence score to a point with high true confidence that our notion of non-redundancy - in fact non-redundancy converge to zero as a function of the training sizes. For corresponds to worst-case coverage over a family of dis- more details, see Section 5. tributions. They provide an optimal perfect learning strategy in the realizable setting and show that there 2 Related work are hypothesis classes with a coverage that converges The closest previous work to ours is Jiang et al [5]. They to 1 and hypothesis classes for which coverage is always consider a very similar problem to the one we address 0 for some distributions. We use their results in our here - the problem of determining when can a classifier’s Section 4. In contrast to their paper our setup also con- prediction on a given point be trusted. For the sake of siders probabilistic labeling functions and our analysis saving space, we refer the reader to that paper for a also considers other assumptions on the family of prob- more extensive review of previous work on this topic. ability distributions, besides generalization guarantees Their theoretical results differ from our work in two es- for some fixed hypothesis space. sential aspects. First, they consider only one setup - a data generating distribution that satisfies several tech- 3 Problem definition nical assumptions. In particular they rely on the follow- Let the domain of instances be X and the label set ing strong condition: ”for any point x ∈ X , if the ratio be {0, 1}. A learning task is determined by a proba- of the distance to one class’s high-density-region to that bility distribution P over X × {0, 1}. We denote the of another is smaller by some margin γ, then it is more marginal distribution over the domain by PX and the likely that x’s label corresponds to the former class.” conditional labeling rule by `P (namely, for x ∈ X, We analyse our notion of confidence under several differ- `P (x) = Pr(x0 ,y0 )∼P [y 0 = 1|x0 = x]). ent incomparable assumptions, arguably, none of which is as strong as that. The second significant difference is The Bayes classifier, that the main result on trust of labels there (theorem 4 hB [y 0 = 1|x0 = x] ≥ 0.5, P (x) = 1 iff Pr of [5]) states that if a certain inequality holds then the (x0 ,y 0 )∼P predicted label agrees with that of the Bayes optimal predictor, and if another inequality holds, there is dis- is the pointwise minimizer of the zero-one prediction agreement between them. However, those inequalities loss. We sometime refer to its prediction hB P (x) as the are not complementary. It may very well be that in majority label of a point or the Bayes label of a point. many cases every domain point (or high probability of We are interested in point-wise confidence. For a instances) fails both inequalities. For such points, that point x ∈ X, the confidence of a label y ∈ {0, 1} is main result tells nothing at all. That paper does not offer any discussion of the conditions under which their CP (x, y) = Pr [y 0 = y|x0 = x]. (x0 ,y 0 )∼P main result is not vacuous in that respect. Additionally their result holds with high probability over the domain Note that the label assigned by he Bayes predictor max- and is not a point-wise guarantee. imizes this confidence for every domain point x. Selective Classification/ Classification with A Confidence score of the label confidence is an Abstension: One line of work that is related to our empirical estimate (based on some training sample S) paper is learning with abstention. Similar to our set- of the true label confidence. Inevitably, the reliability ting, the classification problem does not only consist of a confidence score is dependent on some assump- of the goal of classifying correctly, but to also allows tions on the data generating distribution (or, in other the classifier to abstain from making a prediction, if words, prior knowledge about the task at hand). Given the confidence of a prediction is too low. Many works a family of data generating distributions P (fulfilling in this line provide accuracy guarantees that hold with some niceness assumption that reflect the learners prior high probability over the domain ([1],[11], [3], [4], [6]). knowledge or beliefs about the task), a training sample This is different from our goal of point-wise guarantees. S, and a parameter δ, the empirical confidence estimate Point-wise guarantees are provided in [2] and [10]. for a point x and label y is a function C(x, y, S, δ). We El-Yaniv et al [2] gave a theoretical analysis of the se- want the following property to hold: For every proba- lective classification setup in which a classification func- bility distribution P ∈ P, with probability of at least 1 − δ over an i.i.d. generation of S by P , we have data generating distribution P is a bound on the ap- Pr 0 [y = y] ≥ C(x, y, S, δ). proximation error of a class of predictors. We will dis- y 0 ∼Bernoulli(`P (x)) tinguish two cases here, That is, with high probability, we do not overestimate 1. The family PH,0 of distributions P which are realiz- the probability of y being the correct label for x. Ide- able w.r.t. H, i.e. inf h∈H LP (h) = 0 and ally, this should hold for every point x in the domain. 2. The family PH, of distributions P for which the ap- Of course, there is a trivial solution for this - just let proximation error of class H is low but not guaran- C( , , , ) be the constant 0 function. The goal therefore teed to be zero, i.e. inf h∈H LP (h) ≤ , for some > 0. is to get a confidence score that fulfils the condition Note that, given a class of predictors, H, the second above, while still being as high as possible for ‘many’ family of possible data generating distributions is a x0 s. That is, we aim for a confidence score, such that superset of the first. Consequently, the pointwise er- Ex∼P,S∼P m [max{C(x, 1, S, δ), C(x, 0, S, δ)}] is high. As ror guarantees one can give in that non-necessarily- mentioned above, given a data generating distribution realizable case are weaker1 . P and a data representation available to the learner, the highest confidence on every instance x ∈ X is achieved Definition 1 (Confidence Score, fulfilling the no-over- by the Bayes predictor hB P (x) and it is easy to see that estimation guarantee for all instances). We say a func- it is max{`P (x), (1 − `P (x))}. tion C, that takes as input a sample S, a point x, a In contrast with the common notion of a PAC style hypothesis h and a parameter δ and outputs a value error bound is that confidence scores may vary over in- in [0, 1]. We say such a function C is a confidence dividual instances, capturing the heterogeneity of the score fulfilling the no-overestimation guarantee for all domain and the specific training sample the label pre- instances for a family of probability functions P if for diction relies on. To demonstrate this point, consider every P ∈ P the probability over S ∼ P m that there the following example: exists x ∈ X with Example 1. Let X1 be the 0.1 grid over [0, 1]d , let P ry∼Bernoulli(`P (x)) [h(x) = y] < C(x, y, S, δ) X0 the 0.01 grid over [0, 0.1]d and let our domain be X = X0 ∪ X1 (for some large d). Consider the fam- is less than δ. ily P of all probability distributions over X that have We say a function C(x, y, S, δ), is a confidence score a deterministic labeling rule satisfying the 10- Lipschitz fulfilling the fulfilling the no-overestimation guarantee condition (so points of distance 0.1 or more have no for positive mass instances if the above guarantee holds effect on each other). Assume further that all the dis- not for all x, but for all x with P ({x}) > 0. tributions in P have half of their mass uniformly dis- It is obvious, that this guarantee to achieve, if we tributed over X1 grid points and the other half of the give the confidence score 0 for all predictions. In order mass uniformly distributed over X0 . to measure the informativeness of a confidence score Since outside the [0, 0.1]d cube, every labeling is pos- for a particular distribution we introduce the notion of sible, for every learner there is a distribution P ∈ P non-triviality of a confidence score. w.r.t. which it errs on every domain point in X1 \ SX Definition 2 (Non-triviality). Given a confidence (where SX = {x : ∃y ∈ {0, 1} s.t. (x, y) ∈ S}). On score C for some family of distributions P, we define the other hand, due to the Lipschitz condition, all the the non-triviality ntP (C, x, y) w.r.t. to a distribution points in the [0, 0.1]d grid, X0 , must get the same la- P ∈ P for a given sample size m and parameter δ, to bel. Therefore, given a sample S that includes a point be the extected difference between the estimated and the in X0 , a learner that labels all the points in X0 by the true confidence,i.e.: label of the sample points in it induces confidence 1 for all these points. We conclude that, for sample sizes between 2 and, ntP (C, m, δ) = say, 10d /2, for most of the samples a learner can Ex∼P,S∼P m [1 − min |CP (x, y) − C(x, y, S, δ)|)] y∈{0,1} achieve confidence 1 for points in X0 and no learner can achieve confidence above 0 for even a half of the domain Next we consider a specific confidence score that points in X1 . Note also that the No Free Lunch theorem takes into account whether a hypothesis class is un- (as formulated in, e.g., [8]) implies that for sample sizes decided on a point x given a sample S. smaller than 10d /2, for every learner there exists some probability distribution P ∈ P for which its expected CH (x, y, S, δ) = error is at least 1/8 (1/4 over a subspace X1 that has 0 if there is h ∈ H s.t. LS (h) = 0 and h(x) 6= y probability weight 1/2). 1 otherwise 1 4 Confidence scores for hypothesis To not have to deal with the ambiguity of the labeling function for points with mass 0, we will restrict this discus- classes sion to the family of distributions which have positive mass In the following we will analyze the point-wise confi- on all points. This implies that in the realizable setting all dence when all the prior knowledge available about the labeling function `P we consider are part of H. Proposition 1. For the family of distributions P that 5 Confidence scores under Lipschitz are realizable with respect to H, CH is indeed a con- assumption fidence score fulfilling the no-overestimation guarantee for all instances. Lipschitz Assumption : We say that the probability distribution P over X × {0, 1} satisfies λ-Lipschitzness This statement was made in a different setup by El- w.r.t. a metric d(., .) over X, if for every x, x0 ∈ X Yaniv et al [2]. We note that our confidence score CH is , |`P (x) − `P (x0 )| ≤ λd(x, x0 ). When the domain is a equivalent to their notion of consistent selective strat- subset of a Euclidean space, we will assume that d is egy. Using our terminology, they show that if the real- the Euclidean distance unless we specify otherwise. izability assumption holds, if an instance (x, y) is classi- We provide an algorithm (1) to estimate the labelling fied as 1 by CH then x is guaranteed to have true label probability of points using labelled samples. The algo- y (with probability 1). Furthermore, their Theorems 11 rithm partitions the space into cells. The algorithm and Theorem 14 as well as their Corollory 28 give rise outputs the same answer for points in the same cell. to the following observation about confidence scores. The input parameter r dictates the size of the cells. Observation 1. It turns out that ntP (C, m, δ) for The algorithm estimates the average labelling probabil- this confidence scoring rule CH under the realizability ity for each cell. A confidence interval for this estimate assumption displays different behaviours for different is calculated based on the number of sample points in classes (even when they have similar VC dimension): the cell. The interval is narrow when there are more • For some hypothesis classes, e.g. the class of thresh- sample points in the cell. olds on the real line Hthres or the class of linear sep- The following lemmas show how to estimate probabil- arators in Rd , ntP (C, m, δ) converges to 1 for every ity weights and average labelling probabilities of subsets δ > 0 and every P ∈ P as the sample size go to of the domain: infinity. Lemma 1. Let P be a distribution over domain X. • On the other hand, for some hypothesis classes with Let X 0 be a subset of X. Let S be an i.i.d. sample of finite VC-dimension for every > 0 there exist size m drawn from the distribution P . Let p̂(X 0 , S) be P ∈ PH,0 with ntP (C, m, δ) = 0 for every sample the fraction of the m samples that are in X 0 . For any size m and every δ < 1. This phenomenon occurs for δ > 0, with probability 1 − δ over the generation of the example for H being the class of singletons. samples S, For a more detailed analysis of which hypothesis |P (X 0 ) − p̂(X 0 , S)| ≤ wp (m, δ) classes have high non-triviality, we refer the reader to [2], noting that our notion of non-triviality corresponds where to their notion of coverage. r 1 2 We now look at the second case we wanted to ad- wp (m, δ) = ln . 2m δ dress in this section: The family of probability distri- butions such that the approximation error of a class H Lemma 2. Let D be distribution over X × {0, 1}. Let is bounded by some . We fix a hypothesis class H and X 0 be a subset of X. Let S be an i.i.d. sample of size let PH, be the family of all probability distributions P m drawn from D. Let `(X ˆ 0 , S) be the fraction of the m such that H has approximation error at most w.r.t. P . labelled samples with label 1 in S ∩ X 0 . For any δ > 0, We show that for for any (non-trivial) hypothesis class with probability 1 − δ over the generation of the samples H, it is not possible to find any satisfying confidence S, if p̂(X 0 , S) − wp (m, δ/2) > 0, then score for such a family. Proposition 2. Let H be any hypothesis over an infi- |`¯P (X 0 ) − `(X ˆ 0 , S)| < w` (m, δ, p̂(X 0 , S)) nite domain. Then there is no confidence score C such 1 w` (m, δ, p̂(X 0 , S)) = that the following two statements are true simultane- p̂(X 0 , S) − wp (m, δ/2) ously: r ! 1 4 • C fulfills the no-overestimation guarantee for all · wp (m, δ/2) + ln , positive-mass instances w.r.t. PH, 2m δ • there exists some η > 0 such that for every P ∈ PH, where p̂(X 0 , S) is the fraction of the samples from S there are δ ∈ (0, 1), m ∈ N such that C has non- in X 0 that have label 1, wp (m, δ/2)) is as defined in triviality ntP (C, m, δ) > η. Lemma 1. This shows us that restricting ourselves to a family The following theorem shows that the true labelling of probability distributions that allow for good gener- probability of a point lies within the estimate interval alization is not sufficient for allowing satisfying con- provided by Algorithm 1, with high probability over the fidence scores. In the following section we will make sample used to find the estimate. stronger, more local assumptions and show that under these assumptions more satisfying confidence scores can Theorem 1. Let the domain be [0, 1]d . Suppose the be found. data generating distribution P satisfies λ-Lipschitzness. Algorithm 1 Lipschitz labelling probability estimate Proof of Lemma 1. Let Xi be a random variable in- Input: Test point x, Labelled samples S = dicating if the ith sample belongs to set X 0 . Xi = 1 if (xi , yi )m i=1 , the ith sample belongs to X 0 and zeroPotherwise. For N Xi Radius r, Estimation parameter δ, each i, E[Xi ] = P (X 0 ). p̂(X 0 , S) = i=1 m . Apply- Lipschitz constant λ ing Hoeffding’s inequality, we get the inequality of the Output: Labelling probability estimate, confidence theorem. width of estimate Split the domain X = [0, 1]d into a grid of (1/r)d Proof of Lemma 2. Let Xi be a random variable hypercube cells each of side length r. such that Find the cell tx containing the test point x. 1 If ith sample is in X 0 and has label one, p̂[tx ] := fraction of samples in tx . Xi = ˆ x ] := fraction of samples in the cell tx with label 1. 0 otherwise. `[t w[tx ] := 1 E[Xi ] = P (X 0 )`¯P (X 0 ), for each i. Pm i=1 Xi = if p̂[tx ] − wp (m, δ/2) then ˆ 0 w[tx ] = w` (m, δ, p̂[tx ]) mp̂`P (X , S). Note that by triangle inequality, end if √ |P (X 0 )`ˆP (X 0 , S) − P (X 0 )`¯P (X 0 )| Return `[t ˆ x ], min(1, w[tx ] + rλ 2) ≤ |p̂`ˆP (X 0 , S) − P (X 0 )`¯P (X 0 )| + |p̂ − P (X 0 )|`ˆP (X 0 , S) ≤ |p̂`ˆP (X 0 , S) − P (X 0 )`¯P (X 0 )| + wp . For any r > 0, δ > 0, m ∈ N, for any x ∈ [0, 1]d , define the confidence score based on Algorithm 1 as For any > 0, ( r,λ 1 − `ˆS (x) − wS (x) if y = 1 Pr[|`¯P (X 0 ) − `(X ˆ 0 , S)| > ] ĈLipschitz (x, y; S, δ) = ˆ `S (x) − wS (x) if y = 0 = Pr[P (X 0 ) · |`¯P (X 0 ) − `(X ˆ 0 , S)| > P (X 0 )] where (`ˆS (x), wS (x)) is the output of the Algorithm with ≤ Pr[|p̂`ˆP (X 0 , S) − P (X 0 )`¯P (X 0 )| + wp > (p̂ − wp )] input r, δ, λ. Then with probability 1 − δ over samples S of size m, ˆ 0 , S) − mP (X 0 )`¯P (X 0 )| = Pr[|mp̂`(X r,λ ĈLipschitz (x, y; S, δ) ≤ CP (x, y) > m(p̂ − wP ) − wp ] m We now show that as sample size increases, for an X appropriately chosen input parameter r, Algorithm 1 = Pr[ |Xi − E[Xi ]| > m((p̂ − wP ) − wp )]] returns narrow estimate intervals for the labelling prob- i=1 ≤ 2 exp −2m((p̂ − wP ) − wp )2 abilities for most points. This implies that for most points, the confidence score is not much lower than the true confidence (2|`P (x) − 12 |) When p̂ − wp > 0, choosing Theorem 2. For every λ-Lipschitz distribution, for wp 1 r 1 4 every x , c , δ > 0, there is a sample size m(x , c , δ) wl (m, δ, p̂) > + ln , such that with probability 1 − δ over samples S of size p̂ − wp p̂ − wp 2m δ m(x , c , δ), we get that with probability 1−δ, |`¯P (X 0 )− `(X ˆ 0 , S)| < Pr [wS (x) > c ] < x x∼P w` (m, δ, p̂). where wS is the width of labelling probability estimate obtained from Algorithm 1 with input parameter of grid Confidence scores for hypothesis classes 1 size r = 1/m 8d We start by recalling the definition of confidence scores Note, that this theorem implies that the expected fulfilling the no-overestimation guarantee for all in- non-triviality is high. stances: Definition 1 (Confidence Score, fulfilling the no-over- 6 Proofs estimation guarantee for all instances). We say a func- Useful lemmas tion C, that takes as input a sample S, a point x, a The following lemma appears as Theorem 2.2.6 of the hypothesis h and a parameter δ and outputs a value book of [9], where the reader can find its proof. in [0, 1]. We say such a function C is a confidence Lemma 3 (Hoeffding’s inequality for general bounded score fulfilling the no-overestimation guarantee for all random variables). Let X1 , . . . , XN be independent ran- instances for a family of probability functions P if for dom variables. Assume that Xi ∈ [mi , Mi ] for every i. every P ∈ P the probability over S ∼ P m that there Then, for any t > 0, we have exists x ∈ X "N # ! X 2t2 P ry∼Bernoulli(`P (x)) [h(x) = y] < C(x, y, S, δ) Pr (Xi − E[Xi ]) ≥ t ≤ exp − PN . i=1 i=1 (Mi − mi ) 2 is less than δ. 6.1 Proof of Proposition 1 We now see that for any H and any labeled sample Proposition 1. For the family of distributions P that S the selected function g assigns one to x, if for ev- are realizable with respect to H, CH is indeed a con- ery two h1 , h2 ∈ H with LS (h1 ) = LS (h2 ) = 0 implies fidence score fulfilling the no-overestimation guarantee h1 (x) = h2 (x). Thus, g(x) = CH (x, h(x), S, δ) for any for all instances. h(x) ∈ HS . In [2] the selection function is then anal- ysed with respect to its coverage, which is defined by We need to show that CH fulfills Definition 1, that is, φ(h, g) = Ex∼P [g(x)] for a given distribution P . Note, we need to show that for every hypothesis class H and that our notion of non-triviality corresponds to this no- every distribution P that fulfills the realizable condition tion of coverage for deterministic distributions and bi- w.r.t. H, the probability over S ∼ P m that there exists nary confidence scores. We can now use some of their x∈X results to show our observation. P ry∼Bernoulli(`P (x)) [h(x) = y] < CH (x, y, S, δ) Theorem 3 (non-achievable coverage, Theorem 14 is less than δ. Since CH only assigns values 0 and 1 and from [2]). Let m and d > 2 be given. There exist a the condition is trivially fulfilled for instances with con- distribution P , an infinite hypothesis class H with a fidence score 0, we will now only discuss the case where finite VC-dimension d, and a target hypothesis in H, CH assigns confidence score 1. Recall, that CH only such that φ(h, g) = 0 for any selective classifier (h, g), assigns confidence 1 if and only if there is no h ∈ H chosen from H by CSS using a training sample S of size with LS (h) = 0 and h(x) 6= y. Since we know, that m drawn i.i.d. according to P . realizability holds, we know that `P ∈ H and since S This directly implies the second part of our ob- is an i.i.d sample from P we know LS (`P ) = 0. Now servation. For a more concrete example consider let CH (x, y, S, δ) = 1, then by definition we know that the class of singletons over the real line Hsgl = LS (h) = 0 implies h(x) = y. Thus `P (x) = y. Thus, {hz : R → {0, 1} : hz (x) = 1 if and only if z = x}. this confidence score does not overestimate the confi- We note that CHsgl only gives positive confidence dence of a point in any label. scores to instances outside the sample if the sample contains a positively labeled instance. Let P be the 6.2 Proof of Observation 1 uniform distribution over [0, 1] × {0}. Obviously We start by noting that our definition of the confidence P ∈ PH,0 . Furthermore any sample generated by P score CH is equivalent to the consistent selective strat- will not contain any positively labeled sample. Thus, egy from [2]. In order to state their definition, we will ntP (C, m, δ) = Px∼PX ,S∼P m [x ∈ SX ] = 0, where PX first need to introduce some other concepts. First, we and SX denote the projections of P and S to the will state the definition of version space from [7]. domain X = R. Definition 3 (Version Space). Given a hypothesis class H and a labeled sample S, the version space HS the set Let us consider the class of thresholds Hthres = {ha : of all hypotheses in H that classify S correctly. R → {0, 1}, ha (x) = 1 if and only if x > a}. We can define the following two learning rules for thresholds: Now, we can define the agreement set and maximal agreement set as in [2]. A1 (S) = ha1 , where a1 = arg max xi xi ∈R:(xi ,0)∈S Definition 4 (agreement set). Let G ⊂ H. A subset X ⊂ X is an agreement set with respect to G if all A2 (S) = h̄a2 , where h̄a2 (x) = 1 hypotheses in G agree on every instance in X 0 , namely if and only if x ≥ a2 := arg min xi . for all g1 , g2 ∈ G, x ∈ X xi ∈R:(xi ,1)∈S g1 (x) = g2 (x). Furthermore, The way CHthres is defined, for any S Definition 5 (maximal agreement set). Let G ⊂ H. and δ ∈ (0, 1) we have A1 (S)(x) = A2 (S)(x) if and The maximal agreement set with respect to G is the only if there is y ∈ {0, 1} with CHthres (x, y, S, δ) = 1. union of all agreement sets with respect to G. Furthermore, both of these learning rules are empirical risk minimizers for Hthres in the realizable case. Thus We can now state the definition of consistent selective both of them are PAC-learners. Thus for any , δ, there strategy. Note, that a selective strategy is defined by a is a m(, δ), such that for any m ≥ m(, δ) and any pair (h, g) of a classification function h and a selective P ∈ PHthres ,0 , function g. For our purposes, we will only need to look at the selective function g. 1 − δ ≤ P rS∼P m ,x∼P [A1 (S) = A2 (S)] = Definition 6 (consistent selective strategy (CSS)). P rS∼P m ,x∼P [ max {CHthres (x, y, S, δ)}] Given a labeled sample S, a consistent selective strategy y∈{0,1} (CSS) is a selective classification strategy that takes h to = ntPHthres ,0 (C, m, 0) by any hypothesis in HS (i.e., a consistent learner), and takes a (deterministic) selection function g that equals Thus limm→∞ ntPHthres ,0 (CHthres , m, δ) = 1 for any one for all points in the maximal agreement set with δ ∈ (0, 1) and any P ∈ PHthres ,0 . Furthermore note respect to HS and zero otherwise. that we have uniform convergence. 6.3 Proof of Proposition 2 Proof of Theorem 2. We choose the input to the al- 1 0 For every > 0, every x ∈ X and every n ∈ N with gorithm to be r = m1/8d . With probability 1 − 2δ , for all m > 1 , we can construct a distribution Px,n , such cells with probability weight greater than γ = m11/4 , the that lPx,n (x0 ) = h1 (x0 ) for every x0 ∈ X \ {x} and length of the confidence interval of the labelling proba- h1 (x0 ) 6= lPx,n (x0 ) and such that the marginal Px,n,X is bility is less than uniform over some Xn ⊂ X 0 with |Xn | = n. For a sam- 1 r √ ple to distinguish between two such distributions Px1 ,n m1/2 1 1 4m1/8 λ 2 1 − 1 ln + 1/8 and Px2 ,n either the point x1 or x2 needs to be visited m1/4 + m11/2 m1/4 − m11/2 2m δ m by the sample. Thus in order to give a point-wise guar- r √ antee for all instances with positive mass, only points in 1 1 1 4m λ 2 ≤ 1/4 + ln + 1/8 . the sample can be assigned a positive confidence in this m − 1 m1/4 − 1 16 δ m scenario. Thus any confidence score fulfilling this guar- This quantity decreases with increase in m and con- antee would have ntPx,n (C, m, δ) = Px∼PX ,S∼P m [x ∈ verges to zero. Therefore, for every c > 0, there is SX ] ≤ m n . For every η > 0 we can find n such that M1 (c , δ) such that this interval is less than c . When m n ≤ η, proving our claim. sample size is larger than M1 (c , δ), with probability Confidence scores using Lipschitz 1 − 2δ , the size of confidence intervals for labelling prob- assumption abilities of cells with weights greater than γ = m11/4 , is smaller than c . Proof of Theorem 1. The algorithm partitions the The points for which we can’t say anything about space into rd cells. Let pc be the probability weight the interval lengths are points in cells with weight at of a cell c and let p̂c be the estimate of pc that is cal- most γ. The total weight of such points is at most culated based on a sample to be the fraction of sample γ r1d = m11/8 . For any x > 0, let M2 (x ) be such that points in the cell c. From Lemma 1 and a union bound, 1 < x . we know that with probability 1 − 2δ , for every cell c, M2 (x )1/8 Choosing a sample size M greater than M1 (c , δ) and pc ∈ [p̂c − wp (c), p̂c + wp (c)]. M2 (x ), we get that Here wp (c) = wp (m, δ/2rd ) (as defined in Lemma 1). Pr [w` > c ] < x . S∼P M The algorithm also estimates the average label of a cell c - `c as `ˆc . This is the fraction of the sample point in the cell that have the label one. This is the same as the labelling probability estimate defined in Lemma 2. When the true probability weights of cells lie within the calculated confidence interval, by Lemma 2, we know that with probability 1 − 2δ , for every cell c, `ˆc ∈ [`ˆc − w` (c), `ˆc − w` (c))]. Here w` (c) = w` (m, δ/2rd , p̂c ) (as defined in Lemma 2). The maximum √ distance between any two points in any cell is r 2. By the λ-Lipshitz, any √ point in the cell has labelling probability within λr 2 of the aver- age labelling probability of the cell. Therefore, with probability 1 − δ, for each cell c, for every point x in the cell c, the labelling probability of x satisfies: √ √ `P (x) ∈ [`ˆc − w` (c) − λr 2, `ˆc + w` (c) + λr 2]. This is the interval returned by the algorithm. Now we lower bound true confidence based on the confidence interval of the labelling probability. For a point x, let c(x) denote the cell containing the point. CP (x, 0) = `P (x) √ ≥ `ˆc(x) − w` (c(x)) − λr 2 CP (x, 1) = 1 − `P (x) √ ≥ 1 − `ˆc(x) − w` (c(x)) − λr 2. References [1] Bartlett, P. L.; and Wegkamp, M. H. 2008. Clas- sification with a Reject Option using a Hinge Loss. Journal of Machine Learning Research 9(59): 1823–1840. URL http://jmlr.org/papers/v9/ bartlett08a.html. [2] El-Yaniv, R.; and Wiener, Y. 2010. On the Foun- dations of Noise-free Selective Classification. J. Mach. Learn. Res. 11: 1605–1641. URL http: //portal.acm.org/citation.cfm?id=1859904. [3] Freund, Y.; Mansour, Y.; Schapire, R. E.; et al. 2004. Generalization bounds for averaged classi- fiers. The annals of statistics 32(4): 1698–1722. [4] Herbei, R.; and Wegkamp, M. H. 2006. Classifi- cation with reject option. The Canadian Journal of Statistics/La Revue Canadienne de Statistique 709–721. [5] Jiang, H.; Kim, B.; Guan, M.; and Gupta, M. 2018. To trust or not to trust a classifier. In Advances in neural information processing systems, 5541–5552. [6] Kalai, A. T.; Kanade, V.; and Mansour, Y. 2012. Reliable agnostic learning. Journal of Computer and System Sciences 78(5): 1481–1495. [7] Mitchell, T. M. 1977. Version Spaces: A Candi- date Elimination Approach to Rule Learning. In Reddy, R., ed., Proceedings of the 5th Interna- tional Joint Conference on Artificial Intelligence. Cambridge, MA, USA, August 22-25, 1977, 305– 310. William Kaufmann. URL http://ijcai. org/Proceedings/77-1/Papers/048.pdf. [8] Shalev-Shwartz, S.; and Ben-David, S. 2014. Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press. ISBN 978-1-10-705713- 5. URL http://www.cambridge.org/de/ academic/subjects/computer-science/ pattern-recognition-and-machine-learning/ understanding-machine-learning-theory-algorithms. [9] Vershynin, R. 2019. High-dimensional probability. [10] Wiener, Y.; and El-Yaniv, R. 2015. Agnos- tic pointwise-competitive selective classification. Journal of Artificial Intelligence Research 52: 171– 201. [11] Yuan, M.; and Wegkamp, M. H. 2010. Classifica- tion Methods with Reject Option Based on Convex Risk Minimization. J. Mach. Learn. Res. 11: 111– 130. URL https://dl.acm.org/citation.cfm? id=1756011.