-

Pointwise Con dence scores with provable guarantees

Nivasini Ananthakrishnan

Shai Ben-David

bendavid.shai@gmail.com 0

Tosca Lechner

tlechner@uwaterloo.ca 0 0 University of Waterloo

Quantifying the probability of a machine learning prediction being correct on a given test point enables users to better decide how to use those predictions. Con dence scores are pointwise estimates of such probabilities. Ideally, these would be the label probabilities (a.k.a. the labeling rule) of the data generating distribution. However, in learning scenarios the learner does not have access to the labeling rule. The learner aims to gure out a best approximation of that rule based on training data and some prior knowledge about the task at hand, both for predicting a label and for possibly providing a con dence score. We formulate two goals for con dence scores, motivated by the use of these scores for safety-critical applications. First, they must not under-estimate the probability of error for any test point. Second, they must not be trivially low for most points. We consider a few common types of learner's prior knowledge and provide tools for obtaining pointwise con dence scores based on such prior knowledge and the relation between the test point and the training data. We prove that under the corresponding prior knowledge assumptions, our proposed tools meet these desired goals.

The reliability of machine learnt programs is of course a major concern and has been the focus of much research. Theory o ers quite a selection of tools for evaluating reliability, from generalization bounds to experimental result of test sets. However, most of those guarantees are statistical, in the sense that they only hold with high probability (over the generation of the training data and of the test points) and they provide no information about the correctness of prediction on any speci c instance. In cases where an error on a speci c instance may incur a very high cost, like in safety-critical applications, the common statistical guarantees do not su ce. We would also wish to be able to identify predictions with low con dence so that one could apply some safety procedures (such as a review by a human expert). Ideally, no low con dence prediction should go undetected, At the same time, since expert intervention could be expensive, one also wishes to minimize the occurrence of false positives in the predictions agged as low con dence.

Can one do better than the overall statistical estimates when it comes to evaluating reliability on a given test case?

Arguably, the most common reason for an statistically reliable machine learning program to fail on a test point is that that point is an `outlier', in the sense of not being well represented by the sample the program was trained on. This research aims to quantify this `outlierness'. We propose theoretically founded con dence bounds that take into account the relation between the training sample and the speci c test point in question (on top of the commonly used parameters of the learning algorithm, the size of training sample and assumptions about the processes generating both the training data and the test instance).

Clearly, the con dence of any prediction of an unknown label (or any piece of information) hinges upon some prior knowledge or assumptions. In this work we consider a few forms of prior knowledge that are commonly employed in machine learning theory, and develop and analyse con dence score for prediction of individual test points under such assumptions.

We consider the following types of prior knowledge:

Known hypothesis class with low approxima

tion error: We discuss two cases - the realizable setting (i.e., when that approximation error is zero) and the agnostic setup (both in Section 2).

In the realizable case, we show that there are indeed hypothesis classes for which it is possible to de ne a con dence score that does not overestimate condences for any points, while providing high con dences to many points. However, there are also hypotheses classes, that do not allow non-trivial con dence scores ful lling such a guarantee.

For the agnostic setup, assuming the learner has knowledge of a hypothesis class with low (but not necessarily 0) approximation error, we show that in this case it is not possible to give any non-trivial condence score that does not overestimate con dence for some instances.

The data generating distribution is Lipschitz:

We provide a an algorithm that calculates con dence scores under such an a Lipschitzness assumption. We show that with high probability over samples, the resulting con dence score of every point is an underestimate of its true con dence while the con dence score we obtain is non-trivial. We provide bounds on the probability (over points and samples) of assigning a low condence score to a point with high true con dence that converge to zero as a function of the training sizes. For more details, see Section 5.

Related work

The closest previous work to ours is Jiang et al [ 5 ]. They consider a very similar problem to the one we address here - the problem of determining when can a classi er's prediction on a given point be trusted. For the sake of saving space, we refer the reader to that paper for a more extensive review of previous work on this topic. Their theoretical results di er from our work in two essential aspects. First, they consider only one setup - a data generating distribution that satis es several technical assumptions. In particular they rely on the following strong condition: "for any point x 2 X , if the ratio of the distance to one class's high-density-region to that of another is smaller by some margin , then it is more likely that x's label corresponds to the former class." We analyse our notion of con dence under several di erent incomparable assumptions, arguably, none of which is as strong as that. The second signi cant di erence is that the main result on trust of labels there (theorem 4 of [ 5 ]) states that if a certain inequality holds then the predicted label agrees with that of the Bayes optimal predictor, and if another inequality holds, there is disagreement between them. However, those inequalities are not complementary. It may very well be that in many cases every domain point (or high probability of instances) fails both inequalities. For such points, that main result tells nothing at all. That paper does not o er any discussion of the conditions under which their main result is not vacuous in that respect. Additionally their result holds with high probability over the domain and is not a point-wise guarantee.

Selective Classi cation/ Classi cation with

Abstension: One line of work that is related to our paper is learning with abstention. Similar to our setting, the classi cation problem does not only consist of the goal of classifying correctly, but to also allows the classi er to abstain from making a prediction, if the con dence of a prediction is too low. Many works in this line provide accuracy guarantees that hold with high probability over the domain ([ 1 ],[ 11 ], [ 3 ], [ 4 ], [ 6 ]). This is di erent from our goal of point-wise guarantees.

Point-wise guarantees are provided in [ 2 ] and [ 10 ]. El-Yaniv et al [ 2 ] gave a theoretical analysis of the selective classi cation setup in which a classi cation function and a selective function are learned simultaneously. The risk of a classi cation is then only accessed on the set of instances that was selected for classi cation. The selective function is evaluated by their coverage - how many instances in expectation are selected for classication. They analyse the trade-o between risk and coverage, and introduce the notion of "perfect classication" which requires risk 0 with certainty. This is similar to our requirements on a con dence score in the deterministic setting, where we require 0 risk with high probability. Their notion of coverage is similar to our notion of non-redundancy - in fact non-redundancy corresponds to worst-case coverage over a family of distributions. They provide an optimal perfect learning strategy in the realizable setting and show that there are hypothesis classes with a coverage that converges to 1 and hypothesis classes for which coverage is always 0 for some distributions. We use their results in our Section 4. In contrast to their paper our setup also considers probabilistic labeling functions and our analysis also considers other assumptions on the family of probability distributions, besides generalization guarantees for some xed hypothesis space.

Problem de nition

Let the domain of instances be X and the label set be f0; 1g. A learning task is determined by a probability distribution P over X f0; 1g. We denote the marginal distribution over the domain by PX and the conditional labeling rule by `P (namely, for x 2 X, `P (x) = Pr(x0;y0) P [y0 = 1jx0 = x]).

The Bayes classi er, hPB(x) = 1 i

Pr (x0;y0) P [y0 = 1jx0 = x] 0:5; is the pointwise minimizer of the zero-one prediction loss. We sometime refer to its prediction hPB(x) as the majority label of a point or the Bayes label of a point.

We are interested in point-wise con dence. For a point x 2 X, the con dence of a label y 2 f0; 1g is CP (x; y) =

Pr (x0;y0) P [y0 = yjx0 = x]: Note that the label assigned by he Bayes predictor maximizes this con dence for every domain point x.

A Con dence score of the label con dence is an empirical estimate (based on some training sample S) of the true label con dence. Inevitably, the reliability of a con dence score is dependent on some assumptions on the data generating distribution (or, in other words, prior knowledge about the task at hand). Given a family of data generating distributions P (ful lling some niceness assumption that re ect the learners prior knowledge or beliefs about the task), a training sample S, and a parameter , the empirical con dence estimate for a point x and label y is a function C(x; y; S; ). We want the following property to hold: For every probability distribution P 2 P, with probability of at least over an i.i.d. generation of S by P , we have

Pr y0 Bernoulli(`P (x)) That is, with high probability, we do not overestimate the probability of y being the correct label for x. Ideally, this should hold for every point x in the domain. Of course, there is a trivial solution for this - just let C( ; ; ; ) be the constant 0 function. The goal therefore is to get a con dence score that ful ls the condition above, while still being as high as possible for `many' x0s. That is, we aim for a con dence score, such that Ex P;S P m [maxfC(x; 1; S; ); C(x; 0; S; )g] is high. As mentioned above, given a data generating distribution P and a data representation available to the learner, the highest con dence on every instance x 2 X is achieved by the Bayes predictor hPB(x) and it is easy to see that it is maxf`P (x); (1 `P (x))g.

In contrast with the common notion of a PAC style error bound is that con dence scores may vary over individual instances, capturing the heterogeneity of the domain and the speci c training sample the label prediction relies on. To demonstrate this point, consider the following example: Example 1. Let X1 be the 0:1 grid over [0; 1]d, let X0 the 0:01 grid over [0; 0:1]d and let our domain be X = X0 [ X1 (for some large d). Consider the family P of all probability distributions over X that have a deterministic labeling rule satisfying the 10- Lipschitz condition (so points of distance 0.1 or more have no e ect on each other). Assume further that all the distributions in P have half of their mass uniformly distributed over X1 grid points and the other half of the mass uniformly distributed over X0.

Since outside the [0; 0:1]d cube, every labeling is possible, for every learner there is a distribution P 2 P w.r.t. which it errs on every domain point in X1 n SX (where SX = fx : 9y 2 f0; 1g s.t. (x; y) 2 Sg). On the other hand, due to the Lipschitz condition, all the points in the [0; 0:1]d grid, X0, must get the same label. Therefore, given a sample S that includes a point in X0, a learner that labels all the points in X0 by the label of the sample points in it induces con dence 1 for all these points.

We conclude that, for sample sizes between 2 and, say, 10d=2, for most of the samples a learner can achieve con dence 1 for points in X0 and no learner can achieve con dence above 0 for even a half of the domain points in X1. Note also that the No Free Lunch theorem (as formulated in, e.g., [ 8 ]) implies that for sample sizes smaller than 10d=2, for every learner there exists some probability distribution P 2 P for which its expected error is at least 1=8 (1=4 over a subspace X1 that has probability weight 1=2).

Con dence scores for hypothesis classes

In the following we will analyze the point-wise con dence when all the prior knowledge available about the data generating distribution P is a bound on the approximation error of a class of predictors. We will distinguish two cases here, 1. The family PH;0 of distributions P which are realizable w.r.t. H, i.e. infh2H LP (h) = 0 and 2. The family PH; of distributions P for which the approximation error of class H is low but not guaranteed to be zero, i.e. inf h2H LP (h) , for some > 0. Note that, given a class of predictors, H, the second family of possible data generating distributions is a superset of the rst. Consequently, the pointwise error guarantees one can give in that non-necessarilyrealizable case are weaker1.

De nition 1 (Con dence Score, ful lling the no-overestimation guarantee for all instances). We say a function C, that takes as input a sample S, a point x, a hypothesis h and a parameter and outputs a value in [0; 1]. We say such a function C is a con dence score ful lling the no-overestimation guarantee for all instances for a family of probability functions P if for every P 2 P the probability over S P m that there exists x 2 X with

We say a function C(x; y; S; ), is a con dence score ful lling the ful lling the no-overestimation guarantee for positive mass instances if the above guarantee holds not for all x, but for all x with P (fxg) > 0.

It is obvious, that this guarantee to achieve, if we give the con dence score 0 for all predictions. In order to measure the informativeness of a con dence score for a particular distribution we introduce the notion of non-triviality of a con dence score.

De nition 2 (Non-triviality). Given a con dence score C for some family of distributions P, we de ne the non-triviality ntP (C; x; y) w.r.t. to a distribution P 2 P for a given sample size m and parameter , to be the extected di erence between the estimated and the true con dence,i.e.: ntP (C; m; ) = Ex P;S P m [1 y2mf0in;1g jCP (x; y)

C(x; y; S; )j)]

Next we consider a speci c con dence score that takes into account whether a hypothesis class is undecided on a point x given a sample S.

CH(x; y; S; ) = 1To not have to deal with the ambiguity of the labeling function for points with mass 0, we will restrict this discussion to the family of distributions which have positive mass on all points. This implies that in the realizable setting all labeling function `P we consider are part of H.

This statement was made in a di erent setup by ElYaniv et al [ 2 ]. We note that our con dence score CH is equivalent to their notion of consistent selective strategy. Using our terminology, they show that if the realizability assumption holds, if an instance (x; y) is classied as 1 by CH then x is guaranteed to have true label y (with probability 1). Furthermore, their Theorems 11 and Theorem 14 as well as their Corollory 28 give rise to the following observation about con dence scores. Observation 1. It turns out that ntP (C; m; ) for this con dence scoring rule CH under the realizability assumption displays di erent behaviours for di erent classes (even when they have similar VC dimension): For some hypothesis classes, e.g. the class of thresholds on the real line Hthres or the class of linear separators in Rd, ntP (C; m; ) converges to 1 for every > 0 and every P 2 P as the sample size go to in nity.

On the other hand, for some hypothesis classes with nite VC-dimension for every > 0 there exist P 2 PH;0 with ntP (C; m; ) = 0 for every sample size m and every < 1. This phenomenon occurs for example for H being the class of singletons.

For a more detailed analysis of which hypothesis classes have high non-triviality, we refer the reader to [ 2 ], noting that our notion of non-triviality corresponds to their notion of coverage.

We now look at the second case we wanted to address in this section: The family of probability distributions such that the approximation error of a class H is bounded by some . We x a hypothesis class H and let PH; be the family of all probability distributions P such that H has approximation error at most w.r.t. P . We show that for for any (non-trivial) hypothesis class H, it is not possible to nd any satisfying con dence score for such a family.

Proposition 2. Let H be any hypothesis over an in nite domain. Then there is no con dence score C such that the following two statements are true simultaneously:

C ful lls the no-overestimation guarantee for all positive-mass instances w.r.t. PH; there exists some > 0 such that for every P 2 PH; there are 2 (0; 1), m 2 N such that C has nontriviality ntP (C; m; ) > .

This shows us that restricting ourselves to a family of probability distributions that allow for good generalization is not su cient for allowing satisfying condence scores. In the following section we will make stronger, more local assumptions and show that under these assumptions more satisfying con dence scores can be found.

Con dence scores under Lipschitz assumption

Lipschitz Assumption : We say that the probability distribution P over X f0; 1g satis es -Lipschitzness w.r.t. a metric d(:; :) over X, if for every x; x0 2 X , j`P (x) `P (x0)j d(x; x0). When the domain is a subset of a Euclidean space, we will assume that d is the Euclidean distance unless we specify otherwise.

We provide an algorithm (1) to estimate the labelling probability of points using labelled samples. The algorithm partitions the space into cells. The algorithm outputs the same answer for points in the same cell. The input parameter r dictates the size of the cells. The algorithm estimates the average labelling probability for each cell. A con dence interval for this estimate is calculated based on the number of sample points in the cell. The interval is narrow when there are more sample points in the cell.

The following lemmas show how to estimate probability weights and average labelling probabilities of subsets of the domain: Lemma 1. Let P be a distribution over domain X. Let X0 be a subset of X. Let S be an i.i.d. sample of size m drawn from the distribution P . Let p^(X0; S) be the fraction of the m samples that are in X0. For any > 0, with probability 1 over the generation of the samples S, where jP (X0) p^(X0; S)j

wp(m; ) wp(m; ) = 2m

2 ln : Lemma 2. Let D be distribution over X f0; 1g. Let X0 be a subset of X. Let S be an i.i.d. sample of size m drawn from D. Let `^(X0; S) be the fraction of the m labelled samples with label 1 in S \ X0. For any > 0, with probability 1 over the generation of the samples S, if p^(X0; S) wp(m; =2) > 0, then j`P (X0)

`^(X0; S)j < w`(m; ; p^(X0; S)) w`(m; ; p^(X0; S)) = 1 p^(X0; S)

wp(m; =2) wp(m; =2) + 2m ln 4 ! ; where p^(X0; S) is the fraction of the samples from S in X0 that have label 1, wp(m; =2)) is as de ned in Lemma 1.

The following theorem shows that the true labelling probability of a point lies within the estimate interval provided by Algorithm 1, with high probability over the sample used to nd the estimate.

Theorem 1. Let the domain be [0; 1]d. Suppose the data generating distribution P satis es -Lipschitzness. Algorithm 1 Lipschitz labelling probability estimate Input: Test point x, Labelled samples S = (xi; yi)im=1,

Radius r, Estimation parameter ,

Lipschitz constant Output: Labelling probability estimate, con dence width of estimate Split the domain X = [0; 1]d into a grid of (1=r)d hypercube cells each of side length r.

Find the cell tx containing the test point x. p^[tx] := fraction of samples in tx. `^[tx] := fraction of samples in the cell tx with label 1. w[tx] := 1 if p^[tx] wp(m; =2) then

w[tx] = w`(m; ; p^[tx]) end if Return `^[tx], min(1; w[tx] + r p2)

C^Lr;ipschitz(x; y; S; ) = For any r > 0; > 0, m 2 N, for any x 2 [0; 1]d, de ne the con dence score based on Algorithm 1 as (1 `^S (x) wS (x) wS (x) if y = 1

if y = 0 `^S (x) where (`^S (x); wS (x)) is the output of the Algorithm with input r; ; . Then with probability 1 over samples S of size m,

C^Lr;ipschitz(x; y; S; )

CP (x; y)

We now show that as sample size increases, for an appropriately chosen input parameter r, Algorithm 1 returns narrow estimate intervals for the labelling probabilities for most points. This implies that for most points, the con dence score is not much lower than the true con dence (2j`P (x) 12 j) Theorem 2. For every -Lipschitz distribution, for every x; c; > 0, there is a sample size m( x; c; ) such that with probability 1 over samples S of size m( x; c; ),

Pr [wS (x) > c] < x x P where wS is the width of labelling probability estimate obtained from Algorithm 1 with input parameter of grid size r = 1=m 81d

Note, that this theorem implies that the expected non-triviality is high.

Proofs Useful lemmas

The following lemma appears as Theorem 2.2.6 of the book of [ 9 ], where the reader can nd its proof. Lemma 3 (Hoe ding's inequality for general bounded random variables). Let X1; : : : ; XN be independent random variables. Assume that Xi 2 [mi; Mi] for every i. Then, for any t > 0, we have

" N # 2t2 ! Pr X(Xi E[Xi]) t exp : i=1 Proof of Lemma 1. Let Xi be a random variable indicating if the ith sample belongs to set X0. Xi = 1 if the ith sample belongs to X0 and zero otherwise. For each i, E[Xi] = P (X0). p^(X0; S) = PiN=m1 Xi . Applying Hoe ding's inequality, we get the inequality of the theorem.

Proof of Lemma 2. Let Xi be a random variable such that

Xi = 1 If ith sample is in X0 and has label one, 0

otherwise.

E[Xi] = P (X0)`P (X0), for each i. Pim=1 Xi mp^`^P (X0; S). Note that by triangle inequality, = 2m

4 ln ; we get that with probability 1 w`(m; ; p^). , j`P (X0) `^(X0; S)j <

Con dence scores for hypothesis classes

We start by recalling the de nition of con dence scores ful lling the no-overestimation guarantee for all instances: De nition 1 (Con dence Score, ful lling the no-overestimation guarantee for all instances). We say a function C, that takes as input a sample S, a point x, a hypothesis h and a parameter and outputs a value in [0; 1]. We say such a function C is a con dence score ful lling the no-overestimation guarantee for all instances for a family of probability functions P if for every P 2 P the probability over S P m that there exists x 2 X Proposition 1. For the family of distributions P that are realizable with respect to H, CH is indeed a condence score ful lling the no-overestimation guarantee for all instances.

We need to show that CH ful lls De nition 1, that is, we need to show that for every hypothesis class H and every distribution P that ful lls the realizable condition w.r.t. H, the probability over S P m that there exists x 2 X

P ry Bernoulli(`P (x)) [h(x) = y] < CH(x; y; S; ) is less than . Since CH only assigns values 0 and 1 and the condition is trivially ful lled for instances with condence score 0, we will now only discuss the case where CH assigns con dence score 1. Recall, that CH only assigns con dence 1 if and only if there is no h 2 H with LS (h) = 0 and h(x) 6= y. Since we know, that realizability holds, we know that `P 2 H and since S is an i.i.d sample from P we know LS (`P ) = 0. Now let CH(x; y; S; ) = 1, then by de nition we know that LS (h) = 0 implies h(x) = y. Thus `P (x) = y. Thus, this con dence score does not overestimate the con dence of a point in any label. 6.2

Proof of Observation 1

We start by noting that our de nition of the con dence score CH is equivalent to the consistent selective strategy from [ 2 ]. In order to state their de nition, we will rst need to introduce some other concepts. First, we will state the de nition of version space from [ 7 ]. De nition 3 (Version Space). Given a hypothesis class H and a labeled sample S, the version space HS the set of all hypotheses in H that classify S correctly.

Now, we can de ne the agreement set and maximal agreement set as in [ 2 ].

De nition 4 (agreement set). Let G H. A subset X X is an agreement set with respect to G if all hypotheses in G agree on every instance in X0, namely for all g1; g2 2 G, x 2 X

g1(x) = g2(x): De nition 5 (maximal agreement set). Let G H. The maximal agreement set with respect to G is the union of all agreement sets with respect to G.

We can now state the de nition of consistent selective strategy. Note, that a selective strategy is de ned by a pair (h; g) of a classi cation function h and a selective function g. For our purposes, we will only need to look at the selective function g.

De nition 6 (consistent selective strategy (CSS)). Given a labeled sample S, a consistent selective strategy (CSS) is a selective classi cation strategy that takes h to by any hypothesis in HS (i.e., a consistent learner), and takes a (deterministic) selection function g that equals one for all points in the maximal agreement set with respect to HS and zero otherwise.

We now see that for any H and any labeled sample S the selected function g assigns one to x, if for every two h1; h2 2 H with LS (h1) = LS (h2) = 0 implies h1(x) = h2(x). Thus, g(x) = CH(x; h(x); S; ) for any h(x) 2 HS . In [ 2 ] the selection function is then analysed with respect to its coverage, which is de ned by (h; g) = Ex P [g(x)] for a given distribution P . Note, that our notion of non-triviality corresponds to this notion of coverage for deterministic distributions and binary con dence scores. We can now use some of their results to show our observation.

Theorem 3 (non-achievable coverage, Theorem 14 from [ 2 ]). Let m and d > 2 be given. There exist a distribution P , an in nite hypothesis class H with a nite VC-dimension d, and a target hypothesis in H, such that (h; g) = 0 for any selective classi er (h; g), chosen from H by CSS using a training sample S of size m drawn i.i.d. according to P .

This directly implies the second part of our observation. For a more concrete example consider the class of singletons over the real line Hsgl = fhz : R ! f0; 1g : hz(x) = 1 if and only if z = xg: We note that CHsgl only gives positive con dence scores to instances outside the sample if the sample contains a positively labeled instance. Let P be the uniform distribution over [0; 1] f0g. Obviously P 2 PH;0. Furthermore any sample generated by P will not contain any positively labeled sample. Thus, ntP (C; m; ) = Px PX ;S P m [x 2 SX ] = 0, where PX and SX denote the projections of P and S to the domain X = R.

Let us consider the class of thresholds Hthres = fha : R ! f0; 1g; ha(x) = 1 if and only if x > ag. We can de ne the following two learning rules for thresholds: A1(S) = ha1 ; where a1 = A2(S) = ha2 ; where ha2 (x) = 1 if and only if x a2 :=

arg max xi2R:(xi;0)2S

xi arg min xi2R:(xi;1)2S xi: Furthermore, The way CHthres is de ned, for any S and 2 (0; 1) we have A1(S)(x) = A2(S)(x) if and only if there is y 2 f0; 1g with CHthres (x; y; S; ) = 1. Furthermore, both of these learning rules are empirical risk minimizers for Hthres in the realizable case. Thus both of them are PAC-learners. Thus for any ; , there is a m( ; ), such that for any m m(; ) and any P 2 PHthres;0, 1

P rS P m;x P [A1(S) = A2(S)] = P rS P m;x P [y2mfa0;x1gfCHthres (x; y; S; )g]

= ntPHthres;0 (C; m; 0) Thus limm!1 ntPHthres;0 (CHthres; m; ) = 1 for any 2 (0; 1) and any P 2 PHthres;0. Furthermore note that we have uniform convergence. For every > 0, every x 2 X0 and every n 2 N with m > 1 , we can construct a distribution Px;n, such that lPx;n (x0) = h1(x0) for every x0 2 X n fxg and h1(x0) 6= lPx;n (x0) and such that the marginal Px;n;X is uniform over some Xn X0 with jXnj = n. For a sample to distinguish between two such distributions Px1;n and Px2;n either the point x1 or x2 needs to be visited by the sample. Thus in order to give a point-wise guarantee for all instances with positive mass, only points in the sample can be assigned a positive con dence in this scenario. Thus any con dence score ful lling this guarantee would have ntPx;n (C; m; ) = Px PX ;S P m [x 2 SmnX ] , pmnr.ovFinogr oeuvrercylaim>. 0 we can nd n such that

Con dence scores using Lipschitz assumption

Proof of Theorem 1. The algorithm partitions the space into rd cells. Let pc be the probability weight of a cell c and let p^c be the estimate of pc that is calculated based on a sample to be the fraction of sample points in the cell c. From Lemma 1 and a union bound, we know that with probability 1 2 , for every cell c, pc 2 [p^c

wp(c); p^c + wp(c)]: Here wp(c) = wp(m; =2rd) (as de ned in Lemma 1).

The algorithm also estimates the average label of a cell c - `c as `^c. This is the fraction of the sample point in the cell that have the label one. This is the same as the labelling probability estimate de ned in Lemma 2. When the true probability weights of cells lie within the calculated con dence interval, by Lemma 2, we know that with probability 1 2 , for every cell c, ^ ^ `c 2 [`c w`(c); `^c w`(c))]: Here w`(c) = w`(m; =2rd; p^c) (as de ned in Lemma 2).

The maximum distance between any two points in any cell is rp2. By the -Lipshitz, any point in the cell has labelling probability within rp2 of the average labelling probability of the cell. Therefore, with probability 1 , for each cell c, for every point x in the cell c, the labelling probability of x satis es: ^ `P (x) 2 [`c w`(c) p p r 2; `^c + w`(c) + r 2]:

This is the interval returned by the algorithm. Now we lower bound true con dence based on the con dence interval of the labelling probability. For a point x, let c(x) denote the cell containing the point.

CP (x; 0) = `P (x) CP (x; 1) = 1 `^c(x)

Proof of Theorem 2. We choose the input to the algorithm to be r = m11=8d . With probability 1 2 , for all cells with probability weight greater than = m11=4 , the length of the con dence interval of the labelling probability is less than This quantity decreases with increase in m and converges to zero. Therefore, for every c > 0, there is M1( c; ) such that this interval is less than c. When sample size is larger than M1( c; ), with probability 1 2 , the size of con dence intervals for labelling probabilities of cells with weights greater than = m11=4 , is smaller than c.

The points for which we can't say anything about the interval lengths are points in cells with weight at most . The total weight of such points is at most r1d = m11=8 . For any x > 0, let M2( x) be such that 1 M2( x)1=8 < x.

Choosing a sample size M greater than M1( c; ) and M2( x), we get that

Pr [w` > c] < x: S P M

[1] Bartlett , P. L. ; and Wegkamp, M. H. 2008 . Classi cation with a Reject Option using a Hinge Loss . Journal of Machine Learning Research 9 ( 59 ): 1823 { 1840 . URL http://jmlr.org/papers/v9/ bartlett08a.html.

[2] El-Yaniv , R.; and Wiener, Y. 2010 . On the Foundations of Noise-free Selective Classi cation . J. Mach. Learn. Res . 11 : 1605 { 1641 . URL http: //portal.acm.org/citation.cfm?id= 1859904 .

[3] Freund , Y. ; Mansour , Y. ; Schapire , R. E. ; et al. 2004 . Generalization bounds for averaged classiers . The annals of statistics 32(4): 1698 { 1722 .

[4] Herbei , R. ; and Wegkamp, M. H. 2006 . Classi - cation with reject option . The Canadian Journal of Statistics/La Revue Canadienne de Statistique 709 { 721 .

[5] Jiang , H. ; Kim , B. ; Guan , M. ; and Gupta, M. 2018 . To trust or not to trust a classi er . In Advances in neural information processing systems , 5541 { 5552 .

[6] Kalai , A. T. ; Kanade , V. ; and Mansour, Y. 2012 . Reliable agnostic learning . Journal of Computer and System Sciences 78 ( 5 ): 1481 { 1495 .

[7] Mitchell, T. M. 1977 . Version Spaces: A Candidate Elimination Approach to Rule Learning . In Reddy, R., ed., Proceedings of the 5th International Joint Conference on Arti cial Intelligence . Cambridge, MA, USA, August 22- 25 , 1977 , 305 { 310 . William Kaufmann. URL http://ijcai. org/Proceedings/77-1/Papers/048.pdf.

[8] Shalev-Shwartz , S. ; and Ben-David , S. 2014 . Understanding Machine Learning - From Theory to Algorithms . Cambridge University Press. ISBN 978-1-10-705713- 5 . URL http://www.cambridge.org/de/ academic/subjects/computer-science/ pattern-recognition-and-machine-learning/ understanding-machine-learning-theory-algorithms.

[9] Vershynin , R. 2019 . High-dimensional probability .

[10] Wiener , Y. ; and El-Yaniv , R. 2015 . Agnostic pointwise-competitive selective classi cation . Journal of Arti cial Intelligence Research 52 : 171 { 201 .

[11] Yuan , M. ; and Wegkamp, M. H. 2010 . Classi cation Methods with Reject Option Based on Convex Risk Minimization . J. Mach. Learn. Res . 11 : 111 { 130 . URL https://dl.acm.org/citation.cfm? id= 1756011 .