-

Certifiable Active Class Selection in Multi-Class Classification

0 TU Dortmund University, Artificial Intelligence Group , D-44227 Dortmund , Germany

68 76

Active class selection (ACS) requires the developer of a classifier to actively choose the class proportions of the training data. This freedom of choice puts the trust in the trained classifier at risk if the true class proportions, which occur during deployment, are subject to uncertainties. This issue has recently motivated a certificate for ACS-trained classifiers, which builds trust by proving that a classifier is suficiently correct within a specific set of class proportions and with a high probability. However, this certificate was only developed in the context of binary classification. In this paper, we employ Hölder's inequality to extend the binary ACS certificate to multi-class settings. We demonstrate that our extension indeed provides correct and tight upper bounds of the classifier's error. We conclude with several directions for future work.

Active class selection Prior probability shift Multi-class classification Model certification Learning theory Validation

The proceeding deployment of machine learning models in real-world applications increases the importance of validating these models thoroughly. Ideally, the robustness of these models against distribution shifts [ 5 ] is certified in the sense of being formally proven or extensively tested [ 3 ].

In active class selection [ 4 ], a class-conditional data generator is repeatedly asked to produce feature vectors for arbitrarily chosen classes. In this setting, the developer of a classifier must actively decide for the class proportions in which the training data set is produced. While this freedom can reduce the data acquisition cost while improving classification performance, it also puts the trust in the trained classifier at risk: what if the class proportions, which occur during deployment, are not precisely known or are even subject to changes?

These uncertainties have motivated a certificate for ACS-trained classifiers, which declares a set of class proportions to which a classifier is safely applicable [ 2 ]. In particular, the certified classifier is required to exhibit an ACS-induced error of at most some ϵ > 0, with a probability of at least 1 − δ. However, this certificate was only developed in the context of binary classification; a multi-class certificate has not yet been proposed, to the best of our knowledge.

In this paper, we close the gap between ACS model certification and multiclass classification. In the following, we recapitulate the theoretical background of binary ACS certification in Section 2 before we develop our multi-class ACS certificate in Section 3. We validate our claims empirically in Section 4 before we conclude with Section 5. 2

Theoretical Background

The term “domain”, as used in domain adaption [ 6 ], describes a probability density function over X × Y, where X is the feature space and Y is the label space. In ACS, we assume that the source domain S, where a machine learning model is trained, difers from the target domain T , where the model is deployed, only in terms of the class proportions pS ̸= pT [ 2 ]. Such deviations, also known as target shift [ 8 ] or as prior probability shift [ 5 ], occur due to the freedom of choosing any pS for the acquisition of training data. We are interested in the impact of such deviations on the classification performance with respect to T .

Recently, a PAC learning perspective [ 2 ] on this setting has provided us with Theorem 1. This result quantifies the diference in loss values L(h) between an ACS-generated training set D and the target domain T . Only if this diference is small, we can expect to learn a classifier h from D that is accurate also with respect to T , similar to standard PAC learning theory. The key insight of this theorem is that the relevant loss diference between D and T is continuously approaching the inter-domain gap |LT (h) − LS (h)|, which is independent of the random draw of D from S, while the training set size m increases. In ACS, this increase happens naturally while more and more data is actively being acquired, so that the error of any ACS-trained classifier is increasingly dominated by this gap. Since the inter-domain gap is constant with respect to the random draw of the training set D, it is also independent of ϵ, δ, and m.

Theorem 1 (Identical mechanism bound [ 2 ]). For any ϵ > 0 and any fixed h ∈ H, it holds with probability at least 1 − δ, where δ = 4e−2mϵ2 , that |LT (h) − LS (h)| − ϵ ≤ |LT (h) − LD(h)| ≤ |LT (h) − LS (h)| + ϵ.

This theorem can be used to certify a trained classification model h with N classes in terms of a set of safe class proportions P ⊆ [ 0, 1 ]N . By “safe”, we mean that, during the deployment of h on T , the trained model induces, with a high probability, at most a small domain-induced error ϵ.

Definition 1 (Certified hypothesis [ 2 ]). A hypothesis h ∈ H is certified for all class proportions in P ⊆ [ 0, 1 ]N if, with probability at least 1 − δ and ϵ, δ > 0, |LT (h) − LS (h)| ≤ ϵ

Let pS , pT ∈ [ 0, 1 ]N be vectors with components [p•]i = P•(Y = i), which express the probabilities of the class labels in the respective domains S and T . Furthermore, let ℓh ∈ RN be a vector that represents the class-wise losses Z

X [ℓh]i = ℓX (h, i) =

P(X = x | Y = i) · ℓ(h(x), i) dx, (1) as according to some loss function ℓ. The total loss of the hypothesis h is then given by L•(h) = Pi∈Y [p•]i[ℓh]i = ⟨p•, ℓh⟩. Consequently the inter-domain gap for classification problems can be expressed as |LT (h) − LS (h)| = |⟨pT , ℓh⟩ − ⟨pS , ℓh⟩| = |⟨pT − pS , ℓh⟩| = |⟨d, ℓh⟩|, where d = pT − pS is the diference between the class probabilities in the domains S and T .

In order to certify classification models, it is necessary to calculate Eq. 2. However, the true class-wise losses ℓh are unknown, and we can only estimate the empirical class-wise losses ℓˆX (h, y) = m1y Pi:yi=y ℓ(y, h(xi)) from a finite amount of labeled validation data. Therefore, our goal is to constrain Eq. 2 with the smallest upper bound, which holds with a high probability.

For binary classification problems, the inter-domain gap can be factorized into a product of two scalars, ∆p · ∆ℓ X . Here, ∆p = |pT − pS | ∈ R denotes the diference between class proportions and ∆ℓ = |ℓY =2(h) − ℓY =1(h)| ∈ R denotes the diference between class-wise losses. A smallest upper bound ∆ℓ ∗, which holds with probability 1 − δ, can be found for the empirical estimate ∆ ℓˆ. Therefore, by Def. 1, binary classifiers can be certified as a function of ϵ and δ, where P is characterized by the range [pmin, pTmax] of class proportions [ 2 ]. T 3

Certification in Multi-Class Classification

To certify multi-class classifiers according to Def. 1, an estimation for the interdomain gap with multiple classes must be found. For this purpose, we will make use of Hölders inequality [ 7 ], a fundamental inequality theorem for the study of Lp spaces. This inequality will help us in using PAC bounds for multi-class certification, similar to the certification of binary classifiers.

Theorem 2 (Hölder’s inequality [ 7 ]). Let (S, Σ, µ ) be a measure space and let p, q ∈ [1, ∞] with 1/p + 1/q = 1, where 1/∞ = 0. Then, for all measurable real- or complex-valued functions f and g on S,

With this inequality, the inter-domain gap from Eq. 2 can be transformed to ∥f g∥1 ≤ ∥f ∥p∥g∥q. |⟨d, ℓh⟩| ≤ ∥d∥1 · ∥ℓh∥∞, for p = 1, q = ∞ 

for p = 2, q = 2 ∥d∥2 · ∥ℓh∥2, ∥d∥∞ · ∥ℓh∥1, for p = ∞, q = 1 (2) (3) (4)

In the following we restrict ourselves to the consideration of the Hölder conjugate p = ∞, q = 1. In principle, the other conjugate forms are also applicable. However, we will see that the infinity norm on d provides a simple and intuitive characterization of P.

In order to yield a certified hypothesis, as according to Def. 1, it must hold, with a probability of at least 1 − δ, that, for ϵ, δ > 0, |LT (h) − LS (h)| ≤ ∥d∥∞ · ∥ℓh∥1 ≤ ϵ

By taking the infinity norm on d, ∥d∥∞ reduces to the class i which has the largest absolute label distribution shift |[pT ]i − [pS ]i| = ∆p . In analogy to the binary certification, the range of safe deployment proportions for a class i can be described by [ [pS ]i − ∆p ∗ , [pS ]i + ∆p ∗] = [pTm,iin, pTm,aix]. Here, ∆p ∗ = ∥ℓhϵ∥1∗ is constant for all classes and represents the largest absolute shift that a class is allowed to have to satisfy Eq. 5 with probability at least 1 − δ. Therefore, ( P = p ∈ [ 0, 1 ]N : [p]i ∈ [pTm,iin, pTm,aix] ∀ i ∈ {1, . . . , N } and N X[p]i = 1 i=1 ) (8) defines the set of class proportions to which the certified classifier h is safely applicable.

Based on this approach, a variant of the certificate can be derived by modifying d slightly. For this modification, the negative vector components of d are set to zero, so that a vector d+ is formed. This variant is motivated by the observation that, by applying the norm to d, the negative loss components (falsely) contribute as positives to the estimation of the error. Accordingly, d+ addresses where strict inequalities are realized through non-strict inequalities with some suficiently small τ > 0.

Let us now describe the set P of safe class proportions. In extension to the requirement given in Def. 1, P is supposed to cover all class proportions that are valid according to the certificate. With the minimum upper bound ∥ℓh∥1∗, we can rearrange Eq. 5 to ∥d∥∞ ≤

ϵ ∥ℓh∥1∗ ∀ pT ∈ P. only the positive error component and allows a more tighter estimate of the inter-domain gap. However, since with d+ only the positive error components are considered, the range of class proportions can no longer be expressed by [pTm,iin, pTm,aix] and Pd+ cannot be defined by Eq. 8. As a consequence, Pd+ is more dificult to characterize than P. 4

Experiments

In the following evaluation, we show that the introduced multi-class certificate indeed represents an upper bound of the inter-domain gap. Besides the correctness of the certificate, the accuracy and tightness of the estimated upper bound are inspected. Ideally, the certificates correspond to upper bounds that are both correct and tight. To this end, we randomly subsample the data to generate different deployment class proportions pT while keeping P(X = x | Y = y) fixed. To facilitate visualizations in two dimensions, we limit our evaluation to data sets with three classes. The implementation of our configurable experiments is available online1.

Correctness The certificate is correct if LˆS + ϵ ≥ LˆT holds, where ϵ is the predicted domaininduced error and LˆT is the empirical estimate of the target domain loss [ 2 ]. At this point, recognize that computing LˆT requires target domain data, which is typically not available in ACS. This unavailability raises the desire for an upper bound LˆS + ϵ of LˆT , which allows practitioners to assess, using only ACSgenerated data from S, whether their classifier is suficiently accurate on T . Our certificate is designed to provide this upper bound, and the purpose of our experiments is to validate this claim.

Our experiments cover a repeated three-fold cross validation on six data sets and two learning algorithms, to represent a broad range of scenarios. In total, we have generated 216 000 certificates under the zero-one loss with δ = 0.05. Among these certificates, only one failed, by producing an LˆS + ϵ that is larger than LˆT . Due to the statistical nature of our certificates, δ = 0.05 would have allowed for up to 10 800 failures. Therefore, the number of failures is much smaller than expected. This small number of failures results from the coarse bound estimation that Hölder’s inequality provides.

Tightness A fair comparison between our certificates and our empirical estimate qTuhiirsesneucsestsoittyaksetetmhes efrsotimmatthieonfaecrtrotrhaϵTt Lˆof the baseline, LˆT , into account [ 2 ]. T is just an estimate from a finite amount of data. Having access to labeled target domain data would thus yield LˆT re1 https://github.com/martinsenz/MultiClassAcsCertificates Certified domain-induced error ϵ Tightness = |LˆS + ϵ − LT | Fig. 1: The certified error (left) and its tightness (right), according to ∥d∥∞·∥ℓh∥1 on the satimage data set, using a logistic regression and the zero-one loss.

Certified domain-induced error ϵ Tightness = |LˆS + ϵ − LT | Fig. 2: The certified error (left) and its tightness (right), according to the variant ∥d+∥∞ · ∥ℓh∥1 on the satimage data set, using a logistic regression and the zeroone loss. an upper bound LˆT + ϵT of the true target domain error LT . We speak of a tight bound, if LˆS + ϵ ≈ LˆT + ϵT .

For example, the prediction of the domain induced error ϵ, as according to our certificate, can be inspected in Fig. 1. The prediction by our d+ certificate variant is shown in Fig. 2. As we can see, the upper bound is very tight for the area around pS . With increasing distance from pS , the estimation of the upper bound becomes larger, and hence, the upper bound becomes coarser. As it is expected, the variant using the d+ vector provides a finer bound of the interdomain gap in some regions. Tab. 2 summarizes the absolute deviations between LˆS + ϵ and LˆT + ϵT in terms of mean absolute deviation (MAD) and quartiles (Q1, Q2, Q3). 5

Conclusion and Outlook

Using Hölder’s inequality and considering PAC bounds, we have proposed an upper bound ∥d∥∞ · ∥ℓh∥1, from which certificates of model robustness in multiclass ACS can be issued. Our experiments demonstrate that this certification is correct within a probability budget δ. Moreover, safe class proportions can easily be described by the maximum allowable absolute deviation ∆p ∗. Thus, the certification of a multi-class ACS classifier is straightforward for the practitioner to interpret and intuitive to understand, regardless of the number of classes considered in the classification problem.

By decomposing the inter-domain gap into positive and negative error components, it is possible to find estimates that bound the domain gap even more precisely. An example is the presented d+ certification variant, which considers only the positive error components. In order to obtain even more precise estimates, it is further conceivable to also take the negative error components (correctly) into account. However, as already indicated by the d+ variant, the complexity of describing the set P of valid class proportions increases with the expression strength of the upper bound.

In future work, we plan to evaluate more precise estimates of this kind, as well as the other bounds that are provided by Hölder’s inequality in Eq. 4. We also plan to use our multi-class certificates as a basis for theoretically justified data acquisition strategies for multi-class ACS, similar to the binary acquisition strategy that is based on binary certificates [ 1 ].

1. Bunse , M. , Morik , K. : Active class selection with uncertain deployment class proportions . In: Workshop on Interactive Adaptive Learning . p. 70 ( 2021 )

2. Bunse , M. , Morik , K. : Certification of model robustness in active class selection . In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases . pp. 266 - 281 . Springer ( 2021 )

3. Huang , X. , Kroening , D. , Ruan , W. , Sharp , J. , Sun , Y. , Thamo , E. , Wu , M. , Yi , X.: A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability . Computer Science Review 37 ( 2020 )

4. Lomasky , R. , Brodley , C.E. , Aernecke , M. , Walt , D. , Friedl , M. : Active class selection . In: European Conference on Machine Learning . pp. 640 - 647 . Springer ( 2007 )

5. Moreno-Torres , J.G. , Raeder , T. , Alaíz-Rodríguez , R. , Chawla , N.V. , Herrera , F. : A unifying view on dataset shift in classification . Pattern Recognition 45 ( 1 ), 521 - 530 ( 2012 )

6. Pan , S.J. , Yang , Q. : A survey on transfer learning . IEEE Transactions on Knowledge and Data Engineering 22 ( 10 ), 1345 - 1359 ( 2010 )

7. Yang , W.H. : On generalized Hölder inequality . Nonlinear Analysis: Theory, Methods & Applications 16 ( 5 ), 489 - 498 ( 1991 )

8. Zhang , K. , Schölkopf , B. , Muandet , K. , Wang , Z. : Domain adaptation under target and conditional shift . In: International Conference on Machine Learning. JMLR Workshop and Conference Proceedings , vol. 28 , pp. 819 - 827 ( 2013 )