Certifiable Active Class Selection
                 in Multi-Class Classification

     Martin Senz[0000−0002−9377−3939] , Mirko Bunse[0000−0002−5515−6278] , and
                      Katharina Morik[0000−0003−1153−5986]

TU Dortmund University, Artificial Intelligence Group, D-44227 Dortmund, Germany
       {martin.senz, mirko.bunse, katharina.morik}@tu-dortmund.de


        Abstract. Active class selection (ACS) requires the developer of a clas-
        sifier to actively choose the class proportions of the training data. This
        freedom of choice puts the trust in the trained classifier at risk if the true
        class proportions, which occur during deployment, are subject to uncer-
        tainties. This issue has recently motivated a certificate for ACS-trained
        classifiers, which builds trust by proving that a classifier is sufficiently
        correct within a specific set of class proportions and with a high prob-
        ability. However, this certificate was only developed in the context of
        binary classification. In this paper, we employ Hölder’s inequality to ex-
        tend the binary ACS certificate to multi-class settings. We demonstrate
        that our extension indeed provides correct and tight upper bounds of the
        classifier’s error. We conclude with several directions for future work.

        Keywords: Active class selection · Prior probability shift · Multi-class
        classification · Model certification · Learning theory · Validation.


1     Introduction
The proceeding deployment of machine learning models in real-world applica-
tions increases the importance of validating these models thoroughly. Ideally, the
robustness of these models against distribution shifts [5] is certified in the sense
of being formally proven or extensively tested [3].
    In active class selection [4], a class-conditional data generator is repeatedly
asked to produce feature vectors for arbitrarily chosen classes. In this setting,
the developer of a classifier must actively decide for the class proportions in
which the training data set is produced. While this freedom can reduce the data
acquisition cost while improving classification performance, it also puts the trust
in the trained classifier at risk: what if the class proportions, which occur during
deployment, are not precisely known or are even subject to changes?
    These uncertainties have motivated a certificate for ACS-trained classifiers,
which declares a set of class proportions to which a classifier is safely applicable
[2]. In particular, the certified classifier is required to exhibit an ACS-induced
error of at most some ϵ > 0, with a probability of at least 1 − δ. However, this
certificate was only developed in the context of binary classification; a multi-class
certificate has not yet been proposed, to the best of our knowledge.

    © 2022 for this paper by its authors. Use permitted under CC BY 4.0.
2       M.Senz,Certifiable
                et al.     Active Class Selection in Multi-Class Classification    69

    In this paper, we close the gap between ACS model certification and multi-
class classification. In the following, we recapitulate the theoretical background
of binary ACS certification in Section 2 before we develop our multi-class ACS
certificate in Section 3. We validate our claims empirically in Section 4 before
we conclude with Section 5.


2    Theoretical Background

The term “domain”, as used in domain adaption [6], describes a probability
density function over X × Y, where X is the feature space and Y is the label
space. In ACS, we assume that the source domain S, where a machine learning
model is trained, differs from the target domain T , where the model is deployed,
only in terms of the class proportions pS ̸= pT [2]. Such deviations, also known
as target shift [8] or as prior probability shift [5], occur due to the freedom of
choosing any pS for the acquisition of training data. We are interested in the
impact of such deviations on the classification performance with respect to T .
    Recently, a PAC learning perspective [2] on this setting has provided us with
Theorem 1. This result quantifies the difference in loss values L(h) between an
ACS-generated training set D and the target domain T . Only if this difference
is small, we can expect to learn a classifier h from D that is accurate also with
respect to T , similar to standard PAC learning theory. The key insight of this
theorem is that the relevant loss difference between D and T is continuously
approaching the inter-domain gap |LT (h) − LS (h)|, which is independent of the
random draw of D from S, while the training set size m increases. In ACS, this
increase happens naturally while more and more data is actively being acquired,
so that the error of any ACS-trained classifier is increasingly dominated by this
gap. Since the inter-domain gap is constant with respect to the random draw of
the training set D, it is also independent of ϵ, δ, and m.

Theorem 1 (Identical mechanism bound [2]). For any ϵ > 0 and any fixed
                                                                 2
h ∈ H, it holds with probability at least 1 − δ, where δ = 4e−2mϵ , that

       |LT (h) − LS (h)| − ϵ ≤ |LT (h) − LD (h)| ≤ |LT (h) − LS (h)| + ϵ.

    This theorem can be used to certify a trained classification model h with N
classes in terms of a set of safe class proportions P ⊆ [0, 1]N . By “safe”, we mean
that, during the deployment of h on T , the trained model induces, with a high
probability, at most a small domain-induced error ϵ.

Definition 1 (Certified hypothesis [2]). A hypothesis h ∈ H is certified for
all class proportions in P ⊆ [0, 1]N if, with probability at least 1 − δ and ϵ, δ > 0,

                        |LT (h) − LS (h)| ≤ ϵ      ∀ pT ∈ P.

   Let pS , pT ∈ [0, 1]N be vectors with components [p• ]i = P• (Y = i), which
express the probabilities of the class labels in the respective domains S and T .
70             Certifiable
         M. Senz, M. Bunse,Active Class Selection in Multi-Class Classification
                            K. Morik                                               3

Furthermore, let ℓh ∈ RN be a vector that represents the class-wise losses
                               Z
          [ℓh ]i = ℓX (h, i) =    P(X = x | Y = i) · ℓ(h(x), i) dx,               (1)
                                     X

as according to someP loss function ℓ. The total loss of the hypothesis h is then
given by L• (h) = i∈Y [p• ]i [ℓh ]i = ⟨p• , ℓh ⟩. Consequently the inter-domain gap
for classification problems can be expressed as

                     |LT (h) − LS (h)| = |⟨pT , ℓh ⟩ − ⟨pS , ℓh ⟩|
                                         = |⟨pT − pS , ℓh ⟩|                      (2)
                                         = |⟨d, ℓh ⟩|,

where d = pT − pS is the difference between the class probabilities in the
domains S and T .
    In order to certify classification models, it is necessary to calculate Eq. 2.
However, the true class-wise losses ℓh are unknown,
                                                  P      and we can only estimate
the empirical class-wise losses ℓ̂X (h, y) = m1y i:yi =y ℓ(y, h(xi )) from a finite
amount of labeled validation data. Therefore, our goal is to constrain Eq. 2 with
the smallest upper bound, which holds with a high probability.
    For binary classification problems, the inter-domain gap can be factorized
into a product of two scalars, ∆p · ∆ℓX . Here, ∆p = |pT − pS | ∈ R denotes
the difference between class proportions and ∆ℓ = |ℓY =2 (h) − ℓY =1 (h)| ∈ R
denotes the difference between class-wise losses. A smallest upper bound ∆ℓ∗ ,
which holds with probability 1 − δ, can be found for the empirical estimate ∆ℓ̂.
Therefore, by Def. 1, binary classifiers can be certified as a function of ϵ and δ,
where P is characterized by the range [pmin     max
                                           T , pT   ] of class proportions [2].


3     Certification in Multi-Class Classification
To certify multi-class classifiers according to Def. 1, an estimation for the inter-
domain gap with multiple classes must be found. For this purpose, we will make
use of Hölders inequality [7], a fundamental inequality theorem for the study
of Lp spaces. This inequality will help us in using PAC bounds for multi-class
certification, similar to the certification of binary classifiers.
Theorem 2 (Hölder’s inequality [7]). Let (S, Σ, µ) be a measure space and
let p, q ∈ [1, ∞] with 1/p + 1/q = 1, where 1/∞ = 0. Then, for all measurable
real- or complex-valued functions f and g on S,

                                 ∥f g∥1 ≤ ∥f ∥p ∥g∥q .                            (3)

     With this inequality, the inter-domain gap from Eq. 2 can be transformed to
                               
                               ∥d∥1 · ∥ℓh ∥∞ , for p = 1, q = ∞
                               
                  |⟨d, ℓh ⟩| ≤ ∥d∥2 · ∥ℓh ∥2 , for p = 2, q = 2              (4)
                               
                               
                                 ∥d∥∞ · ∥ℓh ∥1 , for p = ∞, q = 1
4       M.Senz,Certifiable
                et al.     Active Class Selection in Multi-Class Classification     71

   In the following we restrict ourselves to the consideration of the Hölder con-
jugate p = ∞, q = 1. In principle, the other conjugate forms are also applicable.
However, we will see that the infinity norm on d provides a simple and intuitive
characterization of P.
   In order to yield a certified hypothesis, as according to Def. 1, it must hold,
with a probability of at least 1 − δ, that, for ϵ, δ > 0,

               |LT (h) − LS (h)| ≤ ∥d∥∞ · ∥ℓh ∥1 ≤ ϵ ∀ pT ∈ P.                      (5)

   Like in the binary setting, only the empirical class-wise loss ℓ̂h is given.
Hence, a minimum upper bound ∥ℓh ∥∗1 for the norm ∥ℓ̂h ∥1 , that is valid with
a probability of at least 1 − δ, must be found. Each ℓ̂(h, y) is associated with
                                                    2
a positive corresponding error ϵy with δy = e−2my ϵy . For a given probability
                                                                     PN
budget of δ, we get the smallest upper bound ∥ℓh ∥∗1 = ∥ℓ̂h ∥1 + y=1 ϵ∗y by
            PN
minimizing y=1 ϵy through the optimization problem

                            N
                                                 (
                            X                        ϵ1 , . . . ϵN           ≥τ
                min               ϵy ,   s. t.              PN           2      ,   (6)
             ϵ1 ,...ϵN ∈R
                            y=1
                                                     δ − y=1 δy = e−2my ϵy   ≥0

where strict inequalities are realized through non-strict inequalities with some
sufficiently small τ > 0.
    Let us now describe the set P of safe class proportions. In extension to the
requirement given in Def. 1, P is supposed to cover all class proportions that
are valid according to the certificate. With the minimum upper bound ∥ℓh ∥∗1 , we
can rearrange Eq. 5 to
                                                      ϵ
                                  ∥d∥∞ ≤                   ∀ pT ∈ P.                (7)
                                                 ∥ℓh ∥∗1

    By taking the infinity norm on d, ∥d∥∞ reduces to the class i which has the
largest absolute label distribution shift |[pT ]i − [pS ]i | = ∆p. In analogy to the
binary certification, the range of safe deployment proportions for a class i can
be described by [ [pS ]i − ∆p∗ , [pS ]i + ∆p∗ ] = [pmin       max
                                                      T ,i , pT ,i ]. Here, ∆p = ∥ℓh ∥∗
                                                                               ∗    ϵ
                                                                                      1
is constant for all classes and represents the largest absolute shift that a class is
allowed to have to satisfy Eq. 5 with probability at least 1 − δ. Therefore,
       (                                                               N
                                                                                  )
                                                                       X
                                min max
                   N
  P = p ∈ [0, 1] : [p]i ∈ [pT ,i , pT ,i ] ∀ i ∈ {1, . . . , N } and      [p]i = 1  (8)
                                                                             i=1

defines the set of class proportions to which the certified classifier h is safely
applicable.
    Based on this approach, a variant of the certificate can be derived by modify-
ing d slightly. For this modification, the negative vector components of d are set
to zero, so that a vector d+ is formed. This variant is motivated by the obser-
vation that, by applying the norm to d, the negative loss components (falsely)
contribute as positives to the estimation of the error. Accordingly, d+ addresses
72              Certifiable
          M. Senz, M. Bunse,Active Class Selection in Multi-Class Classification
                             K. Morik                                               5

only the positive error component and allows a more tighter estimate of the
inter-domain gap. However, since with d+ only the positive error components
are considered, the range of class proportions can no longer be expressed by
[pmin     max
  T ,i , pT ,i ] and Pd+ cannot be defined by Eq. 8. As a consequence, Pd+ is more
difficult to characterize than P.


4      Experiments

In the following evaluation, we show that the introduced multi-class certificate
indeed represents an upper bound of the inter-domain gap. Besides the correct-
ness of the certificate, the accuracy and tightness of the estimated upper bound
are inspected. Ideally, the certificates correspond to upper bounds that are both
correct and tight. To this end, we randomly subsample the data to generate dif-
ferent deployment class proportions pT while keeping P(X = x | Y = y) fixed.
To facilitate visualizations in two dimensions, we limit our evaluation to data
sets with three classes. The implementation of our configurable experiments is
available online1 .


Correctness

The certificate is correct if L̂S + ϵ ≥ L̂T holds, where ϵ is the predicted domain-
induced error and L̂T is the empirical estimate of the target domain loss [2].
At this point, recognize that computing L̂T requires target domain data, which
is typically not available in ACS. This unavailability raises the desire for an
upper bound L̂S + ϵ of L̂T , which allows practitioners to assess, using only ACS-
generated data from S, whether their classifier is sufficiently accurate on T .
Our certificate is designed to provide this upper bound, and the purpose of our
experiments is to validate this claim.
    Our experiments cover a repeated three-fold cross validation on six data sets
and two learning algorithms, to represent a broad range of scenarios. In total, we
have generated 216 000 certificates under the zero-one loss with δ = 0.05. Among
these certificates, only one failed, by producing an L̂S + ϵ that is larger than L̂T .
Due to the statistical nature of our certificates, δ = 0.05 would have allowed
for up to 10 800 failures. Therefore, the number of failures is much smaller than
expected. This small number of failures results from the coarse bound estimation
that Hölder’s inequality provides.


Tightness

A fair comparison between our certificates and our empirical estimate L̂T re-
quires us to take the estimation error ϵT of the baseline, L̂T , into account [2].
This necessity stems from the fact that L̂T is just an estimate from a finite
amount of data. Having access to labeled target domain data would thus yield
1
     https://github.com/martinsenz/MultiClassAcsCertificates
6         M.Senz,Certifiable
                  et al.     Active Class Selection in Multi-Class Classification             73


Table 1: Feasible class proportions ∆p∗ , according to ∥d∥∞ · ∥ℓh ∥1 certificates,
which are computed for a zero-one loss with ϵ = 0.1 and δ = 0.05.
    data set          classifier             LS (h)          p⊤
                                                              S                    ∆p∗
    optdigits         DecisionTree           0.100225        [0.70, 0.20, 0.10]    0.174888
    optdigits         LogisticRegression     0.0955851       [0.70, 0.20, 0.10]    0.19131
    satimage          DecisionTree           0.10647         [0.58, 0.31, 0.11]    0.2285
    satimage          LogisticRegression     0.106242        [0.58, 0.31, 0.11]    0.217247
    pendigits         DecisionTree           0.0467701       [0.70, 0.20, 0.10]    0.34621
    pendigits         LogisticRegression     0.160971        [0.70, 0.20, 0.10]    0.137843
    eye movements     DecisionTree           0.488188        [0.35, 0.26, 0.40]    0.0652661
    eye movements     LogisticRegression     0.515892        [0.35, 0.26, 0.40]    0.061098
    shuttle           DecisionTree           0.00521672      [0.15, 0.79, 0.06]    1.29197
    shuttle           LogisticRegression     0.0573444       [0.15, 0.79, 0.06]    0.251336
    connect4          DecisionTree           0.297242        [0.65, 0.1, 0.25]     0.0700096
    connect4          LogisticRegression     0.343249        [0.65, 0.1, 0.25]     0.0491499


Table 2: MAD and quartiles of the absolute difference between L̂S +ϵ and L̂T +ϵT .
     data set          method              MAD                 Q1         Q2        Q3
     optdigits         ∥d∥∞ · ∥ℓh ∥1       0.2023 ± 0.0954     0.1258     0.2015    0.2744
     optdigits         ∥d+ ∥∞ · ∥ℓh ∥1     0.18 ± 0.1006       0.1009     0.1607    0.2471
     satimage          ∥d∥∞ · ∥ℓh ∥1       0.1809 ± 0.087      0.1218     0.1741    0.2324
     satimage          ∥d+ ∥∞ · ∥ℓh ∥1     0.1661 ± 0.0908     0.0999     0.1513    0.22
     pendigits         ∥d∥∞ · ∥ℓh ∥1       0.1999 ± 0.1445     0.1034     0.168     0.2357
     pendigits         ∥d+ ∥∞ · ∥ℓh ∥1     0.1703 ± 0.1452     0.0751     0.1319    0.2034
     eye movements     ∥d∥∞ · ∥ℓh ∥1       0.5433 ± 0.245      0.3639     0.5239    0.7331
     eye movements     ∥d+ ∥∞ · ∥ℓh ∥1     0.5207 ± 0.2643     0.3058     0.5006    0.7369
     shuttle           ∥d∥∞ · ∥ℓh ∥1       0.0879 ± 0.0836     0.0315     0.0531    0.1203
     shuttle           ∥d+ ∥∞ · ∥ℓh ∥1     0.071 ± 0.0792      0.0223     0.0424    0.0825
     connect4          ∥d∥∞ · ∥ℓh ∥1       0.5094 ± 0.221      0.3419     0.5167    0.6541
     connect4          ∥d+ ∥∞ · ∥ℓh ∥1     0.4331 ± 0.2515     0.2274     0.405     0.613
74            Certifiable
        M. Senz, M. Bunse,Active Class Selection in Multi-Class Classification
                           K. Morik                                                  7


     Certified domain-induced error ϵ              Tightness = |L̂S + ϵ − LT |

Fig. 1: The certified error (left) and its tightness (right), according to ∥d∥∞ ·∥ℓh ∥1
on the satimage data set, using a logistic regression and the zero-one loss.


     Certified domain-induced error ϵ              Tightness = |L̂S + ϵ − LT |

Fig. 2: The certified error (left) and its tightness (right), according to the variant
∥d+ ∥∞ · ∥ℓh ∥1 on the satimage data set, using a logistic regression and the zero-
one loss.
8       M.Senz,Certifiable
                et al.     Active Class Selection in Multi-Class Classification       75

an upper bound L̂T + ϵT of the true target domain error LT . We speak of a tight
bound, if L̂S + ϵ ≈ L̂T + ϵT .
    For example, the prediction of the domain induced error ϵ, as according to
our certificate, can be inspected in Fig. 1. The prediction by our d+ certificate
variant is shown in Fig. 2. As we can see, the upper bound is very tight for the
area around pS . With increasing distance from pS , the estimation of the upper
bound becomes larger, and hence, the upper bound becomes coarser. As it is
expected, the variant using the d+ vector provides a finer bound of the inter-
domain gap in some regions. Tab. 2 summarizes the absolute deviations between
L̂S + ϵ and L̂T + ϵT in terms of mean absolute deviation (MAD) and quartiles
(Q1 , Q2 , Q3 ).


5    Conclusion and Outlook

Using Hölder’s inequality and considering PAC bounds, we have proposed an
upper bound ∥d∥∞ · ∥ℓh ∥1 , from which certificates of model robustness in multi-
class ACS can be issued. Our experiments demonstrate that this certification
is correct within a probability budget δ. Moreover, safe class proportions can
easily be described by the maximum allowable absolute deviation ∆p∗ . Thus, the
certification of a multi-class ACS classifier is straightforward for the practitioner
to interpret and intuitive to understand, regardless of the number of classes
considered in the classification problem.
    By decomposing the inter-domain gap into positive and negative error com-
ponents, it is possible to find estimates that bound the domain gap even more
precisely. An example is the presented d+ certification variant, which consid-
ers only the positive error components. In order to obtain even more precise
estimates, it is further conceivable to also take the negative error components
(correctly) into account. However, as already indicated by the d+ variant, the
complexity of describing the set P of valid class proportions increases with the
expression strength of the upper bound.
    In future work, we plan to evaluate more precise estimates of this kind, as
well as the other bounds that are provided by Hölder’s inequality in Eq. 4. We
also plan to use our multi-class certificates as a basis for theoretically justified
data acquisition strategies for multi-class ACS, similar to the binary acquisition
strategy that is based on binary certificates [1].


References
1. Bunse, M., Morik, K.: Active class selection with uncertain deployment class pro-
   portions. In: Workshop on Interactive Adaptive Learning. p. 70 (2021)
2. Bunse, M., Morik, K.: Certification of model robustness in active class selection.
   In: Joint European Conference on Machine Learning and Knowledge Discovery in
   Databases. pp. 266–281. Springer (2021)
3. Huang, X., Kroening, D., Ruan, W., Sharp, J., Sun, Y., Thamo, E., Wu, M., Yi, X.:
   A survey of safety and trustworthiness of deep neural networks: Verification, testing,
76            Certifiable
        M. Senz, M. Bunse,Active Class Selection in Multi-Class Classification
                           K. Morik                                                  9

   adversarial attack and defence, and interpretability. Computer Science Review 37
   (2020)
4. Lomasky, R., Brodley, C.E., Aernecke, M., Walt, D., Friedl, M.: Active class selec-
   tion. In: European Conference on Machine Learning. pp. 640–647. Springer (2007)
5. Moreno-Torres, J.G., Raeder, T., Alaíz-Rodríguez, R., Chawla, N.V., Herrera, F.: A
   unifying view on dataset shift in classification. Pattern Recognition 45(1), 521–530
   (2012)
6. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowledge
   and Data Engineering 22(10), 1345–1359 (2010)
7. Yang, W.H.: On generalized Hölder inequality. Nonlinear Analysis: Theory, Methods
   & Applications 16(5), 489–498 (1991)
8. Zhang, K., Schölkopf, B., Muandet, K., Wang, Z.: Domain adaptation under target
   and conditional shift. In: International Conference on Machine Learning. JMLR
   Workshop and Conference Proceedings, vol. 28, pp. 819–827 (2013)