<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Certifiable Active Class Selection in Multi-Class Classification</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Dortmund University, Artificial Intelligence Group</institution>
          ,
          <addr-line>D-44227 Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>68</fpage>
      <lpage>76</lpage>
      <abstract>
        <p>Active class selection (ACS) requires the developer of a classifier to actively choose the class proportions of the training data. This freedom of choice puts the trust in the trained classifier at risk if the true class proportions, which occur during deployment, are subject to uncertainties. This issue has recently motivated a certificate for ACS-trained classifiers, which builds trust by proving that a classifier is suficiently correct within a specific set of class proportions and with a high probability. However, this certificate was only developed in the context of binary classification. In this paper, we employ Hölder's inequality to extend the binary ACS certificate to multi-class settings. We demonstrate that our extension indeed provides correct and tight upper bounds of the classifier's error. We conclude with several directions for future work.</p>
      </abstract>
      <kwd-group>
        <kwd>Active class selection</kwd>
        <kwd>Prior probability shift</kwd>
        <kwd>Multi-class classification</kwd>
        <kwd>Model certification</kwd>
        <kwd>Learning theory</kwd>
        <kwd>Validation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The proceeding deployment of machine learning models in real-world
applications increases the importance of validating these models thoroughly. Ideally, the
robustness of these models against distribution shifts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is certified in the sense
of being formally proven or extensively tested [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In active class selection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a class-conditional data generator is repeatedly
asked to produce feature vectors for arbitrarily chosen classes. In this setting,
the developer of a classifier must actively decide for the class proportions in
which the training data set is produced. While this freedom can reduce the data
acquisition cost while improving classification performance, it also puts the trust
in the trained classifier at risk: what if the class proportions, which occur during
deployment, are not precisely known or are even subject to changes?
      </p>
      <p>
        These uncertainties have motivated a certificate for ACS-trained classifiers,
which declares a set of class proportions to which a classifier is safely applicable
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In particular, the certified classifier is required to exhibit an ACS-induced
error of at most some ϵ &gt; 0, with a probability of at least 1 − δ. However, this
certificate was only developed in the context of binary classification; a multi-class
certificate has not yet been proposed, to the best of our knowledge.
      </p>
      <p>© 2022 for this paper by its authors. Use permitted under CC BY 4.0.</p>
      <p>In this paper, we close the gap between ACS model certification and
multiclass classification. In the following, we recapitulate the theoretical background
of binary ACS certification in Section 2 before we develop our multi-class ACS
certificate in Section 3. We validate our claims empirically in Section 4 before
we conclude with Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Theoretical Background</title>
      <p>
        The term “domain”, as used in domain adaption [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], describes a probability
density function over X × Y, where X is the feature space and Y is the label
space. In ACS, we assume that the source domain S, where a machine learning
model is trained, difers from the target domain T , where the model is deployed,
only in terms of the class proportions pS ̸= pT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Such deviations, also known
as target shift [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or as prior probability shift [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], occur due to the freedom of
choosing any pS for the acquisition of training data. We are interested in the
impact of such deviations on the classification performance with respect to T .
      </p>
      <p>
        Recently, a PAC learning perspective [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] on this setting has provided us with
Theorem 1. This result quantifies the diference in loss values L(h) between an
ACS-generated training set D and the target domain T . Only if this diference
is small, we can expect to learn a classifier h from D that is accurate also with
respect to T , similar to standard PAC learning theory. The key insight of this
theorem is that the relevant loss diference between D and T is continuously
approaching the inter-domain gap |LT (h) − LS (h)|, which is independent of the
random draw of D from S, while the training set size m increases. In ACS, this
increase happens naturally while more and more data is actively being acquired,
so that the error of any ACS-trained classifier is increasingly dominated by this
gap. Since the inter-domain gap is constant with respect to the random draw of
the training set D, it is also independent of ϵ, δ, and m.
      </p>
      <p>
        Theorem 1 (Identical mechanism bound [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). For any ϵ &gt; 0 and any fixed
h ∈ H, it holds with probability at least 1 − δ, where δ = 4e−2mϵ2 , that
|LT (h) − LS (h)| − ϵ ≤ |LT (h) − LD(h)| ≤ |LT (h) − LS (h)| + ϵ.
      </p>
      <p>
        This theorem can be used to certify a trained classification model h with N
classes in terms of a set of safe class proportions P ⊆ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]N . By “safe”, we mean
that, during the deployment of h on T , the trained model induces, with a high
probability, at most a small domain-induced error ϵ.
      </p>
      <p>
        Definition 1 (Certified hypothesis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). A hypothesis h ∈ H is certified for
all class proportions in P ⊆ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]N if, with probability at least 1 − δ and ϵ, δ &gt; 0,
|LT (h) − LS (h)| ≤ ϵ
      </p>
      <p>
        Let pS , pT ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]N be vectors with components [p•]i = P•(Y = i), which
express the probabilities of the class labels in the respective domains S and T .
Furthermore, let ℓh ∈ RN be a vector that represents the class-wise losses
Z
      </p>
      <p>X
[ℓh]i = ℓX (h, i) =</p>
      <p>P(X = x | Y = i) · ℓ(h(x), i) dx,
(1)
as according to some loss function ℓ. The total loss of the hypothesis h is then
given by L•(h) = Pi∈Y [p•]i[ℓh]i = ⟨p•, ℓh⟩. Consequently the inter-domain gap
for classification problems can be expressed as
|LT (h) − LS (h)| = |⟨pT , ℓh⟩ − ⟨pS , ℓh⟩|
= |⟨pT − pS , ℓh⟩|
= |⟨d, ℓh⟩|,
where d = pT − pS is the diference between the class probabilities in the
domains S and T .</p>
      <p>In order to certify classification models, it is necessary to calculate Eq. 2.
However, the true class-wise losses ℓh are unknown, and we can only estimate
the empirical class-wise losses ℓˆX (h, y) = m1y Pi:yi=y ℓ(y, h(xi)) from a finite
amount of labeled validation data. Therefore, our goal is to constrain Eq. 2 with
the smallest upper bound, which holds with a high probability.</p>
      <p>
        For binary classification problems, the inter-domain gap can be factorized
into a product of two scalars, ∆p · ∆ℓ X . Here, ∆p = |pT − pS | ∈ R denotes
the diference between class proportions and ∆ℓ = |ℓY =2(h) − ℓY =1(h)| ∈ R
denotes the diference between class-wise losses. A smallest upper bound ∆ℓ ∗,
which holds with probability 1 − δ, can be found for the empirical estimate ∆ ℓˆ.
Therefore, by Def. 1, binary classifiers can be certified as a function of ϵ and δ,
where P is characterized by the range [pmin, pTmax] of class proportions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
T
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Certification in Multi-Class Classification</title>
      <p>
        To certify multi-class classifiers according to Def. 1, an estimation for the
interdomain gap with multiple classes must be found. For this purpose, we will make
use of Hölders inequality [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a fundamental inequality theorem for the study
of Lp spaces. This inequality will help us in using PAC bounds for multi-class
certification, similar to the certification of binary classifiers.
      </p>
      <p>
        Theorem 2 (Hölder’s inequality [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). Let (S, Σ, µ ) be a measure space and
let p, q ∈ [1, ∞] with 1/p + 1/q = 1, where 1/∞ = 0. Then, for all measurable
real- or complex-valued functions f and g on S,
      </p>
      <p>With this inequality, the inter-domain gap from Eq. 2 can be transformed to
∥f g∥1 ≤ ∥f ∥p∥g∥q.
|⟨d, ℓh⟩| ≤
∥d∥1 · ∥ℓh∥∞, for p = 1, q = ∞
</p>
      <p>for p = 2, q = 2
∥d∥2 · ∥ℓh∥2,
∥d∥∞ · ∥ℓh∥1, for p = ∞, q = 1
(2)
(3)
(4)</p>
      <p>In the following we restrict ourselves to the consideration of the Hölder
conjugate p = ∞, q = 1. In principle, the other conjugate forms are also applicable.
However, we will see that the infinity norm on d provides a simple and intuitive
characterization of P.</p>
      <p>In order to yield a certified hypothesis, as according to Def. 1, it must hold,
with a probability of at least 1 − δ, that, for ϵ, δ &gt; 0,
|LT (h) − LS (h)| ≤ ∥d∥∞ · ∥ℓh∥1 ≤ ϵ</p>
      <p>
        By taking the infinity norm on d, ∥d∥∞ reduces to the class i which has the
largest absolute label distribution shift |[pT ]i − [pS ]i| = ∆p . In analogy to the
binary certification, the range of safe deployment proportions for a class i can
be described by [ [pS ]i − ∆p ∗ , [pS ]i + ∆p ∗] = [pTm,iin, pTm,aix]. Here, ∆p ∗ = ∥ℓhϵ∥1∗
is constant for all classes and represents the largest absolute shift that a class is
allowed to have to satisfy Eq. 5 with probability at least 1 − δ. Therefore,
(
P =
p ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]N : [p]i ∈ [pTm,iin, pTm,aix] ∀ i ∈ {1, . . . , N } and
N
X[p]i = 1
i=1
)
(8)
defines the set of class proportions to which the certified classifier h is safely
applicable.
      </p>
      <p>Based on this approach, a variant of the certificate can be derived by
modifying d slightly. For this modification, the negative vector components of d are set
to zero, so that a vector d+ is formed. This variant is motivated by the
observation that, by applying the norm to d, the negative loss components (falsely)
contribute as positives to the estimation of the error. Accordingly, d+ addresses
where strict inequalities are realized through non-strict inequalities with some
suficiently small τ &gt; 0.</p>
      <p>Let us now describe the set P of safe class proportions. In extension to the
requirement given in Def. 1, P is supposed to cover all class proportions that
are valid according to the certificate. With the minimum upper bound ∥ℓh∥1∗, we
can rearrange Eq. 5 to
∥d∥∞ ≤</p>
      <p>ϵ
∥ℓh∥1∗
∀ pT ∈ P.
only the positive error component and allows a more tighter estimate of the
inter-domain gap. However, since with d+ only the positive error components
are considered, the range of class proportions can no longer be expressed by
[pTm,iin, pTm,aix] and Pd+ cannot be defined by Eq. 8. As a consequence, Pd+ is more
dificult to characterize than P.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In the following evaluation, we show that the introduced multi-class certificate
indeed represents an upper bound of the inter-domain gap. Besides the
correctness of the certificate, the accuracy and tightness of the estimated upper bound
are inspected. Ideally, the certificates correspond to upper bounds that are both
correct and tight. To this end, we randomly subsample the data to generate
different deployment class proportions pT while keeping P(X = x | Y = y) fixed.
To facilitate visualizations in two dimensions, we limit our evaluation to data
sets with three classes. The implementation of our configurable experiments is
available online1.</p>
      <p>
        Correctness
The certificate is correct if LˆS + ϵ ≥ LˆT holds, where ϵ is the predicted
domaininduced error and LˆT is the empirical estimate of the target domain loss [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
At this point, recognize that computing LˆT requires target domain data, which
is typically not available in ACS. This unavailability raises the desire for an
upper bound LˆS + ϵ of LˆT , which allows practitioners to assess, using only
ACSgenerated data from S, whether their classifier is suficiently accurate on T .
Our certificate is designed to provide this upper bound, and the purpose of our
experiments is to validate this claim.
      </p>
      <p>Our experiments cover a repeated three-fold cross validation on six data sets
and two learning algorithms, to represent a broad range of scenarios. In total, we
have generated 216 000 certificates under the zero-one loss with δ = 0.05. Among
these certificates, only one failed, by producing an LˆS + ϵ that is larger than LˆT .
Due to the statistical nature of our certificates, δ = 0.05 would have allowed
for up to 10 800 failures. Therefore, the number of failures is much smaller than
expected. This small number of failures results from the coarse bound estimation
that Hölder’s inequality provides.</p>
      <p>
        Tightness
A fair comparison between our certificates and our empirical estimate
qTuhiirsesneucsestsoittyaksetetmhes efrsotimmatthieonfaecrtrotrhaϵTt Lˆof the baseline, LˆT , into account [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
T is just an estimate from a finite
amount of data. Having access to labeled target domain data would thus yield
LˆT
re1 https://github.com/martinsenz/MultiClassAcsCertificates
Certified domain-induced error ϵ
Tightness = |LˆS + ϵ − LT |
Fig. 1: The certified error (left) and its tightness (right), according to ∥d∥∞·∥ℓh∥1
on the satimage data set, using a logistic regression and the zero-one loss.
      </p>
      <p>Certified domain-induced error ϵ
Tightness = |LˆS + ϵ − LT |
Fig. 2: The certified error (left) and its tightness (right), according to the variant
∥d+∥∞ · ∥ℓh∥1 on the satimage data set, using a logistic regression and the
zeroone loss.
an upper bound LˆT + ϵT of the true target domain error LT . We speak of a tight
bound, if LˆS + ϵ ≈ LˆT + ϵT .</p>
      <p>For example, the prediction of the domain induced error ϵ, as according to
our certificate, can be inspected in Fig. 1. The prediction by our d+ certificate
variant is shown in Fig. 2. As we can see, the upper bound is very tight for the
area around pS . With increasing distance from pS , the estimation of the upper
bound becomes larger, and hence, the upper bound becomes coarser. As it is
expected, the variant using the d+ vector provides a finer bound of the
interdomain gap in some regions. Tab. 2 summarizes the absolute deviations between
LˆS + ϵ and LˆT + ϵT in terms of mean absolute deviation (MAD) and quartiles
(Q1, Q2, Q3).
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Outlook</title>
      <p>Using Hölder’s inequality and considering PAC bounds, we have proposed an
upper bound ∥d∥∞ · ∥ℓh∥1, from which certificates of model robustness in
multiclass ACS can be issued. Our experiments demonstrate that this certification
is correct within a probability budget δ. Moreover, safe class proportions can
easily be described by the maximum allowable absolute deviation ∆p ∗. Thus, the
certification of a multi-class ACS classifier is straightforward for the practitioner
to interpret and intuitive to understand, regardless of the number of classes
considered in the classification problem.</p>
      <p>By decomposing the inter-domain gap into positive and negative error
components, it is possible to find estimates that bound the domain gap even more
precisely. An example is the presented d+ certification variant, which
considers only the positive error components. In order to obtain even more precise
estimates, it is further conceivable to also take the negative error components
(correctly) into account. However, as already indicated by the d+ variant, the
complexity of describing the set P of valid class proportions increases with the
expression strength of the upper bound.</p>
      <p>
        In future work, we plan to evaluate more precise estimates of this kind, as
well as the other bounds that are provided by Hölder’s inequality in Eq. 4. We
also plan to use our multi-class certificates as a basis for theoretically justified
data acquisition strategies for multi-class ACS, similar to the binary acquisition
strategy that is based on binary certificates [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bunse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Active class selection with uncertain deployment class proportions</article-title>
          .
          <source>In: Workshop on Interactive Adaptive Learning</source>
          . p.
          <volume>70</volume>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bunse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Certification of model robustness in active class selection</article-title>
          .
          <source>In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          . pp.
          <fpage>266</fpage>
          -
          <lpage>281</lpage>
          . Springer (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kroening</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thamo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>X.:</given-names>
          </string-name>
          <article-title>A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability</article-title>
          .
          <source>Computer Science Review</source>
          <volume>37</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lomasky</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brodley</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aernecke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Active class selection</article-title>
          .
          <source>In: European Conference on Machine Learning</source>
          . pp.
          <fpage>640</fpage>
          -
          <lpage>647</lpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Moreno-Torres</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raeder</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alaíz-Rodríguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>N.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herrera</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A unifying view on dataset shift in classification</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>521</fpage>
          -
          <lpage>530</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>A survey on transfer learning</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>22</volume>
          (
          <issue>10</issue>
          ),
          <fpage>1345</fpage>
          -
          <lpage>1359</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>W.H.</given-names>
          </string-name>
          :
          <article-title>On generalized Hölder inequality</article-title>
          .
          <source>Nonlinear Analysis: Theory, Methods &amp; Applications</source>
          <volume>16</volume>
          (
          <issue>5</issue>
          ),
          <fpage>489</fpage>
          -
          <lpage>498</lpage>
          (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schölkopf</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muandet</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Domain adaptation under target and conditional shift</article-title>
          .
          <source>In: International Conference on Machine Learning. JMLR Workshop and Conference Proceedings</source>
          , vol.
          <volume>28</volume>
          , pp.
          <fpage>819</fpage>
          -
          <lpage>827</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>