<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>L. Volpi);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Transductive Model Selection under Prior Probability Shift</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo Volpi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Moreo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Sebastiani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche</institution>
          ,
          <addr-line>Via Giuseppe Moruzzi 1, 56124, Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Transductive learning is a supervised machine learning task in which, unlike in traditional inductive learning, the unlabelled data that require labelling are a finite set and are available at training time. Similarly to inductive learning contexts, transductive learning contexts may be afected by dataset shift, i.e., may be such that the assumption according to which the training data and the unlabelled data are independently and identically distributed (IID), does not hold. We here propose a method, tailored to transductive classification contexts, for performing model selection (i.e., hyperparameter optimisation) when the data exhibit prior probability shift, an important type of dataset shift typical of anti-causal learning problems. In our proposed method the hyperparameters can be optimised directly on the unlabelled data to which the trained classifier must be applied; this is unlike traditional model selection methods, that are based on performing cross-validation on the labelled training data. By tailoring model selection to the actual test distribution, our approach contributes to the trustworthiness of AI systems, as it enables more reliable and robust classifier deployment under changed conditions. We provide experimental results that show the benefits brought about by our method.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Model selection</kwd>
        <kwd>Hyperparameter optimisation</kwd>
        <kwd>Classifier accuracy prediction</kwd>
        <kwd>Dataset shift</kwd>
        <kwd>Prior probability shift</kwd>
        <kwd>Transductive learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>A key requirement for trustworthy AI is robustness under dataset shift (i.e., robustness to scenarios in
which the assumption that data are identically and independently distributed (IID) is not verified), as
models trained and validated under the IID assumption often face reliability issues when deployed in
realworld, dynamic scenarios. In applications where the distribution of the priors may vary unexpectedly,
relying on traditional cross-validation for accuracy evaluation and, as a consequence, hyperparameter
selection, can result in biased and misleading estimates of model performance.</p>
      <p>Consider the outbreak of an epidemic, in which the prevalence of individuals afected by an infectious
disease rapidly increases while the distribution of the symptoms (i.e., the efects) across the afected
individuals remains unchanged. In such a scenario, we may want to train a classifier that, from the
symptoms an individual displays, infers whether she is afected or not from the disease. We consider
the situation in which (i) the data to be classified arrives in successive batches 1, 2, . . . , and, (ii)
given the epidemic, we may expect the prevalence of the afected individuals to evolve rapidly across
batches. Assume that, for training the classifier, and in particular for selecting the combination of
hyperparameters expected to yield the best accuracy under the new (i.e., epidemic) conditions, we have
access to training data collected in the old (i.e., pre-epidemic) conditions. This scenario is problematic,
as the training data and the unlabelled data to be classified are not IID, due to the fact that the prevalence
of the individuals afected by the disease has changed from the training data to the unlabelled data. The
classifier, and the chosen hyperparameter combination, may thus reveal suboptimal once used on the
unlabelled data.</p>
      <p>In this paper we present a method for optimising the hyperparameters (i.e., for performing model
selection – MS) directly on the batch of unlabelled data that need to be classified . For reasons that will be
explained in Section 2, we call this task transductive model selection (TMS). A TMS method has obvious
advantages over the standard inductive model selection (IMS) method (that relies on cross-validation on
the training data), since the chosen hyperparameters are tailored to the batch  of unlabelled data that
need to be classified, and can thus deliver better performance on  than hyperparameters chosen via
standard IMS. By moving from a “one-size-fits-all” approach to a context-aware solution, our method
enhances the reliability of model selection under dataset shift, thereby contributing to more robust and
trustworthy decision-making processes.</p>
      <p>
        Our proposed method is based on techniques for classifier accuracy prediction (CAP) under dataset
shift [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], and essentially consists of (i) predicting the accuracy that diferent classifiers, instantiated
with diferent choices of hyperparameters, would obtain on our “out-of-distribution” batch , and (ii)
classifying the data in batch  using the classifier whose predicted accuracy is highest. In other words,
our TMS method replaces traditional accuracy computation on labelled data with accuracy estimation
on unlabelled data. While the method is generic, we here restrict our attention to the case in which the
data are afected by prior probability shift (PPS), an important type of violation of the IID assumption.
      </p>
      <p>For high-stakes applications such as the healthcare-related classification task discussed above, this
suggests a policy of (a) training, on the labelled data, multiple classifiers, each characterised by a
diferent combination of hyperparameters, 1 (b) storing these classifiers for later use, and, every time a
new batch  of unlabelled data becomes available, (c) estimating (via CAP techniques) the accuracy
that the diferent classifiers would have on  and (d) classifying the data in  via the classifier whose
estimated accuracy is highest. In this way, assuming that Step (b) can be carried out eficiently, the
classification of newly arrived data can be performed immediately and, as we will show, with a much
higher accuracy than can be obtained via the traditional model selection method. Note that Step (a) is
carried out only once, since we do not assume new labelled data to become available during the process.</p>
      <p>The rest of the paper is organised as follows. Section 2 introduces the notation and provides a
detailed description of the proposed method. Section 3 presents the experiments we have carried out
and discusses the results we have obtained. Section 4 concludes the paper with a summary of our
ifndings and a discussion of potential applications of this method.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Transductive Model Selection under Prior Probability Shift</title>
      <p>In supervised machine learning we use training data to learn the internal parameters of the model
(e.g., the weights of a neural network, or the coeficients of a hyperplane in a support vector machine).
Many models trained in this way also rely on a set of hyperparameters (e.g., the learning rate in
neural training, or the trade-of between margin and training error in support vector machines) that
impose higher-level constraints on the learning process. Unlike the internal parameters of the model,
hyperparameters are not learned during training, but must be set in advance. Finding good values for
the hyperparameters is crucial for achieving good performance. Model selection, the task of choosing
the values of the hyperparameters, is typically carried out by (a) testing the accuracy of the model
under diferent combinations of hyperparameter values using cross-validation on labelled data, and (b)
choosing the combination of hyperparameter values that maximizes model accuracy.</p>
      <p>
        Relying on labelled data to evaluate diferently configured models requires the labelled data to be
representative of the unlabelled data the trained model will be applied to, a distributional assumption
typically referred to as the IID assumption. Unfortunately, in real-world problems this assumption is
often violated; in this case, the training data are not representative of the unlabelled data (which are
thus said to be “out-of-distribution” data), and we say that the problem is afected by dataset shift [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
In problems characterised by dataset shift, cross-validation on training data is thus a biased estimator
of model accuracy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and this often leads to suboptimal choices of the hyperparameter values.
      </p>
      <p>
        In classification, one type of dataset shift of particular relevance is prior probability shift (PPS) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
also known as label shift [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This type of shift (sometimes considered the “paradigmatic” case of dataset
shift in classification [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) is characteristic of anti-causal learning problems [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (also known as  → 
problems [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where  is a random variable ranging on the class labels and  is a random variable
1Note that this step is performed anyway during traditional hyperparameter optimisation.
ranging on vectors of covariates), i.e., problems where the goal is to predict the causes of a phenomenon
from its observed efects.
      </p>
      <p>PPS is characterised by two distributional assumptions, often called the PPS assumptions, i.e., (i) the
class priors of the training distribution difer from those of the distribution of the unlabelled data (in
symbols:  ( ) ̸= ( ), where  and  are the distributions from which the training data and the
unlabelled data are sampled, respectively); and (ii) the class-conditional distribution of the covariates in
 is the same as that in  (in symbols:  ( | ) = ( | )).</p>
      <p>The healthcare-related problem discussed in Section 1 is indeed an anti-causal learning problem.
Indeed, if we take random variable  to range over  = {Disease, NoDisease}, and random variable 
to range over the vectors of covariates representing symptoms exhibited by individuals, the anti-causal
nature of the problem is evident. If we take  and  to be the data distributions characterizing the
preepidemic and the epidemic scenarios, respectively, we are in the presence of PPS, since  ( ) ̸= ( )
(the prevalence values of Disease and NoDisease have changed when switching from  to ) and
 ( | ) = ( | ) (the distributions of the symptoms exhibited by afected individuals are the same
in  and ).</p>
      <p>In the presence of PPS (as in the presence of any other type of shift, for that matter), a classifier whose
hyperparameters have been optimised on data from  may behave suboptimally when applied to data
from  (see Section 2.1 for a formal proof). For it to behave optimally on data from , hyperparameter
optimisation should have been carried out on data from , but this is not possible if using standard
cross-validation techniques, since the labels of data from  are not known.</p>
      <p>
        To address this problem, we introduce transductive model selection (TMS), a new strategy aimed at
selecting the hyperparameter configuration for a given classifier (or the model from a pool of already
trained candidates) that is predicted to be the best for a specific batch  of unlabelled data characterised
by dataset shift. This strategy leverages recent advances in classifier accuracy prediction (CAP) [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]
(a family of techniques specifically designed to estimate classifier accuracy under dataset shift), and
focuses in particular on CAP methods tailored to PPS [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Why Can’t We Trust Cross-Validation Estimates under Prior Probability Shift?</title>
        <p>Assume our classifier accuracy measure is (“vanilla”) accuracy (in symbols: Acc), i.e., the fraction of
classification decisions that are correct, and assume we have an (arbitrarily good) estimator Âc︂c of the
classifier’s accuracy obtained by means of cross-validation on training data. Assume our unlabelled data
is drawn from a distribution  related to  via PPS: can we trust our estimate? This amounts to asking
whether our estimator is unbiased under PPS, i.e., whether Bias(Âc︂c) ≡ E[Âc︂c] − (^ =  ) = 0,
where ^ is a random variable ranging on the predicted class labels. As our training data is drawn from
distribution  , and since our estimate is arbitrarily good, asymptotically it holds that E[Âc︂c] =  (^ =
 ). For simplicity, let us focus on a generic binary problem (with  = {0, 1}). Note that
 (^ =  ) =  (^ = 1,  = 1) +  (^ = 0,  = 0)</p>
        <p>=  (^ = 1| = 1) ( = 1) +  (^ = 0| = 0) ( = 0)
and that, similarly,</p>
        <p>(^ =  ) = (^ = 1| = 1)( = 1) + (^ = 0| = 0)( = 0)
We first observe that, as shown in [ 6, Lemma 1], the PPS assumption  ( | ) = ( | ) (see Section 2)
implies that  ( ( )| ) = ( ( )| ) for any deterministic and measurable function  . In particular,
if we take  to be our classifier ℎ, it holds that  (^ | ) = (^ | ). This means that  (^ = 1| = 1)
and (^ = 1| = 1) are equal; we indicate both by the symbol “tpr”, since they both represent the
true positive rate of the classifier. Similarly,  (^ = 0| = 0) and (^ = 0| = 0) are equal, and we
indicate both as “tnr”, which stands for the true negative rate of the classifier. We can further simplify
our equations via the shorthands  =  ( = 1) and  = ( = 1). It then follows that
Bias(Âc︂c) = tpr ·  + tnr · (1 − ) − (tpr ·  + tnr · (1 − ))
= ( − ) · (tpr − tnr)
(1)
(2)
(3)
PPS means that  ̸= ; Equation 3 thus implies that Bias(Âc︂c) = 0 holds only if (tpr − tnr) = 0.
However, tpr = tnr is not true in general (and is unlikely to be true in practice), thus implying that the
cross-validation estimator is biased under PPS. A similar reasoning holds for the multiclass case.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Model Selection: From Induction to Transduction</title>
        <p>Let us assume the following problem setting. Let Θ be the set of all assignments of values to
hyperparameters that we want to explore as part of our model selection process; in this paper we will concentrate
on a standard grid search exploration, although other strategies (e.g., Gaussian processes, randomized
search) might be used instead. Let  be our (labelled) training set and  our batch of (unlabelled) data.
Consider the class ℋ of hypotheses, and let ℎ ∈ ℋ be the classifier with hyperparameters  trained via
some learning algorithm  using labelled data . Let  : ℋ ×  → R be the measure of accuracy for
a classifier ℎ ∈ ℋ on batch  ∈  of unlabelled data we want to optimise ℎ for. The model selection
problem can thus be formalized as
 * = arg max  (ℎ , )
 ∈Θ
(4)
Since we do not have access to the labels in , the problem cannot be solved directly, and we must
instead resort to approximations. The most common way for solving it corresponds to the traditional
inductive model selection method (IMS – Section 2.2.1).</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Inductive Model Selection</title>
          <p>The IMS approach comes down to using part of the training data for evaluating each configuration of
hyperparameters, based on the assumption that such an estimate of classifier accuracy generalizes for
future data. In this paper we carry this out via standard cross-validation (although everything we say
applies to -fold cross-validation as well), splitting  (with stratification) into a proper training set tr
and a validation set va. IMS is described in Algorithm 1.</p>
          <p>However, since IMS is unreliable under PPS for the reasons discussed in Section 2.1, we propose an
alternative model selection method called transductive model selection (TMS – Section 2.2.2).</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Transductive Model Selection</title>
          <p>The main diference between IMS and TMS lies in the accuracy estimation step. Unlike IMS, which
estimates the accuracy on unlabelled data by computing accuracy on labelled (validation) data, TMS
estimates the accuracy on unlabelled data directly on the available set of unlabelled data. To this aim,
TMS employs a classifier accuracy prediction (CAP) method, i.e., a predictor  ℎ :  → R of the accuracy
that ℎ will exhibit on a batch  ∈  of unlabelled data. However, this does not mean that TMS can
avoid using part of the labelled data, since it still requires a portion of it to train the CAP method. Since
the procedure is transductive, its outcome is not a generic classifier that can be applied to any future
data, but the set of labels assigned to the unlabelled instances in . TMS is described in Algorithm 2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>In this section we present an experimental comparison between IMS and TMS under PPS.2
Experimental Protocol. The efectiveness measure we use in order to assess the quality of the model
selection strategies is the (“vanilla”) accuracy of the selected model, i.e., the fraction of classification
decisions that are correct.</p>
      <p>The experimental protocol we adopt is as follows. Given a dataset , we split it into a training set
 (70%) and a test set  (30%) via stratified sampling; we further split the training set into a “proper”
2The code to reproduce all our experiments is available on GitHub at https://github.com/lorenzovolpi/tms
Algorithm 1: Inductive Model Selection</p>
      <p>Algorithm 2: Transductive Model Selection
for  ∈ Θ do
ℎ ←
Acc ←
(ℋ, ,  tr)</p>
      <p>(ℎ , va)
// Trains the classifier via algorithm 
// Computes accuracy on validation data
if Acc &gt; BestAcc then
ℎ* ←
BestAcc ←
ℎ</p>
      <p>Acc
end if
end for
return ℎ*
// Returns an inductive classifier that can be
// applied to any set of unlabelled data
for  ∈ Θ do
ℎ ←
 ℎ ←
Âc︂c ←
(ℋ, ,  tr)</p>
      <p>CAP(ℎ , va)
 ℎ ()
// Trains classifier ℎ via algorithm 
// Trains a CAP method for classifier ℎ
// Estimates accuracy on the unlabelled data
if Âc︂c &gt; BestAcc then
ℎ* ←
BestAcc ←
ℎ</p>
      <p>
        Âc︂c
end if
end for
return {(, ℎ* ()) :  ∈ }
// Returns the inferred labels for the specific
// unlabelled data
training set tr and a validation set va, with |tr| = |va|, via stratification. In order to simulate
PPS we apply the Artificial Prevalence Protocol (APP) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] on  . This consists of drawing  vectors
v1, ..., v (we here take  = 1000) of  prevalence values (with  the number of classes) from the unit
simplex Δ− 1 (using the Kraemer sampling algorithm [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), and extracting from  , for each v, a bag
 of || =  elements (we here take  = 100) such that  satisfies the prevalence distribution of
The advantage of the APP is that it allows us to test the robustness of our models to the entire range of
v.3
PPS values, while embodying a bag extraction method that implements exactly the two distributional
assumptions, mentioned in Section 2, that lie at the heart of PPS.
      </p>
      <p>We then run our experiments by first training all the diferently configured classifiers on
tr and:
1. for applying IMS: computing the accuracy of each trained classifier on va and classifying the
datapoints in all the ’s via the classifier that has shown the best accuracy;
2. for applying TMS: for each , estimating the accuracy on  of each trained classifier via a CAP
method trained on va, and applying to  the classifier that has shown the best accuracy.</p>
      <sec id="sec-3-1">
        <title>Classifiers and Hyperparameters.</title>
        <p>We test both model selection approaches on four classifier types,
namely, classifiers trained via Logistic Regression (LR), -Nearest Neighbours (-NN), Support Vector
Machines (SVM), and Multi-Layered Perceptron (MLP). Each classifier type is instantiated with multiple
combinations of hyperparameters, with the total number of combinations depending on the number of
classes in the dataset and on the classifier type.</p>
        <p>Under PPS, one of the most interesting hyperparameters is probably the class_weight
hyperparameter of LR and SVM, which allows rebalancing the relative importance of the classes to compensate for
class imbalance. In the presence of PPS, exploring diferent class-balancing configurations increases the
probability of instantiating a classifier trained according to a class importance scheme that fits well the
unlabelled data. For LR and SVM, we consider diferent values for the
class_weight hyperparameter
depending on the number  of classes. In all cases, we include the configurations
balanced (which
assigns diferent weights to instances of diferent classes to compensate for class imbalance in the
training data) and None (all instances count the same, which results in more popular classes dominating
the learning process). Aside from these, we explore alternative class_weight values that try to
compensate potentially high values of a single class, one class at a time. The reason why we limit
ourselves to this kind of exploration is to prevent combinatorial explosion; focusing on more than
one class at a time, or on more weight values, would result in a potentially un-manageable number
of hyperparameter combinations, especially for high values of . These alternative values must be
specified as points in the probability simplex (i.e., the per-class balancing weights must add up to one).
3We use the term “bag” (i.e., multiset) since we sample with replacement, which might lead to  containing duplicates.
In multiclass problems with  &gt; 2, we add  such configurations to the pool of values, which we obtain
as all diferent permutations composed of one “high” weight and ( − 1) “low” weights. We set the high
value to 2/ (i.e., twice the mass of a uniform assignment) and distribute the remainder among the
low values, thus setting each to ︁( 1−− 2/1 )︁ .4 For example, when  = 3 we explore the class_weight
assignments (0.66, 0.165, 0.165), (0.165, 0.66, 0.165), and (0.165, 0.165, 0.66). In the binary setting,
where a finer-grained set of combinations is manageable, we instead use a grid of class weights  and
explore all combinations (, 1 − ) with  ∈ . In particular, we use the grid  = (0.2, 0.4, 0.6, 0.8),
thus considering the class_weight assignments (0.2, 0.8), (0.4, 0.6), (0.6, 0.4), and (0.8, 0.2).</p>
        <p>Concerning the other hyperparameters, we consider five diferent values ( 10− 2, 10− 1, 100, 101, 102)
for hyperparameter C (the regularization strength for both LR and SVM), as well as two additional values
of gamma (scale and auto) for SVM only. For -NN we explore five values of , i.e., of n_neighbors (5,
7, 9, 11, 13) and two values of weights (uniform and distance). For MLP, we test five values of alpha
(10− 5, 10− 4, 10− 3, 10− 2, 10− 1) and two values of learning_rate (constant and adaptive).5
Datasets. We use the 25 datasets from the UCI machine learning repository6 that can be directly
imported through UCI’s Python API and that have at least 6 features and at least 800 instances. The
number of instances per dataset varies from 830 (mammographc) to 1,024,985 (poker-hand), while the
number of features varies from 6 (mhr) to 617 (isolet). The number of classes varies from 2 (german,
mammographic, semeion, spambase, tictactoe) to 26 (isolet, letter). Class balance is highly variable,
from datasets in which some of the classes represent less than 1% of the instances (e.g., one of the
classes in poker-hand), to perfectly balanced datasets (e.g., image-seg).</p>
        <p>
          Implementation Details for TMS. As our choice of the CAP method we adopt O-LEAPKDEy, a
member of the Linear Equations for Accuracy Prediction (LEAP) family [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] specifically devised for PPS.
LEAP methods work by estimating the values of the cells of the contingency table deriving from the
application of the classifier to the set of unlabelled data, where the estimation is obtained by solving a
system of linear equations that represent the problem constraints (including the PPS assumptions); once
the contingency table is estimated, any classifier accuracy measure can be computed from it. LEAP
internally relies on a quantifier (i.e., a predictor of class prevalence values) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]; following [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], for this
purpose we employ the KDEy-ML quantification method [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. From now on, we will refer to the TMS
method that uses O-LEAPKDEy simply as TMS-All, since this method selects the best model across all
classifier types (LR, SVM, -NN, MLP) and all their hyperparameter combinations.
Baselines. We compare TMS-All against standard (inductive) approaches for model selection. In
particular, we consider two variants of IMS: one in which model selection is performed independently
for each classifier type (IMS-LR, IMS-SVM, etc.), and another in which model selection chooses among
all classifier types and all hyperparameter combinations for each type (IMS-All); in other words, TMS-All
stands to TMS as IMS-All stands to IMS. We also compare these model selection strategies against
configurations of each classifier ( ∅-LR, ∅-SVM, etc.) in which default hyperparameters are used.
        </p>
        <p>
          We also consider ∅-TSVM, an instance of the transductive support vector machine (TSVM) algorithm
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which directly infers the label of each unlabelled datapoint without generating a classifier. 7 While
TSVM relies on the IID assumption and is not a proper model selection approach, we include it as a
reference baseline because it captures the essence of transductive learning, and thus ofers a meaningful
point of comparison. For TSVM, we only consider instantiations with default hyperparameters since, to
the best of our knowledge, there is no established way to tune hyperparameters for non-IID settings.
4Notice that the choice of 2/ as the high value is arbitrary, but we consider it a good choice to compensate for class imbalance.
5 Hyperparameter names and default configurations follow those provided by the scikit-learn library (https://scikit-learn.org/);
we have left all hyperparameters not explicitly discussed here at their default values.
6https://archive.ics.uci.edu/
7For TSVM we use the implementation proposed by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
IMS or via TMS. Boldface represents the best result obtained for the given dataset. Superscript † denotes
the methods (if any) whose scores are not statistically significantly diferent from the best one according to a
Wilcoxon signed-rank test at 0.01 confidence level. Cells are colour-coded in order to facilitate readability, with
indicating intermediate performance values. Values after ±
a bright green (resp., red) cell indicating the best (resp., worst) system on the given dataset, and paler shades
in each cell represent standard deviation.
        </p>
        <p>IMS
IMS-SVM
Results. Table 1 reports the accuracy scores obtained by the classifiers resulting from each of the
model selection strategies considered; each accuracy value is the average across the 1000 tests we
have carried out for that dataset. Overall, TMS-All tends to obtain the (per-dataset) best results, and
obtains the greatest number of best results. In cases when TMS-All does not obtain the best result,
it still tends to obtain, for most datasets, results that are not statistically significantly diferent from
the best-performing baseline. Our experiments clearly show that adopting MS strategies tailored to
PPS yields a substantial performance advantage with respect to MS techniques that assume IID data.
Applying TMS appears also preferable to training TSVMs; although TSVMs are tailored to transductive
contexts, in our experiments they underperform when facing scenarios afected by PPS.
as the L1 distance between the vectors of class prevalence values of the training set and test bag; the
results are obtained as within-bin averages, where a bin groups all the bags  afected by a similar
amount of PPS, across all datasets. A clear pattern emerging from the plot is that most methods perform
similarly for low levels of PPS, but there is a clear advantage in adopting TMS at higher levels of shift.</p>
        <p>The same figure also shows (as a black dashed line) the performance of an oracle, i.e., an idealized
method that always picks the best classifier for each bag
. The gap between the oracle and classifiers
optimized via traditional MS is small at low shift levels (indicating that IMS works well in near-IID
scenarios). However, as shift increases, traditional approaches degrade substantially, while the gap
between the oracle and TMS remains narrow, showing that TMS performs well under such conditions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>
        The ability to perform robustly under distribution shift is a key requirement for trustworthy AI [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
We have discussed transductive model selection (TMS), a new way of performing context-aware model
selection (i.e., hyperparameter optimisation) for classification applications. Essentially, TMS replaces
traditional classifier accuracy computation on training data with classifier accuracy estimation on the
ifnite set of unlabelled data that need to be classified at a certain point in time.
      </p>
      <p>We have presented TMS experiments in a restricted setting, i.e., when the data are afected by prior
probability shift, an important type of dataset shift that often afects anti-causal learning problems. Here,
our experiments have shown that TMS boosts classification accuracy, i.e., bring about classifiers that
outperform the classifiers whose hyperparameters have been optimised, as usual, by cross-validation
on the labelled data. Note that TMS is not restricted to dealing with prior probability shift, and can also
deal with other types of shift too (e.g., covariate shift). For this, one only needs to use, in place of the
O-LEAPKDEy method used in this paper (which is tailored to prior probability shift), a CAP method
explicitly devised for the type of shift that the data sufer from.</p>
      <p>
        TMS also holds promise for all applications that have a strictly transductive nature (i.e., in which all
the unlabelled data to which the classifier needs to be applied are already known at training time), such
as technology-assisted review (TAR – see e.g., [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) for supporting e-discovery [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], online content
moderation [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], or the production of systematic reviews [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Indeed, the next steps in our TMS
research will include its application to these domains.
      </p>
      <p>Transductive model selection not only improves robustness to distributional changes, but also
strengthens the trustworthiness of AI systems by enabling more reliable classifier deployment than
conventional “one-size-fits-all” inductive approaches.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>LV’s work was supported by project “Italian Strengthening of ESFRI RI RESILIENCE” (ITSERR), funded
by the European Union under the NextGenerationEU funding scheme (CUP B53C22001770006). AM’s
and FS’s work was partially supported by project “Future Artificial Intelligence Research” (FAIR—CUP
B53D22000980006), project “Quantification under Dataset Shift” (QuaDaSh—CUP B53D23026250001),
and project “Strengthening the Italian RI for Social Mining and Big Data Analytics” (SoBigData.it—CUP
B53C22001760006), all funded by the European Union under the NextGenerationEU funding scheme.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
Schoot, Performance of active learning models for screening prioritization in systematic reviews:
A simulation study into the average time to discover relevant records, Systematic Reviews 10
(2023). doi:10.1186/s13643-023-02257-7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Balakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Neyshabur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sedghi</surname>
          </string-name>
          ,
          <article-title>Leveraging unlabeled data to predict out-of-distribution performance</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Learning Representations (ICLR</source>
          <year>2022</year>
          ), Virtual Event,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guillory</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ebrahimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          , L. Schmidt,
          <article-title>Predicting with confidence on unseen distributions</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV
          <year>2021</year>
          ), Montreal, CA,
          <year>2021</year>
          , pp.
          <fpage>1134</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Volpi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          , LEAP:
          <article-title>Linear equations for classifier accuracy prediction under prior probability shift</article-title>
          ,
          <source>Machine Learning</source>
          (
          <year>2025</year>
          ). Forthcoming.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Storkey</surname>
          </string-name>
          ,
          <article-title>When training and test sets are diferent: Characterizing learning transfer</article-title>
          , in: J.
          <string-name>
            <surname>Quiñonero-Candela</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sugiyama</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schwaighofer</surname>
          </string-name>
          , N. D. Lawrence (Eds.),
          <article-title>Dataset shift in machine learning</article-title>
          , The MIT Press, Cambridge, US,
          <year>2009</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Model selection under covariate shift</article-title>
          ,
          <source>in: Proceedings of the 15th International Conference on Artificial Neural Networks (ICANN</source>
          <year>2005</year>
          ), Warsaw, PL,
          <year>2005</year>
          , pp.
          <fpage>235</fpage>
          -
          <lpage>240</lpage>
          . doi:
          <volume>10</volume>
          .1007/11550907\_
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <article-title>Detecting and correcting for label shift with black box predictors</article-title>
          ,
          <source>in: Proceedings of the 35th International Conference on Machine Learning (ICML</source>
          <year>2018</year>
          ), Stockholm, SE,
          <year>2018</year>
          , pp.
          <fpage>3128</fpage>
          -
          <lpage>3136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Czyż</surname>
          </string-name>
          ,
          <article-title>Bayesian quantification with black-box estimators</article-title>
          ,
          <source>Transactions on Machine Learning Research</source>
          <year>2024</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Janzing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sgouritsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Mooij</surname>
          </string-name>
          ,
          <article-title>On causal and anticausal learning</article-title>
          ,
          <source>in: Proceedings of the 29th International Conference on Machine Learning (ICML</source>
          <year>2012</year>
          ), Edinburgh, UK,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fawcett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Flach</surname>
          </string-name>
          ,
          <article-title>A response to Webb and Ting's 'On the application of ROC analysis to predict classification performance under varying class distributions'</article-title>
          ,
          <source>Machine Learning</source>
          <volume>58</volume>
          (
          <year>2005</year>
          )
          <fpage>33</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fabris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          , Learning to quantify, Springer Nature, Cham,
          <string-name>
            <surname>CH</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Tromble</surname>
          </string-name>
          ,
          <article-title>Sampling uniformly from the unit simplex</article-title>
          ,
          <source>Technical Report</source>
          , Johns Hopkins University,
          <year>2004</year>
          . https://www.cs.cmu.edu/~nasmith/papers/smith+tromble.tr04.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>González</surname>
          </string-name>
          , J.
          <source>J. del Coz</source>
          ,
          <article-title>Kernel density estimation for multiclass quantification</article-title>
          ,
          <source>Machine Learning</source>
          <volume>114</volume>
          (
          <year>2025</year>
          ).
          <source>doi:10.1007/s10994-024-06726-5.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gammerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. G.</given-names>
            <surname>Vovk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <article-title>Learning by transduction</article-title>
          ,
          <source>in: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI</source>
          <year>1998</year>
          ), Madison,
          <string-name>
            <surname>US</surname>
          </string-name>
          ,
          <year>1998</year>
          , pp.
          <fpage>148</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <article-title>Transductive inference for text classification using support vector machines</article-title>
          ,
          <source>in: Proceedings of the 16th International Conference on Machine Learning (ICML</source>
          <year>1999</year>
          ), Bled,
          <string-name>
            <surname>SL</surname>
          </string-name>
          ,
          <year>1999</year>
          , pp.
          <fpage>200</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Calegari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pratesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Milano</surname>
          </string-name>
          , Introduction to the
          <source>Special Issue on Trustworthy Artificial Intelligence, ACM Computing Surveys</source>
          <volume>56</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          . doi:
          <volume>10</volume>
          .1145/3649452.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pickens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>High-recall retrieval via technology-assisted review</article-title>
          ,
          <source>in: Proceedings of the 47th ACM Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2024</year>
          ), Washington, US,
          <year>2024</year>
          , pp.
          <fpage>2987</fpage>
          -
          <lpage>2988</lpage>
          . doi:
          <volume>10</volume>
          .1145/3626772.3661376.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          , W. Webber,
          <article-title>Information retrieval for e-discovery, Foundations and Trends in Information Retrieval 7 (</article-title>
          <year>2013</year>
          )
          <fpage>99</fpage>
          -
          <lpage>237</lpage>
          . doi:
          <volume>10</volume>
          .1561/1500000025.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Frieder,</surname>
          </string-name>
          <article-title>TAR on social media: A framework for online content moderation</article-title>
          ,
          <source>in: Proceedings of the 2nd International Conference on Design of Experimental Search &amp; Information REtrieval Systems (DESIRES</source>
          <year>2021</year>
          ), Padova,
          <string-name>
            <surname>IT</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ferdinands</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schram</surname>
          </string-name>
          , J. de Bruin,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bagheri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Oberski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tummers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Teijema</surname>
          </string-name>
          , R. van de
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>