<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>CIRCLE</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Active Learning and the Saerens-Latinne-Decaestecker Algorithm: An Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessio Molinari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Esuli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Sebastiani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche</institution>
          ,
          <addr-line>56124, Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2</volume>
      <fpage>4</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>The Saerens-Latinne-Decaestecker (SLD) algorithm is a method whose goal is improving the quality of the posterior probabilities (or simply “posteriors”) returned by a probabilistic classifier in scenarios characterized by prior probability shift (PPS) between the training set and the unlabelled (“test”) set. This is an important task, (a) because posteriors are of the utmost importance in downstream tasks such as, e.g., multiclass classification and cost-sensitive classification, and (b) because PPS is ubiquitous in many applications. In this paper we explore whether using SLD can indeed improve the quality of posteriors returned by a classifier trained via active learning (AL), a class of machine learning (ML) techniques that indeed tend to generate substantial PPS. Specifically, we target AL via relevance sampling (ALvRS) and AL via uncertainty sampling (ALvUS), two AL techniques that are very well-known especially because, due to their low computational cost, are suitable to being applied in scenarios characterized by large datasets. We present experimental results obtained on the RCV1-v2 dataset, showing that SLD fails to deliver better-quality posteriors with both ALvRS and ALvUS, thus contradicting previous findings in the literature, and that this is due not to the amount of PPS that these techniques generate, but to how the examples they prioritize for annotation are distributed.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text Classification</kwd>
        <kwd>Probabilistic Classifiers</kwd>
        <kwd>Active Learning</kwd>
        <kwd>Posterior Probabilities</kwd>
        <kwd>Prior Probabilities</kwd>
        <kwd>Prior Probability Shift</kwd>
        <kwd>Dataset Shift</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In the field of probabilistic classification, a posterior probability (or simply: a posterior ) Pr(|x)
represents the confidence that a classifier ℎ :  →  has in the fact that an unlabelled
(“test”) document x belongs to class . As all confidence scores, posteriors are useful for
ranking unlabelled documents (say, in terms of perceived relevance to class ). However, for
some downstream tasks other than ranking, such as multiclass classification and cost-sensitive
classification, standard (non-probabilistic) confidence scores are not enough, and true posteriors
are needed.</p>
      <p>
        For these downstream tasks to be carried out accurately, it is essential that the posteriors are
high-quality, i.e., well-calibrated.1 Some classifiers (e.g., those trained by logistic regression)
tend to return calibrated posteriors (we thus say that they are calibrated classifiers ); some other
classifiers (e.g., those trained by naive Bayesian methods) tend to return posteriors that are not
calibrated; yet some other classifiers return confidence scores that are not probabilities. For the
last two cases, methods exist (see e.g., [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]) to calibrate uncalibrated classifiers.
      </p>
      <p>
        Unfortunately, independently of the learning method used for training the classifiers,
posteriors tend to be uncalibrated when the application scenario sufers from prior probability shift
(PPS – [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), i.e., the (ubiquitous) phenomenon according to which the distribution  () of the
unlabelled test documents  across the classes is diferent from the distribution () of the
labelled training documents . This is due to the fact that when the (calibrated or uncalibrated)
classifiers generate the posteriors, they assume that the class prior probabilities  () (a.k.a.
“priors”, or “class prevalence values”) in the set  of unlabelled documents are the same as those
encountered in the training set . If this is not the case, the returned posteriors end up not
being calibrated.
      </p>
      <p>
        The Saerens-Latinne-Decaestecker (SLD) algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a well-known method for
recalibrating the posteriors of a set of unlabelled documents in the presence of PPS between the training
set and this latter set. Given a machine-learned classifier and a set of unlabelled documents
for which the classifier has returned posteriors and estimates of the priors, SLD updates them
both in an iterative, mutually recursive way, with the goal of making both more accurate. Since
its publication, SLD has become the standard algorithm for recalibrating the posteriors in the
presence of PPS, and is still considered a top contender (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) when we need to estimate the
priors (a task that has become known as “quantification”).
      </p>
      <p>
        However, its real efectiveness in improving the quality of the posteriors is not yet entirely
clear. On one side, a recent, large experimental study [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] has shown that, at least when the
number of classes in the classification scheme is very small and the classifier is calibrated,
SLD does improve the quality of the posteriors, and especially so when the amount of PPS is
high. On another side, in experiments aimed at improving the quality of cost-sensitive text
classification in technology-assisted review (TAR) [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ], SLD has (strangely) not delivered any
measurable improvement in the quality of the posteriors, not even when the amount of PPS
was high [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. The relationship between SLD and PPS is thus still unclear.
      </p>
      <p>
        The goal of this paper is to shed some light on this relationship. The reason why we are
interested in this is that, if SLD indeed improved the quality of the posteriors under PPS, it
would be extremely useful for TAR. In fact, in TAR we typically use a classifier trained on
labelled data in order to return posterior probabilities of relevance for a large set of unlabelled
documents. These posteriors are needed for ranking the unlabelled documents in terms of their
probability of relevance, and high-quality posteriors are of key importance for approaches to
1The posteriors Pr(|x), where x belongs to a set  = {x1, ..., x| |}, are said to be well-calibrated when, for all
 ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], it holds that
|{x ∈  ∩ | Pr(|x) = }|
|{x ∈  | Pr(|x) = }|
≈ 
(1)
Perfect calibration is usually unattainable on any non-trivial dataset; however, calibration comes in degrees (and
the quality of calibration can indeed be measured), so eforts can be made to obtain posteriors which are as close as
possible to their perfectly calibrated counterparts.
      </p>
      <p>
        TAR based on risk minimization [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Additionally, TAR settings are typically characterized by
PPS, because the typical way to build a training set  in TAR is via active learning (AL), which
usually generates PPS. So, the research question we want to answer is
      </p>
      <p>RQ: Does SLD improve the quality of posterior probabilities in situations
in which the training set  used for training the probabilistic classifier has
been generated via active learning?
In the rest of the paper we briefly introduce the SLD algorithm (Section 2) and the two active
learning techniques (ALvRS and ALvUS) we use in order to investigate our research question
(Section 3), after which we present the results of our experiments (Section 4) followed (Section 5)
by a few concluding remarks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The SLD Algorithm</title>
      <p>We assume a training set  of labelled examples and a set  = {(x1, (x1)), . . . , (x||, (x||))}
of unlabelled examples, i.e., examples whose true labels (x) ∈  = {1, . . . , ||} are unknown
to the system.</p>
      <p>
        SLD, proposed by Saerens et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], is an instance of Expectation Maximization [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], a
wellknown iterative algorithm for finding maximum-likelihood estimates of parameters (in our case:
the class prior probabilities) for models that depend on unobserved variables (in our case: the
class labels). Pseudocode of the SLD algorithm is here included as Algorithm 1.
      </p>
      <p>Essentially, SLD iteratively updates (Line 13) the estimates of the class priors by using the
posteriors computed in the previous iteration, and updates (Line 15) the posteriors by using the
estimates of the class priors computed in the present iteration, in a mutually recursive fashion.
The main goal is to adjust the posteriors and re-estimate the priors in such a way that they are
mutually consistent, i.e., that they should be such that</p>
      <p>Pr () =
1 ∑︁ Pr(|x)
| | x∈
Equation 2 is a necessary (albeit not suficient) condition for the posteriors Pr(|x) of the
documents x ∈  to be calibrated. SLD may thus be viewed as making a step towards calibrating
these posteriors.</p>
      <p>The algorithm iterates until convergence, i.e., until the class priors become stable and
Equation 2 is satisfied. The convergence of SLD may be tested by computing how the distribution of
the priors at iteration ( − 1) and that at iteration () still diverge; this can be evaluated, for
instance, in terms of absolute error, i.e.,2</p>
      <p>
        AE(^(− 1), ^()) =
|| ∈
1 ∑︁ | P^r ()() − P^r (− 1)()|
2Consistently with most mathematical literature, we use the caret symbol (ˆ) to indicate estimation.
(2)
(3)
Algorithm 1: The SLD algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Input : Class priors Pr() on , for all  ∈ ;</p>
      <p>Posterior probabilities Pr(|x), for all  ∈  and for all x ∈  ;
Output : Estimates P^r  () of class prevalence values on  , for all  ∈ ;</p>
      <p>Updated posterior probabilities Pr(|x), for all  ∈  and for all x ∈  ;
6
7
8 end</p>
      <p>end
1 // Initialization
2  ← 0;
3 for  ∈  do
4 P^r ()() ← Pr();
5 for x ∈  do</p>
      <p>Pr()(|x) ← Pr(|x);
9 // Main Iteration Cycle
10 while stopping condition = false do
11  ←  + 1;
12 for  ∈  do
13 P^r ()() ←
1 ∑︁ Pr(− 1)(|x);
// Initialize the prior estimates</p>
      <p>// Initialize the posteriors
// Update the prior estimates</p>
      <p>// Update the posteriors
| | x∈
for x ∈  do</p>
      <p>Pr()(|x) ←
end
^ ()
Pr  () · Pr(0)(|x)
^ (0)
Pr  ()</p>
      <p>^ ()
∑︁ Pr  () · Pr(0)(|x)</p>
      <p>^ (0)
∈ Pr  ()
14
15
16
17
18 end</p>
      <p>end
In the experiments of Section 4, we decree that convergence has been reached when
AE(^(− 1), ^()) &lt; 10− 6; we stop SLD when we have reached either convergence or the
maximum number of iterations (that we set to 1000).</p>
      <p>While SLD is a natively multiclass algorithm, in this paper we restrict our analysis to the
binary case, with codeframe  = {⊕ , ⊖} .</p>
    </sec>
    <sec id="sec-3">
      <title>3. Active learning policies</title>
      <p>
        In the experiments for this work, we test the SLD algorithm on training/test sets generated via
two of the best-known active learning policies, namely Active Learning via Relevance Sampling
(ALvRS) and Active Learning via Uncertainty Sampling (ALvUS), first presented in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. While
fairly old and unsophisticated, these policies are still very popular because their computational
cost is small, which makes them extremely suitable to applications (such as TAR) in which the
set of unlabelled documents that are candidates for annotation, and that the AL policy must
thus rank, is large.
      </p>
      <p>Active Learning via Relevance Sampling (ALvRS). ALvRS is an interactive process which,
given a data pool of unlabelled documents  , asks the reviewer to annotate an initial “seed” set
of documents  ⊂  , uses  as the training set  to train a binary classifier ℎ, and uses ℎ to
rank the documents in ( ∖ ) in decreasing order of their posterior probability of relevance
Pr(⊕| x). Then, the reviewer is asked to annotate the  documents for which Pr(⊕| x) is highest
(with  the batch size), which, once annotated, are added to the training set . Finally, we
retrain our classifier on the new training set and repeat the process, until a predefined number
of documents (the annotation budget) have been reviewed.</p>
      <p>Active Learning via Uncertainty Sampling (ALvUS). The ALvUS policy is a variation of
ALvRS, where we review the documents not in decreasing order of Pr(⊕| x) but in decreasing
order of | Pr(⊕| x) − 0.5|, i.e., we top-rank the documents which the classifier is most uncertain
about.</p>
      <p>The Rand policies. For each of the two policies defined above, we define an “oracle-like”
policy which we will use for a control experiment. The aim of these policies, which we call
(RS) and (US) (corresponding to ALvRS and ALvUS, respectively), is helping to better
understand whether the results we are seeing are due to the PPS generated by the two active
learning policies, or to their document selection strategy. Given a set  of labelled documents
and a set  of unlabelled documents generated via an active learning policy by sampling
 , the  policy samples  randomly to generate alternative labelled and unlabelled sets
′ and  ′ subject to the constraints that || = |′|, | | = | ′|, Pr(⊕ ) = Pr′ (⊕ ), and
Pr (⊕ ) = Pr′ (⊕ ). In other words, the  policies generates the same PPS as the active
learning policy, but with a diferent choice of documents.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We run a set of comparative experiments to explore the interaction between AL-based classifiers
and SLD. In order to do this we test the two AL policies described above, ALvRS and ALvUS,
and compare them with the (RS) and (US) policies.</p>
      <sec id="sec-4-1">
        <title>4.1. The RCV1-v2 dataset</title>
        <p>
          We run our experiments on the RCV1-v2 dataset [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], a multi-label multi-class collection of
804,414 Reuters news (produced from August 1996 to August 1997)3. The RCV1-v2 codeframe
3We use the RCV1-v2 dataset as provided by the scikit-learn implementation. https://scikit-learn.org/stable/datasets/
real_world.html#rcv1-dataset
consists of a set of 103 classes. Since in this work we experiment with binary classification
problems only, for each such class  we consider a binary codeframe  = {⊕ , ⊖} , where
⊕ =  and ⊖ = . Finally, in order to keep computational costs within reasonable bounds, we
only work with a pool  consisting of the first 100,000 documents of the RCV1-v2 collection.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental setup</title>
        <p>
          For each class  ∈ , and for each AL policy, we run the AL process to generate a sequence
of binary classification training sets with incremental sizes; this determines a corresponding
sequence of test sets, since the pool  is always the union of the training set  and the test set  ..
As for training the classifier, in all of our experiments we use a SVM algorithm, post-calibrated
via Platt calibration [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>The active learning process is seeded with a set of 1000 initial training documents  ⊂  ,
i.e., we randomly sample 1000 documents from our pool  and train our classifier on it. Since
in order to calibrate the SVM classifier we need at least 2 positive instances (i.e., instances of
⊕ ) for cross-validation, we always ensure this condition is respected in . We then run the
active learning process on the remaining 99,000 documents. This procedure is illustrated in
Algorithm 2. As previously mentioned, we also generate an analogous sequence of training/test</p>
        <p>Algorithm 2: Pseudo-code to generate active learning datasets.</p>
        <p>Input : Documents  ; Set of training set sizes Σ; AL policy ; Batch size 
1  ← random_sample(, 1000);
2  ←  − ;
3  ← max(Σ);
4  ← | |;
5  ← train_svm();
6 while || &lt;  do
7  ←  ∪ select_via_policy(, , , );
8
9
10
11
12
13
14 end
 ← train_svm();
 ←  − ;
 ← | |;
if  ∈ Σ then</p>
        <p>save(,  )
end
sets with a  policy, i.e., random sampling constrained to keep the same class prevalence
values obtained by the corresponding active learning policies.</p>
        <p>Once the diferent training sets are generated, we train a calibrated SVM from scratch on
each of them and obtain a set of posterior probabilities PrPreSLD(⊕| x) for each respective test
set. Finally, we apply the SLD algorithm, obtaining a new set of posteriors PrPostSLD(⊕| x).</p>
        <p>In TAR scenarios, we are usually interested on the classification performance on the entire
pool  : for this reason, we merge the labels on the training set with the posterior probabilities
on the test set, obtaining a new set of probabilities Pr(⊕| x) where, for all x ∈ , we take
Pr(⊕| x) = 1 if ⊕ is the true label of x and Pr(⊕| x) = 0 if ⊖ is the true label of x, with  the
training set. All of our evaluation measures are computed on this set of probabilities.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation measures</title>
        <p>To evaluate the performance of our classifier and the quality of the posteriors we use several
metrics, namely, Accuracy, Precision, Recall, 1, and Brier Score. We will explain more in detail
the last metric, as the reader is likely familiar with the first four.</p>
        <p>
          Given a set  = {(x1, (x1)), . . . , (x||, (x||))} of unlabelled documents to be labelled
according to codeframe  = {⊕ , ⊖} , and given posteriors Pr(⊕| x) for these documents, the
Brier score [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] is defined as
        </p>
        <p>BS =
1 ∑|︁| (((x) = ⊕ ) − Pr(⊕| x))2
| | =1
(4)
where (· ) is a function that returns 1 if its argument is true and 0 otherwise. BS ranges between
0 (best) and 1 (worst), i.e., it is a measure of error, and not of accuracy, and rewards probabilistic
classifiers that return a high value of Pr(⊕| x) for instances of ⊕ and a low such value for
instances of ⊖ . In our result tables we will report, instead of the Brier score, its complement to
1, i.e., (1 - BS), so that all our metrics can be interpreted as “the higher, the better”.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Results</title>
        <p>We present the results of our experiments in Table 1 for ALvRS and in Table 2 for ALvUS.</p>
        <p>These results are averages across all of the 103 RCV1-v2 classes used in our experiments. We
show both the average results for each training set size (2000, 4000, 8000, 16000) and the results
averaged on all sizes. We note that in all cases (i.e., for both ALvRS and ALvUS, and for all sizes),
the use of SLD has a detrimental efect on the posterior probabilities. However, while this is true
for the setups generated via active learning, the use of SLD has a beneficial efect on the posteriors
when the  policies have been used.</p>
        <p>
          What we see on the AL datasets seems to contradict what was argued in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], i.e., that SLD
can improve posteriors in binary classification contexts with high PPS. The  policy, which
resembles the test data generation technique used in [6, Section 3.2.1], seems instead to confirm
the conclusions of [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. However, when the training and the test sets do not originate from
random sampling, as it is the case for the AL datasets, this hypothesis is disconfirmed.
        </p>
        <p>
          While we defer a proper analysis of the causes of this problem to future work, a first hypothesis
might be that the following is happening. When building active learning datasets, we can assume
that the documents that remain in the test set, as this decreases in size, are documents for which
the classifier is either fairly sure of their negative label (ALvRS) or of their label in general
(ALvUS). Furthermore, AL policies such as relevance sampling or uncertainty sampling sufer
from sampling bias [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], since both AL strategies solely depend on what the classifier thinks
is either relevant or uncertain; this means that, as the active learning phase proceeds, the
annotator is asked to review documents that are very similar among each other and, because of
(a) ALvRS posteriors pre- and post-SLD.
this, not enough informative or representative of the actual dataset. Hence, especially when the
prevalence of ⊕ is very low, we may expect the distribution of the posterior probabilities to
be strongly skewed towards the negative class, much more than if the dataset were a random
sample of the population (as it is for the two  policies); in the latter case, the classifier
might still find documents in the test set for which its confidence is lower than in the AL
case. This can be seen in Figures 1 and 2, where we plot the posteriors Pr(⊕| x) pre- and
post-SLD for ALvRS, ALvUS and  on a random RCV1-v2 class used in our experiments
(C17, training size 16,000). Notice how in both cases (RS and US), the posteriors distribution
on the AL dataset is strongly skewed towards 0, whereas ’s is slightly more spread on
the [0.0, 0.3] interval4. SLD seems to perform a correct rescaling of the posteriors in the 
cases, whereas it simply sets all posteriors to 0 in the AL cases. Since the PPS is equivalent
in both cases, the reasons are to be found in the document strategy selection, within the SLD
algorithm, or both. As we mentioned before, the sampling bias is likely responsible for the
skewness of the posterior probability distributions that we see in the plots, as this is the only
and major diference between the AL and  policies. On the other end, if the estimated
prevalence Pr() (which we compute as the average of the posteriors, see Algorithm 1) is close
to 0, as we see in the figures, then indeed SLD will drag the distribution towards 0. As a matter
of fact, consider the SLD update of the prior and posterior probabilities performed in Line 13
and Line 15, respectively, of Algorithm 1. It is trivial to see that lim ^ () Pr()(|x) = 0, i.e.,
Pr  →0
the “maximization” of the “expectation” is that there are no positive instances in the AL test set.
        </p>
        <p>All this would require a deeper analysis, which however we defer to future work.
4We did not plot the entire [0.0, 1.0] interval as there was hardly any probability after the 0.3 threshold. This makes
the plots more readable.</p>
        <p>(a) ALvUS posteriors pre- and post-SLD.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We have studied the interactions between active learning methods and the SLD algorithm. It is
known that AL-generated scenarios tend to exhibit a high prior probability shift, and that in past
research SLD has proven efective in improving the quality of the posteriors on sets of unlabelled
data, especially in cases of high PPS. We thus tested the use of SLD on the posteriors generated
by classifiers trained on AL-generated training sets, testing the hypothesis that SLD would
improve the quality of these posteriors. Our results do not support this hypothesis, showing
instead that the posteriors returned by AL-based classifiers deteriorate after the application of
SLD. We have run control experiments that used the same amount of PPS of the AL-generated
scenarios, albeit obtained by sampling the elements of the pool randomly. In this case SLD did
improve the quality of the posteriors, which indicates that SLD has a specific problem not with
the amount of PPS but with the documents selected by AL techniques.</p>
      <p>From these preliminary experiments we conclude that, counterintuitively, it is not
recommended to combine AL and SLD. In future work we will investigate more deeply the causes of
this problem, i.e., what aspect of the AL process results in the bad interaction with SLD, and if
and how it is possible to solve this problem, so as to combine the benefits of both methods.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been supported by the AI4Media project, funded by the European Commission
(Grant 951911) under the H2020 Programme ICT-48-2020, and by the SoBigData++ project,
funded by the European Commission (Grant 871042) under the H2020 Programme
INFRAIA2019-1. The authors’ opinions do not necessarily reflect those of the European Commission.</p>
      <p>Accuracy
Precision</p>
      <p>Recall</p>
      <p>F1
(1-BS)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Platt</surname>
          </string-name>
          ,
          <article-title>Probabilistic outputs for support vector machines and comparison to regularized likelihood methods</article-title>
          , in: A.
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Bartlett</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Schölkopf</surname>
          </string-name>
          , D. Schuurmans (Eds.),
          <source>Advances in Large Margin Classifiers</source>
          , The MIT Press, Cambridge, MA,
          <year>2000</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zadrozny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Elkan</surname>
          </string-name>
          ,
          <article-title>Transforming classifier scores into accurate multiclass probability estimates</article-title>
          ,
          <source>in: Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (KDD</source>
          <year>2002</year>
          ), Edmonton, CA,
          <year>2002</year>
          , pp.
          <fpage>694</fpage>
          -
          <lpage>699</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 775107.775151.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Moreno-Torres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Raeder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Alaíz-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. V.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <article-title>A unifying view on dataset shift in classification</article-title>
          ,
          <source>Pattern Recognition</source>
          <volume>45</volume>
          (
          <year>2012</year>
          )
          <fpage>521</fpage>
          -
          <lpage>530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Saerens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Latinne</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Decaestecker, Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure</article-title>
          ,
          <source>Neural Computation</source>
          <volume>14</volume>
          (
          <year>2002</year>
          )
          <fpage>21</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          ,
          <article-title>Tweet sentiment quantification: An experimental re-evaluation, PLoS ONE (</article-title>
          <year>2022</year>
          ). Forthcoming.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Molinari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          ,
          <article-title>A critical reassessment of the Saerens-LatinneDecaestecker algorithm for posterior probability adjustment</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>39</volume>
          (
          <year>2021</year>
          )
          <article-title>Article 19</article-title>
          . doi:
          <volume>10</volume>
          .1145/3433164.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <article-title>Technology-assisted review in e-discovery can be more efective and more eficient than exhaustive manual review</article-title>
          ,
          <source>Richmond Journal of Law and Technology</source>
          <volume>17</volume>
          (
          <year>2011</year>
          )
          <article-title>Article 5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Baron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hedin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tomlinson</surname>
          </string-name>
          ,
          <article-title>Evaluation of information retrieval for E-discovery</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>18</volume>
          (
          <year>2010</year>
          )
          <fpage>347</fpage>
          -
          <lpage>386</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Vinjumur</surname>
          </string-name>
          ,
          <article-title>Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>37</volume>
          (
          <year>2018</year>
          )
          <volume>11</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          :
          <fpage>35</fpage>
          . doi:
          <volume>10</volume>
          .1145/3268928.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Molinari</surname>
          </string-name>
          ,
          <article-title>Leveraging the transductive nature of e-discovery in cost-sensitive technologyassisted review</article-title>
          ,
          <source>in: Proceedings of the 8th BCS-IRSG Symposium on Future Directions in Information Access (FDIA</source>
          <year>2019</year>
          ), Milano,
          <string-name>
            <surname>IT</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>72</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Molinari</surname>
          </string-name>
          ,
          <article-title>Risk minimization models for technology-assisted review and their application to e-discovery, Master's thesis</article-title>
          , Department of Computer Science, University of Pisa, Pisa,
          <string-name>
            <surname>IT</surname>
          </string-name>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Dempster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Laird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <article-title>Maximum likelihood from incomplete data via the EM algorithm</article-title>
          ,
          <source>Journal of the Royal Statistical Society, B</source>
          <volume>39</volume>
          (
          <year>1977</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>D. D. Lewis</surname>
            ,
            <given-names>W. A.</given-names>
          </string-name>
          <string-name>
            <surname>Gale</surname>
          </string-name>
          ,
          <article-title>A sequential algorithm for training text classifiers</article-title>
          ,
          <source>in: Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>1994</year>
          ), Dublin, IE,
          <year>1994</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-1-
          <fpage>4471</fpage>
          -2099-5_
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>D. D. Lewis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>T. G.</given-names>
          </string-name>
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>RCV1: A new benchmark collection for text categorization research</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>5</volume>
          (
          <year>2004</year>
          )
          <fpage>361</fpage>
          -
          <lpage>397</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Brier</surname>
          </string-name>
          ,
          <article-title>Verification of forecasts expressed in terms of probability</article-title>
          ,
          <source>Monthly Weather Review</source>
          <volume>78</volume>
          (
          <year>1950</year>
          )
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          . doi:
          <volume>10</volume>
          .1175/
          <fpage>1520</fpage>
          -
          <lpage>0493</lpage>
          (
          <year>1950</year>
          )
          <volume>078</volume>
          &lt;
          <fpage>0001</fpage>
          <source>:vofeit&gt;2.0.co;2.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <article-title>Hierarchical sampling for active learning</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on Machine Learning (ICML</source>
          <year>2018</year>
          ), Stockholm, SE,
          <year>2008</year>
          , pp.
          <fpage>208</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>