1. Introduction

CIRCLE

Active Learning and the Saerens-Latinne-Decaestecker Algorithm: An Evaluation

Alessio Molinari

Andrea Esuli

Fabrizio Sebastiani

0 0 Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche , 56124, Pisa , Italy

2022

2 4 7

The Saerens-Latinne-Decaestecker (SLD) algorithm is a method whose goal is improving the quality of the posterior probabilities (or simply “posteriors”) returned by a probabilistic classifier in scenarios characterized by prior probability shift (PPS) between the training set and the unlabelled (“test”) set. This is an important task, (a) because posteriors are of the utmost importance in downstream tasks such as, e.g., multiclass classification and cost-sensitive classification, and (b) because PPS is ubiquitous in many applications. In this paper we explore whether using SLD can indeed improve the quality of posteriors returned by a classifier trained via active learning (AL), a class of machine learning (ML) techniques that indeed tend to generate substantial PPS. Specifically, we target AL via relevance sampling (ALvRS) and AL via uncertainty sampling (ALvUS), two AL techniques that are very well-known especially because, due to their low computational cost, are suitable to being applied in scenarios characterized by large datasets. We present experimental results obtained on the RCV1-v2 dataset, showing that SLD fails to deliver better-quality posteriors with both ALvRS and ALvUS, thus contradicting previous findings in the literature, and that this is due not to the amount of PPS that these techniques generate, but to how the examples they prioritize for annotation are distributed.

eol>Text Classification Probabilistic Classifiers Active Learning Posterior Probabilities Prior Probabilities Prior Probability Shift Dataset Shift

1. Introduction

In the field of probabilistic classification, a posterior probability (or simply: a posterior ) Pr(|x) represents the confidence that a classifier ℎ : → has in the fact that an unlabelled (“test”) document x belongs to class . As all confidence scores, posteriors are useful for ranking unlabelled documents (say, in terms of perceived relevance to class ). However, for some downstream tasks other than ranking, such as multiclass classification and cost-sensitive classification, standard (non-probabilistic) confidence scores are not enough, and true posteriors are needed.

For these downstream tasks to be carried out accurately, it is essential that the posteriors are high-quality, i.e., well-calibrated.1 Some classifiers (e.g., those trained by logistic regression) tend to return calibrated posteriors (we thus say that they are calibrated classifiers ); some other classifiers (e.g., those trained by naive Bayesian methods) tend to return posteriors that are not calibrated; yet some other classifiers return confidence scores that are not probabilities. For the last two cases, methods exist (see e.g., [ 1, 2 ]) to calibrate uncalibrated classifiers.

Unfortunately, independently of the learning method used for training the classifiers, posteriors tend to be uncalibrated when the application scenario sufers from prior probability shift (PPS – [ 3 ]), i.e., the (ubiquitous) phenomenon according to which the distribution () of the unlabelled test documents across the classes is diferent from the distribution () of the labelled training documents . This is due to the fact that when the (calibrated or uncalibrated) classifiers generate the posteriors, they assume that the class prior probabilities () (a.k.a. “priors”, or “class prevalence values”) in the set of unlabelled documents are the same as those encountered in the training set . If this is not the case, the returned posteriors end up not being calibrated.

The Saerens-Latinne-Decaestecker (SLD) algorithm [ 4 ] is a well-known method for recalibrating the posteriors of a set of unlabelled documents in the presence of PPS between the training set and this latter set. Given a machine-learned classifier and a set of unlabelled documents for which the classifier has returned posteriors and estimates of the priors, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate. Since its publication, SLD has become the standard algorithm for recalibrating the posteriors in the presence of PPS, and is still considered a top contender (see [ 5 ]) when we need to estimate the priors (a task that has become known as “quantification”).

However, its real efectiveness in improving the quality of the posteriors is not yet entirely clear. On one side, a recent, large experimental study [ 6 ] has shown that, at least when the number of classes in the classification scheme is very small and the classifier is calibrated, SLD does improve the quality of the posteriors, and especially so when the amount of PPS is high. On another side, in experiments aimed at improving the quality of cost-sensitive text classification in technology-assisted review (TAR) [ 7, 8, 9 ], SLD has (strangely) not delivered any measurable improvement in the quality of the posteriors, not even when the amount of PPS was high [ 10, 11 ]. The relationship between SLD and PPS is thus still unclear.

The goal of this paper is to shed some light on this relationship. The reason why we are interested in this is that, if SLD indeed improved the quality of the posteriors under PPS, it would be extremely useful for TAR. In fact, in TAR we typically use a classifier trained on labelled data in order to return posterior probabilities of relevance for a large set of unlabelled documents. These posteriors are needed for ranking the unlabelled documents in terms of their probability of relevance, and high-quality posteriors are of key importance for approaches to 1The posteriors Pr(|x), where x belongs to a set = {x1, ..., x| |}, are said to be well-calibrated when, for all ∈ [ 0, 1 ], it holds that |{x ∈ ∩ | Pr(|x) = }| |{x ∈ | Pr(|x) = }| ≈ (1) Perfect calibration is usually unattainable on any non-trivial dataset; however, calibration comes in degrees (and the quality of calibration can indeed be measured), so eforts can be made to obtain posteriors which are as close as possible to their perfectly calibrated counterparts.

TAR based on risk minimization [ 9 ]. Additionally, TAR settings are typically characterized by PPS, because the typical way to build a training set in TAR is via active learning (AL), which usually generates PPS. So, the research question we want to answer is

RQ: Does SLD improve the quality of posterior probabilities in situations in which the training set used for training the probabilistic classifier has been generated via active learning? In the rest of the paper we briefly introduce the SLD algorithm (Section 2) and the two active learning techniques (ALvRS and ALvUS) we use in order to investigate our research question (Section 3), after which we present the results of our experiments (Section 4) followed (Section 5) by a few concluding remarks.

2. The SLD Algorithm

We assume a training set of labelled examples and a set = {(x1, (x1)), . . . , (x||, (x||))} of unlabelled examples, i.e., examples whose true labels (x) ∈ = {1, . . . , ||} are unknown to the system.

SLD, proposed by Saerens et al. [ 4 ], is an instance of Expectation Maximization [ 12 ], a wellknown iterative algorithm for finding maximum-likelihood estimates of parameters (in our case: the class prior probabilities) for models that depend on unobserved variables (in our case: the class labels). Pseudocode of the SLD algorithm is here included as Algorithm 1.

Essentially, SLD iteratively updates (Line 13) the estimates of the class priors by using the posteriors computed in the previous iteration, and updates (Line 15) the posteriors by using the estimates of the class priors computed in the present iteration, in a mutually recursive fashion. The main goal is to adjust the posteriors and re-estimate the priors in such a way that they are mutually consistent, i.e., that they should be such that

Pr () = 1 ∑︁ Pr(|x) | | x∈ Equation 2 is a necessary (albeit not suficient) condition for the posteriors Pr(|x) of the documents x ∈ to be calibrated. SLD may thus be viewed as making a step towards calibrating these posteriors.

The algorithm iterates until convergence, i.e., until the class priors become stable and Equation 2 is satisfied. The convergence of SLD may be tested by computing how the distribution of the priors at iteration ( − 1) and that at iteration () still diverge; this can be evaluated, for instance, in terms of absolute error, i.e.,2

AE(^(− 1), ^()) = || ∈ 1 ∑︁ | P^r ()() − P^r (− 1)()| 2Consistently with most mathematical literature, we use the caret symbol (ˆ) to indicate estimation. (2) (3) Algorithm 1: The SLD algorithm [ 4 ].

Input : Class priors Pr() on , for all ∈ ;

Posterior probabilities Pr(|x), for all ∈ and for all x ∈ ; Output : Estimates P^r () of class prevalence values on , for all ∈ ;

Updated posterior probabilities Pr(|x), for all ∈ and for all x ∈ ; 6 7 8 end

end 1 // Initialization 2 ← 0; 3 for ∈ do 4 P^r ()() ← Pr(); 5 for x ∈ do

Pr()(|x) ← Pr(|x); 9 // Main Iteration Cycle 10 while stopping condition = false do 11 ← + 1; 12 for ∈ do 13 P^r ()() ← 1 ∑︁ Pr(− 1)(|x); // Initialize the prior estimates

// Initialize the posteriors // Update the prior estimates

// Update the posteriors | | x∈ for x ∈ do

Pr()(|x) ← end ^ () Pr () · Pr(0)(|x) ^ (0) Pr ()

^ () ∑︁ Pr () · Pr(0)(|x)

^ (0) ∈ Pr () 14 15 16 17 18 end

end In the experiments of Section 4, we decree that convergence has been reached when AE(^(− 1), ^()) < 10− 6; we stop SLD when we have reached either convergence or the maximum number of iterations (that we set to 1000).

While SLD is a natively multiclass algorithm, in this paper we restrict our analysis to the binary case, with codeframe = {⊕ , ⊖} .

3. Active learning policies

In the experiments for this work, we test the SLD algorithm on training/test sets generated via two of the best-known active learning policies, namely Active Learning via Relevance Sampling (ALvRS) and Active Learning via Uncertainty Sampling (ALvUS), first presented in [ 13 ]. While fairly old and unsophisticated, these policies are still very popular because their computational cost is small, which makes them extremely suitable to applications (such as TAR) in which the set of unlabelled documents that are candidates for annotation, and that the AL policy must thus rank, is large.

Active Learning via Relevance Sampling (ALvRS). ALvRS is an interactive process which, given a data pool of unlabelled documents , asks the reviewer to annotate an initial “seed” set of documents ⊂ , uses as the training set to train a binary classifier ℎ, and uses ℎ to rank the documents in ( ∖ ) in decreasing order of their posterior probability of relevance Pr(⊕| x). Then, the reviewer is asked to annotate the documents for which Pr(⊕| x) is highest (with the batch size), which, once annotated, are added to the training set . Finally, we retrain our classifier on the new training set and repeat the process, until a predefined number of documents (the annotation budget) have been reviewed.

Active Learning via Uncertainty Sampling (ALvUS). The ALvUS policy is a variation of ALvRS, where we review the documents not in decreasing order of Pr(⊕| x) but in decreasing order of | Pr(⊕| x) − 0.5|, i.e., we top-rank the documents which the classifier is most uncertain about.

The Rand policies. For each of the two policies defined above, we define an “oracle-like” policy which we will use for a control experiment. The aim of these policies, which we call (RS) and (US) (corresponding to ALvRS and ALvUS, respectively), is helping to better understand whether the results we are seeing are due to the PPS generated by the two active learning policies, or to their document selection strategy. Given a set of labelled documents and a set of unlabelled documents generated via an active learning policy by sampling , the policy samples randomly to generate alternative labelled and unlabelled sets ′ and ′ subject to the constraints that || = |′|, | | = | ′|, Pr(⊕ ) = Pr′ (⊕ ), and Pr (⊕ ) = Pr′ (⊕ ). In other words, the policies generates the same PPS as the active learning policy, but with a diferent choice of documents.

4. Experiments

We run a set of comparative experiments to explore the interaction between AL-based classifiers and SLD. In order to do this we test the two AL policies described above, ALvRS and ALvUS, and compare them with the (RS) and (US) policies.

4.1. The RCV1-v2 dataset

We run our experiments on the RCV1-v2 dataset [ 14 ], a multi-label multi-class collection of 804,414 Reuters news (produced from August 1996 to August 1997)3. The RCV1-v2 codeframe 3We use the RCV1-v2 dataset as provided by the scikit-learn implementation. https://scikit-learn.org/stable/datasets/ real_world.html#rcv1-dataset consists of a set of 103 classes. Since in this work we experiment with binary classification problems only, for each such class we consider a binary codeframe = {⊕ , ⊖} , where ⊕ = and ⊖ = . Finally, in order to keep computational costs within reasonable bounds, we only work with a pool consisting of the first 100,000 documents of the RCV1-v2 collection.

4.2. Experimental setup

For each class ∈ , and for each AL policy, we run the AL process to generate a sequence of binary classification training sets with incremental sizes; this determines a corresponding sequence of test sets, since the pool is always the union of the training set and the test set .. As for training the classifier, in all of our experiments we use a SVM algorithm, post-calibrated via Platt calibration [ 1 ].

The active learning process is seeded with a set of 1000 initial training documents ⊂ , i.e., we randomly sample 1000 documents from our pool and train our classifier on it. Since in order to calibrate the SVM classifier we need at least 2 positive instances (i.e., instances of ⊕ ) for cross-validation, we always ensure this condition is respected in . We then run the active learning process on the remaining 99,000 documents. This procedure is illustrated in Algorithm 2. As previously mentioned, we also generate an analogous sequence of training/test

Algorithm 2: Pseudo-code to generate active learning datasets.

Input : Documents ; Set of training set sizes Σ; AL policy ; Batch size 1 ← random_sample(, 1000); 2 ← − ; 3 ← max(Σ); 4 ← | |; 5 ← train_svm(); 6 while || < do 7 ← ∪ select_via_policy(, , , ); 8 9 10 11 12 13 14 end ← train_svm(); ← − ; ← | |; if ∈ Σ then

save(, ) end sets with a policy, i.e., random sampling constrained to keep the same class prevalence values obtained by the corresponding active learning policies.

Once the diferent training sets are generated, we train a calibrated SVM from scratch on each of them and obtain a set of posterior probabilities PrPreSLD(⊕| x) for each respective test set. Finally, we apply the SLD algorithm, obtaining a new set of posteriors PrPostSLD(⊕| x).

In TAR scenarios, we are usually interested on the classification performance on the entire pool : for this reason, we merge the labels on the training set with the posterior probabilities on the test set, obtaining a new set of probabilities Pr(⊕| x) where, for all x ∈ , we take Pr(⊕| x) = 1 if ⊕ is the true label of x and Pr(⊕| x) = 0 if ⊖ is the true label of x, with the training set. All of our evaluation measures are computed on this set of probabilities.

4.3. Evaluation measures

To evaluate the performance of our classifier and the quality of the posteriors we use several metrics, namely, Accuracy, Precision, Recall, 1, and Brier Score. We will explain more in detail the last metric, as the reader is likely familiar with the first four.

Given a set = {(x1, (x1)), . . . , (x||, (x||))} of unlabelled documents to be labelled according to codeframe = {⊕ , ⊖} , and given posteriors Pr(⊕| x) for these documents, the Brier score [ 15 ] is defined as

BS = 1 ∑|︁| (((x) = ⊕ ) − Pr(⊕| x))2 | | =1 (4) where (· ) is a function that returns 1 if its argument is true and 0 otherwise. BS ranges between 0 (best) and 1 (worst), i.e., it is a measure of error, and not of accuracy, and rewards probabilistic classifiers that return a high value of Pr(⊕| x) for instances of ⊕ and a low such value for instances of ⊖ . In our result tables we will report, instead of the Brier score, its complement to 1, i.e., (1 - BS), so that all our metrics can be interpreted as “the higher, the better”.

4.4. Results

We present the results of our experiments in Table 1 for ALvRS and in Table 2 for ALvUS.

These results are averages across all of the 103 RCV1-v2 classes used in our experiments. We show both the average results for each training set size (2000, 4000, 8000, 16000) and the results averaged on all sizes. We note that in all cases (i.e., for both ALvRS and ALvUS, and for all sizes), the use of SLD has a detrimental efect on the posterior probabilities. However, while this is true for the setups generated via active learning, the use of SLD has a beneficial efect on the posteriors when the policies have been used.

What we see on the AL datasets seems to contradict what was argued in [ 6 ], i.e., that SLD can improve posteriors in binary classification contexts with high PPS. The policy, which resembles the test data generation technique used in [6, Section 3.2.1], seems instead to confirm the conclusions of [ 6 ]. However, when the training and the test sets do not originate from random sampling, as it is the case for the AL datasets, this hypothesis is disconfirmed.

While we defer a proper analysis of the causes of this problem to future work, a first hypothesis might be that the following is happening. When building active learning datasets, we can assume that the documents that remain in the test set, as this decreases in size, are documents for which the classifier is either fairly sure of their negative label (ALvRS) or of their label in general (ALvUS). Furthermore, AL policies such as relevance sampling or uncertainty sampling sufer from sampling bias [ 16 ], since both AL strategies solely depend on what the classifier thinks is either relevant or uncertain; this means that, as the active learning phase proceeds, the annotator is asked to review documents that are very similar among each other and, because of (a) ALvRS posteriors pre- and post-SLD. this, not enough informative or representative of the actual dataset. Hence, especially when the prevalence of ⊕ is very low, we may expect the distribution of the posterior probabilities to be strongly skewed towards the negative class, much more than if the dataset were a random sample of the population (as it is for the two policies); in the latter case, the classifier might still find documents in the test set for which its confidence is lower than in the AL case. This can be seen in Figures 1 and 2, where we plot the posteriors Pr(⊕| x) pre- and post-SLD for ALvRS, ALvUS and on a random RCV1-v2 class used in our experiments (C17, training size 16,000). Notice how in both cases (RS and US), the posteriors distribution on the AL dataset is strongly skewed towards 0, whereas ’s is slightly more spread on the [0.0, 0.3] interval4. SLD seems to perform a correct rescaling of the posteriors in the cases, whereas it simply sets all posteriors to 0 in the AL cases. Since the PPS is equivalent in both cases, the reasons are to be found in the document strategy selection, within the SLD algorithm, or both. As we mentioned before, the sampling bias is likely responsible for the skewness of the posterior probability distributions that we see in the plots, as this is the only and major diference between the AL and policies. On the other end, if the estimated prevalence Pr() (which we compute as the average of the posteriors, see Algorithm 1) is close to 0, as we see in the figures, then indeed SLD will drag the distribution towards 0. As a matter of fact, consider the SLD update of the prior and posterior probabilities performed in Line 13 and Line 15, respectively, of Algorithm 1. It is trivial to see that lim ^ () Pr()(|x) = 0, i.e., Pr →0 the “maximization” of the “expectation” is that there are no positive instances in the AL test set.

All this would require a deeper analysis, which however we defer to future work. 4We did not plot the entire [0.0, 1.0] interval as there was hardly any probability after the 0.3 threshold. This makes the plots more readable.

(a) ALvUS posteriors pre- and post-SLD.

5. Conclusion

We have studied the interactions between active learning methods and the SLD algorithm. It is known that AL-generated scenarios tend to exhibit a high prior probability shift, and that in past research SLD has proven efective in improving the quality of the posteriors on sets of unlabelled data, especially in cases of high PPS. We thus tested the use of SLD on the posteriors generated by classifiers trained on AL-generated training sets, testing the hypothesis that SLD would improve the quality of these posteriors. Our results do not support this hypothesis, showing instead that the posteriors returned by AL-based classifiers deteriorate after the application of SLD. We have run control experiments that used the same amount of PPS of the AL-generated scenarios, albeit obtained by sampling the elements of the pool randomly. In this case SLD did improve the quality of the posteriors, which indicates that SLD has a specific problem not with the amount of PPS but with the documents selected by AL techniques.

From these preliminary experiments we conclude that, counterintuitively, it is not recommended to combine AL and SLD. In future work we will investigate more deeply the causes of this problem, i.e., what aspect of the AL process results in the bad interaction with SLD, and if and how it is possible to solve this problem, so as to combine the benefits of both methods.

Acknowledgments

This work has been supported by the AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020, and by the SoBigData++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA2019-1. The authors’ opinions do not necessarily reflect those of the European Commission.

Accuracy Precision

Recall

F1 (1-BS)

[1]

J. C.

Platt , Probabilistic outputs for support vector machines and comparison to regularized likelihood methods , in: A. Smola , P.

Bartlett , B.

Schölkopf , D. Schuurmans (Eds.), Advances in Large Margin Classifiers , The MIT Press, Cambridge, MA, 2000 , pp. 61 - 74 .

[2]

Zadrozny ,

Elkan , Transforming classifier scores into accurate multiclass probability estimates , in: Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2002 ), Edmonton, CA, 2002 , pp. 694 - 699 . doi: 10 .1145/ 775107.775151.

[3]

J. G.

Moreno-Torres ,

Raeder ,

Alaíz-Rodríguez ,

N. V.

Chawla ,

Herrera , A unifying view on dataset shift in classification , Pattern Recognition 45 ( 2012 ) 521 - 530 .

[4]

Saerens ,

Latinne , C. Decaestecker, Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure , Neural Computation 14 ( 2002 ) 21 - 41 .

[5]

Moreo ,

Sebastiani , Tweet sentiment quantification: An experimental re-evaluation, PLoS ONE ( 2022 ). Forthcoming.

[6]

Esuli ,

Molinari ,

Sebastiani , A critical reassessment of the Saerens-LatinneDecaestecker algorithm for posterior probability adjustment , ACM Transactions on Information Systems 39 ( 2021 ) Article 19 . doi: 10 .1145/3433164.

[7]

M. R.

Grossman ,

G. V.

Cormack , Technology-assisted review in e-discovery can be more efective and more eficient than exhaustive manual review , Richmond Journal of Law and Technology 17 ( 2011 ) Article 5 .

[8]

D. W.

Oard ,

J. R.

Baron ,

Hedin ,

D. D.

Lewis ,

Tomlinson , Evaluation of information retrieval for E-discovery , Artificial Intelligence and Law 18 ( 2010 ) 347 - 386 .

[9]

D. W.

Oard ,

Sebastiani ,

J. K.

Vinjumur , Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery , ACM Transactions on Information Systems 37 ( 2018 ) 11 : 1 - 11 : 35 . doi: 10 .1145/3268928.

[10]

Molinari , Leveraging the transductive nature of e-discovery in cost-sensitive technologyassisted review , in: Proceedings of the 8th BCS-IRSG Symposium on Future Directions in Information Access (FDIA 2019 ), Milano, IT , 2019 , pp. 72 - 78 .

[11]

Molinari , Risk minimization models for technology-assisted review and their application to e-discovery, Master's thesis , Department of Computer Science, University of Pisa, Pisa, IT , 2019 .

[12]

A. P.

Dempster ,

N. M.

Laird ,

D. B.

Rubin , Maximum likelihood from incomplete data via the EM algorithm , Journal of the Royal Statistical Society, B 39 ( 1977 ) 1 - 38 .

[13] D. D. Lewis , W. A. Gale , A sequential algorithm for training text classifiers , in: Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1994 ), Dublin, IE, 1994 , pp. 3 - 12 . doi: 10 .1007/978-1- 4471 -2099-5_ 1 .

[14] D. D. Lewis , Y.

Yang , T. G.

Rose , F.

Li , RCV1: A new benchmark collection for text categorization research , Journal of Machine Learning Research 5 ( 2004 ) 361 - 397 .

[15]

G. W.

Brier , Verification of forecasts expressed in terms of probability , Monthly Weather Review 78 ( 1950 ) 1 - 3 . doi: 10 .1175/ 1520 - 0493 ( 1950 ) 078 < 0001 :vofeit>2.0.co;2.

[16]

Dasgupta ,

Hsu , Hierarchical sampling for active learning , in: Proceedings of the 25th International Conference on Machine Learning (ICML 2018 ), Stockholm, SE, 2008 , pp. 208 - 215 .