UniLeiden at LeQua 2022: The first step in understanding the behaviour of the median sweep quantifier using continuous sweep⋆ Kevin Kloos1,2,∗ , Quinten A. Meertens2,3 and Julian D. Karch1 1 Leiden University, Faculty of Social Sciences, Institute of Psychology, Department Methodology and Statistics, Wassenaarseweg 52, 2333 AK Leiden, The Netherlands 2 Statistics Netherlands, Henri Faasdreef 312, 2492 JP Den Haag, The Netherlands 3 University of Amsterdam, Amsterdam School of Economics, Center for Nonlinear Dynamics in Economics and Finance, Roetersstraat 11, 1018 WB Amsterdam, The Netherlands Abstract This paper presents the continuous sweep quantifier, a smoothed adaptation of the median sweep quantifier. Previous research has shown that the median sweep quantifier is a good quantifier. However, it is not well understood why it performs well because it is hard to derive its theoretical properties. The continuous sweep quantifier is a modification of the median sweep quantifier that enables computing theoretical results. The continuous sweep quantifier 1) uses kernel estimates instead of the empirical distribution, 2) constructs decision boundaries instead of applying discrete decision rules, and 3) uses the mean instead of the median. We show that a simplified adaptation of the continuous sweep quantifier performs similarly to the median sweep quantifier in terms of bias and variance on the LeQua 2022 dataset. The continuous sweep quantifier can therefore be used to provide insights into the median sweep quantifier by computing theoretical expressions for bias and variance. Keywords quantification learning, learning to quantify, classification, machine learning, median sweep, continuous sweep, LeQua 2022 1. Introduction Quantification Learning, also known as learning to quantify or quantification, is a machine learning task with the aim to compute the class prevalences from an unlabeled test set [1]. Quantification used to be seen as a side product of classification: a good classifier should also produce good prevalence estimates. However, Forman has objected against this statement and showed that simply classifying and counting the estimated labels from a classifier may lead to a severe bias [2]. Therefore, more advanced techniques are needed. Over the past decades, specific techniques for quantification learning called quantifiers have been developed. Binary quantifiers can be categorized into three groups [1]: the group based CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy Envelope-Open k.kloos@fsw.leidenuniv.nl (K. Kloos); q.a.meertens@uva.nl (Q. A. Meertens); j.d.karch@fsw.leidenuniv.nl (J. D. Karch) GLOBE https://github.com/kevinkloos (K. Kloos) Orcid 0000-0001-6980-4259 (K. Kloos); 0000-0002-3485-8895 (Q. A. Meertens); 0000-0002-1625-2822 (J. D. Karch) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) on Classify, Count and Correct, the group based on direct learners, and the group based on distribution matching [3, 4]. Currently, there is no consensus in the academic literature about which group of techniques performs best. According to Vapnik’s principle [5], a problem should be solved directly without solving a more general problem as an intermediate step. Quantification is a more generalized task than classification. Therefore, Vapnik’s principle implies that quantifiers should be created without the intermediate step of constructing a classifier [5, 6]. Schumacher compared quantifi- cation techniques empirically using an extensive simulation study [7]. They conclude that some techniques based on Classify, Count and Correct, that is, quantifiers that construct a classifier as an intermediate step, performed best. Especially the median sweep method from Forman performs well among all popular quantifiers [3]. These two approaches are rather different. An open question is when and why median sweep is such a good quantifier [7]. In this paper, we take the first step in understanding why median sweep is a good quantifier. We propose to perform a theoretical analysis. More specifically, we aim to derive the mean squared error of the median sweep method as a quantifier for the prevalence of the positive class (𝛼) in a binary classification setting. Fortunately, theoretical results of several threshold-based quantifiers have already been derived [8, 9, 10, 11, 12]. We aim to extend these results to median sweep, which is, in fact, an ensemble of threshold-based quantifiers. The key challenge in the theoretical analysis is the discrete nature of median sweep. Therefore, this paper introduces the new continuous sweep quantifier. Continuous sweep is a quantifier that is empirically similar to median sweep. It is constructed to have similar empirical performance as median sweep and to allow for easier analytical derivations. Since continuous sweep and median sweep are closely related, we anticipate that thoroughly understanding the theoretical properties of continuous sweep will also provide insight into the properties of median sweep. In this paper, we construct the continuous sweep quantifier, study its empirical performance, and specify a research agenda for the theoretical analysis of this new quantifier. The remainder of the paper is organized as follows. In Section 2, we will introduce the mathematical notation and reiterate the mathematical expressions for the common quantifiers from the group Classify, Count and Correct, including median sweep. Moreover, we will introduce the continuous sweep quantifier and we will show how it is related to the median sweep quantifier. In Section 3, we will evaluate and compare the performance of median sweep and continuous sweep using data of the LeQua2022 Task [6]. In Section 4, we will discuss our new continuous sweep quantifier and provide suggestions for feature research. 2. Methods In this section, we introduce the continuous sweep quantifier and explain the differences between the continuous sweep quantifier and the median sweep quantifier. First, we introduce the notation and reiterate the definition of the median sweep. Second, we present three theoretical difficulties in analyzing the median sweep quantifier and introduce the continuous sweep quantifier. 2.1. Notation and median sweep Consider a population of observations where each observation consists of a feature vector 𝑥 ∈ 𝒳 = ℝ𝑝 and class label 𝑦 ∈ 𝒴 = {+, −}. The feature vector 𝑥 consists of 𝑝 (numeric) covariate values. Denote a training set of size 𝑛train by 𝐷train where the feature vectors are independent and identically distributed (i.i.d.) with density 𝑓train . Moreover, we denote a validation set of size 𝑛val by 𝐷val with corresponding density 𝑓val . Last, denote the test set of size 𝑛test by 𝐷test with density 𝑓test . Importantly, the class label 𝑦 is only observed in 𝐷train and 𝐷val . The class label 𝑦 is unobserved in 𝐷test . The aim of quantification in a binary setting is to estimate the proportion of observations with a positive label in 𝐷test using the available data and machine learning techniques. We denote the probability density functions of the feature vector for observations in the posi- tive and negative class by 𝑓 (+) (𝑥) and 𝑓 (−) (𝑥), respectively. The probability density functions of the feature vector for the training, validation and test set are each a mixture of 𝑓 (+) (𝑥) and 𝑓 (−) (𝑥), but with different mixture parameters 𝛼train , 𝛼val and 𝛼test , respectively. So, we assume 𝑓train (𝑥) = 𝛼train ⋅ 𝑓 (+) (𝑥) + (1 − 𝛼train ) ⋅ 𝑓 (−) (𝑥), 𝑓val (𝑥) = 𝛼val ⋅ 𝑓 (+) (𝑥) + (1 − 𝛼val ) ⋅ 𝑓 (−) (𝑥), and 𝑓test (𝑥) = 𝛼test ⋅ 𝑓 (+) (𝑥) + (1 − 𝛼test ) ⋅ 𝑓 (−) (𝑥). In other words, we assume that the distributions of the positive class in the training, validation, and test set are identical (and we make the same assumption for the negative class). Moreover, we assume that the mixture parameters differ across the data sets. The combination of these assumptions is referred to as prior-probability shift [13]. We consider a soft-classifier 𝛿 ̂ that maps each feature vector 𝑥 to an estimate of 𝑃(𝑌 = +|𝑋 = 𝑥). The soft-classifier 𝛿 ̂ can be obtained from a machine learning algorithm which is trained using the training data 𝐷train . Then, we compute probability estimates 𝛿(𝑥) ̂ for all feature vectors in the validation set 𝐷val . Note that these values can only be interpreted as classification probabilities if the classifier is properly calibrated. Otherwise, we interpret these values as scores. With those scores, we can estimate marginal densities for 𝛿(𝑥) ̂ for both classes. We ̂ (𝑖) define 𝑓 as the estimated marginal probability density function for 𝛿(𝑥) ̂ given that 𝑦 = 𝑖. The true positive rate and false positive rate can be computed by integrating 𝑓 (𝑖) . Hence, 𝐹 (+) (𝑥) denotes the true positive rate and 𝐹 (−) (𝑥) denotes the false positive rate. Quantifiers of type Classify, Count and Correct use a threshold to make an initial guess of the prevalence. The threshold value is based on the estimated score that an observation in 𝐷test has a positive label. Usually, classifiers use a threshold with a score/probability of 12 to classify an observation. Observations with an estimated score larger than or equal to 12 are labeled as positive and observations with an estimated score smaller than 12 are labeled as negative. Other score values could also be chosen as the threshold value. We will define the threshold value by 𝜃, where we assume 𝜃 ∈ [0, 1] for convenience. Then, observations with an estimated score larger than 𝜃 are positively labelled and observations with an estimated score smaller than 𝜃 are negatively labelled. There are several ways to estimate the prevalence of 𝐷test using 𝐷train and 𝐷val , which we will discuss in the next subsections. Classify-and-count (𝛼̂ CC ) The most straightforward technique to estimate the prevalence 𝛼 is by simply counting the number of observations that have a score larger than a certain threshold 𝜃 ∈ [0, 1] in 𝐷test and dividing it by the total number of observations in 𝐷test . This technique is more commonly known as the classify-and-count quantifier 𝛼̂ CC . The classify-and-count quantifier 𝛼̂ CC is not a good quantifier for 𝛼, even when the underlying classifier performs well. Good classification performance is not sufficient enough for reliable quantification [1]. The most common threshold for 𝜃 is 12 , which makes sense for classification but is, in general, suboptimal for quantification. ̂ For a biased soft-classifier 𝛿(𝑥) and/or when the prevalences differ across the training, validation and test set, a threshold of 𝜃 = 0.5 is suboptimal for quantification. Given the notation from the previous paragraphs, we define the classify-and-count quantifier as 1 𝛼̂ CC (𝐷test , 𝜃) = ∑ 𝟙{𝛿(𝑥)≥𝜃} ̂ . (1) 𝑛test 𝑥∈𝐷 test In the next subsection, we use the classify-and-count quantifier to define the adjusted-count quantifier. Adjusted count (𝛼̂ AC ) The adjusted-count quantifier (𝛼̂ AC ) corrects the classify-and-count quantifier 𝛼̂ CC using esti- mated classification rates. The adjusted-count quantifier uses the true positive rate and false ̂ positive rate for the classifier 𝛿(𝑥) ≥ 𝜃 to adjust the classify-and-count estimate. The two classification rates are estimated from the validation set. The classification rates of class 𝑖 are computed by counting the proportion of observations in 𝐷val with a label 𝑦 = 𝑖 that have a ̂ score 𝛿(𝑥) larger than 𝜃. Then, the classification rates are defined as ∑(𝑥,𝑦)∈𝐷val ∶𝑦=𝑖 {𝟙{𝛿(𝑥)≥𝜃} ̂ } 𝐹 ̂ (𝑖) (𝐷val , 𝜃) = . (2) ∑𝑦∈𝐷val {𝟙{𝑦=𝑖} } The adjusted-count quantifier is then derived as 𝛼̂ CC (𝐷test , 𝜃) − 𝐹 ̂ (−) (𝐷val , 𝜃) 𝛼̂ AC (𝐷test , 𝐷val , 𝜃) = . (3) 𝐹 ̂ (+) (𝐷val , 𝜃) − 𝐹 ̂ (−) (𝐷val , 𝜃) In contrast to classify-and-count, the adjusted-count quantifier has been proven to be asymptot- ically unbiased [8, 10, 12]. The adjusted-count quantifier does not compute reliable prevalence estimates for each threshold value 𝜃. If 𝜃 is such that the difference between true positive rate 𝐹 ̂ (+) (𝐷val , 𝜃) and false positive rate 𝐹 ̂ (−) (𝐷val , 𝜃) is small, then the numerator of Eq. (3) is small, which, in turn, leads to a large variance of the quantifier [8, 12]. Median sweep (𝛼̂ MS ) The median sweep quantifier uses the adjusted-count quantifier to compute prevalence estimates for a range of threshold values. Then, it takes the median value of the computed range of prevalence estimates as the final estimate [3]. As a remedy for the large variance of the adjusted- count quantifier, Forman advised to only compute the adjusted-count quantifier for those threshold values 𝜃 for which the difference between 𝐹 ̂ (+) (𝐷val , 𝜃) and 𝐹 ̂ (−) (𝐷val , 𝜃) is bigger than 1 4 [3]. In notation, the median sweep is 1 𝛼̂ MS (𝐷test , 𝐷val ) = med ({𝛼̂ AC (𝐷test , 𝐷val , 𝜃) ∶ 𝐹 ̂ (+) (𝐷val , 𝜃) − 𝐹 ̂ (−) (𝐷val ) > }) . (4) 4 This can be simplified by only considering thresholds 𝜃 ∈ {𝛿(𝑥) ̂ ∶ 𝑥 ∈ 𝐷test }, which can be computed easily. Now, we have a finite set of prevalence estimates, which makes it easy to compute the median. We implement median sweep by fitting the estimated probabilities/scores of the validation set 𝐷val to an empirical cumulative density function (ecdf ). The empirical cumulative density function is computed similarly to the classify-and-count quantifier 𝛼̂ 𝐶𝐶 , but then using the validation data 𝐷val conditional on the labels. Hence, 𝐹MŜ (+) (𝑥) defines the function of the true positive rate using the median sweep paradigm and 𝐹MS ̂ (−) (𝑥) defines the function of the false positive rate using the median sweep paradigm. 2.2. Continuous sweep In this section, we first explain why it is difficult to derive the mean square error for the median sweep quantifier. Second, we introduce the continuous sweep quantifier. We introduce two variants of the continuous sweep quantifier: the original continuous sweep quantifier and the simplified continuous sweep quantifier. Difficulties median sweep In Section 2.1, we explained how the median sweep quantifier works. The median sweep quantifier has a few properties that make it difficult to derive the mean square error. We present the three most important reasons. First, the classify-and-count quantifier 𝛼̂ 𝐶𝐶 and classification rates 𝐹 ̂ (−) and 𝐹 ̂ (+) , are inter- preted as step functions in 𝜃. Step functions are not differentiable, so are therefore difficult to study analytically. Second, outliers are removed using a complicated data-dependent function, see Eq. (4). Every test set has a different number of observations that pass the data-dependent function and therefore computations grow fast since we need to compute the variance for every number of observations that pass the data-dependent function. Third, it is in general difficult to compute the mean and variance of the median as a function, especially for complex algorithms and distributions. Even for proper densities, we need to invert the cumulative density function to compute the median analytically, which is often unavailable. In the next subsection, we propose solutions to the problems that occur with median sweep and we introduce the continuous sweep quantifier. Continuous sweep quantifier The continuous sweep quantifier is a smoothed adaptation of the median sweep quantifier. The continuous sweep quantifier provides solutions for the problems that occur for the median sweep regarding computing theoretical results. Instead of using step functions for the classify-and-count quantifier 𝛼̂ 𝐶𝐶 and the classification rates 𝐹 ̂ (−) and 𝐹 ̂ (+) , the continuous sweep quantifier uses continuous functions. If we know the type of distribution, estimating the classify-and-count quantifier 𝛼̂ 𝐶𝐶 and the classification rates 𝐹 ̂ (−) and 𝐹 ̂ (+) can be done parametrically with maximum likelihood estimation. If we do not know the type of distribution, we use kernel methods to estimate the marginal densities. In this paper, we use kernel estimates to compute the continuous functions for the classify- and-count quantifier 𝛼̂ 𝐶𝐶 and the classification rates 𝐹 ̂ (−) and 𝐹 ̂ (+) . These functions are now continuous instead of discrete. Then, classification rates 𝐹 ̂ (−) and 𝐹 ̂ (+) are kernel cumulative density functions given a soft-classifier 𝛿(𝑥) ̂ and validation data 𝐷val , and where the classify- and-count quantifier 𝛼̂ CC is a kernel cumulative density function given a soft-classifier 𝛿(𝑥) ̂ and test data 𝐷test . Figures 1a, 1b and 1c show some examples. The black dots in Figures 1a and 1b show the observations in 𝐷𝑣𝑎𝑙 where from we construct the empirical density functions for the true positive rate and the false positive rate. The red lines show the continuous function of the classification rates using a kernel. The black dots in Figure 1c show the classify-and-count estimate for each observation in 𝐷test and the red line shows the continuous function of the classify-and-count quantifier for each threshold value 𝜃. Figure 1d shows two things: the continuous function of the adjusted-count quantifier using the functions in Figures 1a, 1b, and 1c, and the prevalence estimates of each observation in 𝐷test that we need to compute median sweep. All continuous functions seem to resemble their discrete equivalent. With continuous sweep, we should still consider that prevalence estimates for extreme values of 𝜃 have large variances. With median sweep, we discard every prevalence estimate where the difference between the classification rates is smaller than 14 . In order to keep the differences between continuous sweep and median sweep as small as possible, we propose to apply the same decision rule to continuous sweep as to median sweep. Consider two decision boundaries 𝜃𝑙 and 𝜃𝑟 , where 𝜃𝑙 is the lower (left) threshold value where 𝐹 ̂ (+) (𝐷val , 𝜃) − 𝐹 ̂ (−) (𝐷val , 𝜃) = 14 and 𝜃𝑟 is the upper (right) threshold value where 𝐹 ̂ (+) (𝐷val , 𝜃) − 𝐹 ̂ (−) (𝐷val , 𝜃) = 14 . In Figure 1d, the decision boundaries 𝜃𝑙 and 𝜃𝑟 are showed with a vertical orange line. Then we integrate to compute the area between 𝜃𝑙 and 𝜃𝑟 , where 𝐹 ̂ (+) (𝐷val , 𝜃) − 𝐹 ̂ (−) (𝐷val , 𝜃) ≥ 14 , and divide it by the difference between 𝜃𝑙 and 𝜃𝑟 . In Figure 1d, we see a slight difference between the decision boundaries of the continuous sweep quantifier and the decision rule of the median sweep quantifier. In this example, the median sweep quantifier allows observations with more extreme threshold values 𝜃 than the continuous sweep quantifier in their calculations. This can be seen by the blue dots that lay at the outside of the orange decision boundaries. Therefore, the kernels do not exactly match the discrete observations. Using the estimated continuous distributions, we can estimate the adjusted-count quantifier for any threshold. Hence, instead of computing the median of discrete data points, we propose to use integration across the whole probability range to compute the expected value of (𝛼̂ CS ). Finding the median is more complex since we need to find the quantile function of 𝛼̂ 𝐴𝐶 . In Figure 1d, we see that the function of the adjusted-count quantifier against the threshold values is not bijective. This property makes it hard to find the inverse function, which enables to compute the median. Therefore, we propose to compute the (weighted) mean instead of the median. Even though the median is a more robust estimator, we think that the mean should give similar estimates because the outliers are discarded using the decision rule. The mean can be computed by computing the area under the curve using integrals of the continuous functions. In order to make the continuous sweep quantifier as similar as the median sweep quantifier, we should weight areas with many observations in 𝐷test more than areas with little observations ̂ (𝜃)) in 𝐷test . The probability density function of the observations’ threshold values in 𝐷test (𝑓𝛿(𝑥) defines the weights of the continuous sweep quantifier. In fact, this is the (negative value of the) derivative of the classify-and-count quantifier with respect to 𝜃. We have already computed the function of the classify-and-count quantifier and can use its derivative with respect to 𝜃 to compute the weights. Taking into account the decision boundaries, the expected value of the continuous sweep quantifier 𝛼̂ CS is given by 𝜃 𝑟 1 ̂ 𝛼̂ CS (𝐷test , 𝐷val , 𝜃𝑙 , 𝜃𝑟 ) = ∫ 𝑓𝛿(𝑥) (𝜃) ⋅ 𝛼̂ 𝐴𝐶 (𝐷test , 𝐷val , 𝜃) 𝑑𝜃 𝐹 ̂ (𝜃𝑟 ) − 𝐹 ̂ (𝜃𝑙 ) 𝜃=𝜃𝑙 𝑟𝜃 1 𝑑 = ∫ − ( 𝛼̂ 𝐶𝐶 (𝐷test , 𝜃)) 𝛼̂ 𝐶𝐶 (𝐷test , 𝐷val , 𝜃) 𝑑𝜃 𝛼̂ CC (𝐷test , 𝜃𝑟 ) − 𝛼̂ AC (𝐷test , 𝜃𝑙 ) 𝜃=𝜃𝑙 𝑑𝜃 𝜃𝑟 1 𝑑 𝛼̂ 𝐶𝐶 (𝐷test , 𝜃) − 𝐹 ̂ (−) (𝐷val , 𝜃) = − ( 𝛼̂ (𝐷 , 𝜃)) 𝑑𝜃. 𝛼̂ AC (𝐷test , 𝜃𝑟 ) − 𝛼̂ CC (𝐷test , 𝜃𝑙 ) ∫𝜃=𝜃𝑙 𝑑𝜃 𝐶𝐶 test 𝐹 ̂ (+) (𝐷val , 𝜃) − 𝐹 ̂ (−) (𝐷val , 𝜃) (5) The integral of Eq. (5) is numerically tedious because it contains many estimates from the data. In order to reduce numerical complexity, we introduce the simplified continuous sweep ̂ (𝜃) in the quantifier 𝛼̂ SCS . The simplified continuous sweep quantifier does not contain 𝑓𝛿(𝑥) integral. The interpretation of leaving out this density is that we no longer weight areas with many observations in 𝐷test more than areas with little observations in 𝐷test . We believe that the impact of this omission on the theoretical properties of the quantifier are limited. We include a brief explanation, as an elaborate theoretical analysis is out of scope of this paper. First, we note that the adjusted count estimator is asymptotically unbiased for every threshold value 𝜃 [8, 10, 12]. Hence, the continuous sweep quantifier can be interpreted as a weighted average of asymptotically unbiased estimators and the simplified continuous sweep quantifier can be interpreted as an unweighted average of asymptotically unbiased estimators. Both quantifiers are therefore asymptotically unbiased estimators. The difference between the two is the asymptotic variance. A more detailed theoretical comparison between median sweep, continuous sweep, and simplified continuous will be included in a future paper. The key take home message is that the simplified continuous sweep quantifier is theoretically similar to the continuous sweep quantifier and has more appealing numerical properties. The simplified continuous sweep quantifier 𝛼̂ SCS that can be computed as 𝑟 𝜃 1 𝛼̂ SCS (𝐷test , 𝐷val , 𝜃𝑙 , 𝜃𝑟 ) = 𝛼̂ (𝐷 , 𝐷 , 𝜃) 𝑑𝜃 𝜃𝑟 − 𝜃𝑙 ∫𝜃=𝜃𝑙 𝐴𝐶 test val 𝜃𝑟 1 𝛼̂ 𝐶𝐶 (𝐷test , 𝜃) − 𝐹 ̂ (−) (𝐷val , 𝜃) = 𝑑𝜃. (6) 𝜃𝑟 − 𝜃𝑙 ∫𝜃=𝜃𝑙 𝐹 ̂ (+) (𝐷val , 𝜃) − 𝐹 ̂ (−) (𝐷val , 𝜃) Concluding, the continuous sweep quantifiers are continuous adaptations of median sweep, but makes it easier to compute theoretical results. In the next section, we compare the continuous sweep quantifiers with the median sweep quantifier with the data provided by the LeQua2022 Task. 1.00 1.00 false positive rate (fpr) true positive rate (tpr) 0.75 0.75 0.50 0.50 0.25 0.25 0.00 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 threshold value (θ) threshold value (θ) (a) True positive rate (b) False positive rate 1.00 1.50 ^ CC) ^ AC) classify−and−count quantifier: (α adjusted−count quantifier: (α 0.75 1.25 0.50 1.00 0.25 0.75 0.00 0.50 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 threshold value: (θ) threshold value: (θ) (c) Classify-and-count (d) Adjusted-count Figure 1: This figure shows the strong numerical similarity between median sweep as in Eq. 4 and our continuous sweep method as in Eqs. 5 and 6. In subfigures (a)-(c), the red curves are the continuous version of the discrete median sweep estimates (black dots). In subfigure (d), the black line shows the estimated adjusted-count value for every threshold value 𝜃 using the curve from subfigures (a)-(c). The vertical, orange lines show the decision boundaries 𝜃𝑙 and 𝜃𝑟 . The blue dots shows the adjusted-count estimates from median sweep that pass the criterion that the difference between the true positive rate and the false positive rate is larger than 14 , the red dots are the estimates that fail the criterion. The median sweep quantifier is computed by taking the median of the blue dots in subfigure (d). The simplified continuous sweep quantifier can be computed by integrating the area between the decision boundaries of subfigure (d) and divide it by the distance between the decision boundaries. The original continuous sweep quantifier can be computed by integrating the weighted area between the decision boundaries of subfigure (d) and divide it by the weighted distance between the decision boundaries. These weights are based on the classify-and-count quantifier. 3. Evaluation In this section, we evaluate the continuous sweep quantifiers and the median sweep quantifier. In short, the objective is to quantify the prevalence 𝛼 of positive product reviews (from a webshop) as accurate as possible across 5, 000 test sets. For more information on the quantification task, we refer to the paper of the LeQua 2022 Task [6]. First, we explain the technical details of our study. Second, we show the results of the quantifiers on the test datasets. Third, we explain the similarities and differences between the continuous sweep quantifiers and the median sweep quantifier regarding the quantification task. 3.1. Technical setup The analysis is performed using statistical software R version 4.1.3 [14]. Besides the core packages, we used t i d y v e r s e and t i d y m o d e l s [15, 16]. The training data consists of 5, 000 observations, each with 300 covariates and a label on whether the review is positive or negative. The training set is imbalanced: 3, 870 reviews are positive and 1, 130 reviews are negative. We randomly split this dataset in two parts: a training set 𝐷train containing 4, 000 observations and validation set 𝐷val containing 1, 000 observations from the complete training data. The training data 𝐷train was balanced, which means that some of the negatively labelled observations are replicated to match the number of positively labelled observations. Our classification model was a support vector machine (SVM) [17], denoted by 𝛿.̂ The SVM is trained with the training data 𝐷train . The model had a linear kernel boundary and a regularisation parameter 𝐶 = 1. We converted the decision values of the SVM to probabilities/scores using Platt scaling [18], such that we could use the theory of the previous section. (+) (−) We computed the classify-and-count estimator 𝛼̂ CC and classification rates 𝐹MS (𝑥) and 𝐹MS (𝑥) for the median sweep quantifier using the e c d f function. The e c d f function fits a empirical step function from the input data. (+) (−) We computed the classify-and-count estimator 𝛼̂ CC and classification rates 𝐹CS (𝑥) and 𝐹CS (𝑥) for the continuous sweep quantifiers using the k c d e function from the k s package [19]. Moreover, we computed 𝑓𝛿(𝑥)̂ (𝜃) using the k d e function from the same k s package. We added no additional arguments for both functions, except the boundaries for the estimated probabilities, which are set to 0 and 1. 3.2. Results In this section, we evaluated the median sweep quantifier and the continuous sweep quantifiers on the test sets of the LeQua2022 task. First, we compared the median sweep quantifier and the continuous sweep quantifiers with the true prevalences. Second, we compared the median sweep quantifier with the continuous sweep quantifiers. First, we evaluated the median sweep quantifier on the test sets. Figures 2a, 2b plot the estimated prevalence by the median sweep quantifier against the true prevalence, and the residuals. Obviously, the error of very small estimated prevalences is positive and the error of the very large estimated prevalence is negative. Moreover, it seems that there is a small positive bias among the estimated prevalences. Table 1 Comparing summary statistics between the median sweep and continuous sweep quantifiers with the test sets. Quantifier Bias Variance MAE Continuous sweep 0.02565 0.00302 0.0473 Simplified continuous sweep -0.00916 0.00151 0.0317 Median sweep 0.00650 0.00129 0.0289 Second, we evaluate the continuous sweep quantifiers on the test sets. Figure 2c and 2e plots the estimated prevalence by the continuous sweep quantifiers against the true prevalence, Figure 2d and 2f plot the estimated prevalence by the continuous sweep quantifiers against the residuals. We see different results between the continuous sweep quantifiers. The continuous sweep quantifier performs worse than the simplified continuous sweep quantifier: the continuous sweep quantifier has a large bias for large prevalence values and it has more variance than the simplified continuous sweep quantifier. It is clear that the simplified continuous sweep quantifier performs better than the original continuous sweep quantifier and therefore, we will now only compare the simplified continuous sweep quantifier with the median sweep quantifier. When we compare the median sweep quantifier with the simplified continuous sweep quan- tifier, we see some similarities and differences. The two quantifiers seem to have only little bias across the range of prevalences, however, the direction of the bias is different. Moreover, the pattern of the bias is different. The bias of the median sweep quantifier is monotonically increasing (Figure 2b) while the bias of the simplified continuous sweep quantifier seems to have a local minimum and a local maximum (Figure 2f). The variance of the simplified continuous sweep quantifier is slightly larger than the variance of the median sweep quantifier, see Table 1, hence the mean absolute error (MAE) of the simplified continuous sweep quantifier is slightly larger than the MAE of the median sweep quantifier. The simplified continuous sweep quantifier has more variance than the median sweep quan- tifier. A reason could be that the mean is more sensitive to extreme values than the median. Figure 3 shows nine examples of the adjusted-count integral and the median sweep estimates. Remarkable is that the continuous sweep function is close to the discrete estimates over the whole range of 𝜃, except around the value of 𝜃𝑟 . This difference can be a possible cause of the small difference between the continuous sweep quantifier and the median sweep quantifier. Concluding, the simplified continuous sweep quantifier is a quantifier that performs slightly worse than the median sweep quantifier using the procedure described in this section. The original continuous sweep quantifier performs much worse than the other two quantifiers. The results for the simplified continuous sweep quantifier and the median sweep quantifier are similar and we believe that we can use the (simplified) continuous sweep quantifier to compute theoretical results that are related to the median sweep quantifier. 1.0 difference from true prevalence 0.8 0.1 true prevalence 0.6 0.0 0.4 0.2 −0.1 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 estimated prevalence estimated prevalence (a) Median sweep against true prevalences (b) Fitted residuals median sweep 1.0 0.2 difference from true prevalence 0.8 0.1 true prevalence 0.6 0.0 0.4 −0.1 0.2 −0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 estimated prevalence estimated prevalence (c) Continuous sweep against true prevalences (d) Fitted residuals continuous sweep 1.0 0.8 0.10 difference from true prevalence true prevalence 0.05 0.6 0.00 0.4 −0.05 0.2 −0.10 0.0 0.0 0.2 0.4 0.6 0.8 1.0 −0.15 estimated prevalence 0.0 0.2 0.4 0.6 0.8 1.0 estimated prevalence (e) Simplified continuous sweep against true prevalences (f) Fitted residuals simplified continuous sweep Figure 2: Quantifiers against true prevalence among 5, 000 test sets. The fitted red lines plot the line where the estimated prevalence is equal to the true prevalence. The blue lines plot a fitted GAM-model representing the bias among the prevalences. 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 estimated prevalence 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 threshold value Figure 3: Nine examples of the adjusted-count integral. The black line denotes the estimated adjusted count quantifier at threshold 𝜃 for a development set. The orange vertical lines are the two decision boundaries 𝜃𝑙 and 𝜃𝑟 and the grey horizontal lines denote the prevalence of each development set. The blue dots show the adjusted-count estimates from the median sweep that pass the criterion that the difference between the true positive rate and the false positive rate is larger than 14 , and the red dots are the estimates that fail the criterion. 4. Conclusion and Discussion The goal of this paper was to design the continuous sweep quantifier, study its empirical performance, and specify a research agenda for the theoretical analysis of this new quantifier. In this paper, we constructed the continuous sweep quantifier. We provided two versions of the continuous sweep quantifier: the original continuous sweep quantifier where every threshold is weighted with the classify-and-count quantifier and the simplified continuous sweep quantifier without weights. The continuous sweep quantifiers are based on the well- known median sweep quantifier. Previous research has shown that the median sweep quantifier is a good quantifier. However, it is not well understood why it performs well because it is hard to derive its theoretical properties. The median sweep quantifier uses empirical distributions for the classify-and-count quantifier 𝛼̂ AC and the classification rates 𝐹 (+) (𝑥) and 𝐹 (−) (𝑥) which makes it hard to do proper calculations on, like differentiating and integrating. Moreover, median sweep uses discrete decision rules to remove outliers, which makes the calculations more complicated. Last, the median is hard to compute analytically since the functions of the prevalence 𝛼 against threshold 𝜃 is non-bijective. Therefore, we proposed a new quantifier named the continuous sweep. The continuous sweep quantifier is a modification of the median sweep quantifier that enables computing theoretical results. The continuous sweep quantifier 1) used kernel estimates instead of the empirical distribution, 2) constructed decision boundaries instead of applying discrete decision rules, and 3) used the mean instead of the median. Figure 1 showed that the continuous functions are closely related to the empirical functions. The simplified continuous sweep quantifier performed similar to the median sweep quantifier in terms of bias and variance. The original continuous sweep quantifier performed much worse than the simplified continuous sweep quantifier. Both continuous sweep quantifiers can be further optimized by choosing better kernels and other hyper-parameters. The outline for the theoretical agenda is separated into two parts: defining the assumptions of the continuous distributions, and second we discuss how to compute the theoretical results. First, we make assumptions about the continuous distributions. In this paper, the continuous distributions are kernels with default parameters estimated from the training and validation data. Deriving theoretical results from default kernels is still a cumbersome task. Therefore, we can make assumptions on the distributions. We start using basic distributional distributions such as the uniform. Later on, we can extend it to more complex distributional distributions such as the beta. Second, we discuss how to compute the theoretical results. In the first step, we assume that the classification rates follow a uniform distribution where the limits are given. Then, we can compute the expected value of the classify-and-count quantifier for each prevalence 𝛼 over each threshold value 𝜃. Adding the information of the distributions of the classification rates, we can compute the expected value of the adjusted-count quantifier using [8] and iterate over the whole range of 𝜃 to compute the expected value of the continuous sweep quantifier. We can apply a similar strategy for the variance. Combining the expected value and the variance results in a value for the mean square error for the continuous sweep quantifier. Having the mean square error of the continuous sweep quantifier, we can compare it with the mean square error for other quantifiers like the adjusted count, calibration or mixed quantifier [8, 9, 10]. After computing theoretical results for the continuous sweep quantifier, we can further im- prove the continuous sweep quantifier. The continuous sweep quantifier has been constructed to compute theoretical results for the median sweep quantifier. With innovative techniques regard- ing kernel estimates and handling large variances, we can improve the predictive performance of the continuous sweep quantifier. In conclusion, the continuous sweep quantifier can be used to understand median sweep more thoroughly. It enables us to compute theoretical results for bias and variance in future papers. References [1] P. González, A. Castaño, N. V. Chawla, J. Coz, A review on quantification learning, ACM Computing Surveys 50 (2017) 74:1–74:40. [2] G. Forman, Counting Positives Accurately Despite Inaccurate Classification, volume 3720, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 564–575. URL: http: //link.springer.com/10.1007/11564096_55, series Title: Lecture Notes in Computer Sci- ence DOI: 10.1007/11564096_55. [3] G. Forman, Quantifying counts and costs via classification, Data Mining and Knowledge Discovery 17 (2008) 164–206. URL: http://link.springer.com/10.1007/s10618-008-0097-y. doi:1 0 . 1 0 0 7 / s 1 0 6 1 8 - 0 0 8 - 0 0 9 7 - y . [4] L. Milli, A. Monreale, G. Rossetti, F. Giannotti, D. Pedreschi, F. Sebastiani, Quantification trees, IEEE, 2013, p. 528–536. [5] V. N. Vapnik, Statistical learning theory, 1998. [6] A. Esuli, A. Moreo, F. Sebastiani, Lequa@clef2022: Learning to quantify, 2021. URL: https: //arxiv.org/abs/2111.11249. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 1 1 . 1 1 2 4 9 . [7] T. Schumacher, M. Strohmaier, F. Lemmerich, A comparative evaluation of quantifica- tion methods, arXiv:2103.03223 [cs] (2021). URL: http://arxiv.org/abs/2103.03223, arXiv: 2103.03223. [8] K. Kloos, Q. Meertens, S. Scholtus, J. Karch, Comparing correction methods to reduce misclassification bias, Springer International Publishing, Cham, 2021, pp. 64–90. [9] K. Kloos, A new generic method to improve machine learning applications in official statistics, Statistical Journal of the IAOS 37 (2021) 1181–1196. URL: http://dx.doi.org/10. 3233/SJI-210885. doi:1 0 . 3 2 3 3 / s j i - 2 1 0 8 8 5 . [10] Q. A. Meertens, C. G. H. Diks, H. J. Van Den Herik, F. W. Takes, Understanding the output quality of official statistics that are based on machine learning algorithms, 2021. [11] D. Tasche, Fisher consistency for prior probability shift, The Journal of Machine Learning Research 18 (2017) 3338–3369. [12] D. Tasche, Minimising quantifier variance under prior probability shift, 2021. URL: https: //arxiv.org/abs/2107.08209. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 0 7 . 0 8 2 0 9 . [13] J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N. V. Chawla, F. Herrera, A unifying view on dataset shift in classification, Pattern recognition 45 (2012) 521–530. [14] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2021. URL: https://www.R-project.org/. [15] H. Wickham, M. Averick, J. Bryan, W. Chang, L. McGowan, R. François, G. Grolemund, A. Hayes, L. Henry, J. Hester, M. Kuhn, T. Pedersen, E. Miller, S. Bache, K. Müller, J. Ooms, D. Robinson, D. Seidel, V. Spinu, K. Takahashi, D. Vaughan, C. Wilke, K. Woo, H. Yutani, Welcome to the tidyverse, J. Open Source Softw. 4 (2019) 1686. URL: http://dx.doi.org/10. 21105/joss.01686. doi:1 0 . 2 1 1 0 5 / j o s s . 0 1 6 8 6 . [16] M. Kuhn, H. Wickham, Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles., 2020. URL: https://www.tidymodels.org. [17] J. H. Friedman, T. Hastie, R. Tibshirani, et al., The elements of statistical learning, Springer, New York, 2001. doi:1 0 . 1 0 0 7 / 9 7 8 - 0 - 3 8 7 - 8 4 8 5 8 - 7 . [18] A. Karatzoglou, A. Smola, K. Hornik, A. Zeileis, kernlab – an S4 package for kernel methods in R, 2004. URL: http://www.jstatsoft.org/v11/i09/. [19] T. Duong, ks: Kernel Smoothing, 2022. URL: https://CRAN.R-project.org/package=ks, r package version 1.13.4.