<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explicit Loss Minimization in Quantification Applications (Preliminary Draft)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Italy E-mail: andrea.esuli@isti.cnr.it</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Qatar Computing Research Institute Qatar Foundation</institution>
          <addr-line>PO Box 5825, Doha</addr-line>
          ,
          <country country="QA">Qatar</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years there has been a growing interest in quantification, a variant of classification in which the final goal is not accurately classifying each unlabelled document but accurately estimating the prevalence (or “relative frequency”) of each class c in the unlabelled set. Quantification has several applications in information retrieval, data mining, machine learning, and natural language processing, and is a dominant concern in fields such as market research, epidemiology, and the social sciences. This paper describes recent research in addressing quantification via explicit loss minimization, discussing works that have adopted this approach and some open questions that they raise.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        estimating the percentages of unlabelled objects that belong to each class c ∈ C.
Quantification is usually tackled via supervised learning; it has several applications
in information retrieval [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], data mining [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11–13</xref>
        ], machine learning [
        <xref ref-type="bibr" rid="ref1 ref20">1, 20</xref>
        ], and
natural language processing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and is a dominant concern in fields such as
market research [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], epidemiology [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], and the social / political sciences [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>Classification comes in many variants, including binary classification (where
|C| = 2 and exactly one class per item is assigned), single-label multi-class
(SLMC) classification (where |C| &gt; 2 and exactly one class per item is assigned),
and multi-label multi-class (MLMC) classification (where |C| ≥ 2 and zero, one
or several classes per item may be assigned). To each such classification task
there corresponds a quantification task, which is concerned with evaluating
at the aggregate level (i.e., in terms of class prevalence) the results of the
corresponding classification task. In this paper we will mostly be concerned with
binary quantification, although we will also occasionally hint at how and whether
the solutions we discuss extend to the SLMC and MLMC cases.</p>
      <p>Quantification is not a mere byproduct of classification, and is tackled as a
task on its own. The reason is that the naive quantification approach consisting
of (a) classifying all the test items and (b) counting the number of items assigned
to each class (the “classify and count” method – CC) is suboptimal. In fact, a
classifier may have good classification accuracy but bad quantification accuracy;
for instance, if the binary classifier generates many more false negatives (FN)
than false positives (FP), the prevalence of the positive class will be severely
underestimated.</p>
      <p>
        As a result, several quantification methods that deviate from mere “classify
and count” have been proposed. Most such methods fall in two classes. In the
first approach a generic classifier is trained and applied to the test data, and
the computed prevalences are then corrected according to the estimated bias
of the classifier, which is estimated via k-fold cross-validation on the training
set; “adjusted classify and count” (ACC) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], “probabilistic classify and count”
(PCC) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and “adjusted probabilistic classify and count” (PACC) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] fall in
this category. In the second approach, a “classify and count” method is used on
a classifier in which the acceptance threshold has been tuned so as to deliver
a different proportion of predicted positives and predicted negatives; example
methods falling in this category are the “Threshold@50” (T50), “MAX”, “X”,
and “median sweep” (MS) methods proposed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. See also [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for a more
detailed explanation of all these methods.
      </p>
      <p>
        In this paper we review an emerging class of methods, based on explicit loss
minimization (ELM). Essentially, their underlying idea is to use (unlike the
first approach mentioned above) simple “classify and count” without (unlike the
second approach mentioned above) any heuristic threshold tuning, but using a
classifier trained via a learning method explicitly optimized for quantification
accuracy. This idea was first proposed, but not implemented, in a position paper
by Esuli and Sebastiani [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and was taken up by three very recent works [
        <xref ref-type="bibr" rid="ref19 ref2 ref9">2, 9,
19</xref>
        ] that we will discuss here.
      </p>
      <p>The rest of this paper is organized as follows. Section 2 discusses evaluation
measures for quantification used in the literature. Section 3 discusses the reason
why approaching quantification via ELM is impossible with standard learning
algorithms, and discusses three ELM approaches to quantification that have made
use of nonstandard such algorithms. Section 4 discusses experimental results,
while Section 5 concludes discussing questions that existing research has left
open.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Loss Measures for Evaluating Quantification Error</title>
      <p>ELM requires the loss measure used for evaluating prediction error to be directly
minimized within the learning process. Let us thus look at the measures which
are currently being used for evaluating SLMC quantification error. Note that
a measure for SLMC quantification is also a measure for binary quantification,
since the latter task is a special case of the former. Note also that a measure for
binary quantification is also a measure for MLMC quantification, since the latter
task can be solved by separately solving |C| instances of the former task, one for
each c ∈ C.</p>
      <p>Notation-wise, by Λ(pˆ, p, S, C) we will indicate a quantification loss, i.e., a
measure Λ of the error made in estimating a distribution p defined on set S and
classes C by another distribution pˆ; we will often simply write Λ(pˆ, p) when S
and C are clear from the context.</p>
      <p>The simplest measure for SLMC quantification is absolute error (AE), which
corresponds to the sum (across the classes in C) of the absolute differences
between the predicted class prevalences and the true class prevalences; i.e.,
AE(pˆ, p) = X |pˆ(cj ) − p(cj )|
AE ranges between 0 (best) and 2(1 − mincj∈C p(cj )) (worst); a normalized
version of AE that always ranges between 0 (best) and 1 (worst) can thus be
obtained as</p>
      <p>N AE(pˆ, p) =</p>
      <p>Pcj∈C |pˆ(cj ) − p(cj )|</p>
      <p>2(1 − cmj∈inC p(cj ))
The main advantage of AE and N AE is that they are intuitive, and easy to
understand to non-initiates too.</p>
      <p>However, AE and N AE do not address the fact that the same absolute
difference between predicted class prevalence and true class prevalence should
count as a more serious mistake when the true class prevalence is small. For
instance, predicting pˆ(c) = 0.10 when p(c) = 0.01 and predicting pˆ(c) = 0.50 when
p(c) = 0.41 are equivalent errors according to AE, but the former is intuitively a
more serious error than the latter. Relative absolute error (RAE) addresses this
problem by relativizing the value |pˆ(cj ) − p(cj )| in Equation 1 to the true class
prevalence, i.e.,</p>
      <p>RAE(pˆ, p) =
X |pˆ(cj ) − p(cj )|
p(cj )
(1)
(2)
(3)
RAE may be undefined in some cases, due to the presence of zero denominators.
To solve this problem, in computing RAE we can smooth both p(cj ) and pˆ(cj )
via additive smoothing, i.e.,
ps(cj ) =</p>
      <p>p(cj ) +
( X p(cj )) +
where ps(cj ) denotes the smoothed version of p(cj ) and the denominator is just
a normalizing factor (same for the pˆs(cj )’s); the quantity = 2·1|s| is often used
as a smoothing factor. The smoothed versions of p(cj ) and pˆ(cj ) are then used in
place of their original versions in Equation 3; as a result, RAE is always defined
and still returns a value of 0 when p and pˆ coincide.</p>
      <p>RAE ranges between 0 (best) and (((1−mincj∈C p(cj ))/ mincj∈C p(cj ))+|C|−1)
(worst); a normalized version of RAE that always ranges between 0 (best) and 1
(worst) can thus be obtained as</p>
      <p>N RAE(pˆ, p) =</p>
      <p>P
cj∈C
|pˆ(cj ) − p(cj )|</p>
      <p>
        p(cj )
1 − cmj∈inC p(cj )
min p(cj )
cj∈C
+ |C| − 1
A third measure, and the one that has become somehow standard in the
evaluation of SLMC quantification, is normalized cross-entropy, better known as
Kullback-Leibler Divergence (KLD – see e.g., [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). KLD was proposed as a SLMC
quantification measure in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and is defined as
      </p>
      <p>KLD(pˆ, p) =</p>
      <p>X p(cj ) log
KLD is a measure of the error made in estimating a true distribution p over a
set C of classes by means of a predicted distribution pˆ. KLD is thus suitable for
evaluating quantification, since quantifying exactly means predicting how the
items in set s are distributed across the classes in C.</p>
      <p>KLD ranges between 0 (best) and +∞ (worst). Note that, unlike AE and
RAE, the upper bound of KLD is not finite since Equation 6 has predicted
probabilities, and not true probabilities, at the denominator: that is, by making a
predicted probability pˆ(cj ) infinitely small we can make KLD be infinitely large.
A normalized version of KLD yielding values between 0 (best) and 1 (worst)
may be defined by applying a logistic function, e.g.,</p>
      <p>N KLD(pˆ, p) =
eKLD(pˆ,p) − 1
eKLD(pˆ,p)
Also KLD may be undefined in some cases. While the case in which p(cj ) = 0
is not problematic (since continuity arguments indicate that 0 log a0 should be
(4)
(5)
(6)
(7)
taken to be 0 for any a ≥ 0), the case in which pˆ(cj ) = 0 and p(cj ) &gt; 0 is indeed
problematic, since a log a0 is undefined for a &gt; 0. To solve this problem, also in
computing KLD we use the smoothed probabilities of Equation 4; as a result,
KLD is always defined and still returns a value of zero when p and pˆ coincide.</p>
      <p>
        The main advantage of KLD is that it is a very well-known measure, having
been the subject of intense study within information theory [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and, although from
a more applicative angle, within the language modelling approach to information
retrieval [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Its main disadvantage is that it is less easy to understand to
non-initiates than AE or RAE.
      </p>
      <p>Overall, while no measure is advantageous under all respects, KLD (or
N KLD) wins over the other measures on several accounts; as a consequence, it
has emerged as the de facto standard in the SLMC quantification literature. We
will hereafter consider it as such.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Quantification</title>
    </sec>
    <sec id="sec-4">
      <title>Minimization</title>
    </sec>
    <sec id="sec-5">
      <title>Methods Based on Explicit Loss</title>
      <p>A problem with the quantification methods hinted at in Section 1 is that most
of them are fairly heuristic in nature. A further problem is that some of these
methods rest on assumptions that seem problematic. For instance, one problem
with the ACC method is that it seems to implicitly rely on the hypothesis that
estimating the bias of the classifier via k-fold cross-validation on T r can be
done reliably. However, since the very motivation of doing quantification is that
the training set and the test set may have quite different characteristics, this
hypothesis seems adventurous. In sum, the very same arguments that are used
to deem the CC method unsuitable for quantification seem to undermine the
previously mentioned attempts at improving on CC.</p>
      <p>Note that all of the methods discussed in Section 1 employ general-purpose
supervised learning methods, i.e., address quantification by leveraging a
classifier trained via a general-purpose learning method. In particular, most of the
supervised learning methods adopted in the literature on quantification optimize
zero-one loss or variants thereof, and not a quantification-specific evaluation
function. When the dataset is imbalanced (typically: when the positives are by
far outnumbered by the negatives), as is frequently the case in text classification,
this is suboptimal, since a supervised learning method that minimizes zero-one
loss will generate classifiers with a tendency to make negative predictions. This
means that F N will be much higher than F P , to the detriment of quantification
accuracy1.</p>
      <p>
        In this paper we look at new, theoretically better-founded quantification
methods based upon the use of classifiers explicitly optimized for the evaluation
function used for assessing quantification accuracy. The idea of using learning
1 To witness, in the experiments reported in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the 5148 test sets exhibit, when
classified by the classifiers generated by the linear SVM used for implementing the
CC method, an average F P/F N ratio of 0.109; by contrast, for an optimal quantifier
this ratio is always 1.
algorithms capable of directly optimizing the measure (a.k.a. “loss”) used for
evaluating effectiveness is well-established in supervised learning. However, in
our case following this route is non-trivial; let us see why.
      </p>
      <p>As usual, we assume that our training data T r = {(x1, y1), ..., (x|T r|, y|T r|)}
and our test data T e = {(x10, y10), ..., (x|0T e|, y|0T e|)} are independently generated
from an unknown joint distribution P (X , Y), where X and Y are the input and
output spaces, respectively. In this paper we will assume Y to be {−1, +1}.</p>
      <p>Our task is to learn from T r a hypothesis h ∈ H (where h : X → Y and H
is the hypothesis space) that minimizes the expected risk RΛ(h) on sets T e of
previously unseen inputs. Here Λ is our chosen loss measure; note that it is a
set-based (rather than an instance-based) measure, i.e., it measures the error
incurred by making an entire set of predictions, rather than (as instance-based
measures λ do) the error incurred by making a single prediction. Our task consists
thus of finding
arg min RΛ(h) =
h∈H</p>
      <p>Z
Λ((h(x10), y10), ..., (h(x|0T e|), y|0T e|))dP (T e)
(8)
If the loss function Λ over sets T e can be linearly decomposed into the sum of
the individual losses λ generated by the members of T e, i.e., if
(9)
(10)
(11)
Λ((h(x10), y10), ..., (h(x|0T e|), y|0T e|)) =
|T e|
X λ(h(xi0), yi0)
i=1
then Equation 8 comes down to
arg min RΛ(h) = arg min Rλ(h) =
h∈H h∈H</p>
      <p>Z
λ(h(x0, y0)dP (x0, y0)
Discriminative learning algorithms estimate the expected risk RΛ(h) via the
empirical risk (or “training error”) RTΛr(h), which by virtue of Equation 9 becomes
RˆΛ(h) = RTΛr(h) = RTλ r(h) =
|T r|
X λ(h(xi, yi)
i=1
and pick the hypothesis h which minimizes RTλ r(h).</p>
      <p>The problem with adopting this approach in learning to quantify is that all
quantification loss measures Λ are such that Equation 9 does not hold. In other
words, such loss measures are nonlinear (since they do not linearly decompose into
the individual losses brought about by the members in the set) and multivariate
(since Λ is a function of all the instances, and does not break down into univariate
loss functions). For instance, we cannot evaluate the contribution to KLD of a
classification decision for a single item xi0, since this contribution will depend on
how the other items have been classified; if the other items have given rise, say, to
more false negatives than false positives, then misclassifying a negative example
(thus bringing about an additional false positive) is even beneficial!, since false
positives and false negatives cancel each other out when it comes to quantification
accuracy. This latter fact shows that any quantification loss (and not just the
ones discussed in Section 2) is inherently nonlinear and multivariate. This means
that, since Equations 9–11 do not hold for quantification loss measures Λ, we
need to seek learning methods that can explicitly minimize RTΛr(h) holistically,
i.e., without making the “reductionistic” assumption that RΛ(h) = Rλ(h).</p>
      <p>
        As mentioned in the introduction, the idea to use ELM in quantification
applications was first proposed, but not implemented, in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In this section we
will look at three works [
        <xref ref-type="bibr" rid="ref19 ref2 ref9">2, 9, 19</xref>
        ] that have indeed exploited this idea, although
in three different directions.
3.1
      </p>
      <p>
        Quantification via Structured Prediction I: SVM(KLD) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] Esuli and Sebastiani also suggested using, as a “holistic” algorithm of the
type discussed in the previous paragraph, the SVM for Multivariate Performance
Measures (SVMperf ) learning algorithm proposed by Joachims [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]2.
      </p>
      <p>SVMperf is a learner of the Support Vector Machine family that can generate
classifiers optimized for any non-linear, multivariate loss function that can be
computed from a contingency table (as all the measures presented in Section 2
are). SVMperf is an algorithm for multivariate prediction: instead of handling
hypotheses h : X → Y mapping an individual item xi into an individual label yi, it
considers hypotheses h¯ : X¯ → Y¯ which map entire tuples of items x¯ = (x1, ..., xn)
into tuples of labels y¯ = (y1, ..., yn), and instead of learning hypotheses of type
(12)
(13)
(14)
(15)
h(x) : sign(w · x + b)
h¯(x¯) : arg max(w · Ψ (x¯, y¯0))</p>
      <p>y0∈Y
Ψ (x¯, y¯0) =
n
X xiyi0
i=1
(the joint feature map) is a function that scores the pair of tuples (x¯, y¯0) according
to how “compatible” the tuple of labels y¯0 is with the tuple of inputs x¯.</p>
      <p>
        While the optimization problem of classic soft-margin SVMs consists in finding
1
arg min
w,ξi≥0 2
w · w + C
|T r|
X ξi
i=1
such that yi[w · xi + b] ≥ (1 − ξi) for all i ∈ {1, ..., |T r|}
2 In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] SVMperf is actually called SV M Λmulti, but the author has released its
implementation under the name SVMperf ; we will indeed use this latter name.
it learns hypotheses of type
where w is the vector of parameters to be learnt during training and
the corresponding problem of SVMperf consists instead of finding
1
arg min w · w + Cξ
      </p>
      <p>w,ξ≥0 2
such that w · [Ψ (x¯, y¯) − Ψ (x¯, y¯0) + b] ≥ Λ(y¯, y¯0) − ξ for all y¯0 ∈ Y¯/y¯
(16)
Here, the relevant thing to observe is that the sample-based loss Λ explicitly
appears in the optimization problem.</p>
      <p>
        We refer the interested reader to [
        <xref ref-type="bibr" rid="ref16 ref17 ref21">16, 17, 21</xref>
        ] for more details on SVMperf and
on SVMs for structured output prediction in general. From the point of view
of the user interested in applying it to a certain task, the implementation of
SVMperf made available by its author is essentially an off-the-shelf package, since
for customizing it to a specific loss Λ one only needs to write a small module
that describes how to compute Λ from a contingency table.
      </p>
      <p>
        While [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] only went as far as suggesting the use of SVMperf to optimize a
quantification loss, its authors later went on to actually implement the idea, using
KLD as the quantification loss and naming the resulting system SVM(KLD)
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In Section 4 we will describe some of the insights that they obtained from
experimenting it.
3.2
      </p>
      <p>
        Quantification Trees and Quantification Forests [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
Rather than working in the framework of SVMs, the work of Milli and colleagues
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] perform explicit loss minimization in the context of a decision tree framework.
Essentially, their idea is to use a quantification loss as the splitting criterion in
generating a decision tree, thereby generating a quantification tree (i.e., a decision
tree specialized for quantification). The authors experiment with three different
quantification loss measures: (a) (a proxy of) absolute error, i.e., D(pˆ, p) =
Pcj∈C |F P − F N |; (b) KLD; (c) M OM (pˆ, p) = Pcj∈C |F Pj2 − F Nj2|. Measure
(c) is of particular significance since it is not a “pure” quantification loss. In fact,
notice that M OM (pˆ, p) is equivalent to Pcj∈C(F Nj + F Pj) · |F Nj − F Pj|, and
that while the second factor (|F Nj − F Pj|) may indeed be seen as representing
quantification error, the first factor (F Nj + F Pj) is a measure of classification
error. The motivation behind the authors’ choice is to minimize at the same
time classification and quantification error, based on the notion that a quantifier
that has good quantification accuracy but low classification accuracy is somehow
unreliable, and should be avoided.
      </p>
      <p>
        The authors go on to propose the use of quantification forests, i.e., random
forests of quantification trees, where these latter as defined as above. For more
details we refer the interested reader to [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        It should be remarked that [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] is the only one, among the three works we
review in this section, that directly addresses SLMC quantification. The other two
works that we have discussed [
        <xref ref-type="bibr" rid="ref2 ref9">2, 9</xref>
        ] instead address binary quantification only; in
order to extend their approach to SLMC quantification, binary quantification has
to be performed independently for each class and the resulting class prevalences
have to be rescaled so that they sum up to 1. This is certainly suboptimal, but
better solutions are not known since a SLMC equivalent of SVMperf , which is
binary in nature, is not known.
3.3
      </p>
      <p>
        Quantification via Structured Prediction II: SVM(Q) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
Barranquero et al.’s forthcoming work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposes an approach to binary
quantification that combines elements of the works carried out in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. As
suggested in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and as later implemented in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Barranquero et al. also use
SVMperf to directly optimize quantification accuracy. However, similarly to [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
they optimize a measure (which they call Q-measure) that combines
classification accuracy and quantification accuracy. Their Q-measure is shaped upon the
famous Fβ measure [22, Chapter 7], leading to a loss defined as
Λ = (1 − Qβ (pˆ, p)) = (1 −
(β2 + 1)Γc(pˆ, p) · Γq(pˆ, p)
β2Γc(pˆ, p) + Γq(pˆ, p)
)
where Γ c and Γq are a measure of classification “gain” (the opposite of loss) and
a measure of quantification gain, respectively, and 0 ≤ β ≤ +∞ is a parameter
that controls the relative importance of the two; for β = 0 the Qβ measure
coincides with Γc, while when β tends to infinity Qβ asymptotically tends to Γq.
      </p>
      <p>As a measure of classification gain Barranquero et al. use recall, while as a
measure of quantification gain they use (1 − N AE), where N AE is as defined
in Equation 2. The authors motivate the (apparently strange) decision to use
recall as a measure of classification gain with the fact that, while recall by itself
is not a suitable measure of classification gain (since it is always possible to
arbitrarily increase recall at the expense of precision or specificity), to include
precision or specificity in Qβ is unnecessary, since the presence of Γq in Qβ has
the effect of ruling out anyway those hypotheses characterized by high recall and
low precision / specificity (since these hypotheses are indeed penalized by Γq).
The experiments presented in the paper test values for β in {0.5,1,2}.
4</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>
        The approaches that the three papers mentioned in this section have proposed
have never been compared experimentally, since the experimentation they report
use different datasets. The only paper among the three where the experimentation
is carried out on high-dimensional datasets is [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where tests are conducted on
text classification datasets, while [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] only report test on low-dimensional
ones; in the case of [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], this is due to the fact that the underlying technology
(decision trees) does not scale well to high dimensionalities.
      </p>
      <p>
        We are currently carrying out experiments aimed to compare the approaches
of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] on the above high-dimensional datasets, testing a range of different
experimental conditions (different class prevalence, different distribution drift,
etc.) similarly to what done in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We hope to have the results ready in time for
them to be presented at the workshop.
5
      </p>
    </sec>
    <sec id="sec-7">
      <title>Discussion</title>
      <p>The ELM approach to quantification combines solid theoretical foundations with
state-of-the-art performance, and promises to provide a superior alternative to
the mostly empirical approaches that have been standard in the quantification
literature. The key question that the (few) past works along this line leave open
is: should one (a) optimize a combination of a quantification and a classification
measure, or rather (b) optimize a pure quantification measure? In other words:
how fundamental is classification accuracy to a quantifier? Approach (a) has
indeed intuitive appeal, since we intuitively tend to trust a quantifier if it is
also a good classifier; a quantifier that derives its good quantification accuracy
from a high, albeit balanced, number of false positives and false negatives makes
us a bit uneasy. On the other hand, approach (b) seems more in line with
accepted machine learning wisdom (“optimize the measure you are going to
evaluate the results with”), and one might argue that being serious about the
fact that classification and quantification are fundamentally different means that,
if a quantifier delivers consistently good quantification accuracy at the expense
of classification accuracy, this latter fact should not be our concern. Further
research is needed to answer these questions and to determine which among these
contrasting intuitions is more correct.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>We thank Thorsten Joachims for making SVMperf available, and Jose Barranquero
for providing us with the module that implements the Q-measure within SVMperf .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Rocío</given-names>
            <surname>Alaíz-Rodríguez</surname>
          </string-name>
          ,
          <article-title>Alicia Guerrero-Curieses, and Jesús Cid-Sueiro. Class and subclass probability re-estimation to adapt a classifier in the presence of concept drift</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>74</volume>
          (
          <issue>16</issue>
          ):
          <fpage>2614</fpage>
          -
          <lpage>2623</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Jose Barranquero,
          <source>Jorge Díez, and Juan José del Coz</source>
          .
          <article-title>Quantification-oriented learning based on reliable classifiers</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <volume>48</volume>
          (
          <issue>2</issue>
          ):
          <fpage>591</fpage>
          -
          <lpage>604</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Antonio Bella, Cèsar Ferri,
          <article-title>José Hernández-Orallo, and María José RamírezQuintana</article-title>
          .
          <article-title>Quantification via probability estimators</article-title>
          .
          <source>In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM</source>
          <year>2010</year>
          ), pages
          <fpage>737</fpage>
          -
          <lpage>742</lpage>
          , Sydney,
          <string-name>
            <surname>AU</surname>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Yee</given-names>
            <surname>Seng</surname>
          </string-name>
          <article-title>Chan and Hwee Tou Ng</article-title>
          .
          <article-title>Estimating class priors in domain adaptation for word sense disambiguation</article-title>
          .
          <source>In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL</source>
          <year>2006</year>
          ), pages
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          , Sydney,
          <string-name>
            <surname>AU</surname>
          </string-name>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Thomas</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Cover and Joy A. Thomas. Elements of information theory</article-title>
          . John Wiley &amp; Sons, New York, US,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Imre</given-names>
            <surname>Csiszár</surname>
          </string-name>
          and
          <string-name>
            <given-names>Paul C.</given-names>
            <surname>Shields</surname>
          </string-name>
          .
          <source>Information theory and statistics: A tutorial. Foundations and Trends in Communications and Information Theory</source>
          ,
          <volume>1</volume>
          (
          <issue>4</issue>
          ):
          <fpage>417</fpage>
          -
          <lpage>528</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Esuli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <article-title>Machines that learn how to code open-ended survey data</article-title>
          .
          <source>International Journal of Market Research</source>
          ,
          <volume>52</volume>
          (
          <issue>6</issue>
          ):
          <fpage>775</fpage>
          -
          <lpage>800</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Esuli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <article-title>Sentiment quantification</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>25</volume>
          (
          <issue>4</issue>
          ):
          <fpage>72</fpage>
          -
          <lpage>75</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Esuli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <article-title>Optimizing text quantifiers for multivariate loss functions</article-title>
          .
          <source>ACM Transactions on Knowledge Discovery and Data</source>
          ,
          <year>2014</year>
          . Forthcoming.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>George</given-names>
            <surname>Forman</surname>
          </string-name>
          .
          <article-title>Counting positives accurately despite inaccurate classification</article-title>
          .
          <source>In Proceedings of the 16th European Conference on Machine Learning (ECML 2005)</source>
          , pages
          <fpage>564</fpage>
          -
          <lpage>575</lpage>
          , Porto,
          <string-name>
            <surname>PT</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>George</given-names>
            <surname>Forman</surname>
          </string-name>
          .
          <article-title>Quantifying trends accurately despite classifier error and class imbalance</article-title>
          .
          <source>In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD</source>
          <year>2006</year>
          ), pages
          <fpage>157</fpage>
          -
          <lpage>166</lpage>
          , Philadelphia, US,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>George</given-names>
            <surname>Forman</surname>
          </string-name>
          .
          <article-title>Quantifying counts and costs via classification</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>17</volume>
          (
          <issue>2</issue>
          ):
          <fpage>164</fpage>
          -
          <lpage>206</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>George</surname>
            <given-names>Forman</given-names>
          </string-name>
          , Evan Kirshenbaum, and
          <string-name>
            <given-names>Jaap</given-names>
            <surname>Suermondt</surname>
          </string-name>
          .
          <article-title>Pragmatic text mining: Minimizing human effort to quantify many issues in call logs</article-title>
          .
          <source>In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (KDD</source>
          <year>2006</year>
          ), pages
          <fpage>852</fpage>
          -
          <lpage>861</lpage>
          , Philadelphia, US,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>John J. Gart</surname>
            and
            <given-names>Alfred A.</given-names>
          </string-name>
          <string-name>
            <surname>Buck</surname>
          </string-name>
          .
          <article-title>Comparison of a screening test and a reference test in epidemiologic studies: II. A probabilistic model for the comparison of diagnostic tests</article-title>
          .
          <source>American Journal of Epidemiology</source>
          ,
          <volume>83</volume>
          (
          <issue>3</issue>
          ):
          <fpage>593</fpage>
          -
          <lpage>602</lpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Daniel</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hopkins</surname>
            and
            <given-names>Gary</given-names>
          </string-name>
          <string-name>
            <surname>King</surname>
          </string-name>
          .
          <article-title>A method of automated nonparametric content analysis for social science</article-title>
          .
          <source>American Journal of Political Science</source>
          ,
          <volume>54</volume>
          (
          <issue>1</issue>
          ):
          <fpage>229</fpage>
          -
          <lpage>247</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>A support vector method for multivariate performance measures</article-title>
          .
          <source>In Proceedings of the 22nd International Conference on Machine Learning (ICML 2005)</source>
          , pages
          <fpage>377</fpage>
          -
          <lpage>384</lpage>
          , Bonn, DE,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Thorsten</surname>
            <given-names>Joachims</given-names>
          </string-name>
          , Thomas Hofmann, Yisong Yue, and
          <string-name>
            <surname>Chun-Nam Yu</surname>
          </string-name>
          .
          <article-title>Predicting structured objects with support vector machines</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>52</volume>
          (
          <issue>11</issue>
          ):
          <fpage>97</fpage>
          -
          <lpage>104</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>Gary</given-names>
            <surname>King</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ying</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Verbal autopsy methods with multiple causes of death</article-title>
          .
          <source>Statistical Science</source>
          ,
          <volume>23</volume>
          (
          <issue>1</issue>
          ):
          <fpage>78</fpage>
          -
          <lpage>91</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Letizia</surname>
            <given-names>Milli</given-names>
          </string-name>
          , Anna Monreale, Giulio Rossetti, Fosca Giannotti, Dino Pedreschi, and
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <article-title>Quantification trees</article-title>
          .
          <source>In Proceedings of the 13th IEEE International Conference on Data Mining (ICDM</source>
          <year>2013</year>
          ), pages
          <fpage>528</fpage>
          -
          <lpage>536</lpage>
          , Dallas, US,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Marco</surname>
            <given-names>Saerens</given-names>
          </string-name>
          , Patrice Latinne, and
          <string-name>
            <given-names>Christine</given-names>
            <surname>Decaestecker</surname>
          </string-name>
          .
          <article-title>Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <fpage>21</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Ioannis</surname>
            <given-names>Tsochantaridis</given-names>
          </string-name>
          , Thorsten Joachims, Thomas Hofmann, and
          <string-name>
            <given-names>Yasemin</given-names>
            <surname>Altun</surname>
          </string-name>
          .
          <article-title>Large margin methods for structured and interdependent output variables</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>6</volume>
          :
          <fpage>1453</fpage>
          -
          <lpage>1484</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Cornelis</surname>
            <given-names>J. van Rijsbergen. Information</given-names>
          </string-name>
          <string-name>
            <surname>Retrieval</surname>
          </string-name>
          . Butterworths, London, UK, second edition,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>ChengXiang</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <article-title>Statistical language models for information retrieval: A critical review</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ):
          <fpage>137</fpage>
          -
          <lpage>213</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>