=Paper=
{{Paper
|id=Vol-3180/paper-147
|storemode=property
|title=UniOviedo(Team2) at LeQua 2022: Comparison of traditional quantifiers and a new method
                        based on Energy Distance
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-147.pdf
|volume=Vol-3180
|authors=Juan José del Coz
|dblpUrl=https://dblp.org/rec/conf/clef/Coz22
}}
==UniOviedo(Team2) at LeQua 2022: Comparison of traditional quantifiers and a new method
                        based on Energy Distance==
<pdf width="1500px">https://ceur-ws.org/Vol-3180/paper-147.pdf</pdf>
<pre>
UniOviedo(Team2) at LeQua 2022: Comparison of
traditional quantifiers and a new method based on
Energy Distance
Juan José del Coz1
1
    Artificial Intelligence Center at Gijón, University of Oviedo, Spain


                                         Abstract
                                         The idea of this team was to compare the performance of some of the most important quantification
                                         methods and a new approach based on the Energy Distance that has been proposed by our group recently.
                                         This paper describes this method, called 𝐸𝐷𝑦, and the experimentation carried out to tackle the vector
                                         subtasks (T1A and T1B) of LeQua 2022 quantification competition.

                                         Keywords
                                         LATEX quantification, prevalence estimation, energy distance


1. Motivation
Our main intention in this competition was to analyze the behavior of a new quantification
algorithm devised by our group. This method, called 𝐸𝐷𝑦, is unpublished yet and will be
briefly described in Section 2.2. To assess its performance, we compare it with some of the most
popular quantification algorithms, see Section 2.1.
   We just focus in vector subtasks (T1A and T1B) because we are not experts on deep learning
that is more or less required to tackle the subtasks using raw documents (T2A and T2B). According
to our previous studies using 𝐸𝐷𝑦 over benchmark data, our hopes of being truly competitive
were centered on T1B, because 𝐸𝐷𝑦 usually provides better results for multiclass quantification
tasks. In fact, we only submitted the scores of 𝐸𝐷𝑦 for T1B. For the binary subtask T1A we
employed 𝐻𝐷𝑦 [1] with some customization. We achieved a broze medal in both competitions,
but as we will see later, our results could easily have been better in subtask T1B.


2. Methods
Before describing the methods used, we introduce here some notation. In the general setting,
quantification methods learn from a training set, 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}𝑛𝑖=1 , in which 𝑥𝑖 is the de-
scription of an instance using the features of the input space, and 𝑦𝑖 is its class. In the tasks of
LeQua competition 𝑦𝑖 ∈ {𝑐1 , . . . , 𝑐𝑘 } being 𝑘 = 2 for binary tasks TA and 𝑘 = 28 for multiclass

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ juanjo@uniovi.es (J. J. del Coz)
 www.aic.uniovi.es/juanjo (J. J. del Coz)
 0000-0002-4288-3839 (J. J. del Coz)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
quantification tasks TB. The goal of quantification learning is to automatically obtain models
able to predict the prevalence of all classes,
                                            ∑︀𝑝
                                               ^ = {𝑝             ^𝑘 }, given a set of unlabeled examples,
                                                     ^1 , . . . , 𝑝
                                               𝑘
𝑇 = {𝑥𝑗 }𝑚 𝑗=1 , ensuring that 𝑝
                               ^ 𝑙 ≥ 0 and        𝑝
                                                  ^
                                               𝑙=1 𝑙 =   1.

2.1. SOTA quantifiers
There are several quantification methods that can be considered state-of-the-art, see [2]. We
chose the best performing methods according to our experience, namely:

    • 𝐸𝑀 [3]. This method is based on expectation–maximization algorithm in which the
      parameters to be estimated are the class prior probabilities. This method is denoted as
      𝐸𝑀 𝑄 in QuaPy [4] and 𝑆𝐿𝐷 in the baseline results given by the organizers1 .
    • 𝐻𝐷𝑦 [1]. This is a matching distribution quantifier that uses histograms to represent the
      distributions and the Hellinger Distance to measure histogram similarity.

  However, while we used the 𝐸𝑀 method without any major customization, we tested two
possible improvements for the 𝐻𝐷𝑦 method:
   1. A different way of computing the histograms. The original method is based on equal-width
      bins. We tested also equal-count bins, considering the examples of all the classes.
   2. Taken into account the results reported in [5], we also tested Topsøe as similarity measure.

  We improved the 𝐻𝐷𝑦 results provided by the organizers using these modifications.

2.2. EDy
𝐸𝐷𝑦 is based on the method presented in [6]. It is also a matching distribution algorithm, like
𝐻𝐷𝑦, but the distributions are represented by the complete sets of examples (the sets with the
training examples for each class, denoted as 𝐷𝑐𝑙 , and the testing set 𝑇 ), and the metric is the
Energy Distance (ED). Formally, 𝐸𝐷𝑦 minimizes the ED between 𝑇 and the weighted mixture
distribution,
                          𝐷 ′ = 𝐷 𝑐1 · 𝑝
                                       ^1 + 𝐷𝑐2 · 𝑝
                                                  ^2 + . . . + 𝐷𝑐𝑘 · 𝑝
                                                                     ^𝑘 .                      (1)
with respect to 𝑝
                ^ . That is:

                         min          2 · E𝑥𝑖 ∽𝐷′ ,𝑥𝑗 ∽𝑇 𝛿(𝑥𝑖 , 𝑥𝑗 )                                  (2)
                        𝑝
                        ^1 ,...,𝑝
                                ^𝑘

                                      −E𝑥𝑖 ,𝑥′𝑖 ∽𝐷′ 𝛿(𝑥𝑖 , 𝑥′𝑖 ) − E𝑥𝑗 ,𝑥′𝑗 ∽𝑇 𝛿(𝑥𝑗 , 𝑥′𝑗 ),

where 𝛿 is a distance. The last term can be removed (it does not depend on 𝑝
                                                                           ^ ), so we have:
                                            𝑘
                                           ∑︁
                           min         2        𝑝
                                                ^𝑙 E𝑥𝑖 ∽𝐷𝑐𝑙 ,𝑥𝑗 ∽𝑇 𝛿(𝑥𝑖 , 𝑥𝑗 )                        (3)
                         𝑝
                         ^1 ,...,𝑝
                                 ^𝑘
                                           𝑙=1
                                             𝑘 ∑︁
                                            ∑︁  𝑘
                                       −               𝑝  ^𝑙′ E𝑥𝑖 ∽𝐷𝑐𝑙 ,𝑥′𝑖 ∽𝐷𝑐𝑙′ 𝛿(𝑥𝑖 , 𝑥′𝑖 ).
                                                       ^𝑙 𝑝
                                           𝑙=1 𝑙′ =1
    1
        https://github.com/HLT-ISTI/QuaPy/tree/lequa2022/LeQua2022
   The difference between 𝐸𝐷𝑦 and the method introduced in [6] is how to compute 𝛿(𝑥𝑖 , 𝑥𝑗 ).
The authors in [6] propose to use the actual features of the input space. We denote such approach
as 𝐸𝐷𝑋. Our proposal is to use the predictions of a classifier, ℎ. In symbols, 𝛿(ℎ(𝑥𝑖 ), ℎ(𝑥𝑗 )),
the same predictions used by 𝐸𝑀 and 𝐻𝐷𝑦. As function 𝛿 we used the Manhattan distance.


3. Experiments
The first aspect of our experiments2 was to select the best classifier because all the compared
algorithms require a classifier. We tested several classifiers using just the training set of subtask
T1A, including Logistic Regression, Random Forest, Support Vector Machines, XGboost, Naive
Bayes and Gaussian processes. The best one was Logistic Regression (LR) rather clearly. Then
we adjusted its regularization parameter resulting than the best value was 𝐶 = 0.01. We
employed this classifier for the rest of the experiments including subtask 𝑇 1𝐵.
   Another important factor according to our experience is how to estimate the distributions.
It is well-described in the literature, for instance for 𝐴𝐶 method [7], that it is better to use
some sort of cross-validation (CV). Our approach in our recent papers is to use such CV to
estimate both, the distributions of the training data, but also for the testing sets, averaging the
predictions of the classifiers that compose the CV model. This works better that learning a
separated classifier to estimate any value of the testing bags. We used 20 folds for subtask T1A
and 10 for T1B.
   The only compared method that has hyperparameters is 𝐻𝐷𝑦:
    • Similarity Measure. Two alternatives: HD, as the original method 𝐻𝐷𝑦, and Topsøe
      (method 𝐷𝑦𝑆-𝑇 𝑆 in [5]). We will denoted this last method as 𝑃 𝐷𝐹 𝑦𝑇 , because it uses
      histograms (𝑃 𝐷𝐹 ), the predictions from a classifier (𝑦) and Topsøe (𝑇 ).
    • Number of bins. We tried the following group of values {30, 40, 50}.
    • Method used for computing the cut points for the histograms: equal-width or equal-count.
We just tried these six choices to select the best combination for 𝐻𝐷𝑦 and 𝑃 𝐷𝐹 𝑦𝑇 .
 Recall that the target performance measure is the Mean of the Relative Absolute Error (MRAE):
                                                          𝑘
                                                    1 ∑︁ |𝑝^𝑙 − 𝑝𝑙 |
                                     𝑀 𝑅𝐴𝐸(𝑝, 𝑝
                                              ^) =                   ,                           (4)
                                                   |𝑘|        𝑝𝑙
                                                          𝑙=1
where 𝑝𝑙 and 𝑝^𝑙 are the real and the predicted prevalences for class 𝑙. RAE may be undefined
when 𝑝𝑙 = 0, so both prevalences are smoothed before computing it [8]:
                                               𝜖+𝑝           1
                               𝑠𝑚𝑜𝑜𝑡ℎ(𝑝) =           , 𝜖=       .                          (5)
                                              𝜖𝑘 + 1        2𝑚

3.1. Subtask T1A
For this task the equal-count method works better. The results using LR over the validation set
are those in Table 1. The first conclusion is that 𝐻𝐷𝑦 and 𝑃 𝐷𝐹 𝑦𝑇 are the best performers.
There is no much difference between them but 𝑃 𝐷𝐹 𝑦𝑇 seems slightly better. This is in line
with the conclusions in [5].
    2
        Source code: https://github.com/jjdelcoz/QU-Ant
Table 1
Results over the validation set of subtask T1A using Logistic Regression
                               Method                MRAE        MAE
                               𝐸𝑀                   1.19731    0.22649
                               𝐻𝐷𝑦 (30 bins)        0.15273    0.02570
                               𝐻𝐷𝑦 (40 bins)        0.13941    0.02639
                               𝐻𝐷𝑦 (50 bins)        0.14917    0.02748
                               𝑃 𝐷𝐹 𝑦𝑇 (30 bins)    0.13542    0.02411
                               𝑃 𝐷𝐹 𝑦𝑇 (40 bins)    0.13225    0.02480
                               𝑃 𝐷𝐹 𝑦𝑇 (50 bins)    0.13112    0.02469
                               𝐸𝐷𝑦                  0.21878    0.02676


Table 2
Results over the validation set of T1A using Calibrated Logistic Regression
                               Method                MRAE        MAE
                               𝐸𝑀                   0.13775    0.02374
                               𝐻𝐷𝑦 (30 bins)        0.18334    0.03077
                               𝐻𝐷𝑦 (40 bins)        0.19601    0.03561
                               𝐻𝐷𝑦 (50 bins)        0.20383    0.04044
                               𝑃 𝐷𝐹 𝑦𝑇 (30 bins)    0.13025    0.02425
                               𝑃 𝐷𝐹 𝑦𝑇 (40 bins)    0.12701    0.02470
                               𝑃 𝐷𝐹 𝑦𝑇 (50 bins)    0.12825    0.02552
                               𝐸𝐷𝑦                  0.21586    0.02692


   𝐸𝐷𝑦 is clearly outperformed in terms of MRAE, but its performance is similar to 𝐻𝐷𝑦 in
terms of MAE. Moreover, it is pretty clear from the results in Table 1 that the scores of 𝐸𝑀
are rather bad because it requires well-calibrated posterior probabilities. Thus, we used the
CalibratedCV object of sklearn to obtain calibrated probabilities. The scores of such experiment
are in Table 2. 𝐸𝑀 clearly improves but it performs worse than 𝑃 𝐷𝐹 𝑦𝑇 . Notice that the score
of 𝐸𝑀 is just slightly better than that provided by the organizers (0.13775 vs. 0.1393).
   Taking into account all these results, we finally selected 𝑃 𝐷𝐹 𝑦𝑇 with 40 bins of equal-count
using Calibrated Logistic Regression with 𝐶 = 0.01. Notice that 𝑃 𝐷𝐹 𝑦𝑇 obtains better results
that the original version of 𝐻𝐷𝑌 provided by the organizers (0.12701 vs. 0.1767).

3.2. Subtask T1B
Due to the lack of time, we just tried here basically the same configuration of the classifier
selected for subtask T1A in combination with the OneVsRestClassifier provided by sklearn. The
only changes were: i) we had to reduce the number of folds for the cross-validation used to
10 folds because the smallest class has 14 examples, and ii) for 𝐻𝐷𝑦 the best bin strategy was
equal-width and the number of bins tested were {4, 8, 16} because the performance tended
to decrease as the number of bins increased in this case. Notice that 𝑃 𝐷𝐹 𝑦𝑇 could not be
employed here because it uses search (not optimization) for computing the final prevalences
Table 3
Results over the validation set of T1B using OVR(Calibrated LR) with 𝜖 = 0.002 (incorrect)
                                Method             MRAE        MAE
                                𝐸𝑀                0.74921     0.01637
                                𝐻𝐷𝑦 (4 bins)      0.86716     0.01527
                                𝐻𝐷𝑦 (8 bins)      0.80127     0.01402
                                𝐻𝐷𝑦 (16 bins)     0.85876     0.01586
                                𝐸𝐷𝑦               0.68223     0.01173


Table 4
Results over the validation set of T1B using OVR(Calibrated LR) with 𝜖 = 0.0005 (correct)
                                Method             MRAE        MAE
                                𝐸𝑀                1.12322     0.01637
                                𝐻𝐷𝑦 (4 bins)      1.47463     0.01527
                                𝐻𝐷𝑦 (8 bins)      1.33846     0.01402
                                𝐻𝐷𝑦 (16 bins)     1.39885     0.01586
                                𝐸𝐷𝑦               1.16777     0.01173


Table 5
Results over the validation set of T1B using Logistic Regression with 𝜖 = 0.0005 (correct)
                                Method             MRAE        MAE
                                𝐸𝑀                2.35675     0.02811
                                𝐻𝐷𝑦 (4 bins)      0.95555     0.01158
                                𝐻𝐷𝑦 (8 bins)      1.07063     0.01257
                                𝐻𝐷𝑦 (16 bins)     1.19310     0.01494
                                𝐸𝐷𝑦               0.89837     0.00996


(exhaustive search is not suitable because the searching space is [0, 1]28 and other methods that
should have been implemented, such as genetic algorithms, do not guarantee to find the optimal
solution). When we performed this experiment we committed a terrible mistake: the value used
for the parameter 𝜖 of MRAE was 0.002 (the one for substask T1A), instead of the correct value
0.0005. The results of such incorrect experiment are in Table 3. In such circumstances, 𝐸𝐷𝑦
seemed the best method: its performance was much better than the rest of approaches, including
the baselines provided by the organizers and the results of HistNet (the method of the other
team from the U. of Oviedo). Thus we submitted the predictions of 𝐸𝐷𝑦 to the competition.
   But the problem was of course the value of 𝜖. The results over the validation set using the
correct value are in Table 4. In this case, 𝐸𝑀 performs better in terms of MRAE but worse than
𝐸𝐷𝑦 for MAE. In both cases, their results are worse than those of the two best competitors.
   After exchanging some emails with TU Dortmund University team, we did one last experiment.
Instead of using OneVsRestClassifier and Calibrated Logistic Regression we just applied a plain
Logistic Classifier in combination with the same cross-validation estimation procedure (10
folds). The results of 𝐸𝐷𝑌 would improve significantly, see Table 5. Also the scores of 𝐻𝐷𝑦
are very competitive, while those of 𝐸𝑀 are much worse as it occurred in subtask T1A when
the posteriors were not calibrated. If we had sent the predictions of this version of 𝐸𝐷𝑦 the
scores over the test set would have been MRAE 0.864787, MAE 0.00994 which are better than
those of the winning team of the competition (MRAE 0.879870, MAE 0.011733).

3.3. Conclusions
We have drawn several interesting conclusions from our participation in LeQua:
   1. To obtain good results with quantification algorithms that rely on the use of a classifier it
      is crucial to select the best classifier-quantifier combination. Obviously, not always the
      same classifier is the most appropriate one for all quantification algorithms.
   2. This implies that quantification competitions are even more complex than classification
      ones. There are more elements to be adjusted: select a combination of a classifier and a
      quantifier and adjust their hyperparameters. The search space is sometimes doubled.
   3. 𝐸𝑀 is a very good quantification algorithm but is very sensitive to the classifier calibra-
      tion. Other methods are more robust in this sense and work well with more classifiers.
   4. 𝐸𝐷𝑦 seems a good approach for multiclass quantification.


Acknowledgments
This research was funded by MINECO (Ministerio de Economía y Competitividad) and FEDER
(Fondo Europeo de Desarrollo Regional), grant PID2019-110742RB-I00 (MINECO/FEDER).


References
[1] V. González-Castro, R. Alaiz-Rodríguez, E. Alegre, Class distribution estimation based on
    the hellinger distance, Information Sciences 218 (2013) 146–164.
[2] P. González, A. Castaño, N. V. Chawla, J. J. del Coz, A review on quantification learning,
    ACM Computing Surveys 50 (2017) 74:1–74:40.
[3] M. Saerens, P. Latinne, C. Decaestecker, Adjusting the Outputs of a Classifier to New a
    Priori Probabilities: A Simple Procedure, Neural Computation 14 (2002) 21–41.
[4] A. Moreo, A. Esuli, F. Sebastiani, Quapy: A python-based framework for quantification,
    in: Proceedings of the 30th ACM International Conference on Information & Knowledge
    Management, 2021, pp. 4534–4543.
[5] A. Maletzke, D. dos Reis, E. Cherman, G. Batista, Dys: A framework for mixture models in
    quantification, in: Proceedings of the AAAI, volume 33, 2019, pp. 4552–4560.
[6] H. Kawakubo, M. C. Du Plessis, M. Sugiyama, Computationally efficient class-prior esti-
    mation under class balance change using energy distance, IEICE Tran. on Inf. and Sys. 99
    (2016) 176–186.
[7] G. Forman, Quantifying counts and costs via classification, Data Mining and Knowledge
    Discovery 17 (2008) 164–206.
[8] F. Sebastiani, Evaluation measures for quantification: an axiomatic approach, Information
    Retrieval Journal 23 (2020) 255–288.

</pre>