=Paper=
{{Paper
|id=Vol-3180/paper-147
|storemode=property
|title=UniOviedo(Team2) at LeQua 2022: Comparison of traditional quantifiers and a new method
based on Energy Distance
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-147.pdf
|volume=Vol-3180
|authors=Juan JosΓ© del Coz
|dblpUrl=https://dblp.org/rec/conf/clef/Coz22
}}
==UniOviedo(Team2) at LeQua 2022: Comparison of traditional quantifiers and a new method
based on Energy Distance==
UniOviedo(Team2) at LeQua 2022: Comparison of traditional quantifiers and a new method based on Energy Distance Juan JosΓ© del Coz1 1 Artificial Intelligence Center at GijΓ³n, University of Oviedo, Spain Abstract The idea of this team was to compare the performance of some of the most important quantification methods and a new approach based on the Energy Distance that has been proposed by our group recently. This paper describes this method, called πΈπ·π¦, and the experimentation carried out to tackle the vector subtasks (T1A and T1B) of LeQua 2022 quantification competition. Keywords LATEX quantification, prevalence estimation, energy distance 1. Motivation Our main intention in this competition was to analyze the behavior of a new quantification algorithm devised by our group. This method, called πΈπ·π¦, is unpublished yet and will be briefly described in Section 2.2. To assess its performance, we compare it with some of the most popular quantification algorithms, see Section 2.1. We just focus in vector subtasks (T1A and T1B) because we are not experts on deep learning that is more or less required to tackle the subtasks using raw documents (T2A and T2B). According to our previous studies using πΈπ·π¦ over benchmark data, our hopes of being truly competitive were centered on T1B, because πΈπ·π¦ usually provides better results for multiclass quantification tasks. In fact, we only submitted the scores of πΈπ·π¦ for T1B. For the binary subtask T1A we employed π»π·π¦ [1] with some customization. We achieved a broze medal in both competitions, but as we will see later, our results could easily have been better in subtask T1B. 2. Methods Before describing the methods used, we introduce here some notation. In the general setting, quantification methods learn from a training set, π· = {(π₯π , π¦π )}ππ=1 , in which π₯π is the de- scription of an instance using the features of the input space, and π¦π is its class. In the tasks of LeQua competition π¦π β {π1 , . . . , ππ } being π = 2 for binary tasks TA and π = 28 for multiclass CLEF 2022: Conference and Labs of the Evaluation Forum, September 5β8, 2022, Bologna, Italy $ juanjo@uniovi.es (J. J. del Coz) Β www.aic.uniovi.es/juanjo (J. J. del Coz) 0000-0002-4288-3839 (J. J. del Coz) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) quantification tasks TB. The goal of quantification learning is to automatically obtain models able to predict the prevalence of all classes, βοΈπ ^ = {π ^π }, given a set of unlabeled examples, ^1 , . . . , π π π = {π₯π }π π=1 , ensuring that π ^ π β₯ 0 and π ^ π=1 π = 1. 2.1. SOTA quantifiers There are several quantification methods that can be considered state-of-the-art, see [2]. We chose the best performing methods according to our experience, namely: β’ πΈπ [3]. This method is based on expectationβmaximization algorithm in which the parameters to be estimated are the class prior probabilities. This method is denoted as πΈπ π in QuaPy [4] and ππΏπ· in the baseline results given by the organizers1 . β’ π»π·π¦ [1]. This is a matching distribution quantifier that uses histograms to represent the distributions and the Hellinger Distance to measure histogram similarity. However, while we used the πΈπ method without any major customization, we tested two possible improvements for the π»π·π¦ method: 1. A different way of computing the histograms. The original method is based on equal-width bins. We tested also equal-count bins, considering the examples of all the classes. 2. Taken into account the results reported in [5], we also tested TopsΓΈe as similarity measure. We improved the π»π·π¦ results provided by the organizers using these modifications. 2.2. EDy πΈπ·π¦ is based on the method presented in [6]. It is also a matching distribution algorithm, like π»π·π¦, but the distributions are represented by the complete sets of examples (the sets with the training examples for each class, denoted as π·ππ , and the testing set π ), and the metric is the Energy Distance (ED). Formally, πΈπ·π¦ minimizes the ED between π and the weighted mixture distribution, π· β² = π· π1 Β· π ^1 + π·π2 Β· π ^2 + . . . + π·ππ Β· π ^π . (1) with respect to π ^ . That is: min 2 Β· Eπ₯π β½π·β² ,π₯π β½π πΏ(π₯π , π₯π ) (2) π ^1 ,...,π ^π βEπ₯π ,π₯β²π β½π·β² πΏ(π₯π , π₯β²π ) β Eπ₯π ,π₯β²π β½π πΏ(π₯π , π₯β²π ), where πΏ is a distance. The last term can be removed (it does not depend on π ^ ), so we have: π βοΈ min 2 π ^π Eπ₯π β½π·ππ ,π₯π β½π πΏ(π₯π , π₯π ) (3) π ^1 ,...,π ^π π=1 π βοΈ βοΈ π β π ^πβ² Eπ₯π β½π·ππ ,π₯β²π β½π·ππβ² πΏ(π₯π , π₯β²π ). ^π π π=1 πβ² =1 1 https://github.com/HLT-ISTI/QuaPy/tree/lequa2022/LeQua2022 The difference between πΈπ·π¦ and the method introduced in [6] is how to compute πΏ(π₯π , π₯π ). The authors in [6] propose to use the actual features of the input space. We denote such approach as πΈπ·π. Our proposal is to use the predictions of a classifier, β. In symbols, πΏ(β(π₯π ), β(π₯π )), the same predictions used by πΈπ and π»π·π¦. As function πΏ we used the Manhattan distance. 3. Experiments The first aspect of our experiments2 was to select the best classifier because all the compared algorithms require a classifier. We tested several classifiers using just the training set of subtask T1A, including Logistic Regression, Random Forest, Support Vector Machines, XGboost, Naive Bayes and Gaussian processes. The best one was Logistic Regression (LR) rather clearly. Then we adjusted its regularization parameter resulting than the best value was πΆ = 0.01. We employed this classifier for the rest of the experiments including subtask π 1π΅. Another important factor according to our experience is how to estimate the distributions. It is well-described in the literature, for instance for π΄πΆ method [7], that it is better to use some sort of cross-validation (CV). Our approach in our recent papers is to use such CV to estimate both, the distributions of the training data, but also for the testing sets, averaging the predictions of the classifiers that compose the CV model. This works better that learning a separated classifier to estimate any value of the testing bags. We used 20 folds for subtask T1A and 10 for T1B. The only compared method that has hyperparameters is π»π·π¦: β’ Similarity Measure. Two alternatives: HD, as the original method π»π·π¦, and TopsΓΈe (method π·π¦π-π π in [5]). We will denoted this last method as π π·πΉ π¦π , because it uses histograms (π π·πΉ ), the predictions from a classifier (π¦) and TopsΓΈe (π ). β’ Number of bins. We tried the following group of values {30, 40, 50}. β’ Method used for computing the cut points for the histograms: equal-width or equal-count. We just tried these six choices to select the best combination for π»π·π¦ and π π·πΉ π¦π . Recall that the target performance measure is the Mean of the Relative Absolute Error (MRAE): π 1 βοΈ |π^π β ππ | π π π΄πΈ(π, π ^) = , (4) |π| ππ π=1 where ππ and π^π are the real and the predicted prevalences for class π. RAE may be undefined when ππ = 0, so both prevalences are smoothed before computing it [8]: π+π 1 π ππππ‘β(π) = , π= . (5) ππ + 1 2π 3.1. Subtask T1A For this task the equal-count method works better. The results using LR over the validation set are those in Table 1. The first conclusion is that π»π·π¦ and π π·πΉ π¦π are the best performers. There is no much difference between them but π π·πΉ π¦π seems slightly better. This is in line with the conclusions in [5]. 2 Source code: https://github.com/jjdelcoz/QU-Ant Table 1 Results over the validation set of subtask T1A using Logistic Regression Method MRAE MAE πΈπ 1.19731 0.22649 π»π·π¦ (30 bins) 0.15273 0.02570 π»π·π¦ (40 bins) 0.13941 0.02639 π»π·π¦ (50 bins) 0.14917 0.02748 π π·πΉ π¦π (30 bins) 0.13542 0.02411 π π·πΉ π¦π (40 bins) 0.13225 0.02480 π π·πΉ π¦π (50 bins) 0.13112 0.02469 πΈπ·π¦ 0.21878 0.02676 Table 2 Results over the validation set of T1A using Calibrated Logistic Regression Method MRAE MAE πΈπ 0.13775 0.02374 π»π·π¦ (30 bins) 0.18334 0.03077 π»π·π¦ (40 bins) 0.19601 0.03561 π»π·π¦ (50 bins) 0.20383 0.04044 π π·πΉ π¦π (30 bins) 0.13025 0.02425 π π·πΉ π¦π (40 bins) 0.12701 0.02470 π π·πΉ π¦π (50 bins) 0.12825 0.02552 πΈπ·π¦ 0.21586 0.02692 πΈπ·π¦ is clearly outperformed in terms of MRAE, but its performance is similar to π»π·π¦ in terms of MAE. Moreover, it is pretty clear from the results in Table 1 that the scores of πΈπ are rather bad because it requires well-calibrated posterior probabilities. Thus, we used the CalibratedCV object of sklearn to obtain calibrated probabilities. The scores of such experiment are in Table 2. πΈπ clearly improves but it performs worse than π π·πΉ π¦π . Notice that the score of πΈπ is just slightly better than that provided by the organizers (0.13775 vs. 0.1393). Taking into account all these results, we finally selected π π·πΉ π¦π with 40 bins of equal-count using Calibrated Logistic Regression with πΆ = 0.01. Notice that π π·πΉ π¦π obtains better results that the original version of π»π·π provided by the organizers (0.12701 vs. 0.1767). 3.2. Subtask T1B Due to the lack of time, we just tried here basically the same configuration of the classifier selected for subtask T1A in combination with the OneVsRestClassifier provided by sklearn. The only changes were: i) we had to reduce the number of folds for the cross-validation used to 10 folds because the smallest class has 14 examples, and ii) for π»π·π¦ the best bin strategy was equal-width and the number of bins tested were {4, 8, 16} because the performance tended to decrease as the number of bins increased in this case. Notice that π π·πΉ π¦π could not be employed here because it uses search (not optimization) for computing the final prevalences Table 3 Results over the validation set of T1B using OVR(Calibrated LR) with π = 0.002 (incorrect) Method MRAE MAE πΈπ 0.74921 0.01637 π»π·π¦ (4 bins) 0.86716 0.01527 π»π·π¦ (8 bins) 0.80127 0.01402 π»π·π¦ (16 bins) 0.85876 0.01586 πΈπ·π¦ 0.68223 0.01173 Table 4 Results over the validation set of T1B using OVR(Calibrated LR) with π = 0.0005 (correct) Method MRAE MAE πΈπ 1.12322 0.01637 π»π·π¦ (4 bins) 1.47463 0.01527 π»π·π¦ (8 bins) 1.33846 0.01402 π»π·π¦ (16 bins) 1.39885 0.01586 πΈπ·π¦ 1.16777 0.01173 Table 5 Results over the validation set of T1B using Logistic Regression with π = 0.0005 (correct) Method MRAE MAE πΈπ 2.35675 0.02811 π»π·π¦ (4 bins) 0.95555 0.01158 π»π·π¦ (8 bins) 1.07063 0.01257 π»π·π¦ (16 bins) 1.19310 0.01494 πΈπ·π¦ 0.89837 0.00996 (exhaustive search is not suitable because the searching space is [0, 1]28 and other methods that should have been implemented, such as genetic algorithms, do not guarantee to find the optimal solution). When we performed this experiment we committed a terrible mistake: the value used for the parameter π of MRAE was 0.002 (the one for substask T1A), instead of the correct value 0.0005. The results of such incorrect experiment are in Table 3. In such circumstances, πΈπ·π¦ seemed the best method: its performance was much better than the rest of approaches, including the baselines provided by the organizers and the results of HistNet (the method of the other team from the U. of Oviedo). Thus we submitted the predictions of πΈπ·π¦ to the competition. But the problem was of course the value of π. The results over the validation set using the correct value are in Table 4. In this case, πΈπ performs better in terms of MRAE but worse than πΈπ·π¦ for MAE. In both cases, their results are worse than those of the two best competitors. After exchanging some emails with TU Dortmund University team, we did one last experiment. Instead of using OneVsRestClassifier and Calibrated Logistic Regression we just applied a plain Logistic Classifier in combination with the same cross-validation estimation procedure (10 folds). The results of πΈπ·π would improve significantly, see Table 5. Also the scores of π»π·π¦ are very competitive, while those of πΈπ are much worse as it occurred in subtask T1A when the posteriors were not calibrated. If we had sent the predictions of this version of πΈπ·π¦ the scores over the test set would have been MRAE 0.864787, MAE 0.00994 which are better than those of the winning team of the competition (MRAE 0.879870, MAE 0.011733). 3.3. Conclusions We have drawn several interesting conclusions from our participation in LeQua: 1. To obtain good results with quantification algorithms that rely on the use of a classifier it is crucial to select the best classifier-quantifier combination. Obviously, not always the same classifier is the most appropriate one for all quantification algorithms. 2. This implies that quantification competitions are even more complex than classification ones. There are more elements to be adjusted: select a combination of a classifier and a quantifier and adjust their hyperparameters. The search space is sometimes doubled. 3. πΈπ is a very good quantification algorithm but is very sensitive to the classifier calibra- tion. Other methods are more robust in this sense and work well with more classifiers. 4. πΈπ·π¦ seems a good approach for multiclass quantification. Acknowledgments This research was funded by MINECO (Ministerio de EconomΓa y Competitividad) and FEDER (Fondo Europeo de Desarrollo Regional), grant PID2019-110742RB-I00 (MINECO/FEDER). References [1] V. GonzΓ‘lez-Castro, R. Alaiz-RodrΓguez, E. Alegre, Class distribution estimation based on the hellinger distance, Information Sciences 218 (2013) 146β164. [2] P. GonzΓ‘lez, A. CastaΓ±o, N. V. Chawla, J. J. del Coz, A review on quantification learning, ACM Computing Surveys 50 (2017) 74:1β74:40. [3] M. Saerens, P. Latinne, C. Decaestecker, Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure, Neural Computation 14 (2002) 21β41. [4] A. Moreo, A. Esuli, F. Sebastiani, Quapy: A python-based framework for quantification, in: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 4534β4543. [5] A. Maletzke, D. dos Reis, E. Cherman, G. Batista, Dys: A framework for mixture models in quantification, in: Proceedings of the AAAI, volume 33, 2019, pp. 4552β4560. [6] H. Kawakubo, M. C. Du Plessis, M. Sugiyama, Computationally efficient class-prior esti- mation under class balance change using energy distance, IEICE Tran. on Inf. and Sys. 99 (2016) 176β186. [7] G. Forman, Quantifying counts and costs via classification, Data Mining and Knowledge Discovery 17 (2008) 164β206. [8] F. Sebastiani, Evaluation measures for quantification: an axiomatic approach, Information Retrieval Journal 23 (2020) 255β288.