=Paper=
{{Paper
|id=Vol-3318/short25
|storemode=property
|title=Finding components of a good accuracy with XAI!
|pdfUrl=https://ceur-ws.org/Vol-3318/short25.pdf
|volume=Vol-3318
|authors=Benjamin Chamand,Olivier Risser-Maroix
|dblpUrl=https://dblp.org/rec/conf/cikm/ChamandR22
}}
==Finding components of a good accuracy with XAI!==
Finding components of a good accuracy with XAI ! Benjamin CHAMAND1 , Olivier RISSER-MAROIX2 1 IRIT, Universitรฉ de Toulouse, CNRS, Toulouse INP, UT3, France 2 LIPADE, Universitรฉ Paris Citรฉ, France Abstract This research presents a pipeline to find the key elements to achieve high accuracy. Indeed, one of the most common tasks in machine learning is classification, and numerous loss functions have been created to maximize this non-differentiable goal. Previous work on loss function design was mainly guided by intuition and theory before being validated by experience. Here, we use a different approach: we aim to learn from experiments. This data-driven method is comparable to how general laws are found from data in physics. We automatically discovered a mathematical expression on more than 260 datasets that is highly correlated with the accuracy of a linear classifier. More interestingly, this formula replicates key findings from several earlier papers on loss design and is highly explainable. We hope this research will open up novel possibilities for developing new heuristics and foster a deeper comprehension of machine learning theory. Keywords Symbolic Regression, Explainability, Datasets Representation 1. Introduction as: fastening NAS [7, 8]; evaluating the accuracy of a classifier on an unlabeled test set [9]; or measuring the Most machine learning (ML) research involves creating difficulty of a dataset [10, 11]. Previous works mostly and assessing components based on theoretical intuitions. rely on neural networks or random forests, making their Acquiring knowledge from experimentation would be solutions found non-explainable [9, 12]. The text classifi- a distinct strategy, similar to how physicists have at- cation task has already been studied with features such as tempted to deduce the analytical laws underlying the n-grams [10]. Their approach is nevertheless constrained physical processes in nature from observations. With by choice of features, which limits it to textual datasets, the development of AI, a new tendency to automate and and by only finding an unweighted summation of some support research with ML tools is emerging. Some math- of those statistics. Statistics to characterize datasets have ematics [1] and physics [2, 3] researchers started to use been investigated in broader contexts [13, 14, 15]. While it. The most similar approach in machine learning (ML) studying each variable independently, [14] suggested that would be meta-learning, where a model gains experience the relationship between such statistics and the difficulty throughout numerous learning sessions to enhance its of a dataset is complex and would require a nonlinear performances without human intervention. Although combination of those variables. this paradigm has been used successfully for many tasks, In this work, we propose a pipeline able to produce a including hyperparameters optimization and neural ar- general formula predicting the future performance of a chitecture search (NAS), the solutions found are generally linear classifier with a strong Pearsonโs correlation and not explainable. Thus, it is not so surprising that the use ๐ 2 score. We found our solution highly explainable and of AI as a tool to assist in theoretical findings in ML examined it in the context of decades of research. research has received so little attention. Understanding the mathematical relationships be- tween the variables in a given system is a requirement 2. Proposed Approach of the scientific method. Symbolic regression (SR) aims to solve the problem of finding a function that explains Datasets and Feature Extractors We choose 12 the hidden relationships in the data without knowing the datasets and 22 feature extractors using the same man- structure of the function beforehand. Given that SR is ner as [13] to find a general law spanning a large range NP-hard, evolutionary approaches have been created to of factors for a classification challenge. The amount of find approximations of solutions [4, 5, 6]. classes varies from 10 to 1854, and the dimension of the While the task of predicting accuracy may look odd embeddings spans from 256 to 2048. We used datasets at first glance, solving it has multiple applications such such as CIFAR10, CUB200, ImageNetMini, or THINGS. To cover a large number of dimensions and difficulty Proceedings of the CIKM 2022 Workshops, October, 2022, Atlanta, USA levels of linear classification, varied architectures with Envelope-Open benjamin.chamand@irit.fr (B. CHAMAND); different pretraining have been chosen. Some of them are orissermaroix@gmail.com (O. RISSER-MAROIX) ยฉ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License kept untrained. We used different variants of popular fea- CEUR Workshop Proceedings Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 ture extractors such as ResNet, MobileNet, SqueezeNet, CLIP, etc. We construct a meta-dataset โณ from those 264 Table 1 datasets of embeddings (the combination of all datasets Our formula has a better correlation and higher predictive by all feature extractors). power with only 5 variables (all ๐-value < 0.01). Method Pearson๐ ๐2 Meta-Dataset Representation To be able to find the Linear Regression 0.9042 0.8011 hidden relationship between a given dataset and the as- Decision Tree Regressor 0.9472 0.8868 sociated optimal accuracy, we need to describe each of Random Forest Regr. (10 trees) 0.9643 0.9246 those datasets by a feature vector ๐ in a shared represen- Our GP formula (๐บ๐๐น) 0.9682 0.9319 tation space ๐ฎ. We crafted 19 features ๐ ๐ such as: the dimensionality of embeddings (dim), the number of out- put classes (n_classes), the traces of the average matrices of predicting the accuracy requires a nonlinear combina- of all intra-class and inter-classes covariance matrices tion of variables. Thus, we compare nonlinear regressors (sb_trace, sw_trace), the mean cosine similarity between such as decision trees and random forests because of their each pair of dimensions (feats_cos_sim), the cosine simi- performances and the widespread belief that those mod- larity between prototypes (prototype_cos_sim), etc. els are among the most interpretable ones. Our formula outperformed them while being more explainable. Ground Truth Creation After extracting the embed- dings from diverse datasets using feature extractors, we Symbolic Regression Formula We ran our GP need to determine the best achievable accuracy via a pipeline 1000ร on the same training set and serialized softmax classifier for each dataset of embeddings. We their respective solutions and scores for analysis. The divided each embedding dataset into testing and train- solution having the best test ๐ 2 score was found 6ร. Our ing sets and learned the model for 1000 epochs with a formula has a complexity of 6 nodes. We will refer to batch size of 2048. As pre-processing, all embeddings this Genetic Programming Formulas as: were only โ2 -normalized. By tracking the accuracy on the test set, we can observe the best-reached accuracy ๐ ๐_๐ก๐๐๐๐/๐ ๐ก_๐ก๐๐๐๐ ๐บ๐๐น = log ( ) ๐ผ, an approximation of the best accuracy reachable ๐ผ โ . Our meta-dataset โณ = {(๐ ๐ , ๐ผ๐ )}๐ท โ๐_๐๐๐๐ ๐ ๐๐ โ ๐ ๐๐๐ก๐ _๐๐๐๐ โ ๐๐๐๐ก๐๐ก๐ฆ๐๐๐ _๐๐๐ _๐ ๐๐ ๐=1 corresponds to all the (1) pairs of statistical representation ๐ ๐ โ ๐ฎ of each dataset We can easily rewrite : ๐บ๐๐น = ๐๐ธ๐ โ ๐ถ๐๐ with: ๐๐ of the ๐ท datasets and the observed optimal accuracy ๐ผ๐ โ ๐. These tuples contain our inputs and outputs. ๐ ๐_๐ก๐๐๐๐ ๐๐ธ๐ = log ( ) ๐ ๐ก_๐ก๐๐๐๐ Symbolic Regression We use the gplearn implemen- 1 ๐ถ๐๐ = log (๐_๐๐๐๐ ๐ ๐๐ โ ๐ ๐๐๐ก๐ _๐๐๐๐ โ ๐๐๐๐ก๐๐ก๐ฆ๐๐๐ _๐๐๐ _๐ ๐๐) tation because of the compactness of the solutions, speed 2 of execution, robustness to noise [16], and ease of use. (2) The set of primitive function used is {log, ๐, โ,+, โ, ร, รท} ๐๐ธ๐ may correspond to a separability criterion while and the set of terminals corresponds to the statistics ๐ ๐ ๐ถ๐๐ may correspond to correlation information. We describing the dataset ๐๐ . We evolved a population of found those two parts to be complementary. Indeed ๐๐ธ๐ 5000 individuals for 20 steps. We designed a fitness and ๐ถ๐๐ have respectively a pearson๐ of only 0.65 and function โฑ such that both: pretrained and untrained โ0.87. Finally, we found that other best-performing GP extracted embeddings have a linear correlation with ac- formulas have similar structures and variables. curacy, independently. We split our meta-dataset in a fixed 75/25-train/test fashion and repeat each experiment 1000ร. Since โฑ only seeks for correlation, a linear trans- 4. Discussion formation of the output value is learned on the training set in order to predict the accuracy (๐ผฬ = ๐ โ ๐(โ ) + ๐). ๐บ๐๐น can be written as a summation of two components. One can see that the first element ๐๐ธ๐ is close to the Fisherโs criterion used in the Linear Discriminant Analysis 3. Results (LDA) [17] where the objective is to find a linear projec- tion that maximizes the ratio of between-class variance Baselines To evaluate the performance of our GP so- and the within-class variance. Thus, ๐๐ธ๐ corresponds lution, we compare it with popular regression methods to a separability measure of classes. Remarkably, this using the same train/test split. Performances on the test criterion has been effectively applied as a loss function in set are reported in Table. 1. The substantial gap of ๐ 2 deep learning [18, 19]. The choice of an LDA-based loss score between the linear regressor suggests that the task function remains marginal in deep learning, the cross- entropy (CE) being a more popular choice. However, strong similarities between the LDA and the CE allow us classifier accuracy evaluation?, in: CVPR, Procs., to swap this first separability measure with the latter one. 2021, pp. 15069โ15078. Indeed, [20] noticed that one of the most widely studied [10] E. Collins, N. Rozanov, B. Zhang, Evolutionary technical routes for the CE-based losses is to encourage data measures: Understanding the difficulty of text stronger intra-class compactness and larger inter-class classification tasks, in: CoNLL, 2018. separability such as the Fisherโs criterion. The second [11] F. Scheidegger, R. Istrate, G. Mariani, L. Benini, part, ๐ถ๐๐ , is negatively correlated to the accuracy. The C. Bekas, C. Malossi, Efficient image dataset clas- first variable is the number of classes (๐_๐๐๐๐ ๐ ๐๐ ). Indeed, sification difficulty estimation for predicting deep- it is natural to expect scores to decrease as the number of learning accuracy, The Visual Computer 37 (2021). classes grows. For example [21] observed a drop in accu- [12] Y. Yamada, T. Morimura, Weight features for pre- racy on the CUB200 dataset when changing the number dicting future model performance of deep neural of classes from a coarse level to a fine-grained one. networks., in: IJCAI, 2016, pp. 2231โ2237. In defense of the weights decorrelation term (proto- [13] B. Chamand, O. Risser-Maroix, C. Kurtz, P. Joly, types_cos_sim), [22] found on several state-of-the-art N. Lomรฉnie, Fine-tune your classifier: Finding cor- CNN that they could achieve better accuracy, more stable relations with temperature, in: ICIP, Procs., 2022. training, and smoother convergence by using orthog- [14] T. K. Ho, M. Basu, Complexity measures of super- onal regularization of weights. Previous works on fea- vised classification problems, TPAMI 24 (2002). tures decorrelation heavily justify the presence of our fea- [15] A. C. Lorena, L. P. Garcia, J. Lehmann, M. C. Souto, tures decorrelation variable (feats_corr) [23, 24, 25, 20, 26]. T. K. Ho, How complex is your classification prob- Indeed, [25] found that correlated input variables usu- lem? a survey on measuring classification complex- ally lead to slower convergence. Thus several proposi- ity, ACM Computing Surveys 52 (2019). tions were developed to better decorrelate variables such [16] W. La Cava, P. Orzechowski, B. Burlacu, F. O. as PCA, or ZCA. More recently, decorrelation played de Franรงa, M. Virgolin, Y. Jin, M. Kommenda, J. H. an important role in the performance increase of self- Moore, Contemporary symbolic regression meth- supervised methods [23, 24, 26]. ods and their relative performance, in: J. Van- In this paper, we showed that a simple pipeline could schoren, S. Yeung (Eds.), Proceedings of the NIPS help us to extract theoretical intuitions from experimenta- Track on Datasets and Benchmarks, 2021. tion. Our formula is highly explainable and is consistent [17] R. A. Fisher, The use of multiple measurements in with decades of research. While this work is still ongoing, taxonomic problems, Annals of eugenics 7 (1936). we are working on an extended version [27]. [18] M. Dorfer, R. Kelz, G. Widmer, Deep linear discrim- inant analysis, in: ICLR, Procs., 2016. [19] B. Ghojogh, et al., Fisher discriminant triplet and References contrastive losses for training siamese networks, in: IJCNN Procs, IEEE, 2020. [1] A. Davies, et al., Advancing mathematics by guiding [20] W. Wan, Y. Zhong, T. Li, J. Chen, Rethinking feature human intuition with ai, Nature 600 (2021) 70โ74. distribution for loss functions in image classifica- [2] M. R. Douglas, Machine learning as a tool in theo- tion, in: CVPR, Procs., 2018. retical science, Nature Reviews Physics (2022). [21] D. Chang, K. Pang, Y. Zheng, Z. Ma, Y.-Z. Song, [3] M. Schmidt, H. Lipson, Distilling free-form natural J. Guo, Yourโ flamingoโ is myโ birdโ: Fine-grained, laws from experimental data, Science 324 (2009). or not, in: CVPR, Procs., 2021. [4] D. A. Augusto, H. J. C. Barbosa, Symbolic regression [22] N. Bansal, X. Chen, Z. Wang, Can we gain more via genetic programming, in: SBRN, Procs., 2000. from orthogonality regularizations in training deep [5] J. R. Koza, Genetic programming - on the program- networks?, in: NIPS, Procs., 2018. ming of computers by means of natural selection, [23] A. Ermolov, A. Siarohin, E. Sangineto, N. Sebe, Complex adaptive systems, MIT Press, 1993. Whitening for self-supervised representation learn- [6] Q. Lu, J. Ren, Z. Wang, Using genetic programming ing, in: ICML, Procs., PMLR, 2021. with prior formula knowledge to solve symbolic [24] T. Hua, et al., On feature decorrelation in self- regression problem, Computational Intelligence supervised learning, in: ICCV, Procs., 2021. and Neuroscience (2016). [25] Y. A. LeCun, L. Bottou, G. B. Orr, K.-R. Mรผller, Effi- [7] R. Istrate, et al., Tapas: Train-less accuracy predic- cient BackProp, 2012. tor for architecture search, in: AAAI Procs, 2019. [26] S. Zhang, et al., Zero-cl: Instance and feature decor- [8] W. Wen, et al., Neural predictor for neural architec- relation for negative-free symmetric contrastive ture search, in: ECCV, Procs., 2020. learning, in: ICLR, Procs., 2022. [9] W. Deng, L. Zheng, Are labels always necessary for [27] O. Risser-Maroix, B. Chamand, What can we learn by predicting accuracy?, 2022. a r X i v : 2 2 0 8 . 0 1 3 5 8 .