=Paper= {{Paper |id=Vol-3318/short25 |storemode=property |title=Finding components of a good accuracy with XAI! |pdfUrl=https://ceur-ws.org/Vol-3318/short25.pdf |volume=Vol-3318 |authors=Benjamin Chamand,Olivier Risser-Maroix |dblpUrl=https://dblp.org/rec/conf/cikm/ChamandR22 }} ==Finding components of a good accuracy with XAI!== https://ceur-ws.org/Vol-3318/short25.pdf

Finding components of a good accuracy with XAI !
Benjamin CHAMAND1 , Olivier RISSER-MAROIX2
1
IRIT, Université de Toulouse, CNRS, Toulouse INP, UT3, France
2
LIPADE, Université Paris Cité, France

Abstract
This research presents a pipeline to find the key elements to achieve high accuracy. Indeed, one of the most common tasks in
machine learning is classification, and numerous loss functions have been created to maximize this non-differentiable goal.
Previous work on loss function design was mainly guided by intuition and theory before being validated by experience. Here,
we use a different approach: we aim to learn from experiments. This data-driven method is comparable to how general laws
are found from data in physics. We automatically discovered a mathematical expression on more than 260 datasets that is
highly correlated with the accuracy of a linear classifier. More interestingly, this formula replicates key findings from several
earlier papers on loss design and is highly explainable. We hope this research will open up novel possibilities for developing
new heuristics and foster a deeper comprehension of machine learning theory.

Keywords
Symbolic Regression, Explainability, Datasets Representation

1. Introduction as: fastening NAS [7, 8]; evaluating the accuracy of a
classifier on an unlabeled test set [9]; or measuring the
Most machine learning (ML) research involves creating difficulty of a dataset [10, 11]. Previous works mostly
and assessing components based on theoretical intuitions. rely on neural networks or random forests, making their
Acquiring knowledge from experimentation would be solutions found non-explainable [9, 12]. The text classifi-
a distinct strategy, similar to how physicists have at- cation task has already been studied with features such as
tempted to deduce the analytical laws underlying the n-grams [10]. Their approach is nevertheless constrained
physical processes in nature from observations. With by choice of features, which limits it to textual datasets,
the development of AI, a new tendency to automate and and by only finding an unweighted summation of some
support research with ML tools is emerging. Some math- of those statistics. Statistics to characterize datasets have
ematics [1] and physics [2, 3] researchers started to use been investigated in broader contexts [13, 14, 15]. While
it. The most similar approach in machine learning (ML) studying each variable independently, [14] suggested that
would be meta-learning, where a model gains experience the relationship between such statistics and the difficulty
throughout numerous learning sessions to enhance its of a dataset is complex and would require a nonlinear
performances without human intervention. Although combination of those variables.
this paradigm has been used successfully for many tasks, In this work, we propose a pipeline able to produce a
including hyperparameters optimization and neural ar- general formula predicting the future performance of a
chitecture search (NAS), the solutions found are generally linear classifier with a strong Pearson’s correlation and
not explainable. Thus, it is not so surprising that the use 𝑟 2 score. We found our solution highly explainable and
of AI as a tool to assist in theoretical findings in ML examined it in the context of decades of research.
research has received so little attention.
Understanding the mathematical relationships be-
tween the variables in a given system is a requirement 2. Proposed Approach
of the scientific method. Symbolic regression (SR) aims
to solve the problem of finding a function that explains Datasets and Feature Extractors We choose 12
the hidden relationships in the data without knowing the datasets and 22 feature extractors using the same man-
structure of the function beforehand. Given that SR is ner as [13] to find a general law spanning a large range
NP-hard, evolutionary approaches have been created to of factors for a classification challenge. The amount of
find approximations of solutions [4, 5, 6]. classes varies from 10 to 1854, and the dimension of the
While the task of predicting accuracy may look odd embeddings spans from 256 to 2048. We used datasets
at first glance, solving it has multiple applications such such as CIFAR10, CUB200, ImageNetMini, or THINGS.
To cover a large number of dimensions and difficulty
Proceedings of the CIKM 2022 Workshops, October, 2022, Atlanta, USA levels of linear classification, varied architectures with
Envelope-Open benjamin.chamand@irit.fr (B. CHAMAND); different pretraining have been chosen. Some of them are
orissermaroix@gmail.com (O. RISSER-MAROIX)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License kept untrained. We used different variants of popular fea-
CEUR
Workshop
Proceedings
Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
http://ceur-ws.org
ISSN 1613-0073
ture extractors such as ResNet, MobileNet, SqueezeNet,
CLIP, etc. We construct a meta-dataset ℳ from those 264 Table 1
datasets of embeddings (the combination of all datasets Our formula has a better correlation and higher predictive
by all feature extractors). power with only 5 variables (all 𝑝-value < 0.01).

Method Pearson𝑟 𝑟2
Meta-Dataset Representation To be able to find the
Linear Regression 0.9042 0.8011
hidden relationship between a given dataset and the as- Decision Tree Regressor 0.9472 0.8868
sociated optimal accuracy, we need to describe each of Random Forest Regr. (10 trees) 0.9643 0.9246
those datasets by a feature vector 𝑠 in a shared represen- Our GP formula (𝐺𝑃𝐹) 0.9682 0.9319
tation space 𝒮. We crafted 19 features 𝑠𝑖 such as: the
dimensionality of embeddings (dim), the number of out-
put classes (n_classes), the traces of the average matrices of predicting the accuracy requires a nonlinear combina-
of all intra-class and inter-classes covariance matrices tion of variables. Thus, we compare nonlinear regressors
(sb_trace, sw_trace), the mean cosine similarity between such as decision trees and random forests because of their
each pair of dimensions (feats_cos_sim), the cosine simi- performances and the widespread belief that those mod-
larity between prototypes (prototype_cos_sim), etc. els are among the most interpretable ones. Our formula
outperformed them while being more explainable.
Ground Truth Creation After extracting the embed-
dings from diverse datasets using feature extractors, we Symbolic Regression Formula We ran our GP
need to determine the best achievable accuracy via a pipeline 1000× on the same training set and serialized
softmax classifier for each dataset of embeddings. We their respective solutions and scores for analysis. The
divided each embedding dataset into testing and train- solution having the best test 𝑟 2 score was found 6×. Our
ing sets and learned the model for 1000 epochs with a formula has a complexity of 6 nodes. We will refer to
batch size of 2048. As pre-processing, all embeddings this Genetic Programming Formulas as:
were only ℓ2 -normalized. By tracking the accuracy on
the test set, we can observe the best-reached accuracy 𝑠𝑏_𝑡𝑟𝑎𝑐𝑒/𝑠𝑡_𝑡𝑟𝑎𝑐𝑒
𝐺𝑃𝐹 = log ( )
𝛼, an approximation of the best accuracy reachable 𝛼 ∗ .
Our meta-dataset ℳ = {(𝑠𝑖 , 𝛼𝑖 )}𝐷 √𝑛_𝑐𝑙𝑎𝑠𝑠𝑒𝑠 ⋅ 𝑓 𝑒𝑎𝑡𝑠_𝑐𝑜𝑟𝑟 ⋅ 𝑝𝑟𝑜𝑡𝑜𝑡𝑦𝑝𝑒𝑠_𝑐𝑜𝑠_𝑠𝑖𝑚
𝑖=1 corresponds to all the (1)
pairs of statistical representation 𝑠𝑖 ∈ 𝒮 of each dataset We can easily rewrite : 𝐺𝑃𝐹 = 𝑆𝐸𝑃 − 𝐶𝑂𝑅 with:
𝑑𝑖 of the 𝐷 datasets and the observed optimal accuracy
𝛼𝑖 ∈ 𝒜. These tuples contain our inputs and outputs. 𝑠𝑏_𝑡𝑟𝑎𝑐𝑒
𝑆𝐸𝑃 = log ( )
𝑠𝑡_𝑡𝑟𝑎𝑐𝑒
Symbolic Regression We use the gplearn implemen- 1
𝐶𝑂𝑅 = log (𝑛_𝑐𝑙𝑎𝑠𝑠𝑒𝑠 ⋅ 𝑓 𝑒𝑎𝑡𝑠_𝑐𝑜𝑟𝑟 ⋅ 𝑝𝑟𝑜𝑡𝑜𝑡𝑦𝑝𝑒𝑠_𝑐𝑜𝑠_𝑠𝑖𝑚)
tation because of the compactness of the solutions, speed 2
of execution, robustness to noise [16], and ease of use. (2)
The set of primitive function used is {log, 𝑒, √,+, −, ×, ÷} 𝑆𝐸𝑃 may correspond to a separability criterion while
and the set of terminals corresponds to the statistics 𝑠𝑖 𝐶𝑂𝑅 may correspond to correlation information. We
describing the dataset 𝑑𝑖 . We evolved a population of found those two parts to be complementary. Indeed 𝑆𝐸𝑃
5000 individuals for 20 steps. We designed a fitness and 𝐶𝑂𝑅 have respectively a pearson𝑟 of only 0.65 and
function ℱ such that both: pretrained and untrained −0.87. Finally, we found that other best-performing GP
extracted embeddings have a linear correlation with ac- formulas have similar structures and variables.
curacy, independently. We split our meta-dataset in a
fixed 75/25-train/test fashion and repeat each experiment
1000×. Since ℱ only seeks for correlation, a linear trans- 4. Discussion
formation of the output value is learned on the training
set in order to predict the accuracy (𝛼̂ = 𝑎 ⋅ 𝑝(⋅) + 𝑏). 𝐺𝑃𝐹 can be written as a summation of two components.
One can see that the first element 𝑆𝐸𝑃 is close to the
Fisher’s criterion used in the Linear Discriminant Analysis
3. Results (LDA) [17] where the objective is to find a linear projec-
tion that maximizes the ratio of between-class variance
Baselines To evaluate the performance of our GP so- and the within-class variance. Thus, 𝑆𝐸𝑃 corresponds
lution, we compare it with popular regression methods to a separability measure of classes. Remarkably, this
using the same train/test split. Performances on the test criterion has been effectively applied as a loss function in
set are reported in Table. 1. The substantial gap of 𝑟 2 deep learning [18, 19]. The choice of an LDA-based loss
score between the linear regressor suggests that the task function remains marginal in deep learning, the cross-
entropy (CE) being a more popular choice. However,
strong similarities between the LDA and the CE allow us classifier accuracy evaluation?, in: CVPR, Procs.,
to swap this first separability measure with the latter one. 2021, pp. 15069–15078.
Indeed, [20] noticed that one of the most widely studied [10] E. Collins, N. Rozanov, B. Zhang, Evolutionary
technical routes for the CE-based losses is to encourage data measures: Understanding the difficulty of text
stronger intra-class compactness and larger inter-class classification tasks, in: CoNLL, 2018.
separability such as the Fisher’s criterion. The second [11] F. Scheidegger, R. Istrate, G. Mariani, L. Benini,
part, 𝐶𝑂𝑅, is negatively correlated to the accuracy. The C. Bekas, C. Malossi, Efficient image dataset clas-
first variable is the number of classes (𝑛_𝑐𝑙𝑎𝑠𝑠𝑒𝑠). Indeed, sification difficulty estimation for predicting deep-
it is natural to expect scores to decrease as the number of learning accuracy, The Visual Computer 37 (2021).
classes grows. For example [21] observed a drop in accu- [12] Y. Yamada, T. Morimura, Weight features for pre-
racy on the CUB200 dataset when changing the number dicting future model performance of deep neural
of classes from a coarse level to a fine-grained one. networks., in: IJCAI, 2016, pp. 2231–2237.
In defense of the weights decorrelation term (proto- [13] B. Chamand, O. Risser-Maroix, C. Kurtz, P. Joly,
types_cos_sim), [22] found on several state-of-the-art N. Loménie, Fine-tune your classifier: Finding cor-
CNN that they could achieve better accuracy, more stable relations with temperature, in: ICIP, Procs., 2022.
training, and smoother convergence by using orthog- [14] T. K. Ho, M. Basu, Complexity measures of super-
onal regularization of weights. Previous works on fea- vised classification problems, TPAMI 24 (2002).
tures decorrelation heavily justify the presence of our fea- [15] A. C. Lorena, L. P. Garcia, J. Lehmann, M. C. Souto,
tures decorrelation variable (feats_corr) [23, 24, 25, 20, 26]. T. K. Ho, How complex is your classification prob-
Indeed, [25] found that correlated input variables usu- lem? a survey on measuring classification complex-
ally lead to slower convergence. Thus several proposi- ity, ACM Computing Surveys 52 (2019).
tions were developed to better decorrelate variables such [16] W. La Cava, P. Orzechowski, B. Burlacu, F. O.
as PCA, or ZCA. More recently, decorrelation played de França, M. Virgolin, Y. Jin, M. Kommenda, J. H.
an important role in the performance increase of self- Moore, Contemporary symbolic regression meth-
supervised methods [23, 24, 26]. ods and their relative performance, in: J. Van-
In this paper, we showed that a simple pipeline could schoren, S. Yeung (Eds.), Proceedings of the NIPS
help us to extract theoretical intuitions from experimenta- Track on Datasets and Benchmarks, 2021.
tion. Our formula is highly explainable and is consistent [17] R. A. Fisher, The use of multiple measurements in
with decades of research. While this work is still ongoing, taxonomic problems, Annals of eugenics 7 (1936).
we are working on an extended version [27]. [18] M. Dorfer, R. Kelz, G. Widmer, Deep linear discrim-
inant analysis, in: ICLR, Procs., 2016.
[19] B. Ghojogh, et al., Fisher discriminant triplet and
References contrastive losses for training siamese networks,
in: IJCNN Procs, IEEE, 2020.
[1] A. Davies, et al., Advancing mathematics by guiding
[20] W. Wan, Y. Zhong, T. Li, J. Chen, Rethinking feature
human intuition with ai, Nature 600 (2021) 70–74.
distribution for loss functions in image classifica-
[2] M. R. Douglas, Machine learning as a tool in theo-
tion, in: CVPR, Procs., 2018.
retical science, Nature Reviews Physics (2022).
[21] D. Chang, K. Pang, Y. Zheng, Z. Ma, Y.-Z. Song,
[3] M. Schmidt, H. Lipson, Distilling free-form natural
J. Guo, Your” flamingo” is my” bird”: Fine-grained,
laws from experimental data, Science 324 (2009).
or not, in: CVPR, Procs., 2021.
[4] D. A. Augusto, H. J. C. Barbosa, Symbolic regression
[22] N. Bansal, X. Chen, Z. Wang, Can we gain more
via genetic programming, in: SBRN, Procs., 2000.
from orthogonality regularizations in training deep
[5] J. R. Koza, Genetic programming - on the program-
networks?, in: NIPS, Procs., 2018.
ming of computers by means of natural selection,
[23] A. Ermolov, A. Siarohin, E. Sangineto, N. Sebe,
Complex adaptive systems, MIT Press, 1993.
Whitening for self-supervised representation learn-
[6] Q. Lu, J. Ren, Z. Wang, Using genetic programming
ing, in: ICML, Procs., PMLR, 2021.
with prior formula knowledge to solve symbolic
[24] T. Hua, et al., On feature decorrelation in self-
regression problem, Computational Intelligence
supervised learning, in: ICCV, Procs., 2021.
and Neuroscience (2016).
[25] Y. A. LeCun, L. Bottou, G. B. Orr, K.-R. Müller, Effi-
[7] R. Istrate, et al., Tapas: Train-less accuracy predic-
cient BackProp, 2012.
tor for architecture search, in: AAAI Procs, 2019.
[26] S. Zhang, et al., Zero-cl: Instance and feature decor-
[8] W. Wen, et al., Neural predictor for neural architec-
relation for negative-free symmetric contrastive
ture search, in: ECCV, Procs., 2020.
learning, in: ICLR, Procs., 2022.
[9] W. Deng, L. Zheng, Are labels always necessary for
[27] O. Risser-Maroix, B. Chamand, What can we learn
by predicting accuracy?, 2022. a r X i v : 2 2 0 8 . 0 1 3 5 8 .