=Paper= {{Paper |id=Vol-3318/short25 |storemode=property |title=Finding components of a good accuracy with XAI! |pdfUrl=https://ceur-ws.org/Vol-3318/short25.pdf |volume=Vol-3318 |authors=Benjamin Chamand,Olivier Risser-Maroix |dblpUrl=https://dblp.org/rec/conf/cikm/ChamandR22 }} ==Finding components of a good accuracy with XAI!== https://ceur-ws.org/Vol-3318/short25.pdf
Finding components of a good accuracy with XAI !
Benjamin CHAMAND1 , Olivier RISSER-MAROIX2
1
    IRIT, Universitรฉ de Toulouse, CNRS, Toulouse INP, UT3, France
2
    LIPADE, Universitรฉ Paris Citรฉ, France


                                       Abstract
                                       This research presents a pipeline to find the key elements to achieve high accuracy. Indeed, one of the most common tasks in
                                       machine learning is classification, and numerous loss functions have been created to maximize this non-differentiable goal.
                                       Previous work on loss function design was mainly guided by intuition and theory before being validated by experience. Here,
                                       we use a different approach: we aim to learn from experiments. This data-driven method is comparable to how general laws
                                       are found from data in physics. We automatically discovered a mathematical expression on more than 260 datasets that is
                                       highly correlated with the accuracy of a linear classifier. More interestingly, this formula replicates key findings from several
                                       earlier papers on loss design and is highly explainable. We hope this research will open up novel possibilities for developing
                                       new heuristics and foster a deeper comprehension of machine learning theory.

                                       Keywords
                                       Symbolic Regression, Explainability, Datasets Representation



1. Introduction                                                                                  as: fastening NAS [7, 8]; evaluating the accuracy of a
                                                                                                 classifier on an unlabeled test set [9]; or measuring the
Most machine learning (ML) research involves creating                                            difficulty of a dataset [10, 11]. Previous works mostly
and assessing components based on theoretical intuitions.                                        rely on neural networks or random forests, making their
Acquiring knowledge from experimentation would be                                                solutions found non-explainable [9, 12]. The text classifi-
a distinct strategy, similar to how physicists have at-                                          cation task has already been studied with features such as
tempted to deduce the analytical laws underlying the                                             n-grams [10]. Their approach is nevertheless constrained
physical processes in nature from observations. With                                             by choice of features, which limits it to textual datasets,
the development of AI, a new tendency to automate and                                            and by only finding an unweighted summation of some
support research with ML tools is emerging. Some math-                                           of those statistics. Statistics to characterize datasets have
ematics [1] and physics [2, 3] researchers started to use                                        been investigated in broader contexts [13, 14, 15]. While
it. The most similar approach in machine learning (ML)                                           studying each variable independently, [14] suggested that
would be meta-learning, where a model gains experience                                           the relationship between such statistics and the difficulty
throughout numerous learning sessions to enhance its                                             of a dataset is complex and would require a nonlinear
performances without human intervention. Although                                                combination of those variables.
this paradigm has been used successfully for many tasks,                                             In this work, we propose a pipeline able to produce a
including hyperparameters optimization and neural ar-                                            general formula predicting the future performance of a
chitecture search (NAS), the solutions found are generally                                       linear classifier with a strong Pearsonโ€™s correlation and
not explainable. Thus, it is not so surprising that the use                                      ๐‘Ÿ 2 score. We found our solution highly explainable and
of AI as a tool to assist in theoretical findings in ML                                          examined it in the context of decades of research.
research has received so little attention.
    Understanding the mathematical relationships be-
tween the variables in a given system is a requirement                                           2. Proposed Approach
of the scientific method. Symbolic regression (SR) aims
to solve the problem of finding a function that explains                                                         Datasets and Feature Extractors We choose 12
the hidden relationships in the data without knowing the                                                         datasets and 22 feature extractors using the same man-
structure of the function beforehand. Given that SR is                                                           ner as [13] to find a general law spanning a large range
NP-hard, evolutionary approaches have been created to                                                            of factors for a classification challenge. The amount of
find approximations of solutions [4, 5, 6].                                                                      classes varies from 10 to 1854, and the dimension of the
    While the task of predicting accuracy may look odd                                                           embeddings spans from 256 to 2048. We used datasets
at first glance, solving it has multiple applications such                                                       such as CIFAR10, CUB200, ImageNetMini, or THINGS.
                                                                                                                 To cover a large number of dimensions and difficulty
Proceedings of the CIKM 2022 Workshops, October, 2022, Atlanta, USA levels of linear classification, varied architectures with
Envelope-Open benjamin.chamand@irit.fr (B. CHAMAND);                                                             different pretraining have been chosen. Some of them are
orissermaroix@gmail.com (O. RISSER-MAROIX)
                    ยฉ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License kept untrained. We used different variants of popular fea-
    CEUR
    Workshop
    Proceedings
                    Attribution 4.0 International (CC BY 4.0).
                    CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                 ture extractors such as ResNet, MobileNet, SqueezeNet,
CLIP, etc. We construct a meta-dataset โ„ณ from those 264        Table 1
datasets of embeddings (the combination of all datasets        Our formula has a better correlation and higher predictive
by all feature extractors).                                    power with only 5 variables (all ๐‘-value < 0.01).

                                                                  Method                             Pearson๐‘Ÿ       ๐‘Ÿ2
Meta-Dataset Representation To be able to find the
                                                                  Linear Regression                   0.9042     0.8011
hidden relationship between a given dataset and the as-           Decision Tree Regressor             0.9472     0.8868
sociated optimal accuracy, we need to describe each of            Random Forest Regr. (10 trees)      0.9643     0.9246
those datasets by a feature vector ๐‘  in a shared represen-        Our GP formula (๐บ๐‘ƒ๐น)                0.9682     0.9319
tation space ๐’ฎ. We crafted 19 features ๐‘ ๐‘– such as: the
dimensionality of embeddings (dim), the number of out-
put classes (n_classes), the traces of the average matrices    of predicting the accuracy requires a nonlinear combina-
of all intra-class and inter-classes covariance matrices       tion of variables. Thus, we compare nonlinear regressors
(sb_trace, sw_trace), the mean cosine similarity between       such as decision trees and random forests because of their
each pair of dimensions (feats_cos_sim), the cosine simi-      performances and the widespread belief that those mod-
larity between prototypes (prototype_cos_sim), etc.            els are among the most interpretable ones. Our formula
                                                               outperformed them while being more explainable.
Ground Truth Creation After extracting the embed-
dings from diverse datasets using feature extractors, we       Symbolic Regression Formula We ran our GP
need to determine the best achievable accuracy via a           pipeline 1000ร— on the same training set and serialized
softmax classifier for each dataset of embeddings. We          their respective solutions and scores for analysis. The
divided each embedding dataset into testing and train-         solution having the best test ๐‘Ÿ 2 score was found 6ร—. Our
ing sets and learned the model for 1000 epochs with a          formula has a complexity of 6 nodes. We will refer to
batch size of 2048. As pre-processing, all embeddings          this Genetic Programming Formulas as:
were only โ„“2 -normalized. By tracking the accuracy on
the test set, we can observe the best-reached accuracy                                     ๐‘ ๐‘_๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘’/๐‘ ๐‘ก_๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘’
                                                               ๐บ๐‘ƒ๐น = log (                                               )
๐›ผ, an approximation of the best accuracy reachable ๐›ผ โˆ— .
Our meta-dataset โ„ณ = {(๐‘ ๐‘– , ๐›ผ๐‘– )}๐ท                                         โˆš๐‘›_๐‘๐‘™๐‘Ž๐‘ ๐‘ ๐‘’๐‘  โ‹… ๐‘“ ๐‘’๐‘Ž๐‘ก๐‘ _๐‘๐‘œ๐‘Ÿ๐‘Ÿ โ‹… ๐‘๐‘Ÿ๐‘œ๐‘ก๐‘œ๐‘ก๐‘ฆ๐‘๐‘’๐‘ _๐‘๐‘œ๐‘ _๐‘ ๐‘–๐‘š
                                 ๐‘–=1 corresponds to all the                                                            (1)
pairs of statistical representation ๐‘ ๐‘– โˆˆ ๐’ฎ of each dataset     We can easily rewrite : ๐บ๐‘ƒ๐น = ๐‘†๐ธ๐‘ƒ โˆ’ ๐ถ๐‘‚๐‘… with:
๐‘‘๐‘– of the ๐ท datasets and the observed optimal accuracy
๐›ผ๐‘– โˆˆ ๐’œ. These tuples contain our inputs and outputs.                          ๐‘ ๐‘_๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘’
                                                                ๐‘†๐ธ๐‘ƒ = log (            )
                                                                              ๐‘ ๐‘ก_๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘’
Symbolic Regression We use the gplearn implemen-                       1
                                                               ๐ถ๐‘‚๐‘… =     log (๐‘›_๐‘๐‘™๐‘Ž๐‘ ๐‘ ๐‘’๐‘  โ‹… ๐‘“ ๐‘’๐‘Ž๐‘ก๐‘ _๐‘๐‘œ๐‘Ÿ๐‘Ÿ โ‹… ๐‘๐‘Ÿ๐‘œ๐‘ก๐‘œ๐‘ก๐‘ฆ๐‘๐‘’๐‘ _๐‘๐‘œ๐‘ _๐‘ ๐‘–๐‘š)
tation because of the compactness of the solutions, speed              2
of execution, robustness to noise [16], and ease of use.                                                                 (2)
The set of primitive function used is {log, ๐‘’, โˆš,+, โˆ’, ร—, รท}   ๐‘†๐ธ๐‘ƒ may correspond to a separability criterion while
and the set of terminals corresponds to the statistics ๐‘ ๐‘–      ๐ถ๐‘‚๐‘… may correspond to correlation information. We
describing the dataset ๐‘‘๐‘– . We evolved a population of         found those two parts to be complementary. Indeed ๐‘†๐ธ๐‘ƒ
5000 individuals for 20 steps. We designed a fitness           and ๐ถ๐‘‚๐‘… have respectively a pearson๐‘Ÿ of only 0.65 and
function โ„ฑ such that both: pretrained and untrained            โˆ’0.87. Finally, we found that other best-performing GP
extracted embeddings have a linear correlation with ac-        formulas have similar structures and variables.
curacy, independently. We split our meta-dataset in a
fixed 75/25-train/test fashion and repeat each experiment
1000ร—. Since โ„ฑ only seeks for correlation, a linear trans-     4. Discussion
formation of the output value is learned on the training
set in order to predict the accuracy (๐›ผฬ‚ = ๐‘Ž โ‹… ๐‘(โ‹…) + ๐‘). ๐บ๐‘ƒ๐น can be written as a summation of two components.
                                                          One can see that the first element ๐‘†๐ธ๐‘ƒ is close to the
                                                          Fisherโ€™s criterion used in the Linear Discriminant Analysis
3. Results                                                (LDA) [17] where the objective is to find a linear projec-
                                                          tion that maximizes the ratio of between-class variance
Baselines To evaluate the performance of our GP so- and the within-class variance. Thus, ๐‘†๐ธ๐‘ƒ corresponds
lution, we compare it with popular regression methods to a separability measure of classes. Remarkably, this
using the same train/test split. Performances on the test criterion has been effectively applied as a loss function in
set are reported in Table. 1. The substantial gap of ๐‘Ÿ 2 deep learning [18, 19]. The choice of an LDA-based loss
score between the linear regressor suggests that the task function remains marginal in deep learning, the cross-
                                                          entropy (CE) being a more popular choice. However,
strong similarities between the LDA and the CE allow us                classifier accuracy evaluation?, in: CVPR, Procs.,
to swap this first separability measure with the latter one.           2021, pp. 15069โ€“15078.
Indeed, [20] noticed that one of the most widely studied          [10] E. Collins, N. Rozanov, B. Zhang, Evolutionary
technical routes for the CE-based losses is to encourage               data measures: Understanding the difficulty of text
stronger intra-class compactness and larger inter-class                classification tasks, in: CoNLL, 2018.
separability such as the Fisherโ€™s criterion. The second           [11] F. Scheidegger, R. Istrate, G. Mariani, L. Benini,
part, ๐ถ๐‘‚๐‘…, is negatively correlated to the accuracy. The               C. Bekas, C. Malossi, Efficient image dataset clas-
first variable is the number of classes (๐‘›_๐‘๐‘™๐‘Ž๐‘ ๐‘ ๐‘’๐‘ ). Indeed,           sification difficulty estimation for predicting deep-
it is natural to expect scores to decrease as the number of            learning accuracy, The Visual Computer 37 (2021).
classes grows. For example [21] observed a drop in accu-          [12] Y. Yamada, T. Morimura, Weight features for pre-
racy on the CUB200 dataset when changing the number                    dicting future model performance of deep neural
of classes from a coarse level to a fine-grained one.                  networks., in: IJCAI, 2016, pp. 2231โ€“2237.
    In defense of the weights decorrelation term (proto-          [13] B. Chamand, O. Risser-Maroix, C. Kurtz, P. Joly,
types_cos_sim), [22] found on several state-of-the-art                 N. Lomรฉnie, Fine-tune your classifier: Finding cor-
CNN that they could achieve better accuracy, more stable               relations with temperature, in: ICIP, Procs., 2022.
training, and smoother convergence by using orthog-               [14] T. K. Ho, M. Basu, Complexity measures of super-
onal regularization of weights. Previous works on fea-                 vised classification problems, TPAMI 24 (2002).
tures decorrelation heavily justify the presence of our fea-      [15] A. C. Lorena, L. P. Garcia, J. Lehmann, M. C. Souto,
tures decorrelation variable (feats_corr) [23, 24, 25, 20, 26].        T. K. Ho, How complex is your classification prob-
Indeed, [25] found that correlated input variables usu-                lem? a survey on measuring classification complex-
ally lead to slower convergence. Thus several proposi-                 ity, ACM Computing Surveys 52 (2019).
tions were developed to better decorrelate variables such         [16] W. La Cava, P. Orzechowski, B. Burlacu, F. O.
as PCA, or ZCA. More recently, decorrelation played                    de Franรงa, M. Virgolin, Y. Jin, M. Kommenda, J. H.
an important role in the performance increase of self-                 Moore, Contemporary symbolic regression meth-
supervised methods [23, 24, 26].                                       ods and their relative performance, in: J. Van-
    In this paper, we showed that a simple pipeline could              schoren, S. Yeung (Eds.), Proceedings of the NIPS
help us to extract theoretical intuitions from experimenta-            Track on Datasets and Benchmarks, 2021.
tion. Our formula is highly explainable and is consistent         [17] R. A. Fisher, The use of multiple measurements in
with decades of research. While this work is still ongoing,            taxonomic problems, Annals of eugenics 7 (1936).
we are working on an extended version [27].                       [18] M. Dorfer, R. Kelz, G. Widmer, Deep linear discrim-
                                                                       inant analysis, in: ICLR, Procs., 2016.
                                                                  [19] B. Ghojogh, et al., Fisher discriminant triplet and
References                                                             contrastive losses for training siamese networks,
                                                                       in: IJCNN Procs, IEEE, 2020.
 [1] A. Davies, et al., Advancing mathematics by guiding
                                                                  [20] W. Wan, Y. Zhong, T. Li, J. Chen, Rethinking feature
     human intuition with ai, Nature 600 (2021) 70โ€“74.
                                                                       distribution for loss functions in image classifica-
 [2] M. R. Douglas, Machine learning as a tool in theo-
                                                                       tion, in: CVPR, Procs., 2018.
     retical science, Nature Reviews Physics (2022).
                                                                  [21] D. Chang, K. Pang, Y. Zheng, Z. Ma, Y.-Z. Song,
 [3] M. Schmidt, H. Lipson, Distilling free-form natural
                                                                       J. Guo, Yourโ€ flamingoโ€ is myโ€ birdโ€: Fine-grained,
     laws from experimental data, Science 324 (2009).
                                                                       or not, in: CVPR, Procs., 2021.
 [4] D. A. Augusto, H. J. C. Barbosa, Symbolic regression
                                                                  [22] N. Bansal, X. Chen, Z. Wang, Can we gain more
     via genetic programming, in: SBRN, Procs., 2000.
                                                                       from orthogonality regularizations in training deep
 [5] J. R. Koza, Genetic programming - on the program-
                                                                       networks?, in: NIPS, Procs., 2018.
     ming of computers by means of natural selection,
                                                                  [23] A. Ermolov, A. Siarohin, E. Sangineto, N. Sebe,
     Complex adaptive systems, MIT Press, 1993.
                                                                       Whitening for self-supervised representation learn-
 [6] Q. Lu, J. Ren, Z. Wang, Using genetic programming
                                                                       ing, in: ICML, Procs., PMLR, 2021.
     with prior formula knowledge to solve symbolic
                                                                  [24] T. Hua, et al., On feature decorrelation in self-
     regression problem, Computational Intelligence
                                                                       supervised learning, in: ICCV, Procs., 2021.
     and Neuroscience (2016).
                                                                  [25] Y. A. LeCun, L. Bottou, G. B. Orr, K.-R. Mรผller, Effi-
 [7] R. Istrate, et al., Tapas: Train-less accuracy predic-
                                                                       cient BackProp, 2012.
     tor for architecture search, in: AAAI Procs, 2019.
                                                                  [26] S. Zhang, et al., Zero-cl: Instance and feature decor-
 [8] W. Wen, et al., Neural predictor for neural architec-
                                                                       relation for negative-free symmetric contrastive
     ture search, in: ECCV, Procs., 2020.
                                                                       learning, in: ICLR, Procs., 2022.
 [9] W. Deng, L. Zheng, Are labels always necessary for
                                                                  [27] O. Risser-Maroix, B. Chamand, What can we learn
                                                                       by predicting accuracy?, 2022. a r X i v : 2 2 0 8 . 0 1 3 5 8 .