1. Introduction

Workshop on AI Evaluation Beyond Metrics, July

FERM: A FEature-space Representation Measure for Improved Model Evaluation

Yeu Shin Fu

0 1

Wenbo Ge

0 1

Jo Plested

1 2 0 Australian National University , Canberra, ACT 2601 , Australia 1 Equal contribution 2 University of New South Wales , Northcott Dr, Campbell ACT 2612 , Australia

2022

25 2022 0000 0002

Understanding whether a particular dataset and task are well represented by a deep learning model can be as crucial as the model's prediction accuracy in many applications. Currently, best prediction performance for large, modern datasets is often achieved by complex and dificult to interpret deep learning models. As deep learning model size and complexity increases compared to the size of the training dataset, the capacity of the model to overfit to inappropriate features and perform poorly or unreliably also increases. Unreliability may not be obvious in traditional performance measures during evaluation so it is important to also consider how well the model is representing the current data distribution. There has previously been little work focusing on measuring this space. We introduce several measures that we collectively name FERM: A FEature-space Representation Measure for determining how well the current feature space representation models the current data distribution and task. We compared our new measures and potential candidates from other related research areas. We demonstrated that our new method, along with two others, have excellent potential to be used for measuring how well a trained model is currently representing a dataset and task. These findings have many implications for deep learning research and applications, including, evaluating when the current model is no longer representing new data well to reduce the frequency of computationally expensive retraining of models, assessing for hard to evaluate failure modes such as model biases that result in particular input samples being poorly represented, guidance on the best hyperparameters to use when updating models with limited new data.

eol>Representation learning Feature space evaluation Deep learning

1. Introduction

With the successes of deep learning in the past decade when applied to modelling large well formed and stable data distributions, recent focus has turned to modelling datasets that are: • transferability, being how well a model trained on a related source task is likely to perform when ifne-tuned on a target task [ 1, 2, 3, 4 ] • analysis of deep learning feature spaces and how those produced by pretrained models difer from those with random initialisation [ 5, 6, 7 ]. 1. Not well formed because they are very diferent to the source dataset in the case of some transfer learning applications. 2. Not stable over time in the case of online learning tasks. 3. Are dificult to model as they have long tailed distributions including rare minority classes for example or other non-standard distributions. 1. New evaluation measures for determining how well the current feature space representation models the current data distribution and task.

2. A thorough comparison of our new measures and potential candidates from other related research areas. 2. Related Work

There have been limited previous investigations into measures of how well the data is being represented by a deep learning model. There are however many potential methods that could be adapted for this purpose from other ifelds including: new target dataset in transfer learning [ 5, 6 ]. However these methods are focused on analysing how pretrained weights stabilise and improve training on the target dataset, and prevent over fitting. They do not look at how well fixed weights represent the current target dataset without fine-tuning and thus when and how to perform fine-tuning.

2.2. Exploring and visualising the deep learning feature space

There are many methods that work on visualising either: 1. Recent methods designed for measuring the "transferability" of a pretrained deep learning • the feature activations within a deep neural netmodel [ 2, 1, 3 ]. The logic for this being that how work [10, 11] well modelled a source dataset is would likely • the final feature space [ 12, 13, 14, 15 ] be strongly correlated with how transferable the • the predictions and their accuracy [16, 17]. current model is. If the pretrained model weights While some of these methods, particularly those in produce a poorly modelled feature space transfer item two above, do result in a projection of the feature learning is likely to perform poorly with those space into a lower dimensional visualisation that would weights. be easier to measure, they focus on visual inspection 2. Methods designed for measuring how well clus- rather than on measurement. They also don’t analyse tered a high dimensional space is. The logic here the loss of information, and thus intra-class separation, is that a well modeled feature space for classifi- by projecting from a high dimensional space to a low cation is one where the data points are well clus- dimensional space that can be visualised. tered and separated into their classes in feature space ready to be classified by the final classification algorithm. There are many clustering mea- 2.3. Interpreting Model Predictions sures that fail in high dimensional spaces or with There has been a large amount of work done in interhigh number of classes which mean that they are preting model predictions and producing measures and not useful for many deep learning feature spaces. visualisations that show how much a prediction should However, there are several that do work well in be trusted [18, 17]. These models focus on analysing and these spaces [8]. interpreting the importance of input features, rather than 3. Adapting methods designed to measure distance the final learned feature space.

in high dimensions. A major problem with measuring the feature space is the high dimension- 2.4. Metric Learning ality. We propose a new method of measuring clustering based on the Fisher Score [9] that is Metric learning techniques aim to find a feature embedcommonly used as a clustering measure in two ding space that optimises some predefined distance metdimensions. We replace the Euclidean distance ric given pairs of examples that are classified as either measure in the Fisher Score with cosine similar- the same or diferent [ 19, 20, 21]. This problem has been ity, which is known to be an efective distance well studied. Our problem is the opposite in that we almeasure in high dimensions, along with other ready have an embedding space and we wish to find a adaptations. metric that measures how well our current embedding is separating our current samples into the same and diferent classes or clusters. There may be some potential to repurpose scores designed for the metric learning space, however we leave this to future work as we have focused on the most promising closely related measures in this work.

Several research areas that are related to measuring the feature space are outlined below. 2.1. Exploring the feature space in deep transfer learning

Several methods have been proposed for analysing the feature space from a pretrained model applied to a 3. Methodology 4. Notation • ∈ where is an input and is the domain • = {1, 2, ..., } where is the set of inputs • is the finite set of labels • = {,1, ,2, ..., , } is the set of inputs that belong to class with samples, and thus 1 ∪ 2 ∪ ... ∪ = where is the number of classes • is the trained model, which can be decomposed as () = ℎ(()) • is the feature extractor that maps an input to a representation (or embedding) = () • is the feature representation • ℎ is a classifier (or head) that takes the representation as input and returns a probability distribution over . • ℛ = {,1, ,2, ..., ,} = () is the feature representation of the inputs in a class, processed by the feature extractor • We define (, ) as a function that operates on two sets, and , and gives the unordered set of all unique pairs from and . That is, (1, 2) ={(1,1, 2,1), (1,1, 2,2), ...

, (1, , 2,− 1), (1, , 2, )} • We can also say that, when = , (, ) = () and instead gives the unordered set of unique pairs, excluding pairs with itself. That is, (, ) = () ={(,1, ,2), (,1, ,3), ... , (,− 1, , )}

4.1. Scoring the feature space

The aim of this work was to quantify how well constructed a feature space is by creating or finding a measure that gives high scores when the feature space is well formed and low scores when the feature space is malformed. Here, we think of a well formed feature space as one where there is high similarity/tight clustering within a class (intra-class) and low similarity/sparse clustering between classes (inter-class). Figure 1 shows a well formed 1,500 dimensional feature space reduced using T-SNE into the normalised top-2 representative dimensions so it can be visualised. Note that the data points from all classes are grouped tightly within their class and mostly well separated from other classes.

The motivation for a score that measures how well constructed the feature space is, is three-fold: We propose several scores that use cosine similarity to quantify the level of inter-class similarity vs intra-class similarity. We expect that a well formed feature space as (, ) = cos ∠(, ) =

· ⊺ ‖‖‖‖ = √⊺√⊺

(1) where and are vectors, · is the inner dot product, and ‖ · ‖ is the magnitude of the vectors.

We define our first FERM: FERM 1 = 1 ∑︁

2− 2 ∑︀, ∈ () (, ) =1 (1− ) ∑︀, ∈ (,∖) (, )

(2) The intuition is quite simple: the numerator is the sum of cosine similarities of all unique pairs in a class, normalised by the number of unique pairs (i.e., an average).

The denominator is the sum of cosine similarities of all unique pairs between samples in the class and samples out of the class, normalised by the number of unique pairs (i.e., an average). This provides a ratio of intra-class similarity and inter-class similarity. This ratio is then averaged across all classes, resulting in FERM 1.

We can then define our second FERM:

FERM 2 =

∑︀ ∑︀

2 =1 2−

1 , < ∑︀, ∈ () (, ) ∑︀, ∈ (,) (, ) shown in Figure 1 should have high intra-class similarity This can be interpreted as normalising all samples to the and low inter-class similarity. unit hyper-sphere, then finding the centroid point on

Our measure is based on adapting the Fisher Score [9] the unit hyper-sphere by adding all normalised samples which is known to perform poorly in high dimensions, together and normalising the combined vector. We can by replacing the Euclidian distance with cosine similarity then define our third FERM: which is known to perform well in high dimensions.

Cosine similarity is defined as: (5) (6) FERM 3 = ∑︀

2 =1 2− ∑︀ 1 , ̸= (− 1) ∑︀, ∈ () (, )

∑︀∈ (, ) The numerator term is still the same, but now the denominator is the average of cosine similarity of samples within a class to the centroids of other classes.

Using the same notation above, we can then define our fourth FERM:

FERM 4 = ∑︀ =1 2− 2 ∑︀, ∈ () (, )

∑︀, ̸= 1− 1 (, ) This further simplifies the calculation of the denominator to a comparison of the centroid of a class to the centroids of other classes.

For all FERMs a higher score means better clustering.

As each individual FERM score and thus the numerator and denominator are within the bounds [ − 1, 1 ], a positive score above 1.0 reflects more intra-class similarity compared to inter-class similarity. (3)

4.3. Data sets We have selected the following datasets.

The intuition is similar to the first FERM. The numerator 4.3.1. Source Dataset remains the same after incorporating the out sum (an average of cosine similarities of all unique pairs in a class, ImageNet 1K (ImageNet) [25] A general image across all classes), but the denominator is now an average dataset containing 1,000 common image classes with at of cosine similarities of unique pairs between samples in least 1,000 total images in each class for a total of just over the class and samples out of the class that has not yet been 1.3 million images in the training set. We use ImageNet accounted for. Although, in the first measure, only the as the source dataset for all our experiments. unique pairs of samples in and out of a class are averaged, further repeating this (the outer sum) results in double 4.3.2. Target Datasets counting across classes. FERM 2 prevents this double Caltech-256 (Caltech) [26] Pictures of objects becounting. longing to 256 categories, with at least 80 images per

We define our third FERM through the use of a cen- category. The Caltech dataset is a general image clastroid in terms of cosine similarities, a so called ‘angular sification dataset similar to ImageNet but with orders centroid’. In the same way that the average Euclidean of magnitude fewer training examples. It is generally distance of one point to several other points can be rep- considered to be the most similar target dataset to Imaresented as the distance of that one point to a Euclidean geNet and fixed weights pretrained on ImageNet tend to centroid of points, the average angle between one point perform about as well as fine-tuned weights [22, 23]. and several other points can be represented as the angle between that one point and an ‘angular centroid’ of points. The centroid for a class is defined as: (4)

FGVC Aircraft (Aircraft) [27] Contains 100 diferent makes and models of aircraft with 6,667 training examples and 3,333 test examples. The Aircraft dataset is a finegrained image classification dataset that is very diferent = 1 ∑︁

∈ ‖‖ from ImageNet. Fixed weights pretrained on ImageNet perform extremely poorly on this dataset [22, 23].

Stanford Cars (Cars) [28] Contains 196 diferent makes and models of cars with 8,144 training examples and 8,041 test examples. The Cars dataset is also a finegrained image classification dataset that is very diferent from ImageNet and fixed weights pretrained on ImageNet also perform extremely poorly on this dataset [22, 23].

5. Experiments We performed two sets of experiments:

Describable textures (DTD) [29] Consists of 3,760 training examples of texture images jointly annotated with 47 attributes. While the DTD dataset is conceptually very diferent to ImageNet recent results have shown that ifxed weights pretrained on ImageNet perform reasonably well on this dataset compared to fine-tuned weights [22, 23].

The ratio of the fixed features to fine-tuned results for a model pretrained on ImageNet are shown in Table 4 for all datasets.

For each experiment we used the Inception v4 architecture [36] pretrained on ImageNet 1k. Using this model, we compared the diferent FERMs on the diferent target data sets: Aircraft, DTD, Cars, and Caltech-256. We also used ImageNet 1k as a target data set to determine a baseline score for each measure.

During this evaluation, two pipelines were constructed: one that utilises transformations of the data, and one that does not. When determining how well classes are clustered together, a forward pass of the unaltered data was initially used, providing us with the exact feature representation of that sample. During a standard deep learning training process, samples are randomly flipped, scaled, resized, and rotated. These samples incur a loss if classiifed incorrectly, and so we expect the model to still learn to classify those samples correctly. Therefore it is likely that the feature representation of these randomly transformed samples are still able to be represented in a well 1. Conducting experiments to compare the efec- formed feature space. Assuming the model adequately tiveness of our score along with candidate scores classifies the transformed data, a measure that is robust from other fields in measuring how well a model to these transformations (that is, does not change much trained on the ImageNet 1K source dataset repre- in the presence or absence of transformations) would be sents a particular known and stable target dataset. better than one that is not, as it would allow us to use We use datasets where it is well known how well this during the training process. ifxed pretrained ImageNet 1K weights perform on We explored the four proposed FERMs on the five them so they make a good basis for comparison. target data sets (including ImageNet 1k) with the two dif2. Using the above measures to detect ‘corruption’ ferent pipelines (with or without transformations). Each or domain shift in the feature space. transformation experiment was also repeated five times, as the transformations are random.

We also investigated recent transferability scores that have been shown to perform well when measuring how well transfer learning will perform on a particular target dataset: We further elaborate on each goal in the corresponding Sections 5.1 and 5.2 below.

In addition to our proposed measures, we explored sev- 5.1.1. Results eral other clustering measures. These were chosen by re- Comparisons between diferent FERMS across the diferviewing [8] and removing clustering scores that were not ent target data sets with and without transformations stable as dimensionality increased (large perturbations can be seen in Table 1. Note that results with transformaor outliers), and similar in score between overlapping- tions are reported as means and standard deviations as clusters and well separated-clusters: the experiments were repeated. The datasets in all tables • Silhouette score [30] are listed in order of the ratio of the performance of fixed • Davies Bouldin score [31] features pretrained on ImageNet to the best fine-tuned • Calinski Harabasz score [32] model performance, using results from [22, 23] as a proxy • Dunn score [33] for how well formed the feature space is. • RS index [8] With and without transformations ImageNet consis• Point Biserial Index [34] tently scored highest, followed consistently by Caltech • √ index [35]. except with FERM 4. For FERM 1 and 2 Aircraft and Cars score much lower than ImageNet and Caltech and DTD is in between. This is the same ordering as our proxy for a well formed feature space. to the data suggest that FERM 1, and 2 seem to be able to consistently do this. It seems that FERM 1, and 2 have potential as a way to measure how well formed the feature space is for a particular trained model and target task.

Of the transferability measures, LEEP is the only score that consistently ranks ImageNet 1k and Caltech-256 as most transferable, in the presence and absence of transformations, however it ranks DTD as least transferable in both cases, which is incorrect. Given the scores are in the same order as the number of classes in the dataset it seems likely that it’s afected by the number of classes.

H-score also seems to be strongly afected by the number of classes, as the scores are close to being proportional to the number of classes in the target dataset.

Of the clustering measures, Silhouette score, Davies Bouldin score, Point Biserial Index, and √ index seem to also consistently rank ImageNet 1k and Caltech256 as the most transferable, in the presence and absence of transformations. However only Silhouette score ranks DTD as moderately transferable compared to the others. Point biserial may also be strongly afected by the number of classes, as the scores are again close to being proportional to the number of classes in the target dataset.

In summary when looking at only stable target datasets our proposed scores FERM 1 and 2 as well as the clustering measure Silhouette score are good candidates for measuring how well formed the feature space is for a given trained model and target task.

We know that fixed features pretrained on ImageNet 1k 5.2. Detecting and quantifying domain perform well on Caltech-256, moderately well on DTD, shifts and poorly on Aircraft, and Cars [22, 23] as shown by our ratios of fixed features to fine-tuned performance We attempted to detect and quantify incremental domain in Table 1. We use this as a proxy for a well formed shifts. As it is hard to concretely quantify diferent levels feature space and expect a good score to reflect the same of domain shift, we reduce the problem down into deknowledge, that is, a low score for Aircraft and Cars, a tecting levels of ‘corruption’. ‘Corruption’ is defined as moderate score for DTD, a high score for Caltech-256, the presence of the target data set mixed into the source and a very high score for ImageNet. data set, where the source data can be thought of as no

Our results with and without random transformations domain shift, whilst the target data set can be thought of as complete domain shift. This can be then quantified by Another way we approached the problem is by looking the percentage of target data in the source data set. at transferability measures. Since measures of transfer

We again started with an Inception v4 model pre- ability are largest when the source task is the same as trained on ImageNet 1K. We then incrementally shifted the target task, we hypothesized that at 0% corruption the domain by either adding target data to the evaluation (i.e., there is no domain shift) transferability scores will set or removing source data from the evaluation set. The be high, and will slowly degrade with increasing levels source samples are derived from the ImageNet 1k valida- of corruption. tion set, whilst the target samples are derived from the training set of Aircraft. The Aircraft dataset was used in 5.2.1. Results this case as it was the most poorly represented by the pretrained model in our previous experiments. Each time For each diferent combination of source and target we added more ’corruption’ we used all measures from dataset we ran the experiment 10 times as the selection our previous experiments to measure the feature space. of the examples for each class was random. The classes

Specifically, we created the evaluation set by randomly chosen from ImageNet were fixed to allow for a fixed choosing 200 classes from ImageNet 1k, and then ran- comparison. The change in each of the diferent scores domly choosing the same number of samples across the as the domain shifts to the target data set of Aircraft can classes. Aircraft was combined with this in a similar be seen in Figure 2. The scores have been normalised way, that is, randomly choosing the same number of sam- between 0 and 1. Although several of these were repeated ples across all 100 classes. The union of both creates the and averaged, we did not plot the error bars as they are evaluation set. largely uninformative, as seen in Section 5.1.1.

The feature representation of a sample is defined as = (), where (· ) is the feature extractor from the 5.2.2. Discussion trained source model. We expected that as the level of corruption increases (as more of the source data set is replaced by the target data set), the clustering of classes in the feature space degrades; features in the new class are not clustered well, and thus the overall clustering score should decrease.

We expect a measure that is good at detecting domain shift to start with a normalised score of 1 (or 0 if inversely proportional) with no domain shift, and incrementally decrease to 0 (or increase to 1) as the domain is completely shifted. We also would like the measure to be monotonically decreasing (or increasing). The results in 1 0.5

0 se 1 r o c s sed0.5 i l a m r o 0 N 0.5 1 0

Cosine meaures

Clustering measures Transferability measures cosine measure 1 cosine measure 2 cosine measure 3 silhouette score davies bouldin score calinski harabasz score dunn score rs index point biserial index C root K index leep otce h score Figure 2 show that only Point Biserial Index seems to be the original ImageNet score is approximately at the point almost entirely monotonically trending. Ignoring the last where the dataset has shifted to the extent that its compoint (0 samples of ImageNet 1k), H-score seems to have position is more than 50% of the target dataset. The strong potential to detect domain shift however more experiments with only one example from each class of investigation is required to see why the final point is so either the source or the target dataset can be thought of far out of sequence. as just adding noise, as intra-class distances cannot be

RS index, Davies Bouldin score, and Silhouette score measured with only one example for each class. Thought seem to also have sections of monotonic trend. Further of in this way it is useful that our measure is strongly work is required to make a strong claim in the ability of sensitive to this situation. these measures to detect and quantify domain shift. More extensive work should be done to compare our

The results of our FERMs are particularly interesting. methods with the Point Biserial Index, and H-score across If the points where there is only one example per class a broader range of domain shift applications. of either Aircraft or ImageNet are excluded (second from the left and right on the graph) the trend is almost monotonic from all ImageNet examples to all Aircraft examples.

Also the point where the score reduces significantly from

6. Conclusion References Acknowledgments

Thanks to Dawn Olley for editing services. 1. Detect domain shift and predict the best response in terms of model retraining. 2. Detect when an existing model has biases that make it unreliable for use on rarer data. 3. Predict the optimal way to train or retrain a model with limited training examples for a new or changing target dataset.

There are a great many examples of ways these mea

sures could be useful as an important part of an overall evaluation of a model, some of these are: We have created a selection of new scores for evaluating how well a particular dataset is being represented by the current model weights and architecture. We have performed extensive experiments to compare our new scores with measures from other fields that could have potential to be reused for this purpose. We compared the eficacy of these measures on both measuring how well existing model weights are representing a new stable target dataset, and detecting domain shift. The result of these experiments indicate that this new method, along with two others, have excellent potential to be used for measuring how well a dataset is currently being represented by a model.

Measures for this purpose have not been investigated before and our results have strong implications for the wider deep learning community. These measures have the potential to be used to: features, International Journal of Computer Vision Institute at Chicago, 2013. arXiv:1306.5151. 119 (2016) 145–158. [29] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, [15] G. E. Hinton, S. Roweis, Stochastic neighbor embed- A. Vedaldi, Describing textures in the wild, in: Proding, Advances in neural information processing ceedings of the IEEE Conference on Computer Visystems 15 (2002). sion and Pattern Recognition, 2014, pp. 3606–3613. [16] J. Wexler, M. Pushkarna, T. Bolukbasi, M. Watten- [30] P. J. Rousseeuw, Silhouettes: a graphical aid to berg, F. Viégas, J. Wilson, The what-if tool: Inter- the interpretation and validation of cluster analysis, active probing of machine learning models, IEEE Journal of computational and applied mathematics transactions on visualization and computer graph- 20 (1987) 53–65.

ics 26 (2019) 56–65. [31] D. L. Davies, D. W. Bouldin, A cluster separation [17] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i measure, IEEE transactions on pattern analysis and trust you?" explaining the predictions of any clas- machine intelligence (1979) 224–227. sifier, in: Proceedings of the 22nd ACM SIGKDD [32] T. Caliński, J. Harabasz, A dendrite method for clusinternational conference on knowledge discovery ter analysis, Communications in Statistics-theory and data mining, 2016, pp. 1135–1144. and Methods 3 (1974) 1–27. [18] S. M. Lundberg, S.-I. Lee, A unified approach to [33] J. C. Dunn, Well-separated clusters and optimal interpreting model predictions, Advances in neural fuzzy partitions, Journal of cybernetics 4 (1974) information processing systems 30 (2017). 95–104. [19] K. Q. Weinberger, J. Blitzer, L. Saul, Distance metric [34] G. W. Milligan, A monte carlo study of thirty inlearning for large margin nearest neighbor classifi- ternal criterion measures for cluster analysis, Psycation, Advances in neural information processing chometrika 46 (1981) 187–199.

systems 18 (2005). [35] D. Ratkowsky, A stopping rule and clustering [20] E. Xing, M. Jordan, S. J. Russell, A. Ng, Distance method of wide applicability, Botanical gazette metric learning with application to clustering with 145 (1984) 518–523. side-information, Advances in neural information [36] C. Szegedy, S. Iofe, V. Vanhoucke, A. A. Alemi, processing systems 15 (2002). Inception-v4, inception-resnet and the impact of [21] G. Chechik, V. Sharma, U. Shalit, S. Bengio, Large residual connections on learning, in: Thirty-First scale online learning of image similarity through AAAI Conference on Artificial Intelligence, 2017. ranking., Journal of Machine Learning Research 11 (2010). [22] S. Kornblith, J. Shlens, Q. V. Le, Do better imagenet models transfer better?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2019, pp. 2661–2671. [23] J. Plested, X. Shen, T. Gedeon, Non-binary deep transfer learning for imageclassification, arXiv e-prints (2021) arXiv:2107.08585.

arXiv:2107.08585. [24] J. Buolamwini, T. Gebru, Gender shades: Intersectional accuracy disparities in commercial gender classification, in: Conference on fairness, accountability and transparency, PMLR, 2018, pp. 77–91. [25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei

Fei, ImageNet: A Large-Scale Hierarchical Image

Database, in: CVPR09, 2009. [26] G. Grifin, A. Holub, P. Perona, Caltech-256 object

category dataset, authors.library.caltech.edu (2007). [27] Y. Cui, F. Zhou, Y. Lin, S. Belongie, Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1153–1162. [28] S. Maji, J. Kannala, E. Rahtu, M. Blaschko,

A. Vedaldi, Fine-Grained Visual Classification of Aircraft, Technical Report, Toyota Technological

[1]

Nguyen ,

Hassner ,

Seeger ,

Archambeau , Leep: A new measure to evaluate transferability of learned representations , in: International Conference on Machine Learning, PMLR , 2020 , pp. 7294 - 7305 .

[2]

Bao ,

Li ,

S.-L.

Huang ,

Zhang ,

Zheng ,

Zamir ,

Guibas , An information-theoretic approach to transferability in task transfer learning , in: 2019 IEEE International Conference on Image Processing (ICIP) , IEEE, 2019 , pp. 2309 - 2313 .

[3]

Tan ,

Li ,

S.-L.

Huang , Otce: A transferability metric for cross-domain cross-task representations , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021 , pp. 15779 - 15788 .

[4]

A. T.

Tran ,

C. V.

Nguyen , T. Hassner, Transferability and hardness of supervised classification tasks , in: Proceedings of the IEEE/CVF International Conference on Computer Vision , 2019 , pp. 1395 - 1405 .

[5]

Neyshabur ,

Sedghi , C. Zhang, What is being transferred in transfer learning? , arXiv preprint arXiv: 2008 . 11687 ( 2020 ).

[6]

Liu ,

Long ,

Wang , M. I. Jordan , Towards understanding the transferability of deep representations , arXiv preprint arXiv: 1909 . 12031 ( 2019 ).

[7]

Shen ,

Plested ,

Caldwell , T. Gedeon, Exploring biases and prejudice of facial synthesis via semantic latent space , in: 2021 International Joint Conference on Neural Networks (IJCNN) , IEEE, • Uncovering and quantifying biases in models . For 2021 , pp. 1 - 8 .

example how well is a model that is trained on [8]

Nenad ,

Radovanovic , Clustering Evaluation in mostly Caucasian faces likely to perform in iden- High-Dimensional Data, in: Unsupervised Learning tifying faces from other races . Algorithms , Springer, 2016 , pp. 71 - 107 .

• Quantifying how well prediction models based [9 ] R. A . Fisher, The use of multiple measurements in on historical data are representing data from the taxonomic problems , Annals of eugenics 7 ( 1936 ) last few years that has changed due to COVID 179-188 .

and other modern challenges . Once quantified [10]

Nguyen ,

Yosinski ,

Clune , Multifaceted feathese measures could also give guidance on how ture visualization: Uncovering the diferent types to update models to better incorporate modern of features learned by each neuron in deep neural data . networks, arXiv preprint arXiv:1602.03616 ( 2016 ). • Highlighting when models are performing well [11]

Yosinski ,

Clune ,

Nguyen ,

Fuchs , H. Lipson, on training and test data, but overfitting a poor Understanding neural networks through deep visurepresentation that will not generalise well to alization , arXiv preprint arXiv:1506.06579 ( 2015 ). new data. A classic example being the snow in [12]

Aubry ,

B. C.

Russell , Understanding deep feathe foreground being used to classify a husky tures with computer-generated imagery , in: Proversus a wolf in [17]. ceedings of the IEEE International Conference on Computer Vision , 2015 , pp. 2875 - 2883 .

[13] T.-Y. Lin , S. Maji , Visualizing and understanding deep texture representations , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2016 , pp. 2791 - 2799 .

[14]

Vondrick ,

Khosla ,

Pirsiavash ,

Malisiewicz ,

Torralba , Visualizing object detection