<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on AI Evaluation Beyond Metrics, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>FERM: A FEature-space Representation Measure for Improved Model Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yeu Shin Fu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenbo Ge</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jo Plested</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Australian National University</institution>
          ,
          <addr-line>Canberra, ACT 2601</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Equal contribution</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of New South Wales</institution>
          ,
          <addr-line>Northcott Dr, Campbell ACT 2612</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>25</volume>
      <issue>2022</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Understanding whether a particular dataset and task are well represented by a deep learning model can be as crucial as the model's prediction accuracy in many applications. Currently, best prediction performance for large, modern datasets is often achieved by complex and dificult to interpret deep learning models. As deep learning model size and complexity increases compared to the size of the training dataset, the capacity of the model to overfit to inappropriate features and perform poorly or unreliably also increases. Unreliability may not be obvious in traditional performance measures during evaluation so it is important to also consider how well the model is representing the current data distribution. There has previously been little work focusing on measuring this space. We introduce several measures that we collectively name FERM: A FEature-space Representation Measure for determining how well the current feature space representation models the current data distribution and task. We compared our new measures and potential candidates from other related research areas. We demonstrated that our new method, along with two others, have excellent potential to be used for measuring how well a trained model is currently representing a dataset and task. These findings have many implications for deep learning research and applications, including, evaluating when the current model is no longer representing new data well to reduce the frequency of computationally expensive retraining of models, assessing for hard to evaluate failure modes such as model biases that result in particular input samples being poorly represented, guidance on the best hyperparameters to use when updating models with limited new data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Representation learning</kwd>
        <kwd>Feature space evaluation</kwd>
        <kwd>Deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the successes of deep learning in the past decade
when applied to modelling large well formed and stable
data distributions, recent focus has turned to modelling
datasets that are:
• transferability, being how well a model trained
on a related source task is likely to perform when
ifne-tuned on a target task [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]
• analysis of deep learning feature spaces and how
those produced by pretrained models difer from
those with random initialisation [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ].
1. Not well formed because they are very diferent
to the source dataset in the case of some transfer
learning applications.
2. Not stable over time in the case of online learning
tasks.
3. Are dificult to model as they have long tailed
distributions including rare minority classes for
example or other non-standard distributions.
1. New evaluation measures for determining how
well the current feature space representation
models the current data distribution and task.
      </p>
      <sec id="sec-1-1">
        <title>2. A thorough comparison of our new measures and potential candidates from other related research areas.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        There have been limited previous investigations into
measures of how well the data is being represented by a deep
learning model. There are however many potential
methods that could be adapted for this purpose from other
ifelds including:
new target dataset in transfer learning [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
However these methods are focused on analysing how
pretrained weights stabilise and improve training on the
target dataset, and prevent over fitting. They do not look
at how well fixed weights represent the current target
dataset without fine-tuning and thus when and how to
perform fine-tuning.
      </p>
      <sec id="sec-2-1">
        <title>2.2. Exploring and visualising the deep learning feature space</title>
        <p>
          There are many methods that work on visualising either:
1. Recent methods designed for measuring the
"transferability" of a pretrained deep learning • the feature activations within a deep neural
netmodel [
          <xref ref-type="bibr" rid="ref1 ref2 ref3">2, 1, 3</xref>
          ]. The logic for this being that how work [10, 11]
well modelled a source dataset is would likely • the final feature space [
          <xref ref-type="bibr" rid="ref11 ref12">12, 13, 14, 15</xref>
          ]
be strongly correlated with how transferable the • the predictions and their accuracy [16, 17].
current model is. If the pretrained model weights While some of these methods, particularly those in
produce a poorly modelled feature space transfer item two above, do result in a projection of the feature
learning is likely to perform poorly with those space into a lower dimensional visualisation that would
weights. be easier to measure, they focus on visual inspection
2. Methods designed for measuring how well clus- rather than on measurement. They also don’t analyse
tered a high dimensional space is. The logic here the loss of information, and thus intra-class separation,
is that a well modeled feature space for classifi- by projecting from a high dimensional space to a low
cation is one where the data points are well clus- dimensional space that can be visualised.
tered and separated into their classes in feature
space ready to be classified by the final
classification algorithm. There are many clustering mea- 2.3. Interpreting Model Predictions
sures that fail in high dimensional spaces or with There has been a large amount of work done in
interhigh number of classes which mean that they are preting model predictions and producing measures and
not useful for many deep learning feature spaces. visualisations that show how much a prediction should
However, there are several that do work well in be trusted [18, 17]. These models focus on analysing and
these spaces [8]. interpreting the importance of input features, rather than
3. Adapting methods designed to measure distance the final learned feature space.
        </p>
        <p>in high dimensions. A major problem with
measuring the feature space is the high dimension- 2.4. Metric Learning
ality. We propose a new method of measuring
clustering based on the Fisher Score [9] that is Metric learning techniques aim to find a feature
embedcommonly used as a clustering measure in two ding space that optimises some predefined distance
metdimensions. We replace the Euclidean distance ric given pairs of examples that are classified as either
measure in the Fisher Score with cosine similar- the same or diferent [ 19, 20, 21]. This problem has been
ity, which is known to be an efective distance well studied. Our problem is the opposite in that we
almeasure in high dimensions, along with other ready have an embedding space and we wish to find a
adaptations. metric that measures how well our current embedding is
separating our current samples into the same and
diferent classes or clusters. There may be some potential to
repurpose scores designed for the metric learning space,
however we leave this to future work as we have focused
on the most promising closely related measures in this
work.</p>
        <sec id="sec-2-1-1">
          <title>Several research areas that are related to measuring the feature space are outlined below.</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.1. Exploring the feature space in deep transfer learning</title>
        <p>Several methods have been proposed for analysing the
feature space from a pretrained model applied to a
3. Methodology
4. Notation
•  ∈  where  is an input and  is the domain
•  = {1, 2, ..., } where  is the set of
inputs
•  is the finite set of labels
•  = {,1, ,2, ..., , } is the set of inputs
that belong to class  with  samples, and thus
1 ∪ 2 ∪ ... ∪  =  where  is the number
of classes
•  is the trained model, which can be decomposed
as  () = ℎ(())
•  is the feature extractor that maps an input  to
a representation (or embedding)  = ()
•  is the feature representation
• ℎ is a classifier (or head) that takes the
representation  as input and returns a probability
distribution over .
• ℛ = {,1, ,2, ..., ,} = () is the
feature representation of the inputs in a class,
processed by the feature extractor 
• We define  (, ) as a function that operates
on two sets,  and , and gives the unordered
set of all unique pairs from  and . That is,
 (1, 2) ={(1,1, 2,1), (1,1, 2,2), ...</p>
        <p>, (1, , 2,− 1), (1, , 2, )}
• We can also say that, when  = ,  (, ) =
 () and instead gives the unordered set of
unique pairs, excluding pairs with itself. That
is,
 (, ) =  () ={(,1, ,2), (,1, ,3), ...
, (,− 1, , )}</p>
      </sec>
      <sec id="sec-2-3">
        <title>4.1. Scoring the feature space</title>
        <p>The aim of this work was to quantify how well
constructed a feature space is by creating or finding a
measure that gives high scores when the feature space is well
formed and low scores when the feature space is
malformed. Here, we think of a well formed feature space
as one where there is high similarity/tight clustering
within a class (intra-class) and low similarity/sparse
clustering between classes (inter-class). Figure 1 shows a well
formed 1,500 dimensional feature space reduced using
T-SNE into the normalised top-2 representative
dimensions so it can be visualised. Note that the data points
from all classes are grouped tightly within their class and
mostly well separated from other classes.</p>
        <p>The motivation for a score that measures how well
constructed the feature space is, is three-fold:
We propose several scores that use cosine similarity to
quantify the level of inter-class similarity vs intra-class
similarity. We expect that a well formed feature space as
(, ) = cos ∠(, ) =</p>
        <p>·  ⊺
‖‖‖‖ = √⊺√⊺</p>
        <p>(1)
where  and  are vectors, · is the inner dot product, and
‖ · ‖ is the magnitude of the vectors.</p>
        <p>We define our first FERM:
FERM 1 =
1 ∑︁</p>
        <p>2− 2  ∑︀, ∈ () (,  )
=1 (1− ) ∑︀, ∈ (,∖) (,  )</p>
        <p>(2)
The intuition is quite simple: the numerator is the sum
of cosine similarities of all unique pairs in a class,
normalised by the number of unique pairs (i.e., an average).</p>
        <p>The denominator is the sum of cosine similarities of all
unique pairs between samples in the class and samples
out of the class, normalised by the number of unique
pairs (i.e., an average). This provides a ratio of intra-class
similarity and inter-class similarity. This ratio is then
averaged across all classes, resulting in FERM 1.</p>
        <p>We can then define our second FERM:</p>
        <p>FERM 2 =</p>
        <p>∑︀
∑︀</p>
        <p>2
=1 2−</p>
        <p>1
, &lt; 
∑︀, ∈ () (,  )
∑︀, ∈ (,) (,  )
shown in Figure 1 should have high intra-class similarity This can be interpreted as normalising all samples to the
and low inter-class similarity. unit hyper-sphere, then finding the centroid point on</p>
        <p>Our measure is based on adapting the Fisher Score [9] the unit hyper-sphere by adding all normalised samples
which is known to perform poorly in high dimensions, together and normalising the combined vector. We can
by replacing the Euclidian distance with cosine similarity then define our third FERM:
which is known to perform well in high dimensions.</p>
        <p>Cosine similarity is defined as:
(5)
(6)
FERM 3 =
∑︀</p>
        <p>2
=1 2− 
∑︀ 1
, ̸= (− 1)
∑︀, ∈ () (,  )</p>
        <p>∑︀∈ (, )
The numerator term is still the same, but now the
denominator is the average of cosine similarity of samples
within a class to the centroids of other classes.</p>
        <p>Using the same notation above, we can then define our
fourth FERM:</p>
        <p>FERM 4 =
∑︀
=1 2− 2  ∑︀, ∈ () (,  )</p>
        <p>∑︀, ̸= 1− 1 (, )
This further simplifies the calculation of the denominator
to a comparison of the centroid of a class to the centroids
of other classes.</p>
        <p>For all FERMs a higher score means better clustering.</p>
        <p>
          As each individual FERM score and thus the numerator
and denominator are within the bounds [
          <xref ref-type="bibr" rid="ref1">− 1, 1</xref>
          ], a
positive score above 1.0 reflects more intra-class similarity
compared to inter-class similarity.
(3)
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>4.3. Data sets</title>
        <sec id="sec-2-4-1">
          <title>We have selected the following datasets.</title>
          <p>The intuition is similar to the first FERM. The numerator 4.3.1. Source Dataset
remains the same after incorporating the out sum (an
average of cosine similarities of all unique pairs in a class, ImageNet 1K (ImageNet) [25] A general image
across all classes), but the denominator is now an average dataset containing 1,000 common image classes with at
of cosine similarities of unique pairs between samples in least 1,000 total images in each class for a total of just over
the class and samples out of the class that has not yet been 1.3 million images in the training set. We use ImageNet
accounted for. Although, in the first measure, only the as the source dataset for all our experiments.
unique pairs of samples in and out of a class are averaged,
further repeating this (the outer sum) results in double 4.3.2. Target Datasets
counting across classes. FERM 2 prevents this double Caltech-256 (Caltech) [26] Pictures of objects
becounting. longing to 256 categories, with at least 80 images per</p>
          <p>We define our third FERM through the use of a cen- category. The Caltech dataset is a general image
clastroid in terms of cosine similarities, a so called ‘angular sification dataset similar to ImageNet but with orders
centroid’. In the same way that the average Euclidean of magnitude fewer training examples. It is generally
distance of one point to several other points can be rep- considered to be the most similar target dataset to
Imaresented as the distance of that one point to a Euclidean geNet and fixed weights pretrained on ImageNet tend to
centroid of points, the average angle between one point perform about as well as fine-tuned weights [22, 23].
and several other points can be represented as the
angle between that one point and an ‘angular centroid’ of
points. The centroid for a class  is defined as:
(4)</p>
          <p>FGVC Aircraft (Aircraft) [27] Contains 100 diferent
makes and models of aircraft with 6,667 training
examples and 3,333 test examples. The Aircraft dataset is a
finegrained image classification dataset that is very diferent
 =
1
∑︁</p>
          <p>∈ ‖‖
from ImageNet. Fixed weights pretrained on ImageNet
perform extremely poorly on this dataset [22, 23].</p>
          <p>Stanford Cars (Cars) [28] Contains 196 diferent
makes and models of cars with 8,144 training examples
and 8,041 test examples. The Cars dataset is also a
finegrained image classification dataset that is very diferent
from ImageNet and fixed weights pretrained on ImageNet
also perform extremely poorly on this dataset [22, 23].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Experiments</title>
      <sec id="sec-3-1">
        <title>We performed two sets of experiments:</title>
        <p>Describable textures (DTD) [29] Consists of 3,760
training examples of texture images jointly annotated
with 47 attributes. While the DTD dataset is conceptually
very diferent to ImageNet recent results have shown that
ifxed weights pretrained on ImageNet perform
reasonably well on this dataset compared to fine-tuned weights
[22, 23].</p>
        <p>The ratio of the fixed features to fine-tuned results for
a model pretrained on ImageNet are shown in Table 4 for
all datasets.</p>
        <p>For each experiment we used the Inception v4
architecture [36] pretrained on ImageNet 1k. Using this model,
we compared the diferent FERMs on the diferent
target data sets: Aircraft, DTD, Cars, and Caltech-256. We
also used ImageNet 1k as a target data set to determine a
baseline score for each measure.</p>
        <p>During this evaluation, two pipelines were constructed:
one that utilises transformations of the data, and one that
does not. When determining how well classes are
clustered together, a forward pass of the unaltered data was
initially used, providing us with the exact feature
representation of that sample. During a standard deep learning
training process, samples are randomly flipped, scaled,
resized, and rotated. These samples incur a loss if
classiifed incorrectly, and so we expect the model to still learn
to classify those samples correctly. Therefore it is likely
that the feature representation of these randomly
transformed samples are still able to be represented in a well
1. Conducting experiments to compare the efec- formed feature space. Assuming the model adequately
tiveness of our score along with candidate scores classifies the transformed data, a measure that is robust
from other fields in measuring how well a model to these transformations (that is, does not change much
trained on the ImageNet 1K source dataset repre- in the presence or absence of transformations) would be
sents a particular known and stable target dataset. better than one that is not, as it would allow us to use
We use datasets where it is well known how well this during the training process.
ifxed pretrained ImageNet 1K weights perform on We explored the four proposed FERMs on the five
them so they make a good basis for comparison. target data sets (including ImageNet 1k) with the two
dif2. Using the above measures to detect ‘corruption’ ferent pipelines (with or without transformations). Each
or domain shift in the feature space. transformation experiment was also repeated five times,
as the transformations are random.</p>
      </sec>
      <sec id="sec-3-2">
        <title>We also investigated recent transferability scores that have been shown to perform well when measuring how well transfer learning will perform on a particular target dataset:</title>
      </sec>
      <sec id="sec-3-3">
        <title>We further elaborate on each goal in the corresponding Sections 5.1 and 5.2 below.</title>
        <p>In addition to our proposed measures, we explored sev- 5.1.1. Results
eral other clustering measures. These were chosen by re- Comparisons between diferent FERMS across the
diferviewing [8] and removing clustering scores that were not ent target data sets with and without transformations
stable as dimensionality increased (large perturbations can be seen in Table 1. Note that results with
transformaor outliers), and similar in score between overlapping- tions are reported as means and standard deviations as
clusters and well separated-clusters: the experiments were repeated. The datasets in all tables
• Silhouette score [30] are listed in order of the ratio of the performance of fixed
• Davies Bouldin score [31] features pretrained on ImageNet to the best fine-tuned
• Calinski Harabasz score [32] model performance, using results from [22, 23] as a proxy
• Dunn score [33] for how well formed the feature space is.
• RS index [8] With and without transformations ImageNet
consis• Point Biserial Index [34] tently scored highest, followed consistently by Caltech
• √ index [35]. except with FERM 4. For FERM 1 and 2 Aircraft and Cars
score much lower than ImageNet and Caltech and DTD
is in between. This is the same ordering as our proxy for
a well formed feature space.
to the data suggest that FERM 1, and 2 seem to be able
to consistently do this. It seems that FERM 1, and 2
have potential as a way to measure how well formed the
feature space is for a particular trained model and target
task.</p>
        <p>Of the transferability measures, LEEP is the only score
that consistently ranks ImageNet 1k and Caltech-256 as
most transferable, in the presence and absence of
transformations, however it ranks DTD as least transferable
in both cases, which is incorrect. Given the scores are in
the same order as the number of classes in the dataset it
seems likely that it’s afected by the number of classes.</p>
        <p>H-score also seems to be strongly afected by the number
of classes, as the scores are close to being proportional
to the number of classes in the target dataset.</p>
        <p>Of the clustering measures, Silhouette score, Davies
Bouldin score, Point Biserial Index, and √ index
seem to also consistently rank ImageNet 1k and
Caltech256 as the most transferable, in the presence and absence
of transformations. However only Silhouette score ranks
DTD as moderately transferable compared to the
others. Point biserial may also be strongly afected by the
number of classes, as the scores are again close to
being proportional to the number of classes in the target
dataset.</p>
        <p>In summary when looking at only stable target datasets
our proposed scores FERM 1 and 2 as well as the
clustering measure Silhouette score are good candidates for
measuring how well formed the feature space is for a
given trained model and target task.</p>
        <p>We know that fixed features pretrained on ImageNet 1k 5.2. Detecting and quantifying domain
perform well on Caltech-256, moderately well on DTD, shifts
and poorly on Aircraft, and Cars [22, 23] as shown by
our ratios of fixed features to fine-tuned performance We attempted to detect and quantify incremental domain
in Table 1. We use this as a proxy for a well formed shifts. As it is hard to concretely quantify diferent levels
feature space and expect a good score to reflect the same of domain shift, we reduce the problem down into
deknowledge, that is, a low score for Aircraft and Cars, a tecting levels of ‘corruption’. ‘Corruption’ is defined as
moderate score for DTD, a high score for Caltech-256, the presence of the target data set mixed into the source
and a very high score for ImageNet. data set, where the source data can be thought of as no</p>
        <p>Our results with and without random transformations domain shift, whilst the target data set can be thought of
as complete domain shift. This can be then quantified by Another way we approached the problem is by looking
the percentage of target data in the source data set. at transferability measures. Since measures of
transfer</p>
        <p>We again started with an Inception v4 model pre- ability are largest when the source task is the same as
trained on ImageNet 1K. We then incrementally shifted the target task, we hypothesized that at 0% corruption
the domain by either adding target data to the evaluation (i.e., there is no domain shift) transferability scores will
set or removing source data from the evaluation set. The be high, and will slowly degrade with increasing levels
source samples are derived from the ImageNet 1k valida- of corruption.
tion set, whilst the target samples are derived from the
training set of Aircraft. The Aircraft dataset was used in 5.2.1. Results
this case as it was the most poorly represented by the
pretrained model in our previous experiments. Each time For each diferent combination of source and target
we added more ’corruption’ we used all measures from dataset we ran the experiment 10 times as the selection
our previous experiments to measure the feature space. of the examples for each class was random. The classes</p>
        <p>Specifically, we created the evaluation set by randomly chosen from ImageNet were fixed to allow for a fixed
choosing 200 classes from ImageNet 1k, and then ran- comparison. The change in each of the diferent scores
domly choosing the same number of samples across the as the domain shifts to the target data set of Aircraft can
classes. Aircraft was combined with this in a similar be seen in Figure 2. The scores have been normalised
way, that is, randomly choosing the same number of sam- between 0 and 1. Although several of these were repeated
ples across all 100 classes. The union of both creates the and averaged, we did not plot the error bars as they are
evaluation set. largely uninformative, as seen in Section 5.1.1.</p>
        <p>The feature representation of a sample is defined as
 = (), where (· ) is the feature extractor from the 5.2.2. Discussion
trained source model. We expected that as the level of
corruption increases (as more of the source data set is
replaced by the target data set), the clustering of classes
in the feature space degrades; features in the new class
are not clustered well, and thus the overall clustering
score should decrease.</p>
        <p>We expect a measure that is good at detecting domain
shift to start with a normalised score of 1 (or 0 if inversely
proportional) with no domain shift, and incrementally
decrease to 0 (or increase to 1) as the domain is
completely shifted. We also would like the measure to be
monotonically decreasing (or increasing). The results in
1
0.5</p>
        <p>0
se 1
r
o
c
s
sed0.5
i
l
a
m
r
o 0
N
0.5
1
0</p>
        <p>Cosine meaures</p>
        <p>Clustering measures
Transferability measures
cosine measure 1
cosine measure 2
cosine measure 3
silhouette score
davies bouldin score
calinski harabasz score
dunn score
rs index
point biserial index
C root K index
leep
otce
h score
Figure 2 show that only Point Biserial Index seems to be the original ImageNet score is approximately at the point
almost entirely monotonically trending. Ignoring the last where the dataset has shifted to the extent that its
compoint (0 samples of ImageNet 1k), H-score seems to have position is more than 50% of the target dataset. The
strong potential to detect domain shift however more experiments with only one example from each class of
investigation is required to see why the final point is so either the source or the target dataset can be thought of
far out of sequence. as just adding noise, as intra-class distances cannot be</p>
        <p>RS index, Davies Bouldin score, and Silhouette score measured with only one example for each class. Thought
seem to also have sections of monotonic trend. Further of in this way it is useful that our measure is strongly
work is required to make a strong claim in the ability of sensitive to this situation.
these measures to detect and quantify domain shift. More extensive work should be done to compare our</p>
        <p>The results of our FERMs are particularly interesting. methods with the Point Biserial Index, and H-score across
If the points where there is only one example per class a broader range of domain shift applications.
of either Aircraft or ImageNet are excluded (second from
the left and right on the graph) the trend is almost
monotonic from all ImageNet examples to all Aircraft examples.</p>
        <p>Also the point where the score reduces significantly from</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusion References</title>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>Thanks to Dawn Olley for editing services.
1. Detect domain shift and predict the best response
in terms of model retraining.
2. Detect when an existing model has biases that
make it unreliable for use on rarer data.
3. Predict the optimal way to train or retrain a model
with limited training examples for a new or
changing target dataset.</p>
      <sec id="sec-5-1">
        <title>There are a great many examples of ways these mea</title>
        <p>sures could be useful as an important part of an overall
evaluation of a model, some of these are:
We have created a selection of new scores for evaluating
how well a particular dataset is being represented by the
current model weights and architecture. We have
performed extensive experiments to compare our new scores
with measures from other fields that could have potential
to be reused for this purpose. We compared the eficacy
of these measures on both measuring how well
existing model weights are representing a new stable target
dataset, and detecting domain shift. The result of these
experiments indicate that this new method, along with
two others, have excellent potential to be used for
measuring how well a dataset is currently being represented
by a model.</p>
        <p>Measures for this purpose have not been investigated
before and our results have strong implications for the
wider deep learning community. These measures have
the potential to be used to:
features, International Journal of Computer Vision Institute at Chicago, 2013. arXiv:1306.5151.
119 (2016) 145–158. [29] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed,
[15] G. E. Hinton, S. Roweis, Stochastic neighbor embed- A. Vedaldi, Describing textures in the wild, in:
Proding, Advances in neural information processing ceedings of the IEEE Conference on Computer
Visystems 15 (2002). sion and Pattern Recognition, 2014, pp. 3606–3613.
[16] J. Wexler, M. Pushkarna, T. Bolukbasi, M. Watten- [30] P. J. Rousseeuw, Silhouettes: a graphical aid to
berg, F. Viégas, J. Wilson, The what-if tool: Inter- the interpretation and validation of cluster analysis,
active probing of machine learning models, IEEE Journal of computational and applied mathematics
transactions on visualization and computer graph- 20 (1987) 53–65.</p>
        <p>ics 26 (2019) 56–65. [31] D. L. Davies, D. W. Bouldin, A cluster separation
[17] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i measure, IEEE transactions on pattern analysis and
trust you?" explaining the predictions of any clas- machine intelligence (1979) 224–227.
sifier, in: Proceedings of the 22nd ACM SIGKDD [32] T. Caliński, J. Harabasz, A dendrite method for
clusinternational conference on knowledge discovery ter analysis, Communications in Statistics-theory
and data mining, 2016, pp. 1135–1144. and Methods 3 (1974) 1–27.
[18] S. M. Lundberg, S.-I. Lee, A unified approach to [33] J. C. Dunn, Well-separated clusters and optimal
interpreting model predictions, Advances in neural fuzzy partitions, Journal of cybernetics 4 (1974)
information processing systems 30 (2017). 95–104.
[19] K. Q. Weinberger, J. Blitzer, L. Saul, Distance metric [34] G. W. Milligan, A monte carlo study of thirty
inlearning for large margin nearest neighbor classifi- ternal criterion measures for cluster analysis,
Psycation, Advances in neural information processing chometrika 46 (1981) 187–199.</p>
        <p>systems 18 (2005). [35] D. Ratkowsky, A stopping rule and clustering
[20] E. Xing, M. Jordan, S. J. Russell, A. Ng, Distance method of wide applicability, Botanical gazette
metric learning with application to clustering with 145 (1984) 518–523.
side-information, Advances in neural information [36] C. Szegedy, S. Iofe, V. Vanhoucke, A. A. Alemi,
processing systems 15 (2002). Inception-v4, inception-resnet and the impact of
[21] G. Chechik, V. Sharma, U. Shalit, S. Bengio, Large residual connections on learning, in: Thirty-First
scale online learning of image similarity through AAAI Conference on Artificial Intelligence, 2017.
ranking., Journal of Machine Learning Research 11
(2010).
[22] S. Kornblith, J. Shlens, Q. V. Le, Do better imagenet
models transfer better?, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern</p>
        <p>Recognition, 2019, pp. 2661–2671.
[23] J. Plested, X. Shen, T. Gedeon, Non-binary
deep transfer learning for
imageclassification, arXiv e-prints (2021) arXiv:2107.08585.</p>
        <p>arXiv:2107.08585.
[24] J. Buolamwini, T. Gebru, Gender shades:
Intersectional accuracy disparities in commercial gender
classification, in: Conference on fairness,
accountability and transparency, PMLR, 2018, pp. 77–91.
[25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L.
Fei</p>
        <p>Fei, ImageNet: A Large-Scale Hierarchical Image</p>
        <p>Database, in: CVPR09, 2009.
[26] G. Grifin, A. Holub, P. Perona, Caltech-256 object</p>
        <p>category dataset, authors.library.caltech.edu (2007).
[27] Y. Cui, F. Zhou, Y. Lin, S. Belongie, Fine-grained
categorization and dataset bootstrapping using deep
metric learning with humans in the loop, in:
Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 1153–1162.
[28] S. Maji, J. Kannala, E. Rahtu, M. Blaschko,</p>
        <p>A. Vedaldi, Fine-Grained Visual Classification of
Aircraft, Technical Report, Toyota Technological</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hassner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seeger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Archambeau</surname>
          </string-name>
          ,
          <article-title>Leep: A new measure to evaluate transferability of learned representations</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7294</fpage>
          -
          <lpage>7305</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guibas</surname>
          </string-name>
          ,
          <article-title>An information-theoretic approach to transferability in task transfer learning</article-title>
          ,
          <source>in: 2019 IEEE International Conference on Image Processing (ICIP)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>2309</fpage>
          -
          <lpage>2313</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Otce: A transferability metric for cross-domain cross-task representations</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>15779</fpage>
          -
          <lpage>15788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , T. Hassner,
          <article-title>Transferability and hardness of supervised classification tasks</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1395</fpage>
          -
          <lpage>1405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Neyshabur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sedghi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Zhang,</surname>
          </string-name>
          <article-title>What is being transferred in transfer learning?</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>11687</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. I. Jordan</surname>
          </string-name>
          ,
          <article-title>Towards understanding the transferability of deep representations</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>12031</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plested</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Caldwell</surname>
          </string-name>
          , T. Gedeon,
          <article-title>Exploring biases and prejudice of facial synthesis via semantic latent space</article-title>
          , in: 2021
          <source>International Joint Conference on Neural Networks (IJCNN)</source>
          , IEEE, •
          <article-title>Uncovering and quantifying biases in models</article-title>
          .
          <source>For</source>
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>example how well is a model that is trained</article-title>
          on [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nenad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Radovanovic</surname>
          </string-name>
          ,
          <article-title>Clustering Evaluation in mostly Caucasian faces likely to perform in iden- High-Dimensional Data, in: Unsupervised Learning tifying faces from other races</article-title>
          .
          <source>Algorithms</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>• Quantifying how well prediction models based [9</article-title>
          ]
          <string-name>
            <surname>R. A</surname>
          </string-name>
          . Fisher,
          <article-title>The use of multiple measurements in on historical data are representing data from the taxonomic problems</article-title>
          ,
          <source>Annals of eugenics 7</source>
          (
          <year>1936</year>
          )
          <article-title>last few years that has changed due to COVID 179-188</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>and other modern challenges</article-title>
          .
          <source>Once</source>
          <volume>quantified</volume>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          ,
          <article-title>Multifaceted feathese measures could also give guidance on how ture visualization: Uncovering the diferent types to update models to better incorporate modern of features learned by each neuron in deep neural data</article-title>
          .
          <source>networks, arXiv preprint arXiv:1602.03616</source>
          (
          <year>2016</year>
          ).
          <article-title>• Highlighting when models are performing well</article-title>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fuchs</surname>
          </string-name>
          , H. Lipson,
          <article-title>on training and test data, but overfitting a poor Understanding neural networks through deep visurepresentation that will not generalise well to alization</article-title>
          ,
          <source>arXiv preprint arXiv:1506.06579</source>
          (
          <year>2015</year>
          ).
          <article-title>new data. A classic example being the snow</article-title>
          in [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Aubry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Understanding deep feathe foreground being used to classify a husky tures with computer-generated imagery</article-title>
          ,
          <source>in: Proversus a wolf in [17]. ceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>2875</fpage>
          -
          <lpage>2883</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [13]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Maji</surname>
          </string-name>
          ,
          <article-title>Visualizing and understanding deep texture representations</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2791</fpage>
          -
          <lpage>2799</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Vondrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirsiavash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Malisiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          , Visualizing object detection
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>