=Paper=
{{Paper
|id=Vol-2699/paper02
|storemode=property
|title=Now You See Me (CME): Concept-based Model Extraction

|pdfUrl=https://ceur-ws.org/Vol-2699/paper02.pdf
|volume=Vol-2699
|authors=Dmitry Kazhdan,Botty Dimanov,Mateja Jamnik,Pietro Liò,Adrian Weller
|dblpUrl=https://dblp.org/rec/conf/cikm/KazhdanDJLW20
}}
==Now You See Me (CME): Concept-based Model Extraction
==
<pdf width="1500px">https://ceur-ws.org/Vol-2699/paper02.pdf</pdf>
<pre>
Now You See Me (CME): Concept-based Model Extraction
Dmitry Kazhdana,c , Botty Dimanova,c , Mateja Jamnika , Pietro Liòa and Adrian Wellera,b
a The University of Cambridge, UK
b The Alan Turing Institute, London, UK
c Denotes equal contribution


                                          Abstract
                                          Deep Neural Networks (DNNs) have achieved remarkable performance on a range of tasks. A key step to further empowering
                                          DNN-based approaches is improving their explainability. In this work we present CME: a concept-based model extraction
                                          framework, used for analysing DNN models via concept-based extracted models. Using two case studies (dSprites, and
                                          Caltech UCSD Birds), we demonstrate how CME can be used to (i) analyse the concept information learned by a DNN model
                                          (ii) analyse how a DNN uses this concept information when predicting output labels (iii) identify key concept information
                                          that can further improve DNN predictive performance (for one of the case studies, we showed how model accuracy can be
                                          improved by over 14%, using only 30% of the available concepts).

                                          Keywords
                                          interpretability, concept extraction, concept-based explanations, model extraction, latent space analysis, xai


1. Introduction                                                                                                    Concept-based explanation approaches provide model
                                                                                                                   explanations in terms of human-understandable units,
The black-box nature of Deep Neural Networks (DNNs)                                                                rather than individual features, pixels, or characters
hinders their widespread adoption, especially in indus-                                                            (e.g., the concepts of a wheel and a door are important
tries under heavy regulation with high-cost of error                                                               for the detection of cars) [10, 11, 12].
[1]. As a result, there has recently been a dramatic                                                                  In this paper we introduce CME1 : a (C)oncept-based
increase in research on Explainable AI (XAI), focusing                                                             (M)odel (E)xtraction framework2 . Figure 1 depicts how
on improving explainability of DL systems [2, 3].                                                                  CME can be used to analyse DNN models via explain-
   Currently, the most widely used XAI methods are fea-                                                            able concept-based extracted models, in order to ex-
ture importance methods (also referred to as saliency                                                              plain and improve performance of DNNs, as well as to
methods) [4]. For a given data point, these methods pro-                                                           extract useful knowledge from them. Although this ex-
vide scores showing the importance of each feature (e.g.,                                                          ample focuses on a CNN model, CME is model-agnostic,
pixel, patch, or word vector) to the algorithm’s deci-                                                             and can be applied to any DNN architecture.
sion. Unfortunately, feature importance methods have                                                                  In particular, we make the following contributions:
been shown to be fragile to input perturbations [5, 6]                                                                  • We present the novel CME framework, capable
or model parameter perturbations [7, 8]. Human ex-                                                                         of analysing DNN models via concept-based ex-
periments also demonstrate that feature importance                                                                         tracted models
explanations do not necessarily increase human un-                                                                      • We demonstrate, using two case-studies, how
derstanding, trust, or ability to correct mistakes in a                                                                    CME can analyse (both quantitatively and quali-
model [9, 10].                                                                                                             tatively) the concept information a DNN model
   As a consequence, two other types of XAI approaches                                                                     has learned, and how this information is repre-
are receiving increasing attention: model extraction ap-                                                                   sented accross the DNN layers
proaches, and concept-based explanation approaches.                                                                     • We propose a novel metric for evaluating the
Model extraction methods (also referred to as model                                                                        quality of concept extraction methods
translation methods) approximate black-box models                                                                       • We demonstrate, using two case-studies, how
with simpler models to increase model explainability.                                                                      CME can analyse (both quantitatively and quali-
                                                                                                                           tatively) how a DNN uses concept information
Title of the Proceedings: "Proceedings of the CIKM 2020 Workshops"                                                         when predicting output labels
Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi                                                                 • We demonstrate how CME can identify key con-
email: dk525@cam.ac.uk (D. Kazhdan); btd26@cam.ac.uk (B.
                                                                                                                           cept information that can further improve DNN
Dimanov); mateja.jamnik@cl.cam.ac.uk (M. Jamnik);
pietro.lio@cl.cam.ac.uk (P. Liò); adrian.weller@eng.cam.ac.uk (A.                                                          predictive performance
Weller)
orcid:                                                                                                                 1 Pronounced “See Me.”
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative       2 All
                                    Commons License Attribution 4.0 International (CC BY 4.0).                                     relevant    code     is   available   at
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        https://github.com/dmitrykazhdan/CME
                                                               extracting concept information. DNNs have been shown
                                                               to perform hierarchical feature extraction, with layers
                                                               closer to the output utilising higher-level data represen-
                                                               tations, compared to layers closer to the input [16, 17].
                            (a)
                                                               This implies that choosing a single layer imposes an
                                                               unnecessary trade-off between low- and high-level con-
                                                               cepts. On the other hand, CME is capable of efficiently
                                                               combining latent space information from multiple lay-
                                                               ers, thereby avoiding this constraint.
                                                                  Finally, existing methods typically represent concept
                            (b)                                explanations as a list of concepts, with their relative
Figure 1: CME extracted model example. (a) Given an input
                                                               importance with respect to the classification task. In
image, a CNN uses the image’s pixel information as input,      contrast, our approach describes the functional relation-
and returns class information as output (in this case, class   ship between concepts and outputs, thereby showing in
label 3, corresponding to the Red-headed Woodpecker class),    more detail how the model utilises concept information
performing data processing in a non-explainable, black-box     when making predictions.
fashion. (b) Given an input image, a CME extracted model
uses an Input-to-Concept function (I-to-C) to compute con-
cept information from the pixel data (e.g. bird wing color,
                                                               2.2. Concept Bottleneck Models
or head color values). Next, the model uses a Concept-to-  Recent work on concept-based explanations relies on
Output function (C-to-O) to compute the output class label models that use an intermediate concept-based repre-
from this concept information.                             sentation when making predictions [18, 19]. Work in
                                                           [18] refer to these types of models as concept bottleneck
                                                           models (CBMs). A concept bottleneck model is a model
2. Related Work                                            which, given an input, first predicts an intermediate
                                                           set of human-specified concepts, and then uses only
2.1. Concept-based Explanations                            this concept information to predict the output task la-
Concept-based explanations have been used in a wide bel. Work in [18] proposes a method for turning any
range of different ways, including: inspecting what DNN into a concept bottleneck model given concept
a model has learned [12, 13], providing class-specific annotations at training time. This is achieved by resiz-
explanations [14, 10], and discovering causal relations ing one of the layers to match the number of concepts
between concepts [15]. Similarly to CME, these ap- provided, and re-training with an added intermediate
proaches typically seek to explain model behaviour in loss that encourages the neurons in that layer to align
terms of high-level concepts, extracting this concept component-wise to the provided concepts.
information from a model’s latent space.                      Crucially, CBM approaches provide ways for gen-
   Importantly, existing concept-based explanation ap-     erating  DNN models, which are explicitly encouraged
proaches are typically capable of handling binary-valued   to rely  on  specified concept information. In contrast,
concepts only, which implies that multi-valued con-        our  approach   is used for analysing DNN models (and is
cepts have to be binarised first. For instance, given a    much   cheaper    computationally).
concept such as “shape”, with possible values ‘square’        Furthermore,     CBM approaches require concept an-
and ‘circle’, these approaches have to convert “shape”     notations   to be  available at training time for all of the
into two binary concepts ‘is_square’, and ‘is_circle’.     training   data,  which   is often  expensive to produce.
This makes such approaches (i) computationally expen-      In contrast,   CME    can  be  used with partially-labelled
sive, since the binarised concept space usually has a      datasets  in  a semi-supervised     fashion, as will be de-
high cardinality, (ii) error-prone, since mutual exclusiv- scribed  in Section   3.
ity of concept values is now not enforced (e.g., a single     Finally, CBM approaches require the concepts them-
data point can now have both ‘is_square’ and ‘is_circle’   selves  to be known beforehand. On the other hand,
concepts being true). In contrast, our approach is capa-   CME   can  efficiently utilise knowledge contained in pre-
ble of handling multi-valued concepts directly, without    trained  DNNs,    in order to learn about which concepts
binarisation.                                              are/aren’t  required   for a given task. Further details on
   Furthermore, concept-based explanation approaches       CME/CBM       comparison    can be found in Appendix A.
typically rely on the latent space of a single layer when
2.3. Model Extraction                                       output function, mapping data-points in their concept
                                                            representation  to output space . Thus, when pro-
Model extraction techniques use rules [20, 21, 22], de-
                                                            cessing an input 𝐱, a DNN 𝑓 can be seen as converting
cision trees [23, 24], or other more readily explainable
                                                            this input into an interpretable concept representation
models [25] to approximate complex models, in order
                                                            using 𝑝, and using 𝑞 to predict the output from this
to study their behaviour. Provided the approximation
                                                            representation. The significance of this decomposition
quality (referred to as fidelity) is high enough, an ex-
                                                            is further discussed in Appendix A.
tracted model can preserve many statistical properties
                                                               CME explores whether a given DNN 𝑓 is concept-
of the original model, while remaining open to inter-
                                                            decomposable, by attempting to approximate 𝑓 with an
pretation.
                                                            extracted model 𝑓̂ ∶  → . In this case, 𝑓̂ is defined
   However, extracted models generated by existing
methods represent their decision-making using the as 𝑓 (𝐱) = 𝑞̂ (𝑝̂ (𝐱)), using input-to-concept 𝑝̂ and output-
                                                                ̂
same input representation as the original model, which to-concept 𝑞̂ extracted by CME from the original DNN.
is typically difficult for the user to understand directly. We describe our approach to extracting 𝑝̂ and 𝑞̂ in the
Instead, our extracted models represent decision-making remainder of this section.
via human-understandable concepts, making them eas-
ier to interpret.                                            3.3. Input-to-Concept (𝑝̂ )
                                                             When extracting 𝑝̂ from a pre-trained DNN, we as-
3. Methodology                                               sume we have access to the DNN training data and
                                                             labels {(𝐱(0) , 𝑦 (0) ), ..., (𝐱(𝑑) , 𝑦 (𝑑) )}. Furthermore, we as-
In this section we present our CME approach, describ-        sume partial access to 𝑝 ⋆ , such that a small set of
ing how it can be used to analyse DNN models using           𝑖 training points {𝐱(0) , ..., 𝐱(𝑖−1) } have concept labels
concept-based extracted models.                              {𝐜(0) , ..., 𝐜(𝑖−1) } associated with them, while the remain-
                                                             ing 𝑢 points {𝐱(𝑖) , ..., 𝐱(𝑖+𝑢) } do not (in this case 𝑢 = 𝑑 −𝑖).
3.1. Formulation                                             We refer to these subsets respectively as the concept
                                                             labelled dataset and concept unlabelled dataset. Using
We consider a pre-trained DNN classifier 𝑓 ∶  → ,          these datasets, we generate 𝑝̂ by aggregating concept
( ⊂ ℝ𝑛 ,  ⊂ ℝ𝑜 ), where 𝑓 (𝐱) = 𝑦 is mapping an input      label predictions across multiple layers of the given
𝐱 ∈  to an output class 𝑦 ∈ . For every DNN layer          DNN model, as described below.
𝑙, we denote the function 𝑓 𝑙 ∶  →  𝑙 , ( 𝑙 ⊂ ℝ𝑚 )             Given a DNN layer 𝑙 with 𝑚 hidden units, we com-
as a mapping from the input space  to the hidden            pute the layer’s representation of the input data 𝐡 =
representation space  𝑙 , where 𝑚 denotes the number        𝑓 𝑙 (𝐱), obtaining (𝐡(0) , ..., 𝐡(𝑖+𝑢) ). Using this data and the
of hidden units, and can be different for each layer.        concept labels, we construct a semi-supervised dataset,
   Similarly to [18, 19], we assume the existence of a       consisting of labelled data {(𝐡(0) , 𝐜(0) ), ..., (𝐡(𝑖−1) , 𝐜(𝑖−1) )},
concept representation  ⊂ ℝ𝑘 , defining 𝑘 distinct con-     and unlabelled data {𝐡(𝑖) , ..., 𝐡(𝑖+𝑢) }.
cepts associated with the input data.  is defined such           Next, we rely on Semi-Supervised Multi-Task Learn-
that every basis vector in  spans the space of possi-       ing (SSMTL) [26], in order to extract a function 𝑔 𝑙 ∶
ble values for one particular concept. We further as-         𝑙 → , which predicts concept labels from layer 𝑙’s
sume the existence of a function 𝑝 ⋆ ∶  → , where          hidden space. In this work, we treat each concept as
𝑝 ⋆ (𝐱) = 𝐜 is mapping an input 𝐱 to its concept represen-   a separate, independent task. Hence, 𝑔 𝑙 (𝐡) is decom-
tation 𝐜. Thus, 𝑝 ⋆ defines the concepts and their values    posed into 𝑘 separate tasks (one per concept), and is
(referred to as the ground truth concepts) for every input   defined as 𝑔 𝑙 (𝐡) = (𝑔1𝑙 (𝐡), ..., 𝑔𝑘𝑙 (𝐡)) where each 𝑔𝑖𝑙 (𝐡)
point.                                                       (𝑖 ∈ {1..𝑘}) predicts the value of concept 𝑖 from 𝐡.
                                                                  Repeating this process for all model layers 𝐿, we
3.2. CME                                                     obtain a set of functions 𝐺 = {𝑔𝑖𝑙 | 𝑙 ∈ {1..𝐿} ∧ 𝑖 ∈
                                                             {1..𝑘}}. For every concept 𝑖, we define the “best” layer
In this work, we define a DNN 𝑓 as being concept-            𝑙 𝑖 for predicting that concept as shown in (1):
decomposable, if it can be well-approximated by a com-
position of functions 𝑝 and 𝑞, such that 𝑓 (𝐱) = 𝑞(𝑝(𝐱)).                  𝑙 𝑖 = arg min 𝓁 (𝑔𝑖𝑙 , 𝑖)           (1)
In this definition, the function 𝑝 ∶  →  is an input-                            𝑙∈𝐿
to-concept function, mapping data-points from their       Here, 𝓁 is a loss function (in this case the error rate),
input representation 𝐱 ∈  to their concept represen- computing the predictive loss of function 𝑔 𝑙 with re-
tation 𝐜 ∈ . The function 𝑞 ∶  →  is a concept-to-
                                                                                                        𝑖
spect to a concept 𝑖. Finally, we define 𝑝̂ as shown in                  • Task 1: This task consists of determining the
(2):                                                                        shape concept value from an input image. For
                         1    1            𝑘    𝑘
                                                                            every image sample, we define its task label as
             𝑝̂ (𝐱) = (𝑔1𝑙 ◦𝑓 𝑙 (𝐱), ..., 𝑔𝑘𝑙 ◦𝑓 𝑙 (𝐱))       (2)           the shape concept label of that sample.
                                                                         • Task 2: This task consists of discriminating be-
   Thus, given an input 𝐱, the value computed by 𝑝̂ (𝐱)
                                                                            tween all possible shape and scale concept value
for every concept 𝑖 ∈ {1..𝑘} is equal to the value com-
                                                                            combinations. We assign a distinct identifier to
puted by 𝑔𝑖𝑙 from that input’s representation in layer 𝑙 𝑖 .
             𝑖
                                                                            each possible combination of the shape and scale
Overall, 𝑝̂ encapsulates concept information contained                      concept labels. For every image sample, we de-
in a given DNN model, and can be used to analyse how                        fine its task label as the identifier corresponding
this information is represented, as well as to predict                      to this sample’s shape and scale concept values.
concept values for new inputs.                                         Overall, Task 1 explores a scenario in which a DNN
                                                                     has to learn to recognise a specific concept from an
3.4. Concept-to-Label (𝑞̂ )                                          input image. Task 2 explores a relatively more complex
                                                                     scenario, in which a DNN has to learn to recognise
We setup extraction of 𝑞̂ as a classification problem, in
                                                                     combinations of concepts from an input image.
which we train 𝑞̂ to predict output labels 𝑦 from concept
labels 𝐜 predicted by 𝑝̂ . We use 𝑝̂ to generate concept
labels for all training data points, obtaining a set of con-         4.1.2. Model
cept labels {𝐜(0) , ..., 𝐜(𝑖+𝑢) }. Next, we produce a labelled       We trained a Convolutional Neural Network (CNN)
dataset, consisting of concept labels and corresponding              model [29] for each task. Both models had the same ar-
DNN output labels {(𝐜(0) , 𝑦 (0) ), ..., (𝐜(𝑖+𝑢) , 𝑦 (𝑖+𝑢) )}, and   chitecture, consisting of 3 convolutional layers, 2 dense
use it to train 𝑞̂ in a supervised manner. We experi-                layers with ReLUs, 50% dropout [30] and a softmax out-
mented with using Decision Trees (DTs), and Logistic                 put layer. The models were trained using categorical
Regression (LR) models for representing 𝑞̂ , as will be              cross-entropy loss, and achieved 100.0 ± 0.0% classifi-
discussed in Section 5. Overall, 𝑞̂ can be used to analyse           cation accuracies on their respective held-out test sets.
how a DNN uses concept information when making                       We refer to these models as the Task 1 model and the
predictions.                                                         Task 2 model in the rest of this work.


4. Experimental Setup                                                4.1.3. Ground-truth Concept Information
                                                            Importantly, the task and dataset definitions described
We evaluated CME using two datasets: dSprites [27], in this section imply that we know precisely which
and Caltech-UCSD birds [28]. All relevant code is pub- concepts the models had to learn, in order to achieve
licly available at3 .                                       100.0 ± 0.0% task performances (shape for Task 1, and
                                                            shape and scale for Task 2). We refer to this as the
4.1. dSprites                                               ground truth concept information learned by these mod-
                                                            els.
dSprites is a well-established dataset used for evalu-
ating unsupervised latent factor disentanglement ap-
proaches. dSprites consists of 2D 64×64 pixel black-and- 4.2. Caltech-UCSD Birds (CUB)
white shape images, procedurally generated from all For our second dataset, we used Caltech-UCSD Birds
possible combinations of 6 ground truth independent 200 2011 (CUB). This dataset consists of 11,788 im-
concepts (color, shape, scale, rotation, x and y position). ages of 200 bird species with every image annotated
Further details can be found in Appendix B, and the using 312 binary concept labels (e.g. beak and wing
official dSprites repository. 4                             colour, shape, and pattern). We relied on concept pre-
                                                       processing steps defined in [18] (used for de-noising
4.1.1. Classification Tasks                            concept annotations, and filtering out outlier concepts),
We define 2 classification tasks, used to evaluate our which  produces a refined set of 𝑘 = 112 binary concept
framework:                                             labels for every image sample.


    3 https://github.com/dmitrykazhdan/CME
    4 https://github.com/deepmind/dsprites-dataset/
4.2.1. Classification Task.                                 4.3.2. CBM
We relied on the standard CUB classification task, which As discussed in Section 4.2.3, we do not have access to
consists of predicting the bird species from an input ground truth concept information learned by the CUB
image.                                                     model. Instead, we rely on the pre-trained sequential
                                                           bottleneck model defined in [18] (referred to as CBM
4.2.2. Model                                               in the rest of this work). CBM is a bottleneck model,
                                                           obtained by resizing one of the layers of the CUB model
We used the Inception-v3 architecture [31], pretrained to match the number of concepts provided (we refer
on ImageNet [32] (except for the fully-connected lay- to this as the bottleneck layer), and training the model
ers) and fine-tuned end-to-end on the CUB dataset, in two steps. First, the sub-model consisting of the
following the preprocessing practices described in [33]. layers between the input layer and the bottleneck layer
The model achieved 82.7 ± 0.4% classification accuracy (inclusive) is trained to predict concept values from
on a held-out test set. We refer to this model as the input data. Next, the submodel consisting of the lay-
CUB model in the rest of this work.                        ers between the layer following the bottleneck layer
                                                           and the output layer is trained to predict task labels
4.2.3. Ground-truth Concept Information                    from the concept values predicted by the first submodel.
Unlike dSprites, the CUB dataset does not explicitly Hence, this bottleneck model is guaranteed to solely
define how the available concepts relate to the output rely on concept information that is learnable from the
task. Thus, we do not have access to the ground truth data, when making task label predictions. Thus, this
concept information learned by the CUB model.              benchmark serves as an upper bound for the concept
                                                           information learnable from the dataset, and for the task
                                                           performance achievable using this information. Impor-
4.3. Benchmarks                                            tantly, CBM does not attempt to approximate/analyse
We compare performance of our CME approach to two the CUB model, but instead attempts to solve the same
other benchmarks, described in the remainder of this classification task using concept information only.
section.                                                      We use the first CBM submodel as a 𝑝̂ benchmark,
                                                           representing the upper bound of concept information
                                                           learnable from the data. We use the second submodel
4.3.1. Net2Vec
                                                           as a 𝑞̂ benchmark, representing the upper bound of
We rely on work in [34] for defining benchmark 𝑝̂ func- task performance achievable from predicted concept
tions for the three tasks. Work in [34] attempts to information only. Finally, we use the entire model as
predict presence/absence of concepts from spatially- an 𝑓̂ benchmark. We make use of the saved trained
averaged hidden layer activations of convolutional lay- model from [18], available in their official repository5 .
ers of a CNN model. Given a binary concept 𝑐, this
approach trains a logistic regressor, predicting the pres-
ence/absence of this concept in an input image from 5. Results
the latent representation of a given CNN layer. In case
of multi-valued concepts, the concept space has to be We present the results obtained by evaluating our ap-
binarised, as discussed in Section 2.2. In this case, the proach using the two case studies described above.
binarised concept value with the highest likelihood is        We obtain the concept labelled dataset by returning
returned.                                                  the   ground-truth concept values for a random set of
   Unlike CME, [34] does not provide a way of selecting    samples    in the model training data. For dSprites, we
the convolutional layer to use for concept extraction.     found   that a concept labelled dataset of a 100 samples or
We consider the best-case scenario by selecting, for all   more    worked   well in practice for both tasks. Thus, we
tasks, the convolutional layers yielding the best concept  fix  the size of the concept labelled dataset to 100 in all
extraction performance. For all tasks, these layers were   of  the  dSprites experiments. For CUB, we found that a
convolutional layers closest to the output (the 3rd conv. concept labelled dataset containing 15 or more samples
layer in case of dSprites tasks, and the final inception per class worked well in practice. Thus, we fix the size
block output layer in case of the CUB task).               of the concept labelled dataset to 15 samples per class in
                                                           all of the CUB experiments. In the future, we intend to
                                                           explore the variation of model extraction performance
                                                                5 https://github.com/yewsiang/ConceptBottleneck
                                                             performances, by computing their 𝐹 1 predictive scores
                                                             for each concept, and then averaging over all concepts.
                                                             We obtained 𝐹 1 scores of 92 ± 0.5%, 86.3 ± 2.0%, and
                                                             85.9 ± 2.3% for CBM, CME, and Net2Vec 𝑝̂ functions,
                                                             respectively (averaged over 5 runs).
                                                                Importantly, we argue that in case of a large num-
                                                             ber of concepts, it is crucial to measure how concept
          (a) Task 1                  (b) Task 2             mispredictions are distributed accross the test samples.
                                                             For instance, consider a dSprites Task 2 𝑝̂ function that
Figure 2: Predictive accuracy of CME and Net2Vec 𝑝̂ func-
tions for all concepts
                                                             achieves 90% predictive accuracy on both shape and
                                                             scale concepts. The average predictive accuracy on
                                                             relevant concepts achieved by this 𝑝̂ will therefore be
                                                             90%. However, if the two concepts are mis-predicted
with the size of the concept labelled dataset in more
                                                             for strictly different samples (i.e. none of the samples
detail.
                                                             have both shape and scale predicted incorrectly at the
                                                             same time), this means that 20% of the test samples
5.1. Concept Prediction Performance                          will have one relevant concept predicted incorrectly.
                                                             Given that both concepts need to be predicted correctly
First, we evaluate the quality of 𝑝̂ functions produced
                                                             when using them for task label prediction, this implies
by CME, Net2Vec, and CBM. For both dSprites tasks, we
                                                             that consequent task label prediction will not be able
relied on the Label Spreading semi-supervised model
                                                             to achieve over 80% task label accuracy. This effect
[35], provided in scikit-learn [36], when learning the 𝑔𝑖𝑙
                                                             becomes even more pronounced in case of a larger
functions for CME. For CUB, we used logistic regression
                                                             number of relevant concepts.
functions instead, as they gave better performance.
                                                                Consequently, we defined a novel cumulative mis-
                                                             prediction error metric, which we refer to as the ‘mis-
5.1.1. dSprites                                              prediction-overlap’ (MPO) metric. Given a test set
Figure 2 shows predictive performance of the 𝑝̂ func-        𝑇 = {(𝐱(0) , 𝐜(0) ), ..., (𝐱(𝑛) , 𝐜(𝑛) )} consisting of 𝑛 + 1 in-
tions on all concepts for the two dSprites tasks (aver-      put samples 𝐱 with corresponding concept labels 𝐜, and
aged over 5 runs). As discussed in Section 4.1.1, we         a prediction set 𝑃 = {(𝐜̂(0) ), ..., 𝐜̂(𝑛) }, 𝑀𝑃𝑂 computes the
have access to the ground truth concept information          fraction of samples in the test set, that have at least 𝑚
learned by these models (shape concept information           relevant concepts predicted incorrectly, as shown in
for Task 1, and shape and scale concept information          Equation 3 (where 𝕀(.) denotes the indicator function):
for Task 2). For both tasks, 𝑝̂ functions extracted by
CME successfully achieved high predictive accuracy on
concepts relevant to the tasks, whilst achieving a low                          1 𝑛
                                                               𝑀𝑃𝑂(𝑇 , 𝑃, 𝑚) = ∑ 𝕀(𝑒𝑟𝑟(𝐜𝑖 , 𝐜̂𝑖 ) >= 𝑚)      (3)
performance on concepts irrelevant to the tasks. Thus,                          𝑛 𝑖=0
CME was able to successfully extract the concept infor-
mation contained in the task models. For both tasks, 𝑝̂    Here, 𝑒𝑟𝑟 can be used to specify which concepts to
functions extracted by Net2Vec achieved a much lower    measure  the mis-prediction error on (i.e. in case some
performance on the relevant concepts.                   of the provided concepts are irrelevant). Under our
                                                        assumption of all concepts being relevant, we defined
                                                        𝑒𝑟𝑟 as shown in Equation 4:
5.1.2. CUB
As discussed in Section 4.2.3, the CUB dataset does                                           𝑘
                                                                             𝑒𝑟𝑟(𝐜𝑖 , 𝐜̂𝑖 ) = ∑ 𝕀(𝑐𝑖,𝑗 ≠ 𝑐̂ 𝑖,𝑗 )         (4)
not explicitly define how the concepts relate to the
                                                                                             𝑗=0
output task labels. Thus, we do not know how rel-
evant/important different concepts are, with respect            Using a held-out test set, we plot the 𝑀𝑃𝑂 metric val-
to task label prediction. In this section, we make the       ues for 𝑚 ∈ [0, ..., 112], as shown in Figure 3 (averaged
conservative assumption that all concepts are relevant,      over 5 runs). Importantly, 𝑝̂ function performances
when evaluating 𝑝̂ functions, and explore relative con-      can be evaluated by observing their 𝑀𝑃𝑂 scores for
cept importance in more detail in Section 5.3.               different values of 𝑚. A larger 𝑀𝑃𝑂 score implies a
   Firstly, we relied on the ‘average-per-concept’ met-      bigger proportion of samples had at least 𝑚 relevant
rics introduced in [18] when evaluating the 𝑝̂ function      concept predicted incorrectly.
                                                          performance for these models (averaged over 5 runs).
                                                          The original Task 1, Task 2, and CUB models achieved
                                                          task performances of 100±0%, 100±0%, and 82.7±0.4%,
                                                          respectively, as described in Section 4.

                                                          Table 1
                                                          Fidelity of extracted 𝑓̂ models

                                                                              CME             CBM       Net2Vec
                                                                Task 1     100.0±0.0%           –       24.5±3.6%
                                                                Task 2     99.3±0.5%            –       38.3±4.0%
                                                                 CUB       74.42±3.1%       77.5±0.2%   73.8±2.8%
Figure 3: Performances of 𝑝̂ functions, evaluated using the
𝑀𝑃𝑂 metric. The green line plots the case for perfect pre-
diction, when the predicted concepts are equivalent to the
ground truth concepts (i.e. the 𝑝 ⋆ performance), in which Table 2
case 𝑀𝑃𝑂 = 1 for 𝑚 = 0, and 𝑀𝑃𝑂 = 0 otherwise. Net2Vec Task performance of extracted 𝑓̂ models
obtained values within 1% deviation from the correspond-
ing CME values for all 𝑚, and is therefore omitted here for
                                                                          CME          CBM              Net2Vec
simplicity
                                                                 Task 1 100.0±0%         –              24.5±3.6%
                                                                 Task 2 99.3±0.5%        –              38.3±4.0%
                                                                  CUB   70.8±1.8% 75.7±0.6%             69.8±1.5%
     Overall, CME performed almost identically to Net2Vec,
and worse than 𝐶𝐵𝑀 according to the 𝑀𝑃𝑂 metric.
Similar performance to Net2Vec is likely caused by            For both dSprites tasks, CME 𝑓̂ models achieved high
(i) concepts being binary (requiring no binarisation) (99%+) fidelity and task performance scores, indicat-
(ii) the Inception-v3 model having a relatively large ing that CME successfully approximated the original
number of convolutional layers, implying that the final dSprites models. Furthermore, these scores were con-
convolutional layer likely learned higher-level features, siderably higher than those produced by the Net2Vec
relevant to concept prediction.                            𝑓̂ models.
     Importantly, 𝑀𝑃𝑂 showed that both CBM and CME            For the CUB task, both CME and Net2Vec 𝑓̂ models
𝑝̂ functions had a significant proportion of test samples achieved relatively lower fidelity and task performance
with incorrectly-predicted relevant concepts (e.g. CME scores (in this case, performance of CME was very
had an MPO score of 0.25 at 𝑚 = 4, implying that 25% similar to that of Net2Vec). Crucially, the CBM model
of all test samples have at least 4 concepts predicted also achieved relatively low fidelity and accuracy scores
incorrectly). In practice, these mispredictions can have (as anticipated from our 𝑀𝑃𝑂 metric analysis). This
a significant impact on consequent task label predictive implies that concept information learnable from the
performance, as will be further explored in the next data is insufficient for achieving high task accuracy.
section.                                                   Hence the relatively high CUB model accuracy has to be
                                                           caused by the CUB model relying on other non-concept
5.2. Task Performance                                      information. Thus, the low fidelity of CME and Net2Vec
                                                           is a consequence of the CUB model being non-concept-
In this section, we evaluate the fidelity and performance
                                                           decomposable, implying that it’s behaviour cannot be
of the extracted 𝑓̂ models. For all CME and Net2Vec 𝑝̂
                                                           explained by the desired concepts. The next section
functions evaluated in the previous section, we trained
                                                           discusses possible approaches to fixing this issue.
output-to-concept functions 𝑞̂ , predicting class labels
from the 𝑝̂ concept predictions. Next, for every 𝑝̂ , we
defined its corresponding 𝑓̂ as discussed in Section 3, 5.3. Intervening
via a composition of 𝑝̂ and its associated 𝑞̂ . For every In the previous section, we demonstrated how CME can
𝑓̂ , we evaluated its fidelity and its task performance, be used to identify whether a model relies on desired
using a held-out sample test set. Table 1 shows the concepts during decision-making. In this section, we
fidelity of extracted models, and Table 2 shows the task demonstrate how CME can be used to suggest model
                                                           Figure 5: t-SNE plots for the relevant Task 2 concepts. Each
                                                           row corresponds to a different concept, and each column
                                                           corresponds to a different layer of the Task 2 model. Each
                                                           plot is colored with respect to the concept’s values. For every
                                                           concept row, the subplot with a green border indicates the
                                                           layer CME selected for predicting the value of that concept.
Figure 4: The task accuracy of 𝑞̂ functions, trained on con-
cepts predicted by 𝑝̂ functions, with top # No. corrected
concepts set to their ground truth values. Performance of
Net2Vec was very similar to that of CME, and is thus omit- arising due to data properties (e.g. the data not being
ted here for simplicity.                                     representative with respect to key concepts), not model
                                                         properties (e.g. architecture, or training regime).
                                                            Overall, we demonstrated how CME can be used to
improvements, aligning model behaviour with the de-      identify  the key concept information that can be used
sired concepts.                                          to improve    performance of DNN models, and ensure
   We trained a logistic regression 𝑞̂ model predicting  that they  are  closer aligned with the desired concept-
task labels from ground-truth concept labels for the     based   behaviour.  Furthermore, we demonstrated how
CUB task, obtaining an accuracy score of 96.4 ± 0.5%     CME   can  be  used to identify whether undesired model
on a held-out test set (averaged over 5 runs). Using     behaviour   is caused  by model properties, or data prop-
this model’s coefficient magnitudes as a measure of      erties.
concept importance, we discovered that the 32 most
important concepts identified this way were sufficient 5.4. Explainability
for achieving over 96% task accuracy using logistic
regression.                                              By studying CME-extracted 𝑝̂ and 𝑞̂ functions sepa-
   Using this reduced concept set, we inspected how our rately, we can gain additional insights into what con-
CUB 𝑞̂ function performances would change, if their cept information the original model learned and how
corresponding 𝑝̂ functions extracted these concepts this concept information is used to make predictions.
perfectly. This was achieved by taking the 𝑝̂ concept We give examples of how these sub-models can be in-
predictions of these concepts on the test and training spected in the remainder of this section.
sets, setting the values of the top 𝑖 most important
concepts to their ground truth values, training logistic 5.4.1. Input-to-Concept (𝑝̂ )
regression 𝑞̂ functions on these modified training sets, CME extraction of 𝑝̂ functions from a DNN model is
and measuring their accuracies on the modified test sets highly complementary to existing approaches on la-
(this approach is referred to as concept intervention in tent space analysis. For example, Figure 5 shows a
the rest of this work). The results are shown in Figure t-SNE [37] 2D projected plot of every layer’s hidden
4, with 𝑖 ranging from 0 to 32.                          space of the dSprites Task 2 model, highlighting dif-
   These results demonstrate that concept information ferent concept values of the two relevant concepts, as
from only 32 concepts is sufficient for achieving over well as the layers used by CME to predict them. Figure
96% task performance. Thus, predictive performance 5 demonstrates several important ways in which CME
of the CUB model can be significantly improved (up concept extraction can be combined with existing la-
to 14%) by ensuring that the model is able to learn and tent space analysis approaches, which will be discussed
use this concept information. Crucially, these results in the remainder of this section. Further examples are
show that CME concept intervention also significantly given in Appendix C.
improves CBM model performance, indicating that the
necessary concept information is not learnable from the
                                                         Manifold Types Using ground-truth concept infor-
data. Hence, undesired CUB model behaviour is likely
                                                         mation and hidden space visualisation, it is possible
to inspect the nature of latent space manifolds, with
respect to specific concepts. Firstly, this inspection al-
lows to build an intuition of how concept information
is represented in a particular latent space. Secondly, it
is possible to use this information when selecting the
types of 𝑝̂ functions to use during concept extraction.
For instance, some manifolds consist of “blobs” encod-
ing distinct concept values (e.g. row shape, columns
dense, dense_1), suggesting that the latent space is
clustered with respect to a concept’s values.              Figure 6: Visualisation of a decision tree 𝑞̂ extracted from
                                                            the Task 1 model. The model has correctly learned to differ-
Variation Across Layers Using ground-truth con-             entiate between classes based on the shape concept values.
cept information and hidden space visualisation, it
is also possible to inspect how concept information
representation varies across layers of a DNN model.         behaviour is consistent with user expectations (model
Firstly, this inspection allows to build an intuition of    verification), (ii) identifying specific concepts or con-
how concept-related information is transformed by the       cept interactions (if any) causing incorrect behaviour
DNN. Secondly, it is possible to use this information to    (model debugging), (iii) extracting new knowledge about
identify the ‘best’ layers to extract concept information   how concept information can be used for solving a par-
from. For instance, both rows shape and scale illus-        ticular task (knowledge extraction). Further examples
trate that the manifolds of higher layers become more       and analysis of extracted 𝑞̂ functions can be found in
unimodal (separating concept values) with respect to        Appendix D.
the relevant concepts. Importantly, this analysis, to-
gether with the definition of 𝑝̂ allows using different
layers for extracting different concepts.                   6. Conclusions
    Overall, we argue that CME concept extraction can
                                                            We present CME: a concept-based model extraction
be well-integrated with existing latent space analysis
                                                            framework, used for analysing DNN models via concept-
approaches, in order to study which concept informa-
                                                            based extracted models. Using two case-studies, we
tion is learned by a DNN, and how this information is
                                                            demonstrate how CME can be used to (i) analyse con-
represented across DNN layers. This type of inspec-
                                                            cept information learned by DNN models (ii) analyse
tion can have numerous applications, including: (i)
                                                            how DNNs use concept information when making pre-
inspecting which concepts a model has learned, and
                                                            dictions (iii) identifying key concept information that
verifying whether it has learned the desired concepts
                                                            can further improve DNN predictive performance. CME
(useful for model explanations and model verification),
                                                            is a model-agnostic, general-purpose framework, which
(ii) inspecting how concept information is represented
                                                            can be combined with a wide variety of different DNN
across different layers (useful for fine-grained model
                                                            models and corresponding tasks.
analysis), (iii) extracting concept predictions from a
                                                                In this work, we assume a fixed set of concept la-
DNN (useful for knowledge extraction). Further exam-
                                                            bels available to CME before model extraction begins
ples and analysis of extracted 𝑝̂ functions can be found
                                                            (i.e. the concept-labelled dataset). In the future, we
in Appendix C.
                                                            intend to explore active-learning based approaches to
                                                            obtaining maximally-informative concept labels in an
5.4.2. Concept-to-Output (𝑞̂ )                              interactive fashion. Consequently, these approaches
𝑞̂ functions encapsulate how a DNN uses concept infor-      will improve extracted model fidelity by retrieving the
mation when making predictions. Hence, these func-          most informative concept labels, and reduce manual
tions can be inspected directly, in order to analyse        concept labelling effort.
model behaviour represented in terms of concepts. An            Given the rapidly-increasing interest in concept-based
example is given in Figure 6, in which we plot the deci-    explanations of DNN models, we believe our approach
sion tree 𝑞̂ function extracted by CME from the Task 1      can play an important role in providing granular concept-
model. Further examples are given in Appendix D.            based analyses of DNN models.
    Overall, inspection of 𝑞̂ functions can be used for
(i) verifying that a DNN uses concept information cor-
rectly during decision-making, and that it’s high-level
Acknowledgements                                                  man, J. W. Vaughan, H. Wallach, Manipulat-
                                                                  ing and measuring model interpretability, arXiv
AW acknowledges support from the David MacKay                     preprint arXiv:1802.07810 (2018).
Newton research fellowship at Darwin College, The            [10] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai,
Alan Turing Institute under EPSRC grant EP/N510129/1              J. Wexler, F. B. Viégas, R. Sayres, Interpretabil-
& TU/B/000074, and the Leverhulme Trust via the Lev-              ity beyond feature attribution: Quantitative test-
erhulme Centre for the Future of Intelligence (CFI). BD           ing with concept activation vectors (TCAV), in:
acknowledges support from EPSRC Award #1778323.                   J. G. Dy, A. Krause (Eds.), Proceedings of the
DK acknowledges support from EPSRC ICASE scholar-                 35th International Conference on Machine Learn-
ship and GSK. DK and BD acknowledge the experience                ing, ICML 2018, Stockholmsmässan, Stockholm,
at Tenyks as fundamental to developing this research              Sweden, July 10-15, 2018, volume 80 of Proceed-
idea.                                                             ings of Machine Learning Research, PMLR, 2018,
                                                                  pp. 2673–2682. URL: http://proceedings.mlr.press/
                                                                  v80/kim18d.html.
References                                                   [11] B. Zhou, Y. Sun, D. Bau, A. Torralba, Interpretable
 [1] B. Goodman, S. Flaxman, European union regula-               basis decomposition for visual explanation, in:
     tions on algorithmic decision-making and a “right            Proceedings of the European Conference on Com-
     to explanation”, AI magazine 38 (2017) 50–57.                puter Vision (ECCV), 2018, pp. 119–134.
 [2] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Ben-   [12] A. Ghorbani, J. Wexler, J. Y. Zou, B. Kim, Towards
     netot, S. Tabik, A. Barbado, S. García, S. Gil-López,        automatic concept-based explanations, in: Ad-
     D. Molina, R. Benjamins, et al., Explainable ar-             vances in Neural Information Processing Systems,
     tificial intelligence (xai): Concepts, taxonomies,           2019.
     opportunities and challenges toward responsible         [13] C.-K. Yeh, B. Kim, S. O. Arik, C.-L. Li, P. Ravikumar,
     ai, Information Fusion 58 (2020).                            T. Pfister, On concept-based explanations in deep
 [3] A. Adadi, M. Berrada, Peeking inside the black-              neural networks, arXiv preprint arXiv:1910.07969
     box: A survey on explainable artificial intelligence         (2019).
     (xai), IEEE Access 6 (2018).                            [14] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler,
 [4] U. Bhatt, A. Xiang, S. Sharma, A. Weller, A. Taly,           F. Viegas, R. Sayres, Interpretability beyond fea-
     Y. Jia, J. Ghosh, R. Puri, J. M. Moura, P. Eckersley,        ture attribution: Quantitative testing with con-
     Explainable machine learning in deployment, in:              cept activation vectors (tcav), arXiv preprint
     Proceedings of the 2020 Conference on Fairness,              arXiv:1711.11279 (2017).
     Accountability, and Transparency, 2020, pp. 648–        [15] Y. Goyal, U. Shalit, B. Kim, Explaining classifiers
     657.                                                         with causal concept effect (cace), arXiv preprint
 [5] P.-J. Kindermans, S. Hooker, J. Adebayo, M. Alber,           arXiv:1907.07165 (2019).
     K. T. Schütt, S. Dähne, D. Erhan, B. Kim, The (un)      [16] G. E. Hinton, Learning multiple layers of repre-
     reliability of saliency methods, in: Explainable             sentation, Trends in cognitive sciences 11 (2007)
     AI: Interpreting, Explaining and Visualizing Deep            428–434.
     Learning, Springer, 2019, pp. 267–280.                  [17] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Tor-
 [6] D. A. Melis, T. Jaakkola, Towards robust inter-              ralba, Object detectors emerge in deep scene cnns,
     pretability with self-explaining neural networks,            arXiv preprint arXiv:1412.6856 (2014).
     in: Advances in Neural Information Processing           [18] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann,
     Systems, 2018, pp. 7775–7784.                                E. Pierson, B. Kim, P. Liang, Concept bottleneck
 [7] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow,             models, in: Proceedings of Machine Learning
     M. Hardt, B. Kim, Sanity checks for saliency maps,           and Systems 2020, International Conference on
     in: Advances in Neural Information Processing                Machine Learning, 2020, pp. 11313–11323.
     Systems, 2018, pp. 9505–9515.                           [19] F. D.-V. Isaac Lage, Human-in-the-loop learning
 [8] B. Dimanov, U. Bhatt, M. Jamnik, A. Weller, You              of interpretable and intuitive representations, in:
     shouldn’t trust me: Learning models which con-               ICML Workshop on Human Interpretability, 2020.
     ceal unfairness from multiple explanation meth-              URL: http://whi2020.online/static/pdfs/paper_31.
     ods, in: European Conference on Artificial Intelli-          pdf.
     gence, 2020.                                            [20] R. Andrews, J. Diederich, A. B. Tickle, Survey and
 [9] F. Poursabzi-Sangdeh, D. G. Goldstein, J. M. Hof-            critique of techniques for extracting rules from
                                                                  trained artificial neural networks, Knowledge-
     based systems 8 (1995) 373–389.                              ceedings of the IEEE conference on computer vi-
[21] J. R. Zilke, E. L. Mencía, F. Janssen, Deepred–              sion and pattern recognition, 2018, pp. 4109–4118.
     rule extraction from deep neural networks, in:          [34] R. Fong, A. Vedaldi, Net2vec: Quantifying and
     International Conference on Discovery Science,               explaining how concepts are encoded by filters
     Springer, 2016, pp. 457–473.                                 in deep neural networks, in: Proceedings of the
[22] D. Chen, S. P. Fraiberger, R. Moakler, F. Provost,           IEEE conference on computer vision and pattern
     Enhancing transparency and control when draw-                recognition, 2018, pp. 8730–8738.
     ing data-driven inferences about individuals, Big       [35] D. Zhou, O. Bousquet, T. N. Lal, J. Weston,
     data 5 (2017) 197–212.                                       B. Schölkopf, Learning with local and global con-
[23] R. Krishnan, G. Sivakumar, P. Bhattacharya, Ex-              sistency, in: Advances in Neural Information
     tracting decision trees from trained neural net-             Processing Systems 16, 2004.
     works, Pattern recognition 32 (1999).                   [36] F. Pedregosa, G. Varoquaux, A. Gramfort,
[24] M. Sato, H. Tsukimoto, Rule extraction from                  V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-
     neural networks via decision tree induction, in:             tenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
     IJCNN’01. International Joint Conference on Neu-             A. Passos, D. Cournapeau, M. Brucher, M. Per-
     ral Networks. Proceedings (Cat. No. 01CH37222),              rot, E. Duchesnay, Scikit-learn: Machine learning
     volume 3, IEEE, 2001, pp. 1870–1875.                         in Python, Journal of Machine Learning Research
[25] D. Kazhdan, Z. Shams, P. Liò, Marleme: A multi-              12 (2011).
     agent reinforcement learning model extraction           [37] L. v. d. Maaten, G. Hinton, Visualizing data using
     library, arXiv preprint arXiv:2004.07928 (2020).             t-sne, Journal of Machine Learning Research 9
[26] Q. Liu, X. Liao, L. Carin, Semi-supervised multi-            (2008) 2579–2605.
     task learning, in: Advances in Neural Information
     Processing Systems, 2008.
[27] L. Matthey, I. Higgins, D. Hassabis, A. Lerch-          A. Concept Decomposition
     ner, dsprites: Disentanglement testing sprites
     dataset, https://github.com/deepmind/dsprites-          The results and findings presented in existing work on
     dataset/, 2017.                                         concept-based explanations suggests that users often
[28] C. Wah, S. Branson, P. Welinder, P. Perona, S. Be-      think of tasks in terms of concepts and concept interac-
     longie, The caltech-ucsd birds-200-2011 dataset         tions (see Section 2.1 for further details). For instance,
     (2011).                                                 consider the task of determining the species of a bird
[29] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson,      from an image. A user will typically perform this task
     R. E. Howard, W. E. Hubbard, L. D. Jackel, Hand-        by first identifying relevant concepts (e.g. wing color,
     written digit recognition with a back-propagation       head color, and beak length) present in a given image,
     network, in: Advances in neural information pro-        and then using the values of these concepts to infer the
     cessing systems, 1990, pp. 396–404.                     bird species, in a bottom-up fashion.
[30] N. Srivastava, G. Hinton, A. Krizhevsky,                   On the other hand, Machine Learning (ML) mod-
     I. Sutskever, R. Salakhutdinov,            Dropout:     els usually rely on high-dimensional data representa-
     A simple way to prevent neural networks                 tions, and infer task labels directly from these high-
     from overfitting, Journal of Machine Learn-             dimensional inputs (e.g. a CNN produces a class label
     ing Research 15 (2014) 1929–1958. URL:                  from raw input pixels of an image).
     http://jmlr.org/papers/v15/srivastava14a.html.             Consequently, Concept Decomposition (CD) approaches
[31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo-   attempt to explain the behaviour of such ML models by
     jna, Rethinking the inception architecture for          decomposing their processing into two distinct steps:
     computer vision, in: Proceedings of the IEEE            concept extraction, and label prediction. In concept
     conference on computer vision and pattern recog-        extraction, concept information is extracted from the
     nition, 2016, pp. 2818–2826.                            high-dimensional input data. In label prediction, con-
[32] A. Krizhevsky, I. Sutskever, G. E. Hinton, Ima-         cept information is used to produce the output label.
     genet classification with deep convolutional neu-       Hence, CD approaches attempt to explain ML model
     ral networks, in: Advances in neural information        behaviour in terms of human-understandable concepts
     processing systems, 2012, pp. 1097–1105.                and their interactions in a bottom-up fashion, parallel-
[33] Y. Cui, Y. Song, C. Sun, A. Howard, S. Be-              ing human-like reasoning more closely.
     longie, Large scale fine-grained categorization            Importantly, whilst this work focuses on CNN mod-
     and domain-specific transfer learning, in: Pro-         els and tasks, the notion of CD can in principle be
applied to any ML model and task.                            concept-related knowledge stored in these models. Con-
                                                             sequently, we believe that CME will be invaluable in
A.1. CBMs                                                    situations where concept-related information is expen-
                                                             sive/difficult to obtain, or is only partially-known. In
CBMs can be seen as a special case of models perform-        these cases, a user may interact with existing DNN mod-
ing CD, in which CD behaviour is enforced by design.         els via CME, in order to refine existing concept-related
Hence, these models explicitly consist of two submod-        knowledge.
els, with the first submodel extracting concept infor-          It should be noted that a CBM can trivially be ap-
mation, and the second submodel using this concept           proximated using CME, by defining 𝑝̂ as the output of
information for producing task labels. Importantly,          a CBM’s concept bottleneck layer, and defining 𝑞̂ as
non-CBM models can still demonstrate CD behaviour.           the CBM’s submodel producing task labels from the
For instance, the dSprites Task 2 model was shown to         bottleneck layer output.
have CD behaviour, with relevant concept information
extracted in the dense layers, and used for classification
                                                             A.3. Further Discussion
decisions.
                                                             As discussed in Section 3, CME explores whether a
A.2. CBMs & CME                                              DNN is concept-decomposable, by attempting to ap-
                                                             proximate it with an extracted model that is concept-
The utility of CBMs is that they produce models explic-      decomposable by design (i.e. explicitly consists of two
itly encouraged to use CD. Consequently, these models        separate stages). Intuitively, if a given DNN learns and
are much more likely to rely on the desired concepts         relies on concept information of the specified concepts
during decision-making, and be more aligned with a           during label prediction, this concept information will
user’s mental model of the corresponding task.               be contained in the DNN latent space. Hence, the DNN
   However, a given DNN model can already exhibit            decision process could be separated into two steps: con-
CD behaviour, and use the desired concept information        cept information extraction, and consequent task label
(e.g. as was the case with both dSprites task models).       prediction.
In this case, costly modifications and model re-training        Importantly, existing CD-based approaches (such as
are unnecessary. As discussed in Section 3, CME can          those discussed in Section 2.2) require the set of con-
extract concept information from pre-trained DNNs by         cepts and their values to be (i) sufficient to solve the
training 𝐿 ∗ 𝑘 concept predictors (where 𝐿 denotes the       corresponding classification task (i.e. the class labels
number of DNN layers used in concept extraction, and         can be predicted from concept information with high
𝑘 denotes the number of concepts). As demonstrated           accuracy) (ii) learnable from the data (i.e. the DNN
in Section 5, these concept predictors can consist of        model will be able to learn concept information from
simpler models (e.g. LRs), trained on only a fraction        the given dataset), in order to achieve high task perfor-
of the DNN training data. Thus, the computational            mance.
cost of training these concept predictors is significantly      However, these works do not discuss how to handle
smaller, compared to training a bottleneck model on          cases where these assumptions do not hold (e.g. as was
all the training data, as done in the case of CBMs.          the case with the CUB task). Thus, exploring ways
   More importantly, CBM models require knowledge            of efficiently discovering relevant concepts sufficient
of existing concepts and available concept annotations.      for solving a given task, as well as ways of ensuring
In practice, these annotations are often expensive to        whether this concept information is learnable from the
produce, especially for large datasets and/or a large        data are both important research directions for future
number of concepts. Furthermore, information about           work.
which concepts are relevant and/or sufficient for solv-
ing a given task is often not fully available either. In-
stead, CME is capable of using existing DNN models           B. dSprites Dataset
to extract this information automatically in a semi-
supervised fashion, making concept discovery (identi-        B.1. Description
fying the relevant concepts), and concept annotation         dSprites is a dataset of 2D shapes, procedurally gener-
both faster and cheaper.                                     ated from 6 ground truth independent concepts (color,
   Overall, CME permits efficient interaction with pre-      shape, scale, rotation, x and y position). Table 3 lists the
trained DNN models, which can be used to leverage            concepts, and corresponding values. dSprites consists
Figure 7: t-SNE plots for the relevant Task 1 concept. Each column corresponds to a different layer of the Task 1 model.
Each plot is colored with respect to the concept’s values. The subplot with a green border indicates the layer 𝑝̂ uses for
predicting the value of that concept


of 64×64 pixel black-and-white images, generated from in [18]. Further details regarding layer naming and/or
all possible combinations of these concepts, for a total concept naming can be found in 6 . For all concepts, con-
of 1 × 3 × 6 × 40 × 32 × 32 = 737280 total images.       cept values become significantly better-separated after
                                                         the Mixed_7c layer. However, the figure shows that
Table 3                                                  concept values are still quite mixed together for some
dSprites concepts and values                             of the points, even for later layers. This low separability
        Name                    Values
                                                         indicates that concept values will still be mis-predicted
         Color                   white                   for some of the points, and that concept extraction for
        Shape             square, ellipse, heart         the CUB task will likely perform suboptimally.
       Scale       6 values linearly spaced in [0.5, 1]
      Rotation            40 values in [0, 2𝜋]
     Position X            32 values in [0, 1]
                                                              D. Concept-to-Output Functions
     Position Y            32 values in [0, 1]
                                                              Figure 9 shows the decision tree extracted for dSprites
                                                              Task 2. Overall, this model has correctly learned to
                                                              differentiate between classes based on the shape and
B.2. Pre-processing                                           scale concepts (note: there are 3 × 6 shape and scale
                                                              concept values, for a total of 18 output classes).
We select 16 of the 32 values for Position X and Posi-
tion Y (keeping every other value only), and select 8 of
the 40 values for Rotation (retaining every 5th value).
This step makes the dataset size more manageable (re-
ducing it from 737280 to 3 ∗ 6 ∗ 8 ∗ 16 ∗ 16 = 36864
samples), whilst preserving its characteristics and prop-
erties, such as concept value ranges and diversity.


C. Input-to-Concept Functions
Figure 7 shows a t-SNE 2D projected plot of every
layer’s hidden space of the dSprites Task 1 model, high-
lighting different concept values of the relevant shape
concept, and which layers were used by CME to predict
it.
    The CUB model has a considerably larger number of
layers, and a considerably larger number of task con-
cepts. Hence, for the sake of space, we demonstrate
an example here using only 6 different model layers of
the CUB model, and showing only the top 5 important
concepts identified in Section 5.3. In this Figure, the
concepts are named using their indices, and the lay-
ers are named following the naming convention used                6 https://github.com/yewsiang/ConceptBottleneck/tree/master/CUB
Figure 8: t-SNE plots for the top 5 CUB concepts. Each column corresponds to a different layer of the CUB model. Each
plot is colored with respect to the concept’s values.
Figure 9: Visualisation of a decision tree 𝑞̂ extracted from the Task 2 model. The model has correctly learned to differentiate
between classes based on the shape and scale concept values.

</pre>