=Paper=
{{Paper
|id=Vol-2699/paper02
|storemode=property
|title=Now You See Me (CME): Concept-based Model Extraction
|pdfUrl=https://ceur-ws.org/Vol-2699/paper02.pdf
|volume=Vol-2699
|authors=Dmitry Kazhdan,Botty Dimanov,Mateja Jamnik,Pietro Liò,Adrian Weller
|dblpUrl=https://dblp.org/rec/conf/cikm/KazhdanDJLW20
}}
==Now You See Me (CME): Concept-based Model Extraction
==
Now You See Me (CME): Concept-based Model Extraction Dmitry Kazhdana,c , Botty Dimanova,c , Mateja Jamnika , Pietro Liòa and Adrian Wellera,b a The University of Cambridge, UK b The Alan Turing Institute, London, UK c Denotes equal contribution Abstract Deep Neural Networks (DNNs) have achieved remarkable performance on a range of tasks. A key step to further empowering DNN-based approaches is improving their explainability. In this work we present CME: a concept-based model extraction framework, used for analysing DNN models via concept-based extracted models. Using two case studies (dSprites, and Caltech UCSD Birds), we demonstrate how CME can be used to (i) analyse the concept information learned by a DNN model (ii) analyse how a DNN uses this concept information when predicting output labels (iii) identify key concept information that can further improve DNN predictive performance (for one of the case studies, we showed how model accuracy can be improved by over 14%, using only 30% of the available concepts). Keywords interpretability, concept extraction, concept-based explanations, model extraction, latent space analysis, xai 1. Introduction Concept-based explanation approaches provide model explanations in terms of human-understandable units, The black-box nature of Deep Neural Networks (DNNs) rather than individual features, pixels, or characters hinders their widespread adoption, especially in indus- (e.g., the concepts of a wheel and a door are important tries under heavy regulation with high-cost of error for the detection of cars) [10, 11, 12]. [1]. As a result, there has recently been a dramatic In this paper we introduce CME1 : a (C)oncept-based increase in research on Explainable AI (XAI), focusing (M)odel (E)xtraction framework2 . Figure 1 depicts how on improving explainability of DL systems [2, 3]. CME can be used to analyse DNN models via explain- Currently, the most widely used XAI methods are fea- able concept-based extracted models, in order to ex- ture importance methods (also referred to as saliency plain and improve performance of DNNs, as well as to methods) [4]. For a given data point, these methods pro- extract useful knowledge from them. Although this ex- vide scores showing the importance of each feature (e.g., ample focuses on a CNN model, CME is model-agnostic, pixel, patch, or word vector) to the algorithm’s deci- and can be applied to any DNN architecture. sion. Unfortunately, feature importance methods have In particular, we make the following contributions: been shown to be fragile to input perturbations [5, 6] • We present the novel CME framework, capable or model parameter perturbations [7, 8]. Human ex- of analysing DNN models via concept-based ex- periments also demonstrate that feature importance tracted models explanations do not necessarily increase human un- • We demonstrate, using two case-studies, how derstanding, trust, or ability to correct mistakes in a CME can analyse (both quantitatively and quali- model [9, 10]. tatively) the concept information a DNN model As a consequence, two other types of XAI approaches has learned, and how this information is repre- are receiving increasing attention: model extraction ap- sented accross the DNN layers proaches, and concept-based explanation approaches. • We propose a novel metric for evaluating the Model extraction methods (also referred to as model quality of concept extraction methods translation methods) approximate black-box models • We demonstrate, using two case-studies, how with simpler models to increase model explainability. CME can analyse (both quantitatively and quali- tatively) how a DNN uses concept information Title of the Proceedings: "Proceedings of the CIKM 2020 Workshops" when predicting output labels Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi • We demonstrate how CME can identify key con- email: dk525@cam.ac.uk (D. Kazhdan); btd26@cam.ac.uk (B. cept information that can further improve DNN Dimanov); mateja.jamnik@cl.cam.ac.uk (M. Jamnik); pietro.lio@cl.cam.ac.uk (P. Liò); adrian.weller@eng.cam.ac.uk (A. predictive performance Weller) orcid: 1 Pronounced “See Me.” © 2020 Copyright for this paper by its authors. Use permitted under Creative 2 All Commons License Attribution 4.0 International (CC BY 4.0). relevant code is available at CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) https://github.com/dmitrykazhdan/CME extracting concept information. DNNs have been shown to perform hierarchical feature extraction, with layers closer to the output utilising higher-level data represen- tations, compared to layers closer to the input [16, 17]. (a) This implies that choosing a single layer imposes an unnecessary trade-off between low- and high-level con- cepts. On the other hand, CME is capable of efficiently combining latent space information from multiple lay- ers, thereby avoiding this constraint. Finally, existing methods typically represent concept (b) explanations as a list of concepts, with their relative Figure 1: CME extracted model example. (a) Given an input importance with respect to the classification task. In image, a CNN uses the image’s pixel information as input, contrast, our approach describes the functional relation- and returns class information as output (in this case, class ship between concepts and outputs, thereby showing in label 3, corresponding to the Red-headed Woodpecker class), more detail how the model utilises concept information performing data processing in a non-explainable, black-box when making predictions. fashion. (b) Given an input image, a CME extracted model uses an Input-to-Concept function (I-to-C) to compute con- cept information from the pixel data (e.g. bird wing color, 2.2. Concept Bottleneck Models or head color values). Next, the model uses a Concept-to- Recent work on concept-based explanations relies on Output function (C-to-O) to compute the output class label models that use an intermediate concept-based repre- from this concept information. sentation when making predictions [18, 19]. Work in [18] refer to these types of models as concept bottleneck models (CBMs). A concept bottleneck model is a model 2. Related Work which, given an input, first predicts an intermediate set of human-specified concepts, and then uses only 2.1. Concept-based Explanations this concept information to predict the output task la- Concept-based explanations have been used in a wide bel. Work in [18] proposes a method for turning any range of different ways, including: inspecting what DNN into a concept bottleneck model given concept a model has learned [12, 13], providing class-specific annotations at training time. This is achieved by resiz- explanations [14, 10], and discovering causal relations ing one of the layers to match the number of concepts between concepts [15]. Similarly to CME, these ap- provided, and re-training with an added intermediate proaches typically seek to explain model behaviour in loss that encourages the neurons in that layer to align terms of high-level concepts, extracting this concept component-wise to the provided concepts. information from a model’s latent space. Crucially, CBM approaches provide ways for gen- Importantly, existing concept-based explanation ap- erating DNN models, which are explicitly encouraged proaches are typically capable of handling binary-valued to rely on specified concept information. In contrast, concepts only, which implies that multi-valued con- our approach is used for analysing DNN models (and is cepts have to be binarised first. For instance, given a much cheaper computationally). concept such as “shape”, with possible values ‘square’ Furthermore, CBM approaches require concept an- and ‘circle’, these approaches have to convert “shape” notations to be available at training time for all of the into two binary concepts ‘is_square’, and ‘is_circle’. training data, which is often expensive to produce. This makes such approaches (i) computationally expen- In contrast, CME can be used with partially-labelled sive, since the binarised concept space usually has a datasets in a semi-supervised fashion, as will be de- high cardinality, (ii) error-prone, since mutual exclusiv- scribed in Section 3. ity of concept values is now not enforced (e.g., a single Finally, CBM approaches require the concepts them- data point can now have both ‘is_square’ and ‘is_circle’ selves to be known beforehand. On the other hand, concepts being true). In contrast, our approach is capa- CME can efficiently utilise knowledge contained in pre- ble of handling multi-valued concepts directly, without trained DNNs, in order to learn about which concepts binarisation. are/aren’t required for a given task. Further details on Furthermore, concept-based explanation approaches CME/CBM comparison can be found in Appendix A. typically rely on the latent space of a single layer when 2.3. Model Extraction output function, mapping data-points in their concept representation to output space . Thus, when pro- Model extraction techniques use rules [20, 21, 22], de- cessing an input 𝐱, a DNN 𝑓 can be seen as converting cision trees [23, 24], or other more readily explainable this input into an interpretable concept representation models [25] to approximate complex models, in order using 𝑝, and using 𝑞 to predict the output from this to study their behaviour. Provided the approximation representation. The significance of this decomposition quality (referred to as fidelity) is high enough, an ex- is further discussed in Appendix A. tracted model can preserve many statistical properties CME explores whether a given DNN 𝑓 is concept- of the original model, while remaining open to inter- decomposable, by attempting to approximate 𝑓 with an pretation. extracted model 𝑓̂ ∶ → . In this case, 𝑓̂ is defined However, extracted models generated by existing methods represent their decision-making using the as 𝑓 (𝐱) = 𝑞̂ (𝑝̂ (𝐱)), using input-to-concept 𝑝̂ and output- ̂ same input representation as the original model, which to-concept 𝑞̂ extracted by CME from the original DNN. is typically difficult for the user to understand directly. We describe our approach to extracting 𝑝̂ and 𝑞̂ in the Instead, our extracted models represent decision-making remainder of this section. via human-understandable concepts, making them eas- ier to interpret. 3.3. Input-to-Concept (𝑝̂ ) When extracting 𝑝̂ from a pre-trained DNN, we as- 3. Methodology sume we have access to the DNN training data and labels {(𝐱(0) , 𝑦 (0) ), ..., (𝐱(𝑑) , 𝑦 (𝑑) )}. Furthermore, we as- In this section we present our CME approach, describ- sume partial access to 𝑝 ⋆ , such that a small set of ing how it can be used to analyse DNN models using 𝑖 training points {𝐱(0) , ..., 𝐱(𝑖−1) } have concept labels concept-based extracted models. {𝐜(0) , ..., 𝐜(𝑖−1) } associated with them, while the remain- ing 𝑢 points {𝐱(𝑖) , ..., 𝐱(𝑖+𝑢) } do not (in this case 𝑢 = 𝑑 −𝑖). 3.1. Formulation We refer to these subsets respectively as the concept labelled dataset and concept unlabelled dataset. Using We consider a pre-trained DNN classifier 𝑓 ∶ → , these datasets, we generate 𝑝̂ by aggregating concept ( ⊂ ℝ𝑛 , ⊂ ℝ𝑜 ), where 𝑓 (𝐱) = 𝑦 is mapping an input label predictions across multiple layers of the given 𝐱 ∈ to an output class 𝑦 ∈ . For every DNN layer DNN model, as described below. 𝑙, we denote the function 𝑓 𝑙 ∶ → 𝑙 , ( 𝑙 ⊂ ℝ𝑚 ) Given a DNN layer 𝑙 with 𝑚 hidden units, we com- as a mapping from the input space to the hidden pute the layer’s representation of the input data 𝐡 = representation space 𝑙 , where 𝑚 denotes the number 𝑓 𝑙 (𝐱), obtaining (𝐡(0) , ..., 𝐡(𝑖+𝑢) ). Using this data and the of hidden units, and can be different for each layer. concept labels, we construct a semi-supervised dataset, Similarly to [18, 19], we assume the existence of a consisting of labelled data {(𝐡(0) , 𝐜(0) ), ..., (𝐡(𝑖−1) , 𝐜(𝑖−1) )}, concept representation ⊂ ℝ𝑘 , defining 𝑘 distinct con- and unlabelled data {𝐡(𝑖) , ..., 𝐡(𝑖+𝑢) }. cepts associated with the input data. is defined such Next, we rely on Semi-Supervised Multi-Task Learn- that every basis vector in spans the space of possi- ing (SSMTL) [26], in order to extract a function 𝑔 𝑙 ∶ ble values for one particular concept. We further as- 𝑙 → , which predicts concept labels from layer 𝑙’s sume the existence of a function 𝑝 ⋆ ∶ → , where hidden space. In this work, we treat each concept as 𝑝 ⋆ (𝐱) = 𝐜 is mapping an input 𝐱 to its concept represen- a separate, independent task. Hence, 𝑔 𝑙 (𝐡) is decom- tation 𝐜. Thus, 𝑝 ⋆ defines the concepts and their values posed into 𝑘 separate tasks (one per concept), and is (referred to as the ground truth concepts) for every input defined as 𝑔 𝑙 (𝐡) = (𝑔1𝑙 (𝐡), ..., 𝑔𝑘𝑙 (𝐡)) where each 𝑔𝑖𝑙 (𝐡) point. (𝑖 ∈ {1..𝑘}) predicts the value of concept 𝑖 from 𝐡. Repeating this process for all model layers 𝐿, we 3.2. CME obtain a set of functions 𝐺 = {𝑔𝑖𝑙 | 𝑙 ∈ {1..𝐿} ∧ 𝑖 ∈ {1..𝑘}}. For every concept 𝑖, we define the “best” layer In this work, we define a DNN 𝑓 as being concept- 𝑙 𝑖 for predicting that concept as shown in (1): decomposable, if it can be well-approximated by a com- position of functions 𝑝 and 𝑞, such that 𝑓 (𝐱) = 𝑞(𝑝(𝐱)). 𝑙 𝑖 = arg min 𝓁 (𝑔𝑖𝑙 , 𝑖) (1) In this definition, the function 𝑝 ∶ → is an input- 𝑙∈𝐿 to-concept function, mapping data-points from their Here, 𝓁 is a loss function (in this case the error rate), input representation 𝐱 ∈ to their concept represen- computing the predictive loss of function 𝑔 𝑙 with re- tation 𝐜 ∈ . The function 𝑞 ∶ → is a concept-to- 𝑖 spect to a concept 𝑖. Finally, we define 𝑝̂ as shown in • Task 1: This task consists of determining the (2): shape concept value from an input image. For 1 1 𝑘 𝑘 every image sample, we define its task label as 𝑝̂ (𝐱) = (𝑔1𝑙 ◦𝑓 𝑙 (𝐱), ..., 𝑔𝑘𝑙 ◦𝑓 𝑙 (𝐱)) (2) the shape concept label of that sample. • Task 2: This task consists of discriminating be- Thus, given an input 𝐱, the value computed by 𝑝̂ (𝐱) tween all possible shape and scale concept value for every concept 𝑖 ∈ {1..𝑘} is equal to the value com- combinations. We assign a distinct identifier to puted by 𝑔𝑖𝑙 from that input’s representation in layer 𝑙 𝑖 . 𝑖 each possible combination of the shape and scale Overall, 𝑝̂ encapsulates concept information contained concept labels. For every image sample, we de- in a given DNN model, and can be used to analyse how fine its task label as the identifier corresponding this information is represented, as well as to predict to this sample’s shape and scale concept values. concept values for new inputs. Overall, Task 1 explores a scenario in which a DNN has to learn to recognise a specific concept from an 3.4. Concept-to-Label (𝑞̂ ) input image. Task 2 explores a relatively more complex scenario, in which a DNN has to learn to recognise We setup extraction of 𝑞̂ as a classification problem, in combinations of concepts from an input image. which we train 𝑞̂ to predict output labels 𝑦 from concept labels 𝐜 predicted by 𝑝̂ . We use 𝑝̂ to generate concept labels for all training data points, obtaining a set of con- 4.1.2. Model cept labels {𝐜(0) , ..., 𝐜(𝑖+𝑢) }. Next, we produce a labelled We trained a Convolutional Neural Network (CNN) dataset, consisting of concept labels and corresponding model [29] for each task. Both models had the same ar- DNN output labels {(𝐜(0) , 𝑦 (0) ), ..., (𝐜(𝑖+𝑢) , 𝑦 (𝑖+𝑢) )}, and chitecture, consisting of 3 convolutional layers, 2 dense use it to train 𝑞̂ in a supervised manner. We experi- layers with ReLUs, 50% dropout [30] and a softmax out- mented with using Decision Trees (DTs), and Logistic put layer. The models were trained using categorical Regression (LR) models for representing 𝑞̂ , as will be cross-entropy loss, and achieved 100.0 ± 0.0% classifi- discussed in Section 5. Overall, 𝑞̂ can be used to analyse cation accuracies on their respective held-out test sets. how a DNN uses concept information when making We refer to these models as the Task 1 model and the predictions. Task 2 model in the rest of this work. 4. Experimental Setup 4.1.3. Ground-truth Concept Information Importantly, the task and dataset definitions described We evaluated CME using two datasets: dSprites [27], in this section imply that we know precisely which and Caltech-UCSD birds [28]. All relevant code is pub- concepts the models had to learn, in order to achieve licly available at3 . 100.0 ± 0.0% task performances (shape for Task 1, and shape and scale for Task 2). We refer to this as the 4.1. dSprites ground truth concept information learned by these mod- els. dSprites is a well-established dataset used for evalu- ating unsupervised latent factor disentanglement ap- proaches. dSprites consists of 2D 64×64 pixel black-and- 4.2. Caltech-UCSD Birds (CUB) white shape images, procedurally generated from all For our second dataset, we used Caltech-UCSD Birds possible combinations of 6 ground truth independent 200 2011 (CUB). This dataset consists of 11,788 im- concepts (color, shape, scale, rotation, x and y position). ages of 200 bird species with every image annotated Further details can be found in Appendix B, and the using 312 binary concept labels (e.g. beak and wing official dSprites repository. 4 colour, shape, and pattern). We relied on concept pre- processing steps defined in [18] (used for de-noising 4.1.1. Classification Tasks concept annotations, and filtering out outlier concepts), We define 2 classification tasks, used to evaluate our which produces a refined set of 𝑘 = 112 binary concept framework: labels for every image sample. 3 https://github.com/dmitrykazhdan/CME 4 https://github.com/deepmind/dsprites-dataset/ 4.2.1. Classification Task. 4.3.2. CBM We relied on the standard CUB classification task, which As discussed in Section 4.2.3, we do not have access to consists of predicting the bird species from an input ground truth concept information learned by the CUB image. model. Instead, we rely on the pre-trained sequential bottleneck model defined in [18] (referred to as CBM 4.2.2. Model in the rest of this work). CBM is a bottleneck model, obtained by resizing one of the layers of the CUB model We used the Inception-v3 architecture [31], pretrained to match the number of concepts provided (we refer on ImageNet [32] (except for the fully-connected lay- to this as the bottleneck layer), and training the model ers) and fine-tuned end-to-end on the CUB dataset, in two steps. First, the sub-model consisting of the following the preprocessing practices described in [33]. layers between the input layer and the bottleneck layer The model achieved 82.7 ± 0.4% classification accuracy (inclusive) is trained to predict concept values from on a held-out test set. We refer to this model as the input data. Next, the submodel consisting of the lay- CUB model in the rest of this work. ers between the layer following the bottleneck layer and the output layer is trained to predict task labels 4.2.3. Ground-truth Concept Information from the concept values predicted by the first submodel. Unlike dSprites, the CUB dataset does not explicitly Hence, this bottleneck model is guaranteed to solely define how the available concepts relate to the output rely on concept information that is learnable from the task. Thus, we do not have access to the ground truth data, when making task label predictions. Thus, this concept information learned by the CUB model. benchmark serves as an upper bound for the concept information learnable from the dataset, and for the task performance achievable using this information. Impor- 4.3. Benchmarks tantly, CBM does not attempt to approximate/analyse We compare performance of our CME approach to two the CUB model, but instead attempts to solve the same other benchmarks, described in the remainder of this classification task using concept information only. section. We use the first CBM submodel as a 𝑝̂ benchmark, representing the upper bound of concept information learnable from the data. We use the second submodel 4.3.1. Net2Vec as a 𝑞̂ benchmark, representing the upper bound of We rely on work in [34] for defining benchmark 𝑝̂ func- task performance achievable from predicted concept tions for the three tasks. Work in [34] attempts to information only. Finally, we use the entire model as predict presence/absence of concepts from spatially- an 𝑓̂ benchmark. We make use of the saved trained averaged hidden layer activations of convolutional lay- model from [18], available in their official repository5 . ers of a CNN model. Given a binary concept 𝑐, this approach trains a logistic regressor, predicting the pres- ence/absence of this concept in an input image from 5. Results the latent representation of a given CNN layer. In case of multi-valued concepts, the concept space has to be We present the results obtained by evaluating our ap- binarised, as discussed in Section 2.2. In this case, the proach using the two case studies described above. binarised concept value with the highest likelihood is We obtain the concept labelled dataset by returning returned. the ground-truth concept values for a random set of Unlike CME, [34] does not provide a way of selecting samples in the model training data. For dSprites, we the convolutional layer to use for concept extraction. found that a concept labelled dataset of a 100 samples or We consider the best-case scenario by selecting, for all more worked well in practice for both tasks. Thus, we tasks, the convolutional layers yielding the best concept fix the size of the concept labelled dataset to 100 in all extraction performance. For all tasks, these layers were of the dSprites experiments. For CUB, we found that a convolutional layers closest to the output (the 3rd conv. concept labelled dataset containing 15 or more samples layer in case of dSprites tasks, and the final inception per class worked well in practice. Thus, we fix the size block output layer in case of the CUB task). of the concept labelled dataset to 15 samples per class in all of the CUB experiments. In the future, we intend to explore the variation of model extraction performance 5 https://github.com/yewsiang/ConceptBottleneck performances, by computing their 𝐹 1 predictive scores for each concept, and then averaging over all concepts. We obtained 𝐹 1 scores of 92 ± 0.5%, 86.3 ± 2.0%, and 85.9 ± 2.3% for CBM, CME, and Net2Vec 𝑝̂ functions, respectively (averaged over 5 runs). Importantly, we argue that in case of a large num- ber of concepts, it is crucial to measure how concept (a) Task 1 (b) Task 2 mispredictions are distributed accross the test samples. For instance, consider a dSprites Task 2 𝑝̂ function that Figure 2: Predictive accuracy of CME and Net2Vec 𝑝̂ func- tions for all concepts achieves 90% predictive accuracy on both shape and scale concepts. The average predictive accuracy on relevant concepts achieved by this 𝑝̂ will therefore be 90%. However, if the two concepts are mis-predicted with the size of the concept labelled dataset in more for strictly different samples (i.e. none of the samples detail. have both shape and scale predicted incorrectly at the same time), this means that 20% of the test samples 5.1. Concept Prediction Performance will have one relevant concept predicted incorrectly. Given that both concepts need to be predicted correctly First, we evaluate the quality of 𝑝̂ functions produced when using them for task label prediction, this implies by CME, Net2Vec, and CBM. For both dSprites tasks, we that consequent task label prediction will not be able relied on the Label Spreading semi-supervised model to achieve over 80% task label accuracy. This effect [35], provided in scikit-learn [36], when learning the 𝑔𝑖𝑙 becomes even more pronounced in case of a larger functions for CME. For CUB, we used logistic regression number of relevant concepts. functions instead, as they gave better performance. Consequently, we defined a novel cumulative mis- prediction error metric, which we refer to as the ‘mis- 5.1.1. dSprites prediction-overlap’ (MPO) metric. Given a test set Figure 2 shows predictive performance of the 𝑝̂ func- 𝑇 = {(𝐱(0) , 𝐜(0) ), ..., (𝐱(𝑛) , 𝐜(𝑛) )} consisting of 𝑛 + 1 in- tions on all concepts for the two dSprites tasks (aver- put samples 𝐱 with corresponding concept labels 𝐜, and aged over 5 runs). As discussed in Section 4.1.1, we a prediction set 𝑃 = {(𝐜̂(0) ), ..., 𝐜̂(𝑛) }, 𝑀𝑃𝑂 computes the have access to the ground truth concept information fraction of samples in the test set, that have at least 𝑚 learned by these models (shape concept information relevant concepts predicted incorrectly, as shown in for Task 1, and shape and scale concept information Equation 3 (where 𝕀(.) denotes the indicator function): for Task 2). For both tasks, 𝑝̂ functions extracted by CME successfully achieved high predictive accuracy on concepts relevant to the tasks, whilst achieving a low 1 𝑛 𝑀𝑃𝑂(𝑇 , 𝑃, 𝑚) = ∑ 𝕀(𝑒𝑟𝑟(𝐜𝑖 , 𝐜̂𝑖 ) >= 𝑚) (3) performance on concepts irrelevant to the tasks. Thus, 𝑛 𝑖=0 CME was able to successfully extract the concept infor- mation contained in the task models. For both tasks, 𝑝̂ Here, 𝑒𝑟𝑟 can be used to specify which concepts to functions extracted by Net2Vec achieved a much lower measure the mis-prediction error on (i.e. in case some performance on the relevant concepts. of the provided concepts are irrelevant). Under our assumption of all concepts being relevant, we defined 𝑒𝑟𝑟 as shown in Equation 4: 5.1.2. CUB As discussed in Section 4.2.3, the CUB dataset does 𝑘 𝑒𝑟𝑟(𝐜𝑖 , 𝐜̂𝑖 ) = ∑ 𝕀(𝑐𝑖,𝑗 ≠ 𝑐̂ 𝑖,𝑗 ) (4) not explicitly define how the concepts relate to the 𝑗=0 output task labels. Thus, we do not know how rel- evant/important different concepts are, with respect Using a held-out test set, we plot the 𝑀𝑃𝑂 metric val- to task label prediction. In this section, we make the ues for 𝑚 ∈ [0, ..., 112], as shown in Figure 3 (averaged conservative assumption that all concepts are relevant, over 5 runs). Importantly, 𝑝̂ function performances when evaluating 𝑝̂ functions, and explore relative con- can be evaluated by observing their 𝑀𝑃𝑂 scores for cept importance in more detail in Section 5.3. different values of 𝑚. A larger 𝑀𝑃𝑂 score implies a Firstly, we relied on the ‘average-per-concept’ met- bigger proportion of samples had at least 𝑚 relevant rics introduced in [18] when evaluating the 𝑝̂ function concept predicted incorrectly. performance for these models (averaged over 5 runs). The original Task 1, Task 2, and CUB models achieved task performances of 100±0%, 100±0%, and 82.7±0.4%, respectively, as described in Section 4. Table 1 Fidelity of extracted 𝑓̂ models CME CBM Net2Vec Task 1 100.0±0.0% – 24.5±3.6% Task 2 99.3±0.5% – 38.3±4.0% CUB 74.42±3.1% 77.5±0.2% 73.8±2.8% Figure 3: Performances of 𝑝̂ functions, evaluated using the 𝑀𝑃𝑂 metric. The green line plots the case for perfect pre- diction, when the predicted concepts are equivalent to the ground truth concepts (i.e. the 𝑝 ⋆ performance), in which Table 2 case 𝑀𝑃𝑂 = 1 for 𝑚 = 0, and 𝑀𝑃𝑂 = 0 otherwise. Net2Vec Task performance of extracted 𝑓̂ models obtained values within 1% deviation from the correspond- ing CME values for all 𝑚, and is therefore omitted here for CME CBM Net2Vec simplicity Task 1 100.0±0% – 24.5±3.6% Task 2 99.3±0.5% – 38.3±4.0% CUB 70.8±1.8% 75.7±0.6% 69.8±1.5% Overall, CME performed almost identically to Net2Vec, and worse than 𝐶𝐵𝑀 according to the 𝑀𝑃𝑂 metric. Similar performance to Net2Vec is likely caused by For both dSprites tasks, CME 𝑓̂ models achieved high (i) concepts being binary (requiring no binarisation) (99%+) fidelity and task performance scores, indicat- (ii) the Inception-v3 model having a relatively large ing that CME successfully approximated the original number of convolutional layers, implying that the final dSprites models. Furthermore, these scores were con- convolutional layer likely learned higher-level features, siderably higher than those produced by the Net2Vec relevant to concept prediction. 𝑓̂ models. Importantly, 𝑀𝑃𝑂 showed that both CBM and CME For the CUB task, both CME and Net2Vec 𝑓̂ models 𝑝̂ functions had a significant proportion of test samples achieved relatively lower fidelity and task performance with incorrectly-predicted relevant concepts (e.g. CME scores (in this case, performance of CME was very had an MPO score of 0.25 at 𝑚 = 4, implying that 25% similar to that of Net2Vec). Crucially, the CBM model of all test samples have at least 4 concepts predicted also achieved relatively low fidelity and accuracy scores incorrectly). In practice, these mispredictions can have (as anticipated from our 𝑀𝑃𝑂 metric analysis). This a significant impact on consequent task label predictive implies that concept information learnable from the performance, as will be further explored in the next data is insufficient for achieving high task accuracy. section. Hence the relatively high CUB model accuracy has to be caused by the CUB model relying on other non-concept 5.2. Task Performance information. Thus, the low fidelity of CME and Net2Vec is a consequence of the CUB model being non-concept- In this section, we evaluate the fidelity and performance decomposable, implying that it’s behaviour cannot be of the extracted 𝑓̂ models. For all CME and Net2Vec 𝑝̂ explained by the desired concepts. The next section functions evaluated in the previous section, we trained discusses possible approaches to fixing this issue. output-to-concept functions 𝑞̂ , predicting class labels from the 𝑝̂ concept predictions. Next, for every 𝑝̂ , we defined its corresponding 𝑓̂ as discussed in Section 3, 5.3. Intervening via a composition of 𝑝̂ and its associated 𝑞̂ . For every In the previous section, we demonstrated how CME can 𝑓̂ , we evaluated its fidelity and its task performance, be used to identify whether a model relies on desired using a held-out sample test set. Table 1 shows the concepts during decision-making. In this section, we fidelity of extracted models, and Table 2 shows the task demonstrate how CME can be used to suggest model Figure 5: t-SNE plots for the relevant Task 2 concepts. Each row corresponds to a different concept, and each column corresponds to a different layer of the Task 2 model. Each plot is colored with respect to the concept’s values. For every concept row, the subplot with a green border indicates the layer CME selected for predicting the value of that concept. Figure 4: The task accuracy of 𝑞̂ functions, trained on con- cepts predicted by 𝑝̂ functions, with top # No. corrected concepts set to their ground truth values. Performance of Net2Vec was very similar to that of CME, and is thus omit- arising due to data properties (e.g. the data not being ted here for simplicity. representative with respect to key concepts), not model properties (e.g. architecture, or training regime). Overall, we demonstrated how CME can be used to improvements, aligning model behaviour with the de- identify the key concept information that can be used sired concepts. to improve performance of DNN models, and ensure We trained a logistic regression 𝑞̂ model predicting that they are closer aligned with the desired concept- task labels from ground-truth concept labels for the based behaviour. Furthermore, we demonstrated how CUB task, obtaining an accuracy score of 96.4 ± 0.5% CME can be used to identify whether undesired model on a held-out test set (averaged over 5 runs). Using behaviour is caused by model properties, or data prop- this model’s coefficient magnitudes as a measure of erties. concept importance, we discovered that the 32 most important concepts identified this way were sufficient 5.4. Explainability for achieving over 96% task accuracy using logistic regression. By studying CME-extracted 𝑝̂ and 𝑞̂ functions sepa- Using this reduced concept set, we inspected how our rately, we can gain additional insights into what con- CUB 𝑞̂ function performances would change, if their cept information the original model learned and how corresponding 𝑝̂ functions extracted these concepts this concept information is used to make predictions. perfectly. This was achieved by taking the 𝑝̂ concept We give examples of how these sub-models can be in- predictions of these concepts on the test and training spected in the remainder of this section. sets, setting the values of the top 𝑖 most important concepts to their ground truth values, training logistic 5.4.1. Input-to-Concept (𝑝̂ ) regression 𝑞̂ functions on these modified training sets, CME extraction of 𝑝̂ functions from a DNN model is and measuring their accuracies on the modified test sets highly complementary to existing approaches on la- (this approach is referred to as concept intervention in tent space analysis. For example, Figure 5 shows a the rest of this work). The results are shown in Figure t-SNE [37] 2D projected plot of every layer’s hidden 4, with 𝑖 ranging from 0 to 32. space of the dSprites Task 2 model, highlighting dif- These results demonstrate that concept information ferent concept values of the two relevant concepts, as from only 32 concepts is sufficient for achieving over well as the layers used by CME to predict them. Figure 96% task performance. Thus, predictive performance 5 demonstrates several important ways in which CME of the CUB model can be significantly improved (up concept extraction can be combined with existing la- to 14%) by ensuring that the model is able to learn and tent space analysis approaches, which will be discussed use this concept information. Crucially, these results in the remainder of this section. Further examples are show that CME concept intervention also significantly given in Appendix C. improves CBM model performance, indicating that the necessary concept information is not learnable from the Manifold Types Using ground-truth concept infor- data. Hence, undesired CUB model behaviour is likely mation and hidden space visualisation, it is possible to inspect the nature of latent space manifolds, with respect to specific concepts. Firstly, this inspection al- lows to build an intuition of how concept information is represented in a particular latent space. Secondly, it is possible to use this information when selecting the types of 𝑝̂ functions to use during concept extraction. For instance, some manifolds consist of “blobs” encod- ing distinct concept values (e.g. row shape, columns dense, dense_1), suggesting that the latent space is clustered with respect to a concept’s values. Figure 6: Visualisation of a decision tree 𝑞̂ extracted from the Task 1 model. The model has correctly learned to differ- Variation Across Layers Using ground-truth con- entiate between classes based on the shape concept values. cept information and hidden space visualisation, it is also possible to inspect how concept information representation varies across layers of a DNN model. behaviour is consistent with user expectations (model Firstly, this inspection allows to build an intuition of verification), (ii) identifying specific concepts or con- how concept-related information is transformed by the cept interactions (if any) causing incorrect behaviour DNN. Secondly, it is possible to use this information to (model debugging), (iii) extracting new knowledge about identify the ‘best’ layers to extract concept information how concept information can be used for solving a par- from. For instance, both rows shape and scale illus- ticular task (knowledge extraction). Further examples trate that the manifolds of higher layers become more and analysis of extracted 𝑞̂ functions can be found in unimodal (separating concept values) with respect to Appendix D. the relevant concepts. Importantly, this analysis, to- gether with the definition of 𝑝̂ allows using different layers for extracting different concepts. 6. Conclusions Overall, we argue that CME concept extraction can We present CME: a concept-based model extraction be well-integrated with existing latent space analysis framework, used for analysing DNN models via concept- approaches, in order to study which concept informa- based extracted models. Using two case-studies, we tion is learned by a DNN, and how this information is demonstrate how CME can be used to (i) analyse con- represented across DNN layers. This type of inspec- cept information learned by DNN models (ii) analyse tion can have numerous applications, including: (i) how DNNs use concept information when making pre- inspecting which concepts a model has learned, and dictions (iii) identifying key concept information that verifying whether it has learned the desired concepts can further improve DNN predictive performance. CME (useful for model explanations and model verification), is a model-agnostic, general-purpose framework, which (ii) inspecting how concept information is represented can be combined with a wide variety of different DNN across different layers (useful for fine-grained model models and corresponding tasks. analysis), (iii) extracting concept predictions from a In this work, we assume a fixed set of concept la- DNN (useful for knowledge extraction). Further exam- bels available to CME before model extraction begins ples and analysis of extracted 𝑝̂ functions can be found (i.e. the concept-labelled dataset). In the future, we in Appendix C. intend to explore active-learning based approaches to obtaining maximally-informative concept labels in an 5.4.2. Concept-to-Output (𝑞̂ ) interactive fashion. Consequently, these approaches 𝑞̂ functions encapsulate how a DNN uses concept infor- will improve extracted model fidelity by retrieving the mation when making predictions. Hence, these func- most informative concept labels, and reduce manual tions can be inspected directly, in order to analyse concept labelling effort. model behaviour represented in terms of concepts. An Given the rapidly-increasing interest in concept-based example is given in Figure 6, in which we plot the deci- explanations of DNN models, we believe our approach sion tree 𝑞̂ function extracted by CME from the Task 1 can play an important role in providing granular concept- model. Further examples are given in Appendix D. based analyses of DNN models. Overall, inspection of 𝑞̂ functions can be used for (i) verifying that a DNN uses concept information cor- rectly during decision-making, and that it’s high-level Acknowledgements man, J. W. Vaughan, H. Wallach, Manipulat- ing and measuring model interpretability, arXiv AW acknowledges support from the David MacKay preprint arXiv:1802.07810 (2018). Newton research fellowship at Darwin College, The [10] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, Alan Turing Institute under EPSRC grant EP/N510129/1 J. Wexler, F. B. Viégas, R. Sayres, Interpretabil- & TU/B/000074, and the Leverhulme Trust via the Lev- ity beyond feature attribution: Quantitative test- erhulme Centre for the Future of Intelligence (CFI). BD ing with concept activation vectors (TCAV), in: acknowledges support from EPSRC Award #1778323. J. G. Dy, A. Krause (Eds.), Proceedings of the DK acknowledges support from EPSRC ICASE scholar- 35th International Conference on Machine Learn- ship and GSK. DK and BD acknowledge the experience ing, ICML 2018, Stockholmsmässan, Stockholm, at Tenyks as fundamental to developing this research Sweden, July 10-15, 2018, volume 80 of Proceed- idea. ings of Machine Learning Research, PMLR, 2018, pp. 2673–2682. URL: http://proceedings.mlr.press/ v80/kim18d.html. References [11] B. Zhou, Y. Sun, D. Bau, A. Torralba, Interpretable [1] B. Goodman, S. Flaxman, European union regula- basis decomposition for visual explanation, in: tions on algorithmic decision-making and a “right Proceedings of the European Conference on Com- to explanation”, AI magazine 38 (2017) 50–57. puter Vision (ECCV), 2018, pp. 119–134. [2] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Ben- [12] A. Ghorbani, J. Wexler, J. Y. Zou, B. Kim, Towards netot, S. Tabik, A. Barbado, S. García, S. Gil-López, automatic concept-based explanations, in: Ad- D. Molina, R. Benjamins, et al., Explainable ar- vances in Neural Information Processing Systems, tificial intelligence (xai): Concepts, taxonomies, 2019. opportunities and challenges toward responsible [13] C.-K. Yeh, B. Kim, S. O. Arik, C.-L. Li, P. Ravikumar, ai, Information Fusion 58 (2020). T. Pfister, On concept-based explanations in deep [3] A. Adadi, M. Berrada, Peeking inside the black- neural networks, arXiv preprint arXiv:1910.07969 box: A survey on explainable artificial intelligence (2019). (xai), IEEE Access 6 (2018). [14] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, [4] U. Bhatt, A. Xiang, S. Sharma, A. Weller, A. Taly, F. Viegas, R. Sayres, Interpretability beyond fea- Y. Jia, J. Ghosh, R. Puri, J. M. Moura, P. Eckersley, ture attribution: Quantitative testing with con- Explainable machine learning in deployment, in: cept activation vectors (tcav), arXiv preprint Proceedings of the 2020 Conference on Fairness, arXiv:1711.11279 (2017). Accountability, and Transparency, 2020, pp. 648– [15] Y. Goyal, U. Shalit, B. Kim, Explaining classifiers 657. with causal concept effect (cace), arXiv preprint [5] P.-J. Kindermans, S. Hooker, J. Adebayo, M. Alber, arXiv:1907.07165 (2019). K. T. Schütt, S. Dähne, D. Erhan, B. Kim, The (un) [16] G. E. Hinton, Learning multiple layers of repre- reliability of saliency methods, in: Explainable sentation, Trends in cognitive sciences 11 (2007) AI: Interpreting, Explaining and Visualizing Deep 428–434. Learning, Springer, 2019, pp. 267–280. [17] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Tor- [6] D. A. Melis, T. Jaakkola, Towards robust inter- ralba, Object detectors emerge in deep scene cnns, pretability with self-explaining neural networks, arXiv preprint arXiv:1412.6856 (2014). in: Advances in Neural Information Processing [18] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, Systems, 2018, pp. 7775–7784. E. Pierson, B. Kim, P. Liang, Concept bottleneck [7] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, models, in: Proceedings of Machine Learning M. Hardt, B. Kim, Sanity checks for saliency maps, and Systems 2020, International Conference on in: Advances in Neural Information Processing Machine Learning, 2020, pp. 11313–11323. Systems, 2018, pp. 9505–9515. [19] F. D.-V. Isaac Lage, Human-in-the-loop learning [8] B. Dimanov, U. Bhatt, M. Jamnik, A. Weller, You of interpretable and intuitive representations, in: shouldn’t trust me: Learning models which con- ICML Workshop on Human Interpretability, 2020. ceal unfairness from multiple explanation meth- URL: http://whi2020.online/static/pdfs/paper_31. ods, in: European Conference on Artificial Intelli- pdf. gence, 2020. [20] R. Andrews, J. Diederich, A. B. Tickle, Survey and [9] F. Poursabzi-Sangdeh, D. G. Goldstein, J. M. Hof- critique of techniques for extracting rules from trained artificial neural networks, Knowledge- based systems 8 (1995) 373–389. ceedings of the IEEE conference on computer vi- [21] J. R. Zilke, E. L. Mencía, F. Janssen, Deepred– sion and pattern recognition, 2018, pp. 4109–4118. rule extraction from deep neural networks, in: [34] R. Fong, A. Vedaldi, Net2vec: Quantifying and International Conference on Discovery Science, explaining how concepts are encoded by filters Springer, 2016, pp. 457–473. in deep neural networks, in: Proceedings of the [22] D. Chen, S. P. Fraiberger, R. Moakler, F. Provost, IEEE conference on computer vision and pattern Enhancing transparency and control when draw- recognition, 2018, pp. 8730–8738. ing data-driven inferences about individuals, Big [35] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, data 5 (2017) 197–212. B. Schölkopf, Learning with local and global con- [23] R. Krishnan, G. Sivakumar, P. Bhattacharya, Ex- sistency, in: Advances in Neural Information tracting decision trees from trained neural net- Processing Systems 16, 2004. works, Pattern recognition 32 (1999). [36] F. Pedregosa, G. Varoquaux, A. Gramfort, [24] M. Sato, H. Tsukimoto, Rule extraction from V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- neural networks via decision tree induction, in: tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, IJCNN’01. International Joint Conference on Neu- A. Passos, D. Cournapeau, M. Brucher, M. Per- ral Networks. Proceedings (Cat. No. 01CH37222), rot, E. Duchesnay, Scikit-learn: Machine learning volume 3, IEEE, 2001, pp. 1870–1875. in Python, Journal of Machine Learning Research [25] D. Kazhdan, Z. Shams, P. Liò, Marleme: A multi- 12 (2011). agent reinforcement learning model extraction [37] L. v. d. Maaten, G. Hinton, Visualizing data using library, arXiv preprint arXiv:2004.07928 (2020). t-sne, Journal of Machine Learning Research 9 [26] Q. Liu, X. Liao, L. Carin, Semi-supervised multi- (2008) 2579–2605. task learning, in: Advances in Neural Information Processing Systems, 2008. [27] L. Matthey, I. Higgins, D. Hassabis, A. Lerch- A. Concept Decomposition ner, dsprites: Disentanglement testing sprites dataset, https://github.com/deepmind/dsprites- The results and findings presented in existing work on dataset/, 2017. concept-based explanations suggests that users often [28] C. Wah, S. Branson, P. Welinder, P. Perona, S. Be- think of tasks in terms of concepts and concept interac- longie, The caltech-ucsd birds-200-2011 dataset tions (see Section 2.1 for further details). For instance, (2011). consider the task of determining the species of a bird [29] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, from an image. A user will typically perform this task R. E. Howard, W. E. Hubbard, L. D. Jackel, Hand- by first identifying relevant concepts (e.g. wing color, written digit recognition with a back-propagation head color, and beak length) present in a given image, network, in: Advances in neural information pro- and then using the values of these concepts to infer the cessing systems, 1990, pp. 396–404. bird species, in a bottom-up fashion. [30] N. Srivastava, G. Hinton, A. Krizhevsky, On the other hand, Machine Learning (ML) mod- I. Sutskever, R. Salakhutdinov, Dropout: els usually rely on high-dimensional data representa- A simple way to prevent neural networks tions, and infer task labels directly from these high- from overfitting, Journal of Machine Learn- dimensional inputs (e.g. a CNN produces a class label ing Research 15 (2014) 1929–1958. URL: from raw input pixels of an image). http://jmlr.org/papers/v15/srivastava14a.html. Consequently, Concept Decomposition (CD) approaches [31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo- attempt to explain the behaviour of such ML models by jna, Rethinking the inception architecture for decomposing their processing into two distinct steps: computer vision, in: Proceedings of the IEEE concept extraction, and label prediction. In concept conference on computer vision and pattern recog- extraction, concept information is extracted from the nition, 2016, pp. 2818–2826. high-dimensional input data. In label prediction, con- [32] A. Krizhevsky, I. Sutskever, G. E. Hinton, Ima- cept information is used to produce the output label. genet classification with deep convolutional neu- Hence, CD approaches attempt to explain ML model ral networks, in: Advances in neural information behaviour in terms of human-understandable concepts processing systems, 2012, pp. 1097–1105. and their interactions in a bottom-up fashion, parallel- [33] Y. Cui, Y. Song, C. Sun, A. Howard, S. Be- ing human-like reasoning more closely. longie, Large scale fine-grained categorization Importantly, whilst this work focuses on CNN mod- and domain-specific transfer learning, in: Pro- els and tasks, the notion of CD can in principle be applied to any ML model and task. concept-related knowledge stored in these models. Con- sequently, we believe that CME will be invaluable in A.1. CBMs situations where concept-related information is expen- sive/difficult to obtain, or is only partially-known. In CBMs can be seen as a special case of models perform- these cases, a user may interact with existing DNN mod- ing CD, in which CD behaviour is enforced by design. els via CME, in order to refine existing concept-related Hence, these models explicitly consist of two submod- knowledge. els, with the first submodel extracting concept infor- It should be noted that a CBM can trivially be ap- mation, and the second submodel using this concept proximated using CME, by defining 𝑝̂ as the output of information for producing task labels. Importantly, a CBM’s concept bottleneck layer, and defining 𝑞̂ as non-CBM models can still demonstrate CD behaviour. the CBM’s submodel producing task labels from the For instance, the dSprites Task 2 model was shown to bottleneck layer output. have CD behaviour, with relevant concept information extracted in the dense layers, and used for classification A.3. Further Discussion decisions. As discussed in Section 3, CME explores whether a A.2. CBMs & CME DNN is concept-decomposable, by attempting to ap- proximate it with an extracted model that is concept- The utility of CBMs is that they produce models explic- decomposable by design (i.e. explicitly consists of two itly encouraged to use CD. Consequently, these models separate stages). Intuitively, if a given DNN learns and are much more likely to rely on the desired concepts relies on concept information of the specified concepts during decision-making, and be more aligned with a during label prediction, this concept information will user’s mental model of the corresponding task. be contained in the DNN latent space. Hence, the DNN However, a given DNN model can already exhibit decision process could be separated into two steps: con- CD behaviour, and use the desired concept information cept information extraction, and consequent task label (e.g. as was the case with both dSprites task models). prediction. In this case, costly modifications and model re-training Importantly, existing CD-based approaches (such as are unnecessary. As discussed in Section 3, CME can those discussed in Section 2.2) require the set of con- extract concept information from pre-trained DNNs by cepts and their values to be (i) sufficient to solve the training 𝐿 ∗ 𝑘 concept predictors (where 𝐿 denotes the corresponding classification task (i.e. the class labels number of DNN layers used in concept extraction, and can be predicted from concept information with high 𝑘 denotes the number of concepts). As demonstrated accuracy) (ii) learnable from the data (i.e. the DNN in Section 5, these concept predictors can consist of model will be able to learn concept information from simpler models (e.g. LRs), trained on only a fraction the given dataset), in order to achieve high task perfor- of the DNN training data. Thus, the computational mance. cost of training these concept predictors is significantly However, these works do not discuss how to handle smaller, compared to training a bottleneck model on cases where these assumptions do not hold (e.g. as was all the training data, as done in the case of CBMs. the case with the CUB task). Thus, exploring ways More importantly, CBM models require knowledge of efficiently discovering relevant concepts sufficient of existing concepts and available concept annotations. for solving a given task, as well as ways of ensuring In practice, these annotations are often expensive to whether this concept information is learnable from the produce, especially for large datasets and/or a large data are both important research directions for future number of concepts. Furthermore, information about work. which concepts are relevant and/or sufficient for solv- ing a given task is often not fully available either. In- stead, CME is capable of using existing DNN models B. dSprites Dataset to extract this information automatically in a semi- supervised fashion, making concept discovery (identi- B.1. Description fying the relevant concepts), and concept annotation dSprites is a dataset of 2D shapes, procedurally gener- both faster and cheaper. ated from 6 ground truth independent concepts (color, Overall, CME permits efficient interaction with pre- shape, scale, rotation, x and y position). Table 3 lists the trained DNN models, which can be used to leverage concepts, and corresponding values. dSprites consists Figure 7: t-SNE plots for the relevant Task 1 concept. Each column corresponds to a different layer of the Task 1 model. Each plot is colored with respect to the concept’s values. The subplot with a green border indicates the layer 𝑝̂ uses for predicting the value of that concept of 64×64 pixel black-and-white images, generated from in [18]. Further details regarding layer naming and/or all possible combinations of these concepts, for a total concept naming can be found in 6 . For all concepts, con- of 1 × 3 × 6 × 40 × 32 × 32 = 737280 total images. cept values become significantly better-separated after the Mixed_7c layer. However, the figure shows that Table 3 concept values are still quite mixed together for some dSprites concepts and values of the points, even for later layers. This low separability Name Values indicates that concept values will still be mis-predicted Color white for some of the points, and that concept extraction for Shape square, ellipse, heart the CUB task will likely perform suboptimally. Scale 6 values linearly spaced in [0.5, 1] Rotation 40 values in [0, 2𝜋] Position X 32 values in [0, 1] D. Concept-to-Output Functions Position Y 32 values in [0, 1] Figure 9 shows the decision tree extracted for dSprites Task 2. Overall, this model has correctly learned to differentiate between classes based on the shape and B.2. Pre-processing scale concepts (note: there are 3 × 6 shape and scale concept values, for a total of 18 output classes). We select 16 of the 32 values for Position X and Posi- tion Y (keeping every other value only), and select 8 of the 40 values for Rotation (retaining every 5th value). This step makes the dataset size more manageable (re- ducing it from 737280 to 3 ∗ 6 ∗ 8 ∗ 16 ∗ 16 = 36864 samples), whilst preserving its characteristics and prop- erties, such as concept value ranges and diversity. C. Input-to-Concept Functions Figure 7 shows a t-SNE 2D projected plot of every layer’s hidden space of the dSprites Task 1 model, high- lighting different concept values of the relevant shape concept, and which layers were used by CME to predict it. The CUB model has a considerably larger number of layers, and a considerably larger number of task con- cepts. Hence, for the sake of space, we demonstrate an example here using only 6 different model layers of the CUB model, and showing only the top 5 important concepts identified in Section 5.3. In this Figure, the concepts are named using their indices, and the lay- ers are named following the naming convention used 6 https://github.com/yewsiang/ConceptBottleneck/tree/master/CUB Figure 8: t-SNE plots for the top 5 CUB concepts. Each column corresponds to a different layer of the CUB model. Each plot is colored with respect to the concept’s values. Figure 9: Visualisation of a decision tree 𝑞̂ extracted from the Task 2 model. The model has correctly learned to differentiate between classes based on the shape and scale concept values.