1. Introduction

Globally local and fast explanations of -SNE-like nonlinear

Cyril de Bodt

cyril.debodt@uclouvain.be 0 2 3 4 5 6 7

Pierre Lambert

pierre.h.lambert@uclouvain.be 2 3 4 5 6 7

Rebecca Marion

2 4 5 6 7

Julien Albert

2 4 5 6 7

Emmanuel Jean

1 2 4 5 6 7

Sacha Corbugy

2 4 5 6 7

UNamur - NaDI/PReCISE

2 4 5 6 7

TRAIL

2 4 5 6 7

Namur

2 4 5 6 7

Belgium

2 4 5 6 7 0 MIT Media Lab , Cambridge [MA] , USA 1 Multitel & TRAIL , Mons , Belgium 2 Stochastic Neighbor Embedding (SNE) [5] mitigates 3 UCLouvain - ICTEAM & TRAIL , Louvain-la-Neuve , Belgium 4 Workshop Proce dings 5 [24] L. Pagliosa , P. Pagliosa, L. G. Nonato, Understand- 6 [32] R. Marion, A. Bibal , B. Frénay, Bir: A method for 7 [33] B. Kang , D. García García, J. Lijfijt, R. Santos-

2002

Nonlinear dimensionality reduction (NLDR) algorithms such as -SNE are often employed to visually analyze high-dimensional (HD) data sets in the form of low-dimensional (LD) embeddings. Unfortunately, the nonlinearity of the NLDR process prohibits the interpretation of the resulting embeddings in terms of the HD features. State-of-the-art studies propose post-hoc explanation approaches to locally explain the embeddings. However, such tools are typically slow and do not automatically cover the entire LD embedding, instead providing local explanations around one selected data point at a time. This prevents users from quickly gaining insights about the general explainability landscape of the embedding. This paper presents a globally local and fast explanation framework for NLDR embeddings. This framework is fast because it only requires the computation of sparse linear regression models on subsets of the data, without ever reapplying the NLDR algorithm itself. In addition, the framework is globally local in the sense that the entire LD embedding is automatically covered by multiple local explanations. The diferent interpretable structures in the embedding are directly characterized, making it possible to quantify the importance of the HD features in various regions of the LD embedding. An example use-case is examined, emphasizing the value of the presented framework. Public codes and a software are available at https://github.com/PierreLambert3/glocally_explained. dimensionality reduction, data visualization, interactivity, interpretability, explainability, -SNE, data exploration Advances in Interpretable Machine Learning and Artificial Intelligence, ∗Corresponding author.

Numerous other

1. Introduction

Dimensionality reduction (DR) computes dimensional (LD) representations of high-dimensional (HD) data, e.g., to visually explore them or to curb the curse of dimensionality [1].

The relevance of a DR method for a given visualization task typically depends on its preservation of the HD neighborhoods in the resulting LD embedding [2]. Two major frameworks have been proposed for projecting from HD to LD coordinates [1]: one is based on preserving distances [3], while the other is based on reproducing neighborhoods [ 4, 5 ]. For instance, distance-preserving methods like principal component analysis (PCA) [6] and classical metric multidimensional scaling (MDS) [3] project HD samples linearly; nonlinear variants of these methods

https://github.com/PierreLambert3 (P. Lambert); https://github.com/cdebodt (C. de Bodt)

0000-0003-2347-1756 (C. de Bodt) as the popular -SNE [15], UMAP [16], multi-scale perplexity-free approaches [17, 18, 19], etc.

While these nonlinear DR (NLDR) algorithms deliver HD data, their intrinsic nonlinearity greatly afects the

interpretability of the LD representations. Indeed, the obtained LD dimensions are hardly or most often not interpretable in terms of the HD features [20]. Since NLDR methods are not interpretable by design, previous studies have developed techniques to analyze and interpret the

LD embeddings, which is known as post-hoc explanation

or interpretability [21]. One can for instance cite [ 22 ], (e.g., [7, 8]) aim to preserve weighted Euclidean or impressively faithful LD embeddings with respect to the decision trees. On the other hand, [21] locally explain fast. The globally local nature of our approach refers to -SNE embeddings by adapting LIME; the authors argue the fact that multiple local explanations are automatically that explaining the entire embedding at once would be computed over the entire LD embedding (i.e., globally). dificult, as -SNE usually does not preserve large HD Such an automatic processing enables the user to directly distances well [20]. However, the local nature of -SNE glimpse the overall explainability landscape of the emmotivates the computation of local explanations in the LD bedding, as well as a structured overview of the impact of embedding; LIME can then be revisited and performed the HD features in the various parts of the LD embedding. locally around a user-selected data point. Nevertheless, The regions for which local explanations are learned in such an approach has two main limitations: (1) it is slow, the LD embedding can be determined in diferent ways and (2) it does not cover the entire LD embedding auto- [24]: using a clustering algorithm such as K-means, as matically, as local explanations are only provided around in this work, thanks to a manual selection performed by data points that have been selected, one at a time. This the user, or by recursively splitting the embedding into approach is slow because, in order to explain a given data subcells along the LD dimensions based on a model error point’s position in the embedding, -SNE must be reap- criterion. plied to many artificially simulated points around that Our fast and globally local explanation framework can data point; the non-parametric nature of -SNE, combined be viewed as taking the best of both linear and nonlinwith its significant computational cost, greatly increases ear projection worlds: the LD embedding can indeed be computation time, which decreases the potential for in- generated by a nonlinear DR algorithm, achieving much teractivity. The second limitation of the method is that better DR quality in terms of data visualization thanks to the user only receives a local explanation around the increased flexibility and adaptability [ 12, 15, 2]. On the selected point in the embedding; she must thus explore other hand, the computed local explanations are linear the various regions of the embedding manually. This is and sparse, which promotes interpretability. Moreover, not realistic in practice, especially when working with the globally local explanations make it possible to readily large databases, and even more so since the approach is depict the importance of the HD features in the diferent not fast. regions of the LD embedding. As an experiment, an exam

This paper aims to address these limitations by devel- ple use-case on a public data set is presented, highlightoping a fast and globally local explanation framework ing the usefulness of the proposed approach. Free code for NLDR embeddings. Based on the BIOT explanation and software are publicly available online ( https://github. approach [23], this framework learns sparse linear re- com/PierreLambert3/glocally_explained), enabling the gression models for subsets of the data set and does not easy use of the proposed framework. require a reapplication of the NLDR algorithm, making it This paper is organized as follows: Section 2 first reviews some related works. Section 3 then presents our in this paper addresses the limitations of [21] by (1) diproposed approach, while Section 4 discusses an example rectly providing local explanations everywhere in the LD use-case. Section 5 draws final conclusions. embedding (i.e., globally local explanations), (2) avoiding the need to sample new artificial data points, and (3) relying only on the calculation of linear regression mod2. Related works els, which ensures fast processing and hence facilitates interactivity.

Interpreting NLDR techniques is a challenging task. To tackle this challenge, various approaches have been proposed. Some papers (e.g., [ 25, 26, 27 ]) have proposed 3. Proposed approach methods for explaining the LD embedding dimensions with respect to the HD features. Since local NLDR algo- This section introduces our proposed approach for globrithms such as -SNE do not efectively preserve large dis- ally local and fast explanations of NLDR embeddings. tances, explaining the resulting embedding dimensions Section 3.1 first summarizes our notations. Section 3.2 with these methods may be misleading. Other methods then details our methodology, and Section 3.3 finally attempt to interpret NLDR results by explaining visual presents an optional fine-tuning strategy. clusters [ 22, 28, 29 ]. For example, in [ 22 ], the authors propose an interactive pipeline for explaining clusters in the 3.1. Notations LD embedding using decision trees; this pipeline enables the user to manually select LD clusters, which are then Matrices are denoted with bold-faced capital letters (e.g., explained in terms of the HD features with a decision ⃗), vectors with bold-faced lower-case letters (e.g., ⃗) and tree, an interpretable model. The resulting model can scalars with lower-case letters (e.g., ). A single element be used to explain why certain data points are clustered from a matrix is denoted with a lower-case letter with two together and to identify the HD features that distinguish subscripts (e.g., ), the first indicating the row and the the diferent clusters. In contrast, our proposed approach second indicating the column. Instances are indexed by aims to understand intra-cluster positions, i.e., the HD the letter ∈ {1, ..., } , features by the letter ∈ {1, ..., } , features that make two points from the same cluster lie embedding dimensions by the letter ∈ {1, ..., } and at diferent corners of this cluster. Moreover, our frame- regions or subcells of the embedding by the letter ℓ ∈ work makes it possible to not only explain LD clusters, {1, ..., } . but more generally interpret the overall positions of the points in the embedding. 3.2. General methodology

Other existing methods aim to locally and linearly explain the position of a specific instance in the LD space. In [23], the Best Interpretable Orthogonal TransformaIn particular, [21] adapts LIME [30] to locally explain - tion (BIOT) method was proposed to explain the dimenSNE embeddings. The original version of LIME involves sions of multidimensional scaling (MDS) embeddings. In three steps. First, it samples instances around a point of the case of -SNE, such an explanation strategy is not interest. Then, it queries the model for these instances. directly applicable because -SNE only preserves local Finally, it fits an interpretable model with the result of the structure from the high-dimensional data. However, as queries. In [21], the authors use a SMOTE oversampling proposed in [21], -SNE embeddings may be explained technique [31] to create new artificial neighbors for the locally. Instead of learning a BIOT explanation model for point of interest. To query -SNE, the entire DR process the entire embedding (i.e., a single global explanation), is re-applied for each sampled instance, since the -SNE we propose learning diferent BIOT models for diferent mapping function is unknown. Finally, BIR [32] —which regions (or subcells) of the embedding (i.e. local explais the predecessor of BIOT [23], a method employed in nation). For a given region, the BIOT model identifies our work —is used to produce local explanations; BIR the features that best explain the positioning of points ifnds the rotation of the queried sampled data that results within that region of the embedding, independently of all in the best explanation model (in terms of model sparsity other regions. This approach can be applied to any nonand error). While the approach presented in [21] pro- linear 2-D embedding, including embeddings generated vides nice intuitions about the LD embedding structure, by -SNE and its extensions (e.g., [33, 19]) or by other it has several limitations. First, it can only compute one NLDR algorithms (e.g., [ 16, 17, 9, 34 ]). local explanation at a time, for one selected point. Sec- Let (⃗×) be the matrix of features used to generate ond, the obtained explanation is highly dependent on the the embedding ⃗ ( × 2 ). Furthermore, let ⃗ ( × 2 ) and ⃗0 artificial sampling. Finally, running the entire NLDR pro- (2 × 1) contain the weights and intercepts for the linear cess for all sampled instances is (very) time consuming, models relating the features in ⃗ to each dimension of the and thus prohibits interactivity. The approach presented embedding ⃗, where there is one model per dimension.

∑ 2 =1 =1

2 0( ,⃗ ⃗ 0, )⃗=

⊤ which is minimized w.r.t ⃗ , ⃗0 and ⃗ under the constraint

Clearly, this objective function can be extended to the case where diferent model parameters ⃗ (ℓ), ⃗0 are optimized for diferent regions ℓ of the embedding, where the set of instances in region ℓ is denoted ℓ. In practice, the best segmentation of the embedding into re

(ℓ) and ⃗(ℓ) gions is unknown. In this paper, we propose segmenting the embedding automatically by performing K-means on the embedding data. The choice of the hyperparameter depends on the topology apparent in the embedding and of the granularity of details desired by the user. Other strategies are possible, for instance by recursively dividing the LD dimensions along their medians.

2 =1 ⃗2).

Finally, ⃗ (22 ) is an orthogonal transformation matrix that is applied to ⃗ to promote model sparsity and pre

The interface displayed in Fig. 1 shows an embedding with multiple local linear explanations: each explanation diction quality, and > 0 is a hyperparameter to control is composed of a green and a burgundy axis. Explanation model sparsity. For 2-D embeddings, the BIOT objective ⃝ A has been selected by the user; the color transparency function for global explanation is ding. An example use-case demonstrates that the method can efectively reveal zones in the embedding where points are organized according to specific HD features.

Finally, some accompanying software is provided ( https:

//github.com/PierreLambert3/glocally_explained), targeting both DR researchers and experts seeking to analyse their data with nonlinear dimensionality reduction visualization tools.

Further works will include testing our framework with of the points increases linearly with the absolute diference between their position in the embedding and the position predicted by the selected linear model (i.e., the greater the error, the more transparent). This enables the user to visualize the portion of the embedding for which (1) the selected linear model is faithful. The right panel depicts the relative importance of the HD features for each axis of the selected explanation (i.e., ⃝ A in this case), as quantified by the local linear model weights; the horizontal bar under each feature name represents the feature’s signed linear projection weight (LPW) on the considered axis, highlighting the importance of the feature in the local explanation. For visual clarity, only the 5 features with the greatest LPW magnitudes are depicted for each local explanation axis. The feature total sulfur dioxide has been selected by the user (mark ⃝ B ). When selecting a feature in the right panel, thick indicators appear on both axes of all local explanations, with lengths proportional to the LPW magnitudes of the corresponding feature on all axes; mark ⃝ C shows two such indicators. This makes it possible to grasp the influence of an HD feature in the various regions of the entire embedding.

Each view in Fig. 2 shows the importance of a particu

indicated at the bottom of the panel. The left view highlights that free sulfur dioxide is particularly important when explaining the top portion of the embedding along a vertical direction, whereas the horizontal direction can be partly explained by the concentration of citric acid.

We observe that the structures apparent in the bottomleft part of the embedding are not very dependent on the three analyzed features. 5. Conclusion This work proposes a globally local and fast explanation framework that provides multiple local linear explanations for 2-D data embeddings, enabling the user to assess, at a glance, the importance of diferent HD

3.3. Fine-tuning In Section 3.2, the proposed strategy for automatic seg- lar feature in the embedding, with the respective feature mentation (K-means) depends on the coordinates of the instances in the embedding. However, the shape and size of the zone that can be explained may not directly depend on the spatial coordinates of the embedding. This means that the regions identified using K-means may not be the most optimal with respect to the quality of the resulting explanations. In some cases, it is hence useful to finetune the final regions by directly considering explanation quality. To do so, we propose a method called Clustered

BIOT, which reassigns instances to explanation regions

Clustered BIOT can be found in Appendix A. ℓ based on a modification of BIOT. Further details on 4. Experiments and discussion posed method using an interactive user interface. This user interface is available on the public repository indicated in the abstract. All of the featured embeddings are representations of the winequality-red dataset, available in the UCI machine learning repository [35]. This data set contains 11 physico-chemical variables describing various red wines. The embeddings are produced by a recent NE algorithm that mixes -SNE gradients with those of a fast stochastic approximation of MDS, which This section presents an example use-case for the pro- features, both locally and across the whole LD embedpreserves HD data structures across multiple scales [ 34 ]. actual end-users in the context of a real use case; their

An interactive pipeline for explaining visual clus- [35] D. Dua, C. Graf, UCI machine learning repository, ters in dimensionality reduction visualizations with

Explaining multidimensional nonlinear mds embed

dings using the best interpretable orthogonal transing attribute variability in multidimensional projections, in: 2016 29th SIBGRAPI Conference on

Graphics, Patterns and Images (SIBGRAPI), IEEE,

2016, pp. 297–304.

Telea, F. V. Paulovich, Explaining three-dimensional dimensionality reduction plots, Information Visuwork for dimensionality reduction based data exploration, in: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 2018, pp. 1–13. projection matrix/tree: Interactive subspace visual exploration and analysis of high dimensional data,

IEEE Transactions on Visualization and Computer

Graphics 19 (2013) 2625–2633. analysis of dimensionality reduction results with

A. Clustered BIOT As mentioned in Section 3.3, the main method proposed

in this paper can be fine-tuned with a method we call Clustered BIOT. Let ℓ = 1 if instance is in region ℓ and 0 otherwise. The matrix ⃗ containing all elements ℓ respects the general conventions of hard clustering (each instance belongs to exactly one cluster and each cluster contains at least one instance). Then, the objective function for Clustered BIOT is 1( ,⃗{ ⃗ (ℓ), ⃗0

, ⃗(ℓ)}|ℓ=1) =1 ⊤ ization and computer graphics 26 (2019) 45–55. contrastive learning, IEEE transactions on visual- where ℓ ∶= { | ℓ = 1}. For fixed ⃗ (ℓ), ⃗0 a given instance , the solution for ⃗ is the vector that (ℓ) and ⃗(ℓ) and international conference on knowledge discovery can belong to only one cluster), the optimal cluster for minimizes ality reduction results using shapley values, Expert

Systems with Applications 178 (2021) 115020.

trust you?” explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD and data mining, 2016, pp. 1135–1144. selecting the best interpretable multidimensional scaling rotation using external variables, Neuro

2 ∑ ℓ ℓ=1 Since only one element of ⃗ can be equal to one (instance instance is whichever model ℓ minimizes the prediction arg minℓ ( ⃗ ⊤

Thus, Clustered BIOT can be optimized by alternating between clustering instances according to prediction error and fitting BIOT models to the clusters. An instance is assigned to cluster ℓ if BIOT model ℓ has the lowest prediction error for that instance compared to the other which is minimized w.r.t ⃗ and { ⃗ (ℓ), ⃗0 , ⃗(ℓ)}|ℓ=1 under the constraints that (i) ⃗(ℓ) is an orthogonal matrix ∀ℓ and (ii) ⃗ respects the clustering conventions above.

For fixed ⃗, the solution for { ⃗ (ℓ), ⃗0 , ⃗(ℓ)}|ℓ=1 can be found by training BIOT on each subset of instances ℓ, (2) (3) (4)

feedback will enable the improvement of the various sionality reduction , Science 290 ( 2000 ) 2319 - 2323 .

design choices of our interface . In addition, a qualitative DOI: 10.1126/science.290.5500 .2319.

comparison with other explainability methods such as [9 ]

S. T.

Roweis ,

L. K.

Saul , Nonlinear dimensionality

LIME will enable a more comprehensive evaluation of reduction by locally linear embedding , science 290

the proposed method . ( 2000 ) 2323 - 2326 . [10]

Suykens , Data visualization and dimensionality reduction using kernel maps with a reference point,

Acknowledgments IEEE Trans. Neural Netw . 19 ( 2008 ) 1501 - 1517 . [11]

Francois ,

Wertz ,

Verleysen , The concentra-

This work was supported by Service Public de Wallonie tion of fractional distances 19 (

2007 ) 873 - 886 .

Recherche under grant n° 2010235-ARIAC by DIGITAL - [12]

J. A.

Lee ,

Verleysen , Quality assessment of di-

WALLONIA4. AI . SC is supported by a FRIA grant ( F.R.S . - mensionality reduction: Rank-based criteria , Neu-

FNRS ). rocomputing 72 ( 2009 ) 1431 - 1443 . [13]

Schölkopf ,

Smola , K.-R. Müller , Nonlinear

t-sne efectively , Distill 1 ( 2016 ) e2 . [21]

Bibal ,

V. M.

Vu ,

Nanfack ,

Frénay , Explain-

ESANN , 2020 , pp. 393 - 398 .

[34]

Lambert , C. de Bodt,

Verleysen ,

J. A.

Lee ,

bedding like t-sne and umap , Neurocomputing 503

[22]

Bibal ,

Clarinval ,

Dumas ,

Frénay , Ixvc:

[25] D. B. Coimbra , R. M.

Martins , T. T.

Neves , A. C. [27]

Yuan ,

Ren ,

Wang ,

Guo , Dimension [28]

Fujiwara ,

O.-H.

Kwon , K.-L. Ma, Supporting

2017. URL: http://archive.ics.uci.edu/ml. [29]

W. E.

Marcilio-Jr ,

D. M.

Eler , Explaining dimension[30]

M. T.

Ribeiro ,

Singh ,

Guestrin , ” why should i [31]

N. V.

Chawla ,

K. W.

Bowyer ,

L. O.

Hall , W. P. error: