Globally local and fast explanations of 𝑡-SNE-like nonlinear
embeddings
Pierre Lambert1,∗ , Rebecca Marion2 , Julien Albert2 , Emmanuel Jean3 , Sacha Corbugy2 and
Cyril de Bodt1,4
1
  UCLouvain - ICTEAM & TRAIL, Louvain-la-Neuve, Belgium
2
  UNamur - NaDI/PReCISE & TRAIL, Namur, Belgium
3
  Multitel & TRAIL, Mons, Belgium
4
  MIT Media Lab, Cambridge [MA], USA


                                          Abstract
                                          Nonlinear dimensionality reduction (NLDR) algorithms such as 𝑡-SNE are often employed to visually analyze high-dimensional
                                          (HD) data sets in the form of low-dimensional (LD) embeddings. Unfortunately, the nonlinearity of the NLDR process
                                          prohibits the interpretation of the resulting embeddings in terms of the HD features. State-of-the-art studies propose
                                          post-hoc explanation approaches to locally explain the embeddings. However, such tools are typically slow and do not
                                          automatically cover the entire LD embedding, instead providing local explanations around one selected data point at a
                                          time. This prevents users from quickly gaining insights about the general explainability landscape of the embedding. This
                                          paper presents a globally local and fast explanation framework for NLDR embeddings. This framework is fast because it
                                          only requires the computation of sparse linear regression models on subsets of the data, without ever reapplying the NLDR
                                          algorithm itself. In addition, the framework is globally local in the sense that the entire LD embedding is automatically
                                          covered by multiple local explanations. The different interpretable structures in the embedding are directly characterized,
                                          making it possible to quantify the importance of the HD features in various regions of the LD embedding. An example
                                          use-case is examined, emphasizing the value of the presented framework. Public codes and a software are available at
                                          https://github.com/PierreLambert3/glocally_explained.

                                          Keywords
                                          dimensionality reduction, data visualization, interactivity, interpretability, explainability, 𝑡-SNE, data exploration


1. Introduction                                                                                                   approximately geodesic distances. Numerous other
                                                                                                                  schemes have also been developed that determine the
Dimensionality reduction (DR) computes low- LD embedding design based on HD affinity matrices
dimensional (LD) representations of high-dimensional [9, 10]. Regrettably, the local neighborhood preservation
(HD) data, e.g., to visually explore them or to curb the of all of these techniques is limited in visualization
curse of dimensionality [1]. The relevance of a DR contexts by the norm concentration phenomenon
method for a given visualization task typically depends [11, 12], most probably due to their distance-preserving
on its preservation of the HD neighborhoods in the nature [1, 13]. In contrast, the native shift invariance
resulting LD embedding [2]. Two major frameworks of neighbor embedding (NE) algorithms [14] such as
have been proposed for projecting from HD to LD Stochastic Neighbor Embedding (SNE) [5] mitigates
coordinates [1]: one is based on preserving distances [3], this phenomenon, leading to astonishing DR quality.
while the other is based on reproducing neighborhoods These achievements have naturally encouraged the
[4, 5]. For instance, distance-preserving methods like development of numerous SNE-based methods, such
principal component analysis (PCA) [6] and classical as the popular 𝑡-SNE [15], UMAP [16], multi-scale
metric multidimensional scaling (MDS) [3] project HD perplexity-free approaches [17, 18, 19], etc.
samples linearly; nonlinear variants of these methods                                                                While these nonlinear DR (NLDR) algorithms deliver
(e.g., [7, 8]) aim to preserve weighted Euclidean or impressively faithful LD embeddings with respect to the
                                                                                                                  HD data, their intrinsic nonlinearity greatly affects the
Advances in Interpretable Machine Learning and Artificial Intelligence, interpretability of the LD representations. Indeed, the
2022 October 21, Atlanta, Georgia, USA                                                                            obtained LD dimensions are hardly or most often not in-
∗
     Corresponding author.                                                                                        terpretable in terms of the HD features [20]. Since NLDR
Envelope-Open pierre.h.lambert@uclouvain.be (P. Lambert);                                                         methods are not interpretable by design, previous studies
cyril.debodt@uclouvain.be (C. de Bodt)
GLOBE https://github.com/PierreLambert3 (P. Lambert);
                                                                                                                  have developed techniques to analyze and interpret the
https://github.com/cdebodt (C. de Bodt)                                                                           LD embeddings, which is known as post-hoc explanation
Orcid 0000-0003-2347-1756 (C. de Bodt)                                                                            or interpretability [21]. One can for instance cite [22],
                     © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                     Attribution 4.0 International (CC BY 4.0).                                                   which proposes to explain visual LD clusters thanks to
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Interface for the proposed globally local and fast explanation framework.


decision trees. On the other hand, [21] locally explain        fast. The globally local nature of our approach refers to
𝑡-SNE embeddings by adapting LIME; the authors argue           the fact that multiple local explanations are automatically
that explaining the entire embedding at once would be          computed over the entire LD embedding (i.e., globally).
difficult, as 𝑡-SNE usually does not preserve large HD         Such an automatic processing enables the user to directly
distances well [20]. However, the local nature of 𝑡-SNE        glimpse the overall explainability landscape of the em-
motivates the computation of local explanations in the LD      bedding, as well as a structured overview of the impact of
embedding; LIME can then be revisited and performed            the HD features in the various parts of the LD embedding.
locally around a user-selected data point. Nevertheless,       The regions for which local explanations are learned in
such an approach has two main limitations: (1) it is slow,     the LD embedding can be determined in different ways
and (2) it does not cover the entire LD embedding auto-        [24]: using a clustering algorithm such as K-means, as
matically, as local explanations are only provided around      in this work, thanks to a manual selection performed by
data points that have been selected, one at a time. This       the user, or by recursively splitting the embedding into
approach is slow because, in order to explain a given data     subcells along the LD dimensions based on a model error
point’s position in the embedding, 𝑡-SNE must be reap-         criterion.
plied to many artificially simulated points around that           Our fast and globally local explanation framework can
data point; the non-parametric nature of 𝑡-SNE, combined       be viewed as taking the best of both linear and nonlin-
with its significant computational cost, greatly increases     ear projection worlds: the LD embedding can indeed be
computation time, which decreases the potential for in-        generated by a nonlinear DR algorithm, achieving much
teractivity. The second limitation of the method is that       better DR quality in terms of data visualization thanks to
the user only receives a local explanation around the          increased flexibility and adaptability [12, 15, 2]. On the
selected point in the embedding; she must thus explore         other hand, the computed local explanations are linear
the various regions of the embedding manually. This is         and sparse, which promotes interpretability. Moreover,
not realistic in practice, especially when working with        the globally local explanations make it possible to readily
large databases, and even more so since the approach is        depict the importance of the HD features in the different
not fast.                                                      regions of the LD embedding. As an experiment, an exam-
   This paper aims to address these limitations by devel-      ple use-case on a public data set is presented, highlight-
oping a fast and globally local explanation framework          ing the usefulness of the proposed approach. Free code
for NLDR embeddings. Based on the BIOT explanation             and software are publicly available online (https://github.
approach [23], this framework learns sparse linear re-         com/PierreLambert3/glocally_explained), enabling the
gression models for subsets of the data set and does not       easy use of the proposed framework.
require a reapplication of the NLDR algorithm, making it          This paper is organized as follows: Section 2 first re-
views some related works. Section 3 then presents our            in this paper addresses the limitations of [21] by (1) di-
proposed approach, while Section 4 discusses an example          rectly providing local explanations everywhere in the LD
use-case. Section 5 draws final conclusions.                     embedding (i.e., globally local explanations), (2) avoid-
                                                                 ing the need to sample new artificial data points, and (3)
                                                                 relying only on the calculation of linear regression mod-
2. Related works                                                 els, which ensures fast processing and hence facilitates
                                                                 interactivity.
Interpreting NLDR techniques is a challenging task. To
tackle this challenge, various approaches have been pro-
posed. Some papers (e.g., [25, 26, 27]) have proposed            3. Proposed approach
methods for explaining the LD embedding dimensions
with respect to the HD features. Since local NLDR algo-          This section introduces our proposed approach for glob-
rithms such as 𝑡-SNE do not effectively preserve large dis-      ally local and fast explanations of NLDR embeddings.
tances, explaining the resulting embedding dimensions            Section 3.1 first summarizes our notations. Section 3.2
with these methods may be misleading. Other methods              then details our methodology, and Section 3.3 finally
attempt to interpret NLDR results by explaining visual           presents an optional fine-tuning strategy.
clusters [22, 28, 29]. For example, in [22], the authors pro-
pose an interactive pipeline for explaining clusters in the      3.1. Notations
LD embedding using decision trees; this pipeline enables
the user to manually select LD clusters, which are then          Matrices are denoted with bold-faced capital letters (e.g.,
explained in terms of the HD features with a decision             ⃗ vectors with bold-faced lower-case letters (e.g., 𝑥)
                                                                 𝑋),                                                      ⃗ and
tree, an interpretable model. The resulting model can            scalars with lower-case letters (e.g., 𝑥). A single element
be used to explain why certain data points are clustered         from a matrix is denoted with a lower-case letter with two
together and to identify the HD features that distinguish        subscripts (e.g., 𝑥𝑖𝑗 ), the first indicating the row and the
the different clusters. In contrast, our proposed approach       second indicating the column. Instances are indexed by
aims to understand intra-cluster positions, i.e., the HD         the letter 𝑖 ∈ {1, ..., 𝑛}, features by the letter 𝑗 ∈ {1, ..., 𝑑},
features that make two points from the same cluster lie          embedding dimensions by the letter 𝑘 ∈ {1, ..., 𝑚} and
at different corners of this cluster. Moreover, our frame-       regions or subcells of the embedding by the letter ℓ ∈
work makes it possible to not only explain LD clusters,          {1, ..., 𝐿}.
but more generally interpret the overall positions of the
points in the embedding.                                         3.2. General methodology
   Other existing methods aim to locally and linearly ex-
plain the position of a specific instance in the LD space.       In [23], the Best Interpretable Orthogonal Transforma-
In particular, [21] adapts LIME [30] to locally explain 𝑡-       tion (BIOT) method was proposed to explain the dimen-
SNE embeddings. The original version of LIME involves            sions of multidimensional scaling (MDS) embeddings. In
three steps. First, it samples instances around a point of       the case of 𝑡-SNE, such an explanation strategy is not
interest. Then, it queries the model for these instances.        directly applicable because 𝑡-SNE only preserves local
Finally, it fits an interpretable model with the result of the   structure from the high-dimensional data. However, as
queries. In [21], the authors use a SMOTE oversampling           proposed in [21], 𝑡-SNE embeddings may be explained
technique [31] to create new artificial neighbors for the        locally. Instead of learning a BIOT explanation model for
point of interest. To query 𝑡-SNE, the entire DR process         the entire embedding (i.e., a single global explanation),
is re-applied for each sampled instance, since the 𝑡-SNE         we propose learning different BIOT models for different
mapping function is unknown. Finally, BIR [32] —which            regions (or subcells) of the embedding (i.e. local expla-
is the predecessor of BIOT [23], a method employed in            nation). For a given region, the BIOT model identifies
our work —is used to produce local explanations; BIR             the features that best explain the positioning of points
finds the rotation of the queried sampled data that results      within that region of the embedding, independently of all
in the best explanation model (in terms of model sparsity        other regions. This approach can be applied to any non-
and error). While the approach presented in [21] pro-            linear 2-D embedding, including embeddings generated
vides nice intuitions about the LD embedding structure,          by 𝑡-SNE and its extensions (e.g., [33, 19]) or by other
it has several limitations. First, it can only compute one       NLDR algorithms (e.g., [16, 17, 9, 34]).
local explanation at a time, for one selected point. Sec-           Let 𝑋⃗ (𝑛×𝑑) be the matrix of 𝑑 features used to generate
ond, the obtained explanation is highly dependent on the         the embedding 𝑌⃗ (𝑛 × 2). Furthermore, let 𝑊 ⃗ (𝑑 × 2) and 𝑤⃗0
artificial sampling. Finally, running the entire NLDR pro-       (2 × 1) contain the weights and intercepts for the linear
cess for all sampled instances is (very) time consuming,         models relating the features in 𝑋⃗ to each dimension of the
and thus prohibits interactivity. The approach presented         embedding 𝑌,  ⃗ where there is one model per dimension.
Finally, 𝑅⃗ (2𝑥2) is an orthogonal transformation matrix                The interface displayed in Fig. 1 shows an embedding
that is applied to 𝑌⃗ to promote model sparsity and pre-             with multiple local linear explanations: each explanation
diction quality, and 𝜆 > 0 is a hyperparameter to control            is composed of a green and a burgundy axis. Explanation
model sparsity. For 2-D embeddings, the BIOT objective               ⃝A has been selected by the user; the color transparency
function for global explanation is                                   of the points increases linearly with the absolute differ-
                                                                     ence between their position in the embedding and the
                                                                     position predicted by the selected linear model (i.e., the
                      𝑛   2                             2
              ⃗ = 1                                                  greater the error, the more transparent). This enables the
    ⃗ , 𝑤⃗0 , 𝑅)
𝐽0 (𝑊               ∑ ∑(⃗   𝑦 ⊤ 𝑟⃗ −𝑤 −𝑥⃗⊤ 𝑤⃗ )2 +𝜆 ∑ ‖𝑤⃗𝑘 ‖1 ,
                 2𝑛 𝑖=1 𝑘=1 𝑖 𝑘 0𝑘 𝑖 𝑘                 𝑘=1
                                                                     user to visualize the portion of the embedding for which
                                                               (1)   the selected linear model is faithful. The right panel de-
which is minimized w.r.t 𝑊,⃗ 𝑤⃗0 and 𝑅⃗ under the constraint         picts the relative importance of the HD features for each
that 𝑅⃗ is an orthogonal matrix (𝑅⃗𝑅⃗⊤ = 𝑅⃗⊤ 𝑅⃗ = 𝐼⃗2 ).             axis of the selected explanation (i.e., ⃝A in this case), as
   Clearly, this objective function can be extended to the           quantified by the local linear model weights; the horizon-
case where different model parameters 𝑊   ⃗ (ℓ) , 𝑤⃗0(ℓ) and 𝑅⃗(ℓ)   tal bar under each feature name represents the feature’s
are optimized for different regions ℓ of the embedding,              signed linear projection weight (LPW) on the considered
where the set of instances in region ℓ is denoted 𝒮ℓ . In            axis, highlighting the importance of the feature in the
practice, the best segmentation of the embedding into re-            local explanation. For visual clarity, only the 5 features
gions is unknown. In this paper, we propose segmenting               with the greatest LPW magnitudes are depicted for each
the embedding automatically by performing K-means on                 local explanation axis. The feature total sulfur dioxide has
the embedding data. The choice of the hyperparameter 𝐾               been selected by the user (mark ⃝).  B When selecting a
depends on the topology apparent in the embedding and                feature in the right panel, thick indicators appear on both
of the granularity of details desired by the user. Other             axes of all local explanations, with lengths proportional
strategies are possible, for instance by recursively divid-          to the LPW magnitudes of the corresponding feature on
ing the LD dimensions along their medians.                           all axes; mark ⃝ C shows two such indicators. This makes
                                                                     it possible to grasp the influence of an HD feature in the
                                                                     various regions of the entire embedding.
3.3. Fine-tuning                                                        Each view in Fig. 2 shows the importance of a particu-
In Section 3.2, the proposed strategy for automatic seg-             lar feature in the embedding, with the respective feature
mentation (K-means) depends on the coordinates of the                indicated at the bottom of the panel. The left view high-
instances in the embedding. However, the shape and size              lights that free sulfur dioxide is particularly important
of the zone that can be explained may not directly depend            when explaining the top portion of the embedding along
on the spatial coordinates of the embedding. This means              a vertical direction, whereas the horizontal direction can
that the regions identified using K-means may not be the             be partly explained by the concentration of citric acid.
most optimal with respect to the quality of the resulting            We observe that the structures apparent in the bottom-
explanations. In some cases, it is hence useful to fine-             left part of the embedding are not very dependent on the
tune the final regions by directly considering explanation           three analyzed features.
quality. To do so, we propose a method called Clustered
BIOT, which reassigns instances 𝑖 to explanation regions
𝒮ℓ based on a modification of BIOT. Further details on
                                                           5. Conclusion
Clustered BIOT can be found in Appendix A.                 This work proposes a globally local and fast explana-
                                                           tion framework that provides multiple local linear ex-
                                                           planations for 2-D data embeddings, enabling the user
4. Experiments and discussion                              to assess, at a glance, the importance of different HD
This section presents an example use-case for the pro- features, both locally and across the whole LD embed-
posed method using an interactive user interface. This ding. An example use-case demonstrates that the method
user interface is available on the public repository in- can effectively reveal zones in the embedding where
dicated in the abstract. All of the featured embeddings points are organized according to specific HD features.
are representations of the winequality-red dataset, avail- Finally, some accompanying software is provided (https:
able in the UCI machine learning repository [35]. This //github.com/PierreLambert3/glocally_explained), target-
data set contains 11 physico-chemical variables describ- ing both DR researchers and experts seeking to analyse
ing various red wines. The embeddings are produced by their data with nonlinear dimensionality reduction visu-
a recent NE algorithm that mixes 𝑡-SNE gradients with alization tools.
those of a fast stochastic approximation of MDS, which        Further works will include testing our framework with
preserves HD data structures across multiple scales [34]. actual end-users in the context of a real use case; their
Figure 2: Importance of 3 features in the local explanations of an embedding.


feedback will enable the improvement of the various                 sionality reduction, Science 290 (2000) 2319–2323.
design choices of our interface. In addition, a qualitative         DOI: 10.1126/science.290.5500.2319.
comparison with other explainability methods such as            [9] S. T. Roweis, L. K. Saul, Nonlinear dimensionality
LIME will enable a more comprehensive evaluation of                 reduction by locally linear embedding, science 290
the proposed method.                                                (2000) 2323–2326.
                                                               [10] J. Suykens, Data visualization and dimensionality
                                                                    reduction using kernel maps with a reference point,
Acknowledgments                                                     IEEE Trans. Neural Netw. 19 (2008) 1501–1517.
                                                               [11] D. Francois, V. Wertz, M. Verleysen, The concentra-
This work was supported by Service Public de Wallonie
                                                                    tion of fractional distances 19 (2007) 873–886.
Recherche under grant n° 2010235-ARIAC by DIGITAL-
                                                               [12] J. A. Lee, M. Verleysen, Quality assessment of di-
WALLONIA4.AI. SC is supported by a FRIA grant (F.R.S.-
                                                                    mensionality reduction: Rank-based criteria, Neu-
FNRS).
                                                                    rocomputing 72 (2009) 1431–1443.
                                                               [13] B. Schölkopf, A. Smola, K.-R. Müller, Nonlinear
References                                                          component analysis as a kernel eigenvalue problem,
                                                                    Neural computation 10 (1998) 1299–1319.
 [1] J. A. Lee, M. Verleysen, Nonlinear dimensionality         [14] J. A. Lee, M. Verleysen, Shift-invariant similari-
     reduction, Springer Science & Business Media, 2007.            ties circumvent distance concentration in stochastic
 [2] J. Venna, J. Peltonen, K. Nybo, H. Aidos, S. Kaski, In-        neighbor embedding and variants, Procedia Com-
     formation retrieval perspective to nonlinear dimen-            puter Science 4 (2011) 538–547.
     sionality reduction for data visualization., Journal      [15] L. Van der Maaten, G. Hinton, Visualizing data
     of Machine Learning Research 11 (2010).                        using t-sne., Journal of machine learning research
 [3] I. Borg, P. J. F. Groenen, Modern Multidimensional             9 (2008).
     Scaling: Theory and applications, Springer Science        [16] L. McInnes, J. Healy, J. Melville, Umap: Uniform
     & Business Media, 2005.                                        manifold approximation and projection for dimen-
 [4] T. Kohonen, The self-organizing map, Pro-                      sion reduction, arXiv preprint arXiv:1802.03426
     ceedings of the IEEE 78 (1990) 1464–1480. DOI:                 (2018).
     10.1109/5.58325.                                          [17] J. A. Lee, D. H. Peluffo-Ordóñez, M. Verleysen,
 [5] G. Hinton, S. Roweis, Stochastic neighbor embed-               Multi-scale similarities in stochastic neighbour em-
     ding, in: NIPS, volume 15, 2002, pp. 833–840.                  bedding: Reducing dimensionality while preserv-
 [6] I. T. Jolliffe, Principal component analysis and fac-          ing both local and global structure, Neurocomput-
     tor analysis, in: Principal component analysis,                ing 169 (2015) 246–261.
     Springer, 1986, pp. 115–128.                              [18] C. de Bodt, D. Mulders, M. Verleysen, J. A. Lee,
 [7] J. W. Sammon, A nonlinear mapping for data struc-              Perplexity-free t-SNE and twice Student tt-SNE, in:
     ture analysis 100 (1969) 401–409.                              ESANN, 2018, pp. 123–128.
 [8] J. B. Tenenbaum, V. De Silva, J. C. Langford, A           [19] C. de Bodt, D. Mulders, M. Verleysen, J. A. Lee, Fast
     global geometric framework for nonlinear dimen-                multiscale neighbor embedding, IEEE Transactions
                                                                    on Neural Networks and Learning Systems (2020).
[20] M. Wattenberg, F. Viégas, I. Johnson, How to use              (2021) 2905–2940.
     t-sne effectively, Distill 1 (2016) e2.                  [34] P. Lambert, C. de Bodt, M. Verleysen, J. A. Lee,
[21] A. Bibal, V. M. Vu, G. Nanfack, B. Frénay, Explain-           Squadmds: A lean stochastic quartet mds improv-
     ing t-sne embeddings locally by adapting lime., in:           ing global structure preservation in neighbor em-
     ESANN, 2020, pp. 393–398.                                     bedding like t-sne and umap, Neurocomputing 503
[22] A. Bibal, A. Clarinval, B. Dumas, B. Frénay, Ixvc:            (2022) 17–27.
     An interactive pipeline for explaining visual clus-      [35] D. Dua, C. Graff, UCI machine learning repository,
     ters in dimensionality reduction visualizations with          2017. URL: http://archive.ics.uci.edu/ml.
     decision trees, Array 11 (2021) 100080.
[23] A. Bibal, R. Marion, R. von Sachs, B. Frénay, Biot:
     Explaining multidimensional nonlinear mds embed-
     dings using the best interpretable orthogonal trans-
                                                              A. Clustered BIOT
     formation, Neurocomputing 453 (2021) 109–118.            As mentioned in Section 3.3, the main method proposed
[24] L. Pagliosa, P. Pagliosa, L. G. Nonato, Understand-      in this paper can be fine-tuned with a method we call
     ing attribute variability in multidimensional pro-       Clustered BIOT. Let 𝑧𝑖ℓ = 1 if instance 𝑖 is in region ℓ
     jections, in: 2016 29th SIBGRAPI Conference on
                                                              and 0 otherwise. The matrix 𝑍⃗ containing all elements
     Graphics, Patterns and Images (SIBGRAPI), IEEE,
                                                              𝑧𝑖ℓ respects the general conventions of hard clustering
     2016, pp. 297–304.
                                                              (each instance belongs to exactly one cluster and each
[25] D. B. Coimbra, R. M. Martins, T. T. Neves, A. C.
                                                              cluster contains at least one instance). Then, the objective
     Telea, F. V. Paulovich, Explaining three-dimensional
                                                              function for Clustered BIOT is
     dimensionality reduction plots, Information Visu-
     alization 15 (2016) 154–172.
[26] M. Cavallo, Ç. Demiralp, A visual interaction frame-                         𝐽1 (𝑍⃗ , {𝑊
                                                                                                     (ℓ)
                                                                                            ⃗ (ℓ) , 𝑤⃗0 ,𝑅⃗(ℓ) }|𝐿ℓ=1 )
     work for dimensionality reduction based data explo-
                                                                      𝑛   𝐿       2                                         2
     ration, in: Proceedings of the 2018 CHI Conference           1                      (ℓ)   (ℓ)        (ℓ)            (ℓ)
                                                              =      ∑ ∑ 𝑧 ∑(⃗     𝑦 ⊤ 𝑟⃗ − 𝑤0 𝑘 − 𝑥⃗𝑖⊤ 𝑤⃗𝑘 )2 + 𝜆 ∑ ‖𝑤⃗𝑘 ‖1
     on Human Factors in Computing Systems, 2018, pp.             2𝑛 𝑖=1 ℓ=1 𝑖ℓ 𝑘=1 𝑖 𝑘                            𝑘=1
     1–13.                                                                                                              (2)
[27] X. Yuan, D. Ren, Z. Wang, C. Guo, Dimension
     projection matrix/tree: Interactive subspace visual      which is minimized w.r.t 𝑍⃗ and {𝑊 ⃗ (ℓ) , 𝑤⃗0(ℓ) , 𝑅⃗(ℓ) }|𝐿ℓ=1 under
     exploration and analysis of high dimensional data,
                                                              the constraints that (i) 𝑅⃗(ℓ) is an orthogonal matrix ∀ℓ
     IEEE Transactions on Visualization and Computer
     Graphics 19 (2013) 2625–2633.                            and (ii) 𝑍⃗ respects the clustering conventions above.
[28] T. Fujiwara, O.-H. Kwon, K.-L. Ma, Supporting              For fixed 𝑍,                    ⃗ (ℓ) , 𝑤⃗0(ℓ) , 𝑅⃗(ℓ) }|𝐿ℓ=1 can be
                                                                            ⃗ the solution for {𝑊
     analysis of dimensionality reduction results with        found by training BIOT on each subset of instances 𝒮ℓ ,
     contrastive learning, IEEE transactions on visual-       where 𝒮ℓ ∶= {𝑖 | 𝑧𝑖ℓ = 1}. For fixed 𝑊⃗ (ℓ) , 𝑤⃗0(ℓ) and 𝑅⃗(ℓ) and
     ization and computer graphics 26 (2019) 45–55.           a given instance 𝑖, the solution for 𝑧⃗𝑖 is the vector that
[29] W. E. Marcilio-Jr, D. M. Eler, Explaining dimension-     minimizes
     ality reduction results using shapley values, Expert                     𝐿       2
     Systems with Applications 178 (2021) 115020.                                             (ℓ)          (ℓ)
                                                                                   𝑦𝑖⊤ 𝑟⃗𝑘 − 𝑤0 𝑘 − 𝑥⃗𝑖⊤ 𝑤⃗𝑘 )2 .
                                                                           ∑ 𝑧𝑖ℓ ∑(⃗
                                                                                                                      (ℓ)
                                                                                                                                (3)
[30] M. T. Ribeiro, S. Singh, C. Guestrin, ” why should i                  ℓ=1    𝑘=1
     trust you?” explaining the predictions of any clas-
     sifier, in: Proceedings of the 22nd ACM SIGKDD           Since only one element of 𝑧⃗𝑖 can be equal to one (instance
     international conference on knowledge discovery          𝑖 can belong to only one cluster), the optimal cluster for
     and data mining, 2016, pp. 1135–1144.                    instance 𝑖 is whichever model ℓ minimizes the prediction
[31] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P.            error:
     Kegelmeyer, Smote: synthetic minority over-
                                                                                             (ℓ)        (ℓ)           (ℓ)
     sampling technique, Journal of artificial intelligence                          𝑦𝑖⊤ 𝑟⃗𝑘 − 𝑤0 𝑘 − 𝑥⃗𝑖⊤ 𝑤⃗𝑘 )2 .
                                                                           arg minℓ (⃗                                          (4)
     research 16 (2002) 321–357.                                  Thus, Clustered BIOT can be optimized by alternating
[32] R. Marion, A. Bibal, B. Frénay, Bir: A method for        between clustering instances according to prediction er-
     selecting the best interpretable multidimensional        ror and fitting BIOT models to the clusters. An instance
     scaling rotation using external variables, Neuro-        𝑖 is assigned to cluster ℓ if BIOT model ℓ has the lowest
     computing 342 (2019) 83–96.                              prediction error for that instance compared to the other
[33] B. Kang, D. García García, J. Lijffijt, R. Santos-       models.
     Rodríguez, T. De Bie, Conditional t-sne: more infor-
     mative t-sne embeddings, Machine Learning 110