Detecting out-of-distribution text using topological features of
                         transformer-based language models
                         Andres Pollano1 , Anupam Chaudhuri2,* and Anj Simmons3
                         1
                           University of Melbourne, Melbourne, Australia
                         2
                           Deakin University, Geelong, Australia
                         3
                           Hashtag AI, Melbourne, Australia


                                         Abstract
                                         To safeguard machine learning systems that operate on textual data against out-of-distribution (OOD) inputs that could cause unpre-
                                         dictable behaviour, we explore the use of topological features of self-attention maps from transformer-based language models to detect
                                         when input text is out of distribution. Self-attention forms the core of transformer-based language models, dynamically assigning vectors
                                         to words based on context, thus in theory our methodology is applicable to any transformer-based language model with multihead
                                         self-attention. We evaluate our approach on BERT and compare it to a traditional OOD approach using CLS embeddings. Our results
                                         show that our approach outperforms CLS embeddings in distinguishing in-distribution samples from far-out-of-domain samples, but
                                         struggles with near or same-domain datasets.

                                         Keywords
                                         Large language model, Topological data analysis, Out of distribution detection


                         1. Introduction                                                                                            is possible to train a classifier on the activation values of
                                                                                                                                    the hidden layers of large language models to predict when
                         Machine learning (ML) models perform well on the datasets                                                  they are generating false information rather than true infor-
                         they have been trained on, but can behave unreliably when                                                  mation. However, training a classifier for OOD detection in
                         tested on data that is out-of-distribution (OOD). For example,                                             this manner is not a suitable approach, as the distribution
                         when a ML model has been trained to recognise different                                                    of the OOD data that will be encountered is not knowable
                         breeds of cats is fed an image of a dog, the results are un-                                               in advance. That is, due to the nature of OOD detection,
                         predictable. OOD detection is the task of identifying that an                                              we need to extract an embedding vector and associated dis-
                         input does not seem to be drawn from the same distribution                                                 tance metric (calibrated solely on the training/validation
                         as the training data, and thus the prediction given by the                                                 data) without training a further classifier over this space.
                         ML model should not be trusted. OOD detectors can be used                                                     Recently, Kushnareva et al. [4] proposed an approach
                         to defend ML models deployed in high stakes applications                                                   to analyze the topology of attention maps of transformer-
                         from OOD data by providing a warning/error message for                                                     based language models to determine when text had been
                         OOD inputs rather than processing the input and producing                                                  artificially generated, and Perez and Reinauer [5] propose
                         untrustworthy results [1].                                                                                 using the topology of attention maps of transformer-based
                            In this paper, we focus on OOD detection for textual in-                                                language models to detect adversarial textual attacks. Specif-
                         puts to safeguard ML models that perform natural language                                                  ically, topological data analysis (TDA) provides a way to
                         processing (NLP) tasks. For example, a sentiment classi-                                                   extract high-level features (related to the topology of the at-
                         fication model trained on formal restaurant reviews may                                                    tention maps for each attention head in each layer) that can
                         not produce valid results when applied to informal posts                                                   serve as an embedding vector of lower dimension than the
                         from social media. Determining that an input is OOD re-                                                    full internal model state. In this paper, we investigate the
                         quires a way to measure the distance between an input and                                                  suitability of these topological embeddings for the task of
                         the in-distribution data. This in turn requires a method to                                                OOD detection, and contrast them to traditional approaches.
                         convert textual data into an embedding space in which we                                                   Some of the work related to out-of-distribution detection in
                         can measure distance. One approach to this is to input the                                                 the context of transformer-based language models and using
                         text to a transformer-based language model, such as BERT                                                   Mahalanobis distance can be referred to here [6, 7, 8, 9].
                         [2], to extract an embedding vector for the input text (e.g.,                                                 We have made the code used to generate our results public
                         the hidden representation of the special [𝐶𝐿𝑆] token). We                                                  under the MIT licence, with the intention of aiding the
                         can then measure the distance of the embedding vector for                                                  application of TDA methods to transformer-based models.1
                         an input text to the nearest (or k-nearest) embedding vec-
                         tor of a text from an in-distribution validation set. When
                         this distance is beyond some threshold (which needs to be                                                  2. Background
                         calibrated for the application), the input text is flagged as
                         out of distribution. The internal state of transformer-based                                               2.1. Topological Data Analysis
                         language models contains important information, which
                                                                                                                                    Topology studies properties of geometric objects invariant
                         may be able to offer richer representations than only using
                                                                                                                                    under continuous deformation. For instance, a donut and
                         the embedding obtained from the last or penultimate layer.
                                                                                                                                    a coffee cup are topologically equivalent. Algebraic topol-
                         For example, Azaria and Mitchell [3] demonstrated that it
                                                                                                                                    ogy, as in Hatcher’s work [10], attaches algebraic objects
                          The IJCAI-2024 AISafety Workshop                                                                          such as groups to topological spaces. Certain features of
                         *
                           Corresponding author.                                                                                    these algebraic object can help to quantify those topological
                          $ apollano@student.unimelb.edu.au (A. Pollano);                                                           spaces.
                          anupam.chaudhuri@deakin.edu.au (A. Chaudhuri); anj@simmons.ai
                          (A. Simmons)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                     Attribution 4.0 International (CC BY 4.0).                                                         https://github.com/andrespollano/neural_nets-tda


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Persistence extends topology to finite data sets, tracing      2.1.2. Vietoris-Rips Complex
back to Frosini [11], Robins [12]. Persistence homology
                                                                  The Vietoris-Rips complex is a key construct in topological
groups, derived from homology groups, serve as invariants
                                                                  data analysis, used for forming a simplicial complex from a
for discrete objects.
                                                                  set of data points based on their pairwise distances.
   For any finite set of points, we can construct a distance
                                                                     Definition: Given a set of points 𝑋 and a distance thresh-
matrix where both the rows and columns are labeled by
                                                                  old 𝜀, the Vietoris-Rips complex 𝒱ℛ𝜀 (𝑋) is defined as fol-
these points, and each entry in the matrix represents the
                                                                  lows: for any subset 𝜎 ⊆ 𝑋, 𝜎 is a simplex in 𝒱ℛ𝜀 (𝑋) if
distance between a pair of points. We can apply tools from
                                                                  and only if the distance between every pair of points in 𝜎 is
Topological Data Analysis (TDA) to this set of points, al-
                                                                  less than or equal to 𝜀.
lowing us to assign certain invariant characteristics to the
                                                                     Formal Construction:
collection.
   In the context of language or text, we can think of each            • Vertices: Each point in 𝑋 is a 0-simplex (vertex).
word as a point in some vector space, with a distance defined          • Edges: An edge (1-simplex) connects vertices 𝑥𝑖 and
between words. For example, the distance might be related                𝑥𝑗 if 𝑑(𝑥𝑖 , 𝑥𝑗 ) ≤ 𝜀.
to semantic similarity or other linguistic relationships. By           • Higher Simplices: A 𝑘-simplex is formed by a set of
considering a text as a collection of such points, we can                𝑘 + 1 vertices if every pair of vertices in the set is
assign various numerical characteristics to it. These char-              connected by an edge.
acteristics can distinguish the text from others and provide
insights into its structure and content.
                                                                  2.2. BERT Model
2.1.1. Simplicial Complex and Chain                               BERT [2] is a transformer-based language model that has
                                                                  been pre-trained on a large corpus of text from BooksCorpus
A simplicial complex is a fundamental construct in alge-          and English Wikipedia. Input text first needs to be tokenized,
braic topology, used to approximate and study more com-           in which each word is converted to one or more tokens.
plex topological spaces. It is formed by combining simpler        The first token is the special [𝐶𝐿𝑆] token, followed by
building blocks called simplices.                                 the tokenization of each word, using the special [𝑆𝐸𝑃 ]
   Simplices: A 𝑘-dimensional simplex, denoted as 𝜎, is           token to separate “sentences” (e.g., question and answer,
the convex hull of 𝑘 + 1 affinely independent points. For         these don’t necessarily correspond to linguistic sentences).
example, a 0-simplex is a point, a 1-simplex is a line segment,   BERT is trained to achieve two objectives: Masked Language
a 2-simplex is a triangle, and a 3-simplex is a tetrahedron.      Modelling (MLM) in which tokens are masked at random
   Forming a Simplicial Complex: A simplicial complex             (replaced with the special [𝑀 𝐴𝑆𝐾] token) and the language
𝐾 in R𝑑 is a collection of simplices that satisfies two condi-    model needs to learn to fill these in; and Next Sentence
tions:                                                            Prediction (NSP) in which the final hidden vector of the
    1. Any face of a simplex in 𝐾 is also in 𝐾.                   special [𝐶𝐿𝑆] token is used to predict if two sentences
    2. The intersection of any two simplices in 𝐾 is either       follow each other in the corpus.
       empty or a common face of both.                               As a transformer-based model, BERT consists of multiple
                                                                  layers, each with multiple attention heads. While multiple
   Simplicial Chains: To study the algebraic properties of        variants of BERT are available, for the purpose of this paper
simplicial complexes, we introduce the concept of simplicial      we use 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 , which consists of 12 layers, each with
chains. A simplicial chain in a complex is a formal sum of        12 attention heads (i.e., 144 attention heads in total) that
simplices. For a given dimension 𝑘, the group of 𝑘-chains,        operate on an input matrix, 𝑋, of 𝑛 tokens and 768 hidden
denoted 𝐶𝑘 , is the free abelian group generated by the 𝑘-        dimensions, 𝑑.
dimensional simplices of the complex.
   Boundary Operators: The boundary of a simplex is the           2.2.1. Sentence Embeddings
sum of its faces. The boundary operator 𝜕𝑘 : 𝐶𝑘 → 𝐶𝑘−1
maps each 𝑘-simplex to its (𝑘 − 1)-dimensional boundary.          The final hidden vector of the special [𝐶𝐿𝑆] token can be
This operator is crucial for defining the homology of the         used to embed the input sequence (which varies in length) in
complex.                                                          𝑑 hidden dimensions (178 in the case of 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 ). The
   For example, the boundary of a 2-simplex (triangle) 𝜎 =        authors of the BERT paper [2] note that the [𝐶𝐿𝑆] embed-
[𝑣0 , 𝑣1 , 𝑣2 ] is the sum of its 1-dimensional faces (edges):    ding is not a meaningful sentence representation without
𝜕2 (𝜎) = [𝑣1 , 𝑣2 ] + [𝑣2 , 𝑣0 ] + [𝑣0 , 𝑣1 ].                    fine-tuning. Nevertheless, Uppaal et al. [13] claim that the
   Chain Complex: A chain complex is a sequence of chain          practice of using this to obtain sentence embeddings “is stan-
groups connected by boundary operators:                           dard for most BERT-like models”, and find that in the case
                                                                  of RoBERTa (a BERT-like model without the NSP training
            𝑛
   0 → 𝐶𝑛 −−→
              𝜕           𝜕𝑛−1      1
              𝐶𝑛−1 −−−→ · · · → 𝐶1 −→ 𝐶0 → 0.
                                             𝜕                    objective) this embedding serves as a “near perfect” OOD
                                                                  detector even without fine-tuning.
  Cycle and Boundary Groups:
                                                                  2.2.2. Attention Maps
       𝑍𝑝 = ker 𝜕𝑝 ,    𝐵𝑝 = im 𝜕𝑝+1 ,      𝐵 𝑝 ⊂ 𝑍𝑝 .
                                                                  Each attention head computes an attention map, 𝑊 𝑎𝑡𝑡𝑛 , of
                                     th
  Simplicial Homology: The 𝑘 simplicial homology                  shape 𝑛 × 𝑛 as an intermediate step of the calculation. We
group of a complex 𝐾 is 𝐻𝑘 (𝐾) = 𝑍𝑘 (𝐾)/𝐵𝑘 (𝐾), with              use the same definition of attention maps as Kushnareva
the Betti number 𝛽𝑘 (𝐾) = dim 𝐻𝑘 (𝐾).                             et al. [4] presented below:


                                                                            𝑋 𝑜𝑢𝑡 = 𝑊 𝑎𝑡𝑡𝑛 (𝑋𝑊 𝑉 )
                                 (𝑋𝑊 𝑄 )(𝑋𝑊 𝐾 )𝑇
                            (︂                     )︂
        𝑊 𝑎𝑡𝑡𝑛 = softmax               √                               • Same-Domain shift. We also test a more chal-
                                         𝑑                               lenging setting, where ID and OOD samples are
                                                                         drawn from the same domain, but with different
   Where 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 are learned projection matrices                 labels. Specifically, we extract the ‘Business’ news
of shape 𝑑 × 𝑑 and 𝑋 𝑜𝑢𝑡 is the output of the attention head             articles from the news-category dataset.
applied to the 𝑛 × 𝑑 matrix 𝑋 from the previous layer. In
this paper, we analyse the attention maps for each of the            In our experiments we used a sample of 30,000 points
144 attention heads in 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 using TDA.                        from the in-distribution dataset for the fine-tuned version
                                                                  of the model, and use a validation and test size of 1,000
                                                                  datapoints.
3. Experiment design
In this section, we outline the design of our methodology         3.2. Model
for our OOD detection using Topological Data Analysis.
For a supervised classification task, given a test sample 𝑥,      We focus on the attention heads of a pre-trained
OOD detection aims to determine whether it belongs to             𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 (L=12, H=12) generated from an input text
the in-distribution (ID) dataset 𝑥 ∈ 𝒟𝑖𝑛 or not. Some of          𝑥 to produce topological features and compare this encod-
the background and literature review related to confidence        ing to the embeddings of the [𝐶𝐿𝑆] token as the sentence
score for OOD detection can be found in [9, 14, 15]. We           representation. We replicate our experiments on a fine-
consider a 𝑑-dimensional representation of an input text          tuned 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 on the ID news categorisation task
𝑥 as ℎ(𝑥) in R𝑑 . To analyse the benefits of TDA in OOD           𝒳 → {’Politics’, ’Entertainment’}. We fine-tune the model
detection, we consider two encoding functions ℎ1 (𝑥) and          for 3 epochs, using Adam with batch size of 32 and learning
ℎ2 (𝑥):                                                           rate 10−5 .

    1. Topological feature vector ℎ1 (𝑥): given 𝑥, we gen-        3.3. Attention Maps and Attention Graphs
       erate a vector of 𝑑1 topological features using the
       graph representations of the 144 attention maps gen-
       erated by 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 . In 3.3 and subsection 3.4,
       we explain in detail how the topological features are
       generated from an input sentence.
    2. Sentence embedding ℎ2 (𝑥): we take the 𝑑2 -
       dimensional text embedding of the [𝐶𝐿𝑆] token
       output by 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸 , which captures the con-
       textual and semantic information of the input text
       𝑥.
   Similar to Uppaal et al. [13], we define the OOD detection                       (a) Attention           maps
function as 𝐺(𝑥), which maps an instance 𝑥 to {𝑖𝑛, 𝑜𝑢𝑡}                                 (12 × 12) derived from
as follows:                                                                             pre-trained BERT for
                                                                                        the input text "President
                                                                                        issues vows as tensions
                          {︃
                             𝑖𝑛 if 𝑆(𝑥; ℎ) ≥ 𝜆
             𝐺𝜆 (𝑥; ℎ) =                                                                with China rise"
                             𝑜𝑢𝑡 if 𝑆(𝑥; ℎ) < 𝜆
   where 𝑆(𝑥; ℎ) is an OOD scoring function using a
distance-based method (Mahalanobis distance to the ID class
centroids or Euclidean distance to k-nearest ID neighbour),
described in subsection 3.5, and 𝜆 is the threshold chosen
so that a high proportion of ID samples’ scores are above 𝜆.

3.1. Data
As the in-distribution dataset, we choose the headlines and
                                                                      (b) BERT Attention Map      (c) Undirected attention
abstract text of ‘Politics’ and ‘Entertainment’ news articles
                                                                          (Layer 7; Head 10)          graph (Layer 7; Head 10)
from HuffPost from the news-category dataset [16]. To test                                            where edges are propor-
the robustness of the OOD method, we conduct experiments                                              tional to the maximal
on three kinds of dataset distribution shifts [17]:                                                   attention between the
                                                                                                      two vertices. The edge
     • Near Out-of-Domain shift. In this paradigm, ID                                                 width represents shorter
       and OOD samples come from different distributions                                              distances     (attention
       (datasets) exhibiting semantic similarities. In our ex-                                        strength)
       periments, we evaluate the abstract of news articles
                                                                  Figure 1: Process of transforming an attention map to an atten-
       from the cnn-dailymail dataset [18].
                                                                  tion graph (one per attention head)
     • Far Out-of-Domain shift. In this type of shift,
       the OOD samples come from a different domain and
       exhibit significant semantic differences. In particular,     Attention maps play a crucial role in our methodology
       we evaluate the IMDB movie review dataset [19] as          as they form the basis for extracting topological features
       OOD samples.                                               used in our OOD detection. An attention map 𝑊 𝑎𝑡𝑡𝑛 is a
𝑛 × 𝑛-dimensional matrix where each entry represents the
attention weight between two tokens. Each element 𝑤𝑖𝑗     𝑎𝑡𝑡𝑛

can be interpreted as the level of ‘attention’ token 𝑖 pays to
token 𝑗 in the input sequence during the encoding process.
The higher the weight the stronger the relation between two
tokens. They are non-negative and   ∑︀𝑛the attention  weights
of a token sum up to one (i.e.              𝑎𝑡𝑡𝑛
                                      𝑗=1 𝑤𝑖𝑗     = 1 for all
𝑖 = 1, ..., 𝑛.).
   To generate topological features from an attention map,
                                                                      (a) Persistence diagram gen-    (b) Topological features ex-
we first convert it into an attention graph following the
                                                                          erated from the filtra-         tracted from the persis-
approach of Perez and Reinauer [5]. Given an attention ma-                tion process for atten-         tence diagram, calculat-
trix 𝑊 𝑎𝑡𝑡𝑛 , we create an undirected weighted graph where                tion map in Layer 7,            ing persistence entropy,
the vertices represent the tokens of the input text 𝑥, and                Head 10. The set of             and amplitude with ‘Bot-
the weights are determined by the attention weights in the                𝐻0 (red points) repre-          tleneck’ and ‘Wassert-
corresponding attention map. To emphasise the important                   sents the birth and death       stein’ distances for ho-
relationships and reduce noise, we calculate the distance                 of ‘connected compo-            mology dimensions 0, 1,
between vertices as 1 − max(𝑤𝑖𝑗 , 𝑤𝑗𝑖 ). The distance calcu-              nents’ and the set of 𝐻1        2 and 3. (In the case of
lation reflects the inverse of the maximum attention weight               (teal points) represents        NaN values, e.g. due to
between two tokens, ensuring the relationship is symmet-                  the birth and death of          no higher dimensional
                                                                          ‘holes’.                        simplices, we set the per-
ric and the strong relationships result in smaller distances.
                                                                                                          sistence entropy feature
To prevent the formation of self-loops, all diagonals in the
                                                                                                          to -1, as per the default
adjacency matrix are set to 0. Figure 1 shows an example of                                               behaviour of Giotto-tda)
constructing the attention graph for an attention map.
                                                                   Figure 3: Example persistence diagram and extracted topological
                                                                   features
3.4. Persistent Homology

                                                                   appeared and disappeared. For example, when the thresh-
                                                                   old is 0 all 0-dimensional features are born (vertices), and
                                                                   when two vertices 𝑖 and 𝑗 are connected at threshold 𝑤𝑖𝑗 ,
                                                                   one 0-dimensional feature will disappear. Similarly, a 1-
                                                                   dimensional feature (hole) will appear at the threshold
                                                                   where 3 vertices connect to each other, and disappear when
Figure 2: Filtration process for the attention graph (Layer 7;     a fourth vertex forms a 2-dimensional simplex (void). The
Head 10) where edges with shorter distances below a threshold
                                                                   birth and death of all 𝑘-dimensional simplices are recorded
are added first, gradually connection the nodes until a complete
                                                                   in a persistence diagram. An example persistence diagram
graph is formed
                                                                   is shown in Figure 3a.
                                                                      From the persistence diagrams, we extract various topo-
   The constructed attention graphs from the attention             logical features to represent the underlying graph’s struc-
heads contain the structure and relationships we need to           ture. In our experiments, we focus on the following topo-
extract topological features. To encode the topological infor-     logical features:
mation provided by the attention graph, we use a filtration
                                                                       1. Persistence Entropy: This feature quantifies the
process to generate a persistence diagram. Filtration in
                                                                          complexity of the persistence diagram as calculated
TDA is a systematic process where a topological space is
                                                                          by the Shannon entropy of the persistence values
progressively constructed across varying scales to analyse
                                                                          (birth and death), with higher entropy indicating a
the emergence, persistence and disappearance of simplicial
                                                                          more complex topology.
complexes, such as connected components, holes, or voids.
   We apply one of the most widely used types of filtration            2. Amplitude: We compute amplitude using two dif-
process to the attention graphs, the Vietoris-Rips filtration.            ferent distance measures: ‘bottleneck’ and ‘Wasser-
This process starts with only the vertices of the graph, con-             stein’. The amplitude measures the maximum persis-
sidering them as zero-dimensional simplices. Then it adds                 tence value within the diagram, providing insights
edges one by one, depending on their weights (i.e. distances).            into the significance of the topological features.
Edges with shorter distances below a threshold are added           We focus on different homology dimensions to capture topo-
first, gradually connecting the vertices by increasing the         logical features of varying complexities. In our experiments,
threshold until a complete graph is formed. As edges are           we consider homology dimensions [0, 1, 2, 3] to account for
added, the filtration process captures the graph’s properties      different aspects of the attention graph’s topology. We use
and the relationships between its vertices [20]. This process      the Giotto-tda library to generate the persistence diagrams
is visualised in Figure 2.                                         and extract the topological features, as per Figure 3b. Both
   To construct a persistence diagram, we keep track of            persistence entropy and amplitude features are used in the
the lifetime of persistence features as the threshold is in-       experiment through concatenating all features into a single
creased. One can think of 0-dimensional persistent fea-            feature vector.
tures as connected components, 1-dimensional features as
holes and 2-dimensional features as voids (2-dimensional
holes) and so on. The birth and death time of a persis-
tence feature is the threshold value at which the feature
3.5. OOD Scoring Function                                                       Pre-trained                   Fine-tuned

Similar to Perez and Reinauer [5], given ℎ(𝑥), a 𝑑-
dimensional representation of an input text 𝑥, we employ
two distance-based methods as the OOD scoring functions:
    1. Mahalanobis distance to the ID class centroids:
       the Mahalanobis distance is used to measure the dis-
       tance between the feature vector ℎ(𝑥) and the class       TDA
       centroids. This distance is based on the covariance
       matrix of the class features, which is based on the
       assumption that the data in that class follows a mul-
       tivariate Gaussian distribution. The OOD score is
       calculated as follows:

                                                                 CLS
                𝑆Maha (𝑥; ℎ; Σ; 𝜇) =
                                                                Figure 4: The data representations from the TDA and CLS ap-
                𝑚𝑖𝑛𝑐∈𝑦 (𝑧𝑥 − 𝜇𝑐 )𝑇 Σ−1 (𝑧𝑥 − 𝜇𝑐 )
                                                                proaches for the far out-of-domain IMDB dataset.
       where 𝑧𝑥 is the standardised feature vector for the
       input ℎ(𝑥), Σ is the covariance matrix of the stan-                      Pre-trained                   Fine-tuned
       dardised ID feature vectors and 𝜇 is the set of class
       mean standardised embeddings. Both Σ and 𝜇𝑐 are
       extracted from the ID validation set embeddings to
       account for the inherent distribution of the ID data.
       The covariance matrix Σ captures how the features
       vary with respect to one another, and 𝜇𝑐 represents
       the centroid or average representation of data be-
                                                                 TDA
       longing to class 𝑐.
    2. Euclidean distance to k-nearest ID neighbour:
       We measure the distance between ℎ(𝑥) and the k-
       nearest ID neighbour’s feature vector from the vali-
       dation set. Given ℎ(𝑥) and a set of 𝑚 ID feature vec-
       tors {ℎ(𝑥1 ), ℎ(𝑥2 ), ..., ℎ(𝑥𝑚 )}, the Euclidean dis-
       tance to the k-nearest ID neighbour is calculated as
       follows:                                                  CLS
                                                                Figure 5: The data representations from the TDA and CLS ap-
                    𝑆KNN (𝑥; ℎ) = ||𝑧𝑥 − 𝑧𝑥𝑘 ||2                proaches for the near out-of-domain CNN/Dailymail dataset.
       where 𝑧𝑥 and 𝑧𝑥𝑘 are the standardised feature vector
       for the input ℎ(𝑥) and its k-nearest ID sample ℎ(𝑥𝑘 ).
       In our experiments, we set 𝑘 = 5.                        Figure 4, the TDA feature vectors project the data into well-
                                                                separated and compact clusters, which explains its superior
                                                                performance.
4. Results                                                         The TDA approach was less effective than the CLS ap-
                                                                proach at detecting OOD samples from the near out-of-
We conduct our experiments using Topological Data Anal-
                                                                domain CNN/Dailymail dataset. Even though the data vi-
ysis to generate topological feature vectors ℎ1 (𝑥) from at-
                                                                sualisation in Figure 5 shows that TDA was able to cluster
tention maps, which are then compared to standard sen-
                                                                OOD samples together, the cluster was not distant enough
tence embeddings ℎ2 (𝑥) generated from the [𝐶𝐿𝑆] token
                                                                from ID samples, rendering both distance-based OOD de-
of BERT. Table 1 shows the OOD detection performance
                                                                tection methods less effective.
of both approaches for three out-of-distribution datasets,
                                                                   For same-domain datasets (news-category), both ap-
using both pre-trained and fine-tuned BERT models.
                                                                proaches struggled to detect OOD samples. As seen in Fig-
   For visualisation purposes, we use UMAP projections
                                                                ure 6, when both ID and OOD data are from the same do-
of the in-distribution (validation and test sets) and out-of-
                                                                main, their feature vectors are highly overlapping, although
distribution data points in the corresponding feature space.
                                                                fine-tuning seems to provide stronger separability between
Figure 4, Figure 5, and Figure 6 show the data representa-
                                                                ID and OOD data for the CLS approach.
tions from the TDA and CLS approaches for the far out-of-
domain dataset (IMDB), near out-of-domain dataset (CN-
N/Dailymail) and the same-domain dataset (business news-        5. Discussion
category), respectively.
   The results demonstrate that the TDA-based approach          From our experiments, we showed that the TDA approach
consistently outperforms the CLS embeddings in detecting        outperforms the CLS approach at detecting far out-of-
OOD samples in the IMDB dataset from both the pre-trained       domain OOD samples like those in the IMDB dataset. Yet,
and fine-tuned models. OOD detection using TDA can detect       its effectiveness deteriorates with near out-of-domain (CN-
IMDB review samples with 8-9% FPR95, in stark contrast to       N/Dailymail) or same-domain (business news-category)
the 87-91% FPR95 exhibited by CLS embeddings. As seen in        datasets. To understand why, we looked at the samples that
                                                     Pre-trained model                                      Fine-tuned model
                                                 KNN                 MAHA                               KNN                MAHA
                                            AUROC   FPR95 ↓     AUROC   FPR95 ↓                    AUROC   FPR95 ↓    AUROC   FPR95 ↓
                                              ↑                   ↑                                  ↑                   ↑
                                   TDA        0.940         0.090          0.940        0.112       0.958          0.084      0.950       0.124
 IMDB
                                   CLS        0.680         0.875          0.799        0.704       0.771          0.916      0.814       0.852
                                   TDA        0.572         0.890          0.563        0.908       0.551          0.909      0.521       0.927
 CNN/Dailymail
                                   CLS        0.875         0.591          0.897        0.445       0.947          0.215      0.949       0.208
                                   TDA        0.527         0.929          0.543        0.921       0.570          0.923      0.568       0.925
 News-Category (Business)
                                   CLS        0.580         0.921          0.638        0.878       0.884          0.431      0.885       0.424

    Table 1
    Comparison of the performance of our scoring functions on all three out-of-distribution datasets using both pre-trained and
    fine-tuned models.


                Pre-trained                         Fine-tuned                     CNN/Dailymail sample              Nearest ID neighbour
                                                                                   Footage showed an unusual         Trump’s Proposed Cuts To
                                                                                   ’apocalyptic’ dust storm hit-     Foreign Food Aid Are Prov-
                                                                                   ting Belarus. China has           ing Unpopular. The presi-
                                                                                   suffered four massive sand-       dent might see zeroed-out
                                                                                   storms since the start of the     funding for foreign food aid
                                                                                   year. Half of dust in atmo-       as "putting America first,"
                                                                                   sphere today is due to hu-        but members of Congress
TDA                                                                                man activity, said Nasa.          clearly disagree.
                                                                                   Video posted by YouTube           Trump Signs Larry Nassar-
                                                                                   user Richard Stewart show-        Inspired Sexual Assault Bill
                                                                                   ing a Porsche Cayman fly-         Behind Closed Doors. The
                                                                                   ing out of control. Police        president quietly signed the
                                                                                   cited unidentified driver for     bill the week after two White
                                                                                   the crash. Car reportedly         House staffers resigned amid
                                                                                   wrecked and needed to be          allegations of domestic vio-
 CLS                                                                               towed from the scene.             lence.
Figure 6: The data representations from the TDA and CLS                        Table 2
approaches for the same-domain News-Category (Business)                        Least confident OOD samples from the CNN/Dailymail dataset
dataset.                                                                       and their nearest ID neighbours, from the TDA approach using
                                                                               the pre-trained BERT model

each approach thrived and struggled with, and we highlight
three observations:                                                               (2) CLS embeddings are sensitive to the semantic
   (1) The TDA approach accentuates features asso-                             and contextual meaning of the samples, regardless
ciated with textual flow or grammatical structures                             of sentence structure. This explains why this approach
rather than lexical semantics, consistent the findings                         struggled with OOD detection from IMDB reviews, as it of-
of Deng and Duzhin [21] and Kushnareva et al. [4]. For                         ten classified IMDB movie reviews as in-distribution due to
example, TDA was adept at identifying OOD samples that                         their semantic similarities with the entertainment news ar-
are structurally unique in the IMDB dataset, as the most                       ticles from the ID dataset, especially those related to movies.
confident OOD samples detected were:                                           A closer look at the IMDB samples with smallest OOD score
                                                                               from the CLS embeddings in Table 3 exemplifies this insight,
     • ‘OK...i have seen just about everything....and some are
                                                                               identifying ID samples of similar topic as nearest neighbours
       considered classics that shouldn’t be ( like all those
                                                                               even though they are clearly from different domains.
       Halloween movies that suck crap or even Steven king
                                                                                  (3) Fine-tuning has improved performance of CLS
       junk).......and some are considered just OK that are
                                                                               embeddings for near or same-domain shifts, but shows
       really great.....( like carnival of souls )........and then
                                                                               no significant benefit for TDA. Fine-tuning induces a
       some are just plain ignored............like ( evil ed ) [. . . ]’
                                                                               model to divide a single domain cluster into class clusters,
     • ‘Time line of the film: * Laugh * Laugh * Laugh *
                                                                               as highlighted by Uppaal et al. [13]. For CNN/Dailymail
       Smirk * Smirk * Yawn * Look at watch * walk out *
                                                                               and Business news OOD datasets, this is beneficial for the
       remember funny parts at the beginning * smirk < br /
                                                                               CLS approach as it learns to better distinguish topics. How-
       > <br /> [. . . ]’
                                                                               ever, fine-tuning made the CLS embeddings of IMDB movie
  In contrast, TDA struggled with detecting CNN/Daily-                         reviews appear even more similar to entertainment news,
mail OOD samples as they have similar sentence structures                      deteriorating OOD performance.
and length to the ID samples, even if they are semantically                       For the TDA approach, fine-tuning did not present any
unrelated. Table 2 shows the samples with the least confi-                     considerable benefits. This can be partly attributed to obser-
dent OOD score from the CNN/Dailymail dataset, and their                       vation (1) that TDA primarily captures structural differences,
nearest ID neighbour.                                                          and fine-tuning, which is driven by semantics, does not sig-
                                                                               nificantly alter the topological representation.
 IMDB review sample                Nearest ID neighbour              [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
 ’[...] I would spend good,        DVDs: Great Blimp, Bad-               Pre-training of deep bidirectional transformers for lan-
 hard-earned cash money to         lands, Buster Keaton & More.          guage understanding, in: Proceedings of the 2019
 see it again on DVD. And          Let’s catch up with some reis-        Conference of the North American Chapter of the
 as long as we’re requesting       sues of classic – and not so          Association for Computational Linguistics: Human
 Smart Series That Never Got       classic – movies, with a few          Language Technologies, Volume 1 (Long and Short Pa-
 a Chance...How about DVD          documentaries tossed in at            pers), 2019, pp. 4171–4186. URL: https://aclanthology.
 releases of Maximum Bob           the end for good measure.             org/N19-1423. doi:10.18653/v1/N19-1423.
 (another well written, odd                                          [3] A. Azaria, T. Mitchell, The internal state of an llm
 duck show with a delightful
                                                                         knows when its lying (2023). URL: https://arxiv.org/
 cast of characters.) [...]’
                                                                         abs/2304.13734.
 ’[...] I am generally not a fan   How ‘Erin Brockovich’ Be-         [4] L. Kushnareva, D. Cherniavskii, V. Mikhailov, E. Arte-
 of Zeta-Jones but even I must     came One Of The Most                  mova, S. Barannikov, A. Bernstein, I. Piontkovskaya,
 admit that Kate is STUN-          Rewatchable Movies Ever               D. Piontkovski, E. Burnaev, Artificial text detection
 NING in this movie. [...]’        Made. Julia Roberts gives the
                                                                         via examining the topology of attention maps, in:
                                   best performance of her ca-
                                   reer, aided by a sassy Susan-         Proceedings of the 2021 Conference on Empirical
                                   nah Grant script full of one-         Methods in Natural Language Processing, Associa-
                                   liners.                               tion for Computational Linguistics, Online and Punta
                                                                         Cana, Dominican Republic, 2021, pp. 635–649. URL:
Table 3                                                                  https://aclanthology.org/2021.emnlp-main.50. doi:10.
Least confident OOD samples from the IMDB dataset and their              18653/v1/2021.emnlp-main.50.
nearest ID neighbours, from the CLS approach using the pre-          [5] I. Perez, R. Reinauer, The topological bert: Transform-
trained BERT model
                                                                         ing attention into topology for natural language pro-
                                                                         cessing, 2022. URL: https://arxiv.org/abs/2206.15195.
                                                                         arXiv:2206.15195.
6. Conclusion                                                        [6] A. Podolskiy, D. Lipin, A. Bout, E. Artemova, I. Pi-
                                                                         ontkovskaya, Revisiting mahalanobis distance for
In this paper, we explore the capabilities of Topological
                                                                         transformer-based out-of-domain detection, in: Pro-
Data Analysis for identifying Out-of-Distribution samples
                                                                         ceedings of the AAAI Conference on Artificial Intelli-
by leveraging the attention maps derived from BERT, a
                                                                         gence, volume 35, 2021, pp. 13675–13682.
transformer-based Large Language Model. Our results
                                                                     [7] P. Colombo, E. D. Gomes, G. Staerman, N. Noiry, P. Pi-
demonstrate the potential of TDA as an effective tool to
                                                                         antanida, Beyond mahalanobis-based scores for tex-
capture the structural information of textual data.
                                                                         tual ood detection, arXiv preprint arXiv:2211.13527
   Nevertheless, our experiments also highlighted the intrin-
                                                                         (2022).
sic limitations of TDA-based methods. Predominantly, our
                                                                     [8] X. Li, J. Li, X. Sun, C. Fan, T. Zhang, F. Wu, Y. Meng,
TDA method captured the inter-word relations derived from
                                                                         J. Zhang, 𝑘 folden: 𝑘-fold ensemble for out-of-
the attention maps, but failed to account for the actual lexi-
                                                                         distribution detection, arXiv preprint arXiv:2108.12731
cal meaning of the text. This distinction suggests that while
                                                                         (2021).
TDA offers valuable insights into textual structure, a lexical
                                                                     [9] K. Lee, K. Lee, H. Lee, J. Shin, A simple unified frame-
and more holistic understanding of textual data is needed
                                                                         work for detecting out-of-distribution samples and
for OOD detection, especially with near or same-domain
                                                                         adversarial attacks, Advances in neural information
shifts.
                                                                         processing systems 31 (2018).
   For future work, it might be worth combining the topo-
                                                                    [10] A. Hatcher, Algebraic Topology, Cambridge University
logical features that capture the structural information of
                                                                         Press, 2002. URL: https://pi.math.cornell.edu/~hatcher/
textual data, with those that encode the semantics of text
                                                                         AT/ATpage.html.
in an ensemble model that might boost our ability to detect
                                                                    [11] P. Frosini, Measuring shapes by size functions, in:
OOD samples. In addition, there is an opportunity to inves-
                                                                         Intelligent Robots and Computer Vision X: Algorithms
tigate the effectiveness of TDA in other NLP tasks where
                                                                         and Techniques, volume 1607, SPIE, 1992, pp. 122–133.
the textual structure might be important.
                                                                         URL: https://doi.org/10.1117/12.57059. doi:10.1117/
                                                                         12.57059.
Acknowledgments                                                     [12] V. Robins, Towards computing homology from finite
                                                                         approximations, in: Topology proceedings, volume 24,
The research was supported by a National Intelligence Post-              1999, pp. 503–532.
doctoral Grant (NIPG-2021-006).                                     [13] R. Uppaal, J. Hu, Y. Li, Is fine-tuning needed? pre-
                                                                         trained language models are near perfect for out-of-
                                                                         domain detection, in: Proceedings of the 61st Annual
References                                                               Meeting of the Association for Computational Lin-
                                                                         guistics (Volume 1: Long Papers), 2023, pp. 12813–
 [1] S. Wong, S. Barnett, J. Rivera-Villicana, A. Simmons,
                                                                         12832. URL: https://aclanthology.org/2023.acl-long.
     H. Abdelkader, J.-G. Schneider, R. Vasa, MLGuard:
                                                                         717. doi:10.18653/v1/2023.acl-long.717.
     Defend your machine learning model!, in: Proceedings
                                                                    [14] Y. Sun, Y. Ming, X. Zhu, Y. Li, Out-of-distribution
     of the 1st International Workshop on Dependability
                                                                         detection with deep nearest neighbors, in: Interna-
     and Trustworthiness of Safety-Critical Systems with
                                                                         tional Conference on Machine Learning, PMLR, 2022,
     Machine Learned Components, SE4SafeML 2023, 2023,
                                                                         pp. 20827–20840.
     p. 10–13. doi:10.1145/3617574.3617859.
                                                                    [15] J. Yang, K. Zhou, Y. Li, Z. Liu, Generalized out-
     of-distribution detection: A survey, arXiv preprint
     arXiv:2110.11334 (2021).
[16] R. Misra, News category dataset (2022). URL: https:
     //arxiv.org/abs/2209.11429.
[17] U. Arora, W. Huang, H. He,           Types of out-of-
     distribution texts and how to detect them, in: Proceed-
     ings of the 2021 Conference on Empirical Methods in
     Natural Language Processing, 2021, pp. 10687–10701.
     URL: https://aclanthology.org/2021.emnlp-main.835.
     doi:10.18653/v1/2021.emnlp-main.835.
[18] A. See, P. J. Liu, C. D. Manning, Get to the point:
     Summarization with pointer-generator networks, in:
     Proceedings of the 55th Annual Meeting of the As-
     sociation for Computational Linguistics (Volume 1:
     Long Papers), 2017, pp. 1073–1083. URL: https://www.
     aclweb.org/anthology/P17-1099. doi:10.18653/v1/
     P17-1099.
[19] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng,
     C. Potts, Learning word vectors for sentiment analysis,
     in: Proceedings of the 49th Annual Meeting of the
     Association for Computational Linguistics: Human
     Language Technologies, 2011, pp. 142–150. URL: http:
     //www.aclweb.org/anthology/P11-1015.
[20] U. Bauer, Ripser: efficient computation of vietoris–rips
     persistence barcodes, Journal of Applied and Com-
     putational Topology 5 (2021) 391–423. URL: https:
     //doi.org/10.1007/s41468-021-00071-5. doi:10.1007/
     s41468-021-00071-5.
[21] R. Deng, F. Duzhin, Topological data analysis helps
     to improve accuracy of deep learning models for fake
     news detection trained on very small training sets, Big
     Data Cogn. Comput. 6 (2022) 74. URL: https://doi.org/
     10.3390/bdcc6030074. doi:10.3390/bdcc6030074.