Detecting out-of-distribution text using topological features of transformer-based language models Andres Pollano1 , Anupam Chaudhuri2,* and Anj Simmons3 1 University of Melbourne, Melbourne, Australia 2 Deakin University, Geelong, Australia 3 Hashtag AI, Melbourne, Australia Abstract To safeguard machine learning systems that operate on textual data against out-of-distribution (OOD) inputs that could cause unpre- dictable behaviour, we explore the use of topological features of self-attention maps from transformer-based language models to detect when input text is out of distribution. Self-attention forms the core of transformer-based language models, dynamically assigning vectors to words based on context, thus in theory our methodology is applicable to any transformer-based language model with multihead self-attention. We evaluate our approach on BERT and compare it to a traditional OOD approach using CLS embeddings. Our results show that our approach outperforms CLS embeddings in distinguishing in-distribution samples from far-out-of-domain samples, but struggles with near or same-domain datasets. Keywords Large language model, Topological data analysis, Out of distribution detection 1. Introduction is possible to train a classifier on the activation values of the hidden layers of large language models to predict when Machine learning (ML) models perform well on the datasets they are generating false information rather than true infor- they have been trained on, but can behave unreliably when mation. However, training a classifier for OOD detection in tested on data that is out-of-distribution (OOD). For example, this manner is not a suitable approach, as the distribution when a ML model has been trained to recognise different of the OOD data that will be encountered is not knowable breeds of cats is fed an image of a dog, the results are un- in advance. That is, due to the nature of OOD detection, predictable. OOD detection is the task of identifying that an we need to extract an embedding vector and associated dis- input does not seem to be drawn from the same distribution tance metric (calibrated solely on the training/validation as the training data, and thus the prediction given by the data) without training a further classifier over this space. ML model should not be trusted. OOD detectors can be used Recently, Kushnareva et al. [4] proposed an approach to defend ML models deployed in high stakes applications to analyze the topology of attention maps of transformer- from OOD data by providing a warning/error message for based language models to determine when text had been OOD inputs rather than processing the input and producing artificially generated, and Perez and Reinauer [5] propose untrustworthy results [1]. using the topology of attention maps of transformer-based In this paper, we focus on OOD detection for textual in- language models to detect adversarial textual attacks. Specif- puts to safeguard ML models that perform natural language ically, topological data analysis (TDA) provides a way to processing (NLP) tasks. For example, a sentiment classi- extract high-level features (related to the topology of the at- fication model trained on formal restaurant reviews may tention maps for each attention head in each layer) that can not produce valid results when applied to informal posts serve as an embedding vector of lower dimension than the from social media. Determining that an input is OOD re- full internal model state. In this paper, we investigate the quires a way to measure the distance between an input and suitability of these topological embeddings for the task of the in-distribution data. This in turn requires a method to OOD detection, and contrast them to traditional approaches. convert textual data into an embedding space in which we Some of the work related to out-of-distribution detection in can measure distance. One approach to this is to input the the context of transformer-based language models and using text to a transformer-based language model, such as BERT Mahalanobis distance can be referred to here [6, 7, 8, 9]. [2], to extract an embedding vector for the input text (e.g., We have made the code used to generate our results public the hidden representation of the special [𝐢𝐿𝑆] token). We under the MIT licence, with the intention of aiding the can then measure the distance of the embedding vector for application of TDA methods to transformer-based models.1 an input text to the nearest (or k-nearest) embedding vec- tor of a text from an in-distribution validation set. When this distance is beyond some threshold (which needs to be 2. Background calibrated for the application), the input text is flagged as out of distribution. The internal state of transformer-based 2.1. Topological Data Analysis language models contains important information, which Topology studies properties of geometric objects invariant may be able to offer richer representations than only using under continuous deformation. For instance, a donut and the embedding obtained from the last or penultimate layer. a coffee cup are topologically equivalent. Algebraic topol- For example, Azaria and Mitchell [3] demonstrated that it ogy, as in Hatcher’s work [10], attaches algebraic objects The IJCAI-2024 AISafety Workshop such as groups to topological spaces. Certain features of * Corresponding author. these algebraic object can help to quantify those topological $ apollano@student.unimelb.edu.au (A. Pollano); spaces. anupam.chaudhuri@deakin.edu.au (A. Chaudhuri); anj@simmons.ai (A. Simmons) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). https://github.com/andrespollano/neural_nets-tda CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Persistence extends topology to finite data sets, tracing 2.1.2. Vietoris-Rips Complex back to Frosini [11], Robins [12]. Persistence homology The Vietoris-Rips complex is a key construct in topological groups, derived from homology groups, serve as invariants data analysis, used for forming a simplicial complex from a for discrete objects. set of data points based on their pairwise distances. For any finite set of points, we can construct a distance Definition: Given a set of points 𝑋 and a distance thresh- matrix where both the rows and columns are labeled by old πœ€, the Vietoris-Rips complex π’±β„›πœ€ (𝑋) is defined as fol- these points, and each entry in the matrix represents the lows: for any subset 𝜎 βŠ† 𝑋, 𝜎 is a simplex in π’±β„›πœ€ (𝑋) if distance between a pair of points. We can apply tools from and only if the distance between every pair of points in 𝜎 is Topological Data Analysis (TDA) to this set of points, al- less than or equal to πœ€. lowing us to assign certain invariant characteristics to the Formal Construction: collection. In the context of language or text, we can think of each β€’ Vertices: Each point in 𝑋 is a 0-simplex (vertex). word as a point in some vector space, with a distance defined β€’ Edges: An edge (1-simplex) connects vertices π‘₯𝑖 and between words. For example, the distance might be related π‘₯𝑗 if 𝑑(π‘₯𝑖 , π‘₯𝑗 ) ≀ πœ€. to semantic similarity or other linguistic relationships. By β€’ Higher Simplices: A π‘˜-simplex is formed by a set of considering a text as a collection of such points, we can π‘˜ + 1 vertices if every pair of vertices in the set is assign various numerical characteristics to it. These char- connected by an edge. acteristics can distinguish the text from others and provide insights into its structure and content. 2.2. BERT Model 2.1.1. Simplicial Complex and Chain BERT [2] is a transformer-based language model that has been pre-trained on a large corpus of text from BooksCorpus A simplicial complex is a fundamental construct in alge- and English Wikipedia. Input text first needs to be tokenized, braic topology, used to approximate and study more com- in which each word is converted to one or more tokens. plex topological spaces. It is formed by combining simpler The first token is the special [𝐢𝐿𝑆] token, followed by building blocks called simplices. the tokenization of each word, using the special [𝑆𝐸𝑃 ] Simplices: A π‘˜-dimensional simplex, denoted as 𝜎, is token to separate β€œsentences” (e.g., question and answer, the convex hull of π‘˜ + 1 affinely independent points. For these don’t necessarily correspond to linguistic sentences). example, a 0-simplex is a point, a 1-simplex is a line segment, BERT is trained to achieve two objectives: Masked Language a 2-simplex is a triangle, and a 3-simplex is a tetrahedron. Modelling (MLM) in which tokens are masked at random Forming a Simplicial Complex: A simplicial complex (replaced with the special [𝑀 𝐴𝑆𝐾] token) and the language 𝐾 in R𝑑 is a collection of simplices that satisfies two condi- model needs to learn to fill these in; and Next Sentence tions: Prediction (NSP) in which the final hidden vector of the 1. Any face of a simplex in 𝐾 is also in 𝐾. special [𝐢𝐿𝑆] token is used to predict if two sentences 2. The intersection of any two simplices in 𝐾 is either follow each other in the corpus. empty or a common face of both. As a transformer-based model, BERT consists of multiple layers, each with multiple attention heads. While multiple Simplicial Chains: To study the algebraic properties of variants of BERT are available, for the purpose of this paper simplicial complexes, we introduce the concept of simplicial we use 𝐡𝐸𝑅𝑇𝐡𝐴𝑆𝐸 , which consists of 12 layers, each with chains. A simplicial chain in a complex is a formal sum of 12 attention heads (i.e., 144 attention heads in total) that simplices. For a given dimension π‘˜, the group of π‘˜-chains, operate on an input matrix, 𝑋, of 𝑛 tokens and 768 hidden denoted πΆπ‘˜ , is the free abelian group generated by the π‘˜- dimensions, 𝑑. dimensional simplices of the complex. Boundary Operators: The boundary of a simplex is the 2.2.1. Sentence Embeddings sum of its faces. The boundary operator πœ•π‘˜ : πΆπ‘˜ β†’ πΆπ‘˜βˆ’1 maps each π‘˜-simplex to its (π‘˜ βˆ’ 1)-dimensional boundary. The final hidden vector of the special [𝐢𝐿𝑆] token can be This operator is crucial for defining the homology of the used to embed the input sequence (which varies in length) in complex. 𝑑 hidden dimensions (178 in the case of 𝐡𝐸𝑅𝑇𝐡𝐴𝑆𝐸 ). The For example, the boundary of a 2-simplex (triangle) 𝜎 = authors of the BERT paper [2] note that the [𝐢𝐿𝑆] embed- [𝑣0 , 𝑣1 , 𝑣2 ] is the sum of its 1-dimensional faces (edges): ding is not a meaningful sentence representation without πœ•2 (𝜎) = [𝑣1 , 𝑣2 ] + [𝑣2 , 𝑣0 ] + [𝑣0 , 𝑣1 ]. fine-tuning. Nevertheless, Uppaal et al. [13] claim that the Chain Complex: A chain complex is a sequence of chain practice of using this to obtain sentence embeddings β€œis stan- groups connected by boundary operators: dard for most BERT-like models”, and find that in the case of RoBERTa (a BERT-like model without the NSP training 𝑛 0 β†’ 𝐢𝑛 βˆ’βˆ’β†’ πœ• πœ•π‘›βˆ’1 1 πΆπ‘›βˆ’1 βˆ’βˆ’βˆ’β†’ Β· Β· Β· β†’ 𝐢1 βˆ’β†’ 𝐢0 β†’ 0. πœ• objective) this embedding serves as a β€œnear perfect” OOD detector even without fine-tuning. Cycle and Boundary Groups: 2.2.2. Attention Maps 𝑍𝑝 = ker πœ•π‘ , 𝐡𝑝 = im πœ•π‘+1 , 𝐡 𝑝 βŠ‚ 𝑍𝑝 . Each attention head computes an attention map, π‘Š π‘Žπ‘‘π‘‘π‘› , of th Simplicial Homology: The π‘˜ simplicial homology shape 𝑛 Γ— 𝑛 as an intermediate step of the calculation. We group of a complex 𝐾 is π»π‘˜ (𝐾) = π‘π‘˜ (𝐾)/π΅π‘˜ (𝐾), with use the same definition of attention maps as Kushnareva the Betti number π›½π‘˜ (𝐾) = dim π»π‘˜ (𝐾). et al. [4] presented below: 𝑋 π‘œπ‘’π‘‘ = π‘Š π‘Žπ‘‘π‘‘π‘› (π‘‹π‘Š 𝑉 ) (π‘‹π‘Š 𝑄 )(π‘‹π‘Š 𝐾 )𝑇 (οΈ‚ )οΈ‚ π‘Š π‘Žπ‘‘π‘‘π‘› = softmax √ β€’ Same-Domain shift. We also test a more chal- 𝑑 lenging setting, where ID and OOD samples are drawn from the same domain, but with different Where π‘Š 𝑄 , π‘Š 𝐾 , π‘Š 𝑉 are learned projection matrices labels. Specifically, we extract the β€˜Business’ news of shape 𝑑 Γ— 𝑑 and 𝑋 π‘œπ‘’π‘‘ is the output of the attention head articles from the news-category dataset. applied to the 𝑛 Γ— 𝑑 matrix 𝑋 from the previous layer. In this paper, we analyse the attention maps for each of the In our experiments we used a sample of 30,000 points 144 attention heads in 𝐡𝐸𝑅𝑇𝐡𝐴𝑆𝐸 using TDA. from the in-distribution dataset for the fine-tuned version of the model, and use a validation and test size of 1,000 datapoints. 3. Experiment design In this section, we outline the design of our methodology 3.2. Model for our OOD detection using Topological Data Analysis. For a supervised classification task, given a test sample π‘₯, We focus on the attention heads of a pre-trained OOD detection aims to determine whether it belongs to 𝐡𝐸𝑅𝑇𝐡𝐴𝑆𝐸 (L=12, H=12) generated from an input text the in-distribution (ID) dataset π‘₯ ∈ π’Ÿπ‘–π‘› or not. Some of π‘₯ to produce topological features and compare this encod- the background and literature review related to confidence ing to the embeddings of the [𝐢𝐿𝑆] token as the sentence score for OOD detection can be found in [9, 14, 15]. We representation. We replicate our experiments on a fine- consider a 𝑑-dimensional representation of an input text tuned 𝐡𝐸𝑅𝑇𝐡𝐴𝑆𝐸 on the ID news categorisation task π‘₯ as β„Ž(π‘₯) in R𝑑 . To analyse the benefits of TDA in OOD 𝒳 β†’ {’Politics’, ’Entertainment’}. We fine-tune the model detection, we consider two encoding functions β„Ž1 (π‘₯) and for 3 epochs, using Adam with batch size of 32 and learning β„Ž2 (π‘₯): rate 10βˆ’5 . 1. Topological feature vector β„Ž1 (π‘₯): given π‘₯, we gen- 3.3. Attention Maps and Attention Graphs erate a vector of 𝑑1 topological features using the graph representations of the 144 attention maps gen- erated by 𝐡𝐸𝑅𝑇𝐡𝐴𝑆𝐸 . In 3.3 and subsection 3.4, we explain in detail how the topological features are generated from an input sentence. 2. Sentence embedding β„Ž2 (π‘₯): we take the 𝑑2 - dimensional text embedding of the [𝐢𝐿𝑆] token output by 𝐡𝐸𝑅𝑇𝐡𝐴𝑆𝐸 , which captures the con- textual and semantic information of the input text π‘₯. Similar to Uppaal et al. [13], we define the OOD detection (a) Attention maps function as 𝐺(π‘₯), which maps an instance π‘₯ to {𝑖𝑛, π‘œπ‘’π‘‘} (12 Γ— 12) derived from as follows: pre-trained BERT for the input text "President issues vows as tensions {οΈƒ 𝑖𝑛 if 𝑆(π‘₯; β„Ž) β‰₯ πœ† πΊπœ† (π‘₯; β„Ž) = with China rise" π‘œπ‘’π‘‘ if 𝑆(π‘₯; β„Ž) < πœ† where 𝑆(π‘₯; β„Ž) is an OOD scoring function using a distance-based method (Mahalanobis distance to the ID class centroids or Euclidean distance to k-nearest ID neighbour), described in subsection 3.5, and πœ† is the threshold chosen so that a high proportion of ID samples’ scores are above πœ†. 3.1. Data As the in-distribution dataset, we choose the headlines and (b) BERT Attention Map (c) Undirected attention abstract text of β€˜Politics’ and β€˜Entertainment’ news articles (Layer 7; Head 10) graph (Layer 7; Head 10) from HuffPost from the news-category dataset [16]. To test where edges are propor- the robustness of the OOD method, we conduct experiments tional to the maximal on three kinds of dataset distribution shifts [17]: attention between the two vertices. The edge β€’ Near Out-of-Domain shift. In this paradigm, ID width represents shorter and OOD samples come from different distributions distances (attention (datasets) exhibiting semantic similarities. In our ex- strength) periments, we evaluate the abstract of news articles Figure 1: Process of transforming an attention map to an atten- from the cnn-dailymail dataset [18]. tion graph (one per attention head) β€’ Far Out-of-Domain shift. In this type of shift, the OOD samples come from a different domain and exhibit significant semantic differences. In particular, Attention maps play a crucial role in our methodology we evaluate the IMDB movie review dataset [19] as as they form the basis for extracting topological features OOD samples. used in our OOD detection. An attention map π‘Š π‘Žπ‘‘π‘‘π‘› is a 𝑛 Γ— 𝑛-dimensional matrix where each entry represents the attention weight between two tokens. Each element 𝑀𝑖𝑗 π‘Žπ‘‘π‘‘π‘› can be interpreted as the level of β€˜attention’ token 𝑖 pays to token 𝑗 in the input sequence during the encoding process. The higher the weight the stronger the relation between two tokens. They are non-negative and βˆ‘οΈ€π‘›the attention weights of a token sum up to one (i.e. π‘Žπ‘‘π‘‘π‘› 𝑗=1 𝑀𝑖𝑗 = 1 for all 𝑖 = 1, ..., 𝑛.). To generate topological features from an attention map, (a) Persistence diagram gen- (b) Topological features ex- we first convert it into an attention graph following the erated from the filtra- tracted from the persis- approach of Perez and Reinauer [5]. Given an attention ma- tion process for atten- tence diagram, calculat- trix π‘Š π‘Žπ‘‘π‘‘π‘› , we create an undirected weighted graph where tion map in Layer 7, ing persistence entropy, the vertices represent the tokens of the input text π‘₯, and Head 10. The set of and amplitude with β€˜Bot- the weights are determined by the attention weights in the 𝐻0 (red points) repre- tleneck’ and β€˜Wassert- corresponding attention map. To emphasise the important sents the birth and death stein’ distances for ho- relationships and reduce noise, we calculate the distance of β€˜connected compo- mology dimensions 0, 1, between vertices as 1 βˆ’ max(𝑀𝑖𝑗 , 𝑀𝑗𝑖 ). The distance calcu- nents’ and the set of 𝐻1 2 and 3. (In the case of lation reflects the inverse of the maximum attention weight (teal points) represents NaN values, e.g. due to between two tokens, ensuring the relationship is symmet- the birth and death of no higher dimensional β€˜holes’. simplices, we set the per- ric and the strong relationships result in smaller distances. sistence entropy feature To prevent the formation of self-loops, all diagonals in the to -1, as per the default adjacency matrix are set to 0. Figure 1 shows an example of behaviour of Giotto-tda) constructing the attention graph for an attention map. Figure 3: Example persistence diagram and extracted topological features 3.4. Persistent Homology appeared and disappeared. For example, when the thresh- old is 0 all 0-dimensional features are born (vertices), and when two vertices 𝑖 and 𝑗 are connected at threshold 𝑀𝑖𝑗 , one 0-dimensional feature will disappear. Similarly, a 1- dimensional feature (hole) will appear at the threshold where 3 vertices connect to each other, and disappear when Figure 2: Filtration process for the attention graph (Layer 7; a fourth vertex forms a 2-dimensional simplex (void). The Head 10) where edges with shorter distances below a threshold birth and death of all π‘˜-dimensional simplices are recorded are added first, gradually connection the nodes until a complete in a persistence diagram. An example persistence diagram graph is formed is shown in Figure 3a. From the persistence diagrams, we extract various topo- The constructed attention graphs from the attention logical features to represent the underlying graph’s struc- heads contain the structure and relationships we need to ture. In our experiments, we focus on the following topo- extract topological features. To encode the topological infor- logical features: mation provided by the attention graph, we use a filtration 1. Persistence Entropy: This feature quantifies the process to generate a persistence diagram. Filtration in complexity of the persistence diagram as calculated TDA is a systematic process where a topological space is by the Shannon entropy of the persistence values progressively constructed across varying scales to analyse (birth and death), with higher entropy indicating a the emergence, persistence and disappearance of simplicial more complex topology. complexes, such as connected components, holes, or voids. We apply one of the most widely used types of filtration 2. Amplitude: We compute amplitude using two dif- process to the attention graphs, the Vietoris-Rips filtration. ferent distance measures: β€˜bottleneck’ and β€˜Wasser- This process starts with only the vertices of the graph, con- stein’. The amplitude measures the maximum persis- sidering them as zero-dimensional simplices. Then it adds tence value within the diagram, providing insights edges one by one, depending on their weights (i.e. distances). into the significance of the topological features. Edges with shorter distances below a threshold are added We focus on different homology dimensions to capture topo- first, gradually connecting the vertices by increasing the logical features of varying complexities. In our experiments, threshold until a complete graph is formed. As edges are we consider homology dimensions [0, 1, 2, 3] to account for added, the filtration process captures the graph’s properties different aspects of the attention graph’s topology. We use and the relationships between its vertices [20]. This process the Giotto-tda library to generate the persistence diagrams is visualised in Figure 2. and extract the topological features, as per Figure 3b. Both To construct a persistence diagram, we keep track of persistence entropy and amplitude features are used in the the lifetime of persistence features as the threshold is in- experiment through concatenating all features into a single creased. One can think of 0-dimensional persistent fea- feature vector. tures as connected components, 1-dimensional features as holes and 2-dimensional features as voids (2-dimensional holes) and so on. The birth and death time of a persis- tence feature is the threshold value at which the feature 3.5. OOD Scoring Function Pre-trained Fine-tuned Similar to Perez and Reinauer [5], given β„Ž(π‘₯), a 𝑑- dimensional representation of an input text π‘₯, we employ two distance-based methods as the OOD scoring functions: 1. Mahalanobis distance to the ID class centroids: the Mahalanobis distance is used to measure the dis- tance between the feature vector β„Ž(π‘₯) and the class TDA centroids. This distance is based on the covariance matrix of the class features, which is based on the assumption that the data in that class follows a mul- tivariate Gaussian distribution. The OOD score is calculated as follows: CLS 𝑆Maha (π‘₯; β„Ž; Ξ£; πœ‡) = Figure 4: The data representations from the TDA and CLS ap- π‘šπ‘–π‘›π‘βˆˆπ‘¦ (𝑧π‘₯ βˆ’ πœ‡π‘ )𝑇 Ξ£βˆ’1 (𝑧π‘₯ βˆ’ πœ‡π‘ ) proaches for the far out-of-domain IMDB dataset. where 𝑧π‘₯ is the standardised feature vector for the input β„Ž(π‘₯), Ξ£ is the covariance matrix of the stan- Pre-trained Fine-tuned dardised ID feature vectors and πœ‡ is the set of class mean standardised embeddings. Both Ξ£ and πœ‡π‘ are extracted from the ID validation set embeddings to account for the inherent distribution of the ID data. The covariance matrix Ξ£ captures how the features vary with respect to one another, and πœ‡π‘ represents the centroid or average representation of data be- TDA longing to class 𝑐. 2. Euclidean distance to k-nearest ID neighbour: We measure the distance between β„Ž(π‘₯) and the k- nearest ID neighbour’s feature vector from the vali- dation set. Given β„Ž(π‘₯) and a set of π‘š ID feature vec- tors {β„Ž(π‘₯1 ), β„Ž(π‘₯2 ), ..., β„Ž(π‘₯π‘š )}, the Euclidean dis- tance to the k-nearest ID neighbour is calculated as follows: CLS Figure 5: The data representations from the TDA and CLS ap- 𝑆KNN (π‘₯; β„Ž) = ||𝑧π‘₯ βˆ’ 𝑧π‘₯π‘˜ ||2 proaches for the near out-of-domain CNN/Dailymail dataset. where 𝑧π‘₯ and 𝑧π‘₯π‘˜ are the standardised feature vector for the input β„Ž(π‘₯) and its k-nearest ID sample β„Ž(π‘₯π‘˜ ). In our experiments, we set π‘˜ = 5. Figure 4, the TDA feature vectors project the data into well- separated and compact clusters, which explains its superior performance. 4. Results The TDA approach was less effective than the CLS ap- proach at detecting OOD samples from the near out-of- We conduct our experiments using Topological Data Anal- domain CNN/Dailymail dataset. Even though the data vi- ysis to generate topological feature vectors β„Ž1 (π‘₯) from at- sualisation in Figure 5 shows that TDA was able to cluster tention maps, which are then compared to standard sen- OOD samples together, the cluster was not distant enough tence embeddings β„Ž2 (π‘₯) generated from the [𝐢𝐿𝑆] token from ID samples, rendering both distance-based OOD de- of BERT. Table 1 shows the OOD detection performance tection methods less effective. of both approaches for three out-of-distribution datasets, For same-domain datasets (news-category), both ap- using both pre-trained and fine-tuned BERT models. proaches struggled to detect OOD samples. As seen in Fig- For visualisation purposes, we use UMAP projections ure 6, when both ID and OOD data are from the same do- of the in-distribution (validation and test sets) and out-of- main, their feature vectors are highly overlapping, although distribution data points in the corresponding feature space. fine-tuning seems to provide stronger separability between Figure 4, Figure 5, and Figure 6 show the data representa- ID and OOD data for the CLS approach. tions from the TDA and CLS approaches for the far out-of- domain dataset (IMDB), near out-of-domain dataset (CN- N/Dailymail) and the same-domain dataset (business news- 5. Discussion category), respectively. The results demonstrate that the TDA-based approach From our experiments, we showed that the TDA approach consistently outperforms the CLS embeddings in detecting outperforms the CLS approach at detecting far out-of- OOD samples in the IMDB dataset from both the pre-trained domain OOD samples like those in the IMDB dataset. Yet, and fine-tuned models. OOD detection using TDA can detect its effectiveness deteriorates with near out-of-domain (CN- IMDB review samples with 8-9% FPR95, in stark contrast to N/Dailymail) or same-domain (business news-category) the 87-91% FPR95 exhibited by CLS embeddings. As seen in datasets. To understand why, we looked at the samples that Pre-trained model Fine-tuned model KNN MAHA KNN MAHA AUROC FPR95 ↓ AUROC FPR95 ↓ AUROC FPR95 ↓ AUROC FPR95 ↓ ↑ ↑ ↑ ↑ TDA 0.940 0.090 0.940 0.112 0.958 0.084 0.950 0.124 IMDB CLS 0.680 0.875 0.799 0.704 0.771 0.916 0.814 0.852 TDA 0.572 0.890 0.563 0.908 0.551 0.909 0.521 0.927 CNN/Dailymail CLS 0.875 0.591 0.897 0.445 0.947 0.215 0.949 0.208 TDA 0.527 0.929 0.543 0.921 0.570 0.923 0.568 0.925 News-Category (Business) CLS 0.580 0.921 0.638 0.878 0.884 0.431 0.885 0.424 Table 1 Comparison of the performance of our scoring functions on all three out-of-distribution datasets using both pre-trained and fine-tuned models. Pre-trained Fine-tuned CNN/Dailymail sample Nearest ID neighbour Footage showed an unusual Trump’s Proposed Cuts To ’apocalyptic’ dust storm hit- Foreign Food Aid Are Prov- ting Belarus. China has ing Unpopular. The presi- suffered four massive sand- dent might see zeroed-out storms since the start of the funding for foreign food aid year. Half of dust in atmo- as "putting America first," sphere today is due to hu- but members of Congress TDA man activity, said Nasa. clearly disagree. Video posted by YouTube Trump Signs Larry Nassar- user Richard Stewart show- Inspired Sexual Assault Bill ing a Porsche Cayman fly- Behind Closed Doors. The ing out of control. Police president quietly signed the cited unidentified driver for bill the week after two White the crash. Car reportedly House staffers resigned amid wrecked and needed to be allegations of domestic vio- CLS towed from the scene. lence. Figure 6: The data representations from the TDA and CLS Table 2 approaches for the same-domain News-Category (Business) Least confident OOD samples from the CNN/Dailymail dataset dataset. and their nearest ID neighbours, from the TDA approach using the pre-trained BERT model each approach thrived and struggled with, and we highlight three observations: (2) CLS embeddings are sensitive to the semantic (1) The TDA approach accentuates features asso- and contextual meaning of the samples, regardless ciated with textual flow or grammatical structures of sentence structure. This explains why this approach rather than lexical semantics, consistent the findings struggled with OOD detection from IMDB reviews, as it of- of Deng and Duzhin [21] and Kushnareva et al. [4]. For ten classified IMDB movie reviews as in-distribution due to example, TDA was adept at identifying OOD samples that their semantic similarities with the entertainment news ar- are structurally unique in the IMDB dataset, as the most ticles from the ID dataset, especially those related to movies. confident OOD samples detected were: A closer look at the IMDB samples with smallest OOD score from the CLS embeddings in Table 3 exemplifies this insight, β€’ β€˜OK...i have seen just about everything....and some are identifying ID samples of similar topic as nearest neighbours considered classics that shouldn’t be ( like all those even though they are clearly from different domains. Halloween movies that suck crap or even Steven king (3) Fine-tuning has improved performance of CLS junk).......and some are considered just OK that are embeddings for near or same-domain shifts, but shows really great.....( like carnival of souls )........and then no significant benefit for TDA. Fine-tuning induces a some are just plain ignored............like ( evil ed ) [. . . ]’ model to divide a single domain cluster into class clusters, β€’ β€˜Time line of the film: * Laugh * Laugh * Laugh * as highlighted by Uppaal et al. [13]. For CNN/Dailymail Smirk * Smirk * Yawn * Look at watch * walk out * and Business news OOD datasets, this is beneficial for the remember funny parts at the beginning * smirk < br / CLS approach as it learns to better distinguish topics. How- >
[. . . ]’ ever, fine-tuning made the CLS embeddings of IMDB movie In contrast, TDA struggled with detecting CNN/Daily- reviews appear even more similar to entertainment news, mail OOD samples as they have similar sentence structures deteriorating OOD performance. and length to the ID samples, even if they are semantically For the TDA approach, fine-tuning did not present any unrelated. Table 2 shows the samples with the least confi- considerable benefits. This can be partly attributed to obser- dent OOD score from the CNN/Dailymail dataset, and their vation (1) that TDA primarily captures structural differences, nearest ID neighbour. and fine-tuning, which is driven by semantics, does not sig- nificantly alter the topological representation. IMDB review sample Nearest ID neighbour [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: ’[...] I would spend good, DVDs: Great Blimp, Bad- Pre-training of deep bidirectional transformers for lan- hard-earned cash money to lands, Buster Keaton & More. guage understanding, in: Proceedings of the 2019 see it again on DVD. And Let’s catch up with some reis- Conference of the North American Chapter of the as long as we’re requesting sues of classic – and not so Association for Computational Linguistics: Human Smart Series That Never Got classic – movies, with a few Language Technologies, Volume 1 (Long and Short Pa- a Chance...How about DVD documentaries tossed in at pers), 2019, pp. 4171–4186. URL: https://aclanthology. releases of Maximum Bob the end for good measure. org/N19-1423. doi:10.18653/v1/N19-1423. (another well written, odd [3] A. Azaria, T. Mitchell, The internal state of an llm duck show with a delightful knows when its lying (2023). URL: https://arxiv.org/ cast of characters.) [...]’ abs/2304.13734. ’[...] I am generally not a fan How β€˜Erin Brockovich’ Be- [4] L. Kushnareva, D. Cherniavskii, V. Mikhailov, E. Arte- of Zeta-Jones but even I must came One Of The Most mova, S. Barannikov, A. Bernstein, I. Piontkovskaya, admit that Kate is STUN- Rewatchable Movies Ever D. Piontkovski, E. Burnaev, Artificial text detection NING in this movie. [...]’ Made. Julia Roberts gives the via examining the topology of attention maps, in: best performance of her ca- reer, aided by a sassy Susan- Proceedings of the 2021 Conference on Empirical nah Grant script full of one- Methods in Natural Language Processing, Associa- liners. tion for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 635–649. URL: Table 3 https://aclanthology.org/2021.emnlp-main.50. doi:10. Least confident OOD samples from the IMDB dataset and their 18653/v1/2021.emnlp-main.50. nearest ID neighbours, from the CLS approach using the pre- [5] I. Perez, R. Reinauer, The topological bert: Transform- trained BERT model ing attention into topology for natural language pro- cessing, 2022. URL: https://arxiv.org/abs/2206.15195. arXiv:2206.15195. 6. Conclusion [6] A. Podolskiy, D. Lipin, A. Bout, E. Artemova, I. Pi- ontkovskaya, Revisiting mahalanobis distance for In this paper, we explore the capabilities of Topological transformer-based out-of-domain detection, in: Pro- Data Analysis for identifying Out-of-Distribution samples ceedings of the AAAI Conference on Artificial Intelli- by leveraging the attention maps derived from BERT, a gence, volume 35, 2021, pp. 13675–13682. transformer-based Large Language Model. Our results [7] P. Colombo, E. D. Gomes, G. Staerman, N. Noiry, P. Pi- demonstrate the potential of TDA as an effective tool to antanida, Beyond mahalanobis-based scores for tex- capture the structural information of textual data. tual ood detection, arXiv preprint arXiv:2211.13527 Nevertheless, our experiments also highlighted the intrin- (2022). sic limitations of TDA-based methods. Predominantly, our [8] X. Li, J. Li, X. Sun, C. Fan, T. Zhang, F. Wu, Y. Meng, TDA method captured the inter-word relations derived from J. Zhang, π‘˜ folden: π‘˜-fold ensemble for out-of- the attention maps, but failed to account for the actual lexi- distribution detection, arXiv preprint arXiv:2108.12731 cal meaning of the text. This distinction suggests that while (2021). TDA offers valuable insights into textual structure, a lexical [9] K. Lee, K. Lee, H. Lee, J. Shin, A simple unified frame- and more holistic understanding of textual data is needed work for detecting out-of-distribution samples and for OOD detection, especially with near or same-domain adversarial attacks, Advances in neural information shifts. processing systems 31 (2018). For future work, it might be worth combining the topo- [10] A. Hatcher, Algebraic Topology, Cambridge University logical features that capture the structural information of Press, 2002. URL: https://pi.math.cornell.edu/~hatcher/ textual data, with those that encode the semantics of text AT/ATpage.html. in an ensemble model that might boost our ability to detect [11] P. Frosini, Measuring shapes by size functions, in: OOD samples. In addition, there is an opportunity to inves- Intelligent Robots and Computer Vision X: Algorithms tigate the effectiveness of TDA in other NLP tasks where and Techniques, volume 1607, SPIE, 1992, pp. 122–133. the textual structure might be important. URL: https://doi.org/10.1117/12.57059. doi:10.1117/ 12.57059. Acknowledgments [12] V. Robins, Towards computing homology from finite approximations, in: Topology proceedings, volume 24, The research was supported by a National Intelligence Post- 1999, pp. 503–532. doctoral Grant (NIPG-2021-006). [13] R. Uppaal, J. Hu, Y. Li, Is fine-tuning needed? pre- trained language models are near perfect for out-of- domain detection, in: Proceedings of the 61st Annual References Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2023, pp. 12813– [1] S. Wong, S. Barnett, J. Rivera-Villicana, A. Simmons, 12832. URL: https://aclanthology.org/2023.acl-long. H. Abdelkader, J.-G. Schneider, R. Vasa, MLGuard: 717. doi:10.18653/v1/2023.acl-long.717. Defend your machine learning model!, in: Proceedings [14] Y. Sun, Y. Ming, X. Zhu, Y. Li, Out-of-distribution of the 1st International Workshop on Dependability detection with deep nearest neighbors, in: Interna- and Trustworthiness of Safety-Critical Systems with tional Conference on Machine Learning, PMLR, 2022, Machine Learned Components, SE4SafeML 2023, 2023, pp. 20827–20840. p. 10–13. doi:10.1145/3617574.3617859. [15] J. Yang, K. Zhou, Y. Li, Z. Liu, Generalized out- of-distribution detection: A survey, arXiv preprint arXiv:2110.11334 (2021). [16] R. Misra, News category dataset (2022). URL: https: //arxiv.org/abs/2209.11429. [17] U. Arora, W. Huang, H. He, Types of out-of- distribution texts and how to detect them, in: Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 10687–10701. URL: https://aclanthology.org/2021.emnlp-main.835. doi:10.18653/v1/2021.emnlp-main.835. [18] A. See, P. J. Liu, C. D. Manning, Get to the point: Summarization with pointer-generator networks, in: Proceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1073–1083. URL: https://www. aclweb.org/anthology/P17-1099. doi:10.18653/v1/ P17-1099. [19] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts, Learning word vectors for sentiment analysis, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 142–150. URL: http: //www.aclweb.org/anthology/P11-1015. [20] U. Bauer, Ripser: efficient computation of vietoris–rips persistence barcodes, Journal of Applied and Com- putational Topology 5 (2021) 391–423. URL: https: //doi.org/10.1007/s41468-021-00071-5. doi:10.1007/ s41468-021-00071-5. [21] R. Deng, F. Duzhin, Topological data analysis helps to improve accuracy of deep learning models for fake news detection trained on very small training sets, Big Data Cogn. Comput. 6 (2022) 74. URL: https://doi.org/ 10.3390/bdcc6030074. doi:10.3390/bdcc6030074.