Optimal Robust Simplifications for Explaining Time Series Classifications

Optimal Robust Simplifications for Explaining Time Series Classifications JanArneTelle Department of Informatics University of Bergen

Norway

CèsarFerri VRAIN Universitat Politècnica de València

Spain

BrigtHåvardstun Department of Informatics University of Bergen

Norway

Explainable AI for Time Series and Data Streams Tutorial-Workshop

September 9 2024 Vilnus Lithuania

Optimal Robust Simplifications for Explaining Time Series Classifications 1613-0073 DE322D51DD9AAF3B2B8DF39DF001EC5E GROBID - A machine learning software for extracting information from scholarly documents Time Series XAI Simplification Explainability

Given a Time Series Classifier (TSC) and three parameters that balance between error, simplicity and robustness, we define an optimization problem over all possible ways of simplifying a given time series 𝑡𝑠 into straight-line segments. Robustness is fixed as the fraction of perturbations that have the same classification as 𝑡𝑠 under the TSC, and we introduce a novel method for generating a set of perturbations where this information is easy to visualize. We prove that under some mild conditions on the TSC and the three parameters, we can find the optimal solution in time polynomial in the length of 𝑡𝑠, by first doing dynamic programming to solve for error and simplicity, and then adding robustness. We test the resulting Optimal Robust Simplification (ORS)-Algorithm on binary TSCs for three datasets from UCR. We apply the ORS-Algorithm to prototypes of the classes, with varying parameters, to evaluate its power as an explanatory tool for the trained classifiers. We also provide a tool for visualizing the robustness information. We believe the resulting insights show the usefulness of the Optimal Robust Simplifications in explaining TSCs.

Introduction

Temporal data is encountered in many real-world applications ranging from patient data in healthcare [1] to the field of cyber security [2]. Deep learning methods have been successful [3,1,2] for TSC -Time Series Classification -but such methods are not easily interpretable, and often viewed as black boxes, which limits their applications when user trust in the decision process is crucial. To enable the analysis of these black-box models we revert to post-hoc interpretability. Recent research has focused on adapting existing methods to time series, both specific methods like SHAP-LIME [4], Saliency Methods [5] and Counterfactuals [6], and also combinations of these [7].

As humans learn and reason by forming mental representations of concepts based on examples, and any machine learning model has been trained on data, then we believe that data e.g. in the form of prototypes and counterfactuals is indeed the natural common language between the user and this model. However, the basic problem is that compared to images and text, time series data are not intuitively understandable to humans [8]. This makes interpretability of time series extra demanding, both when it comes to understanding how users will react to the provided explanations and to predict what explanatory tools are best. For example, in [9] a tool was given for explainability of a TSC, that allowed model inspection so the user could form their own mental model of the classifier. However, a user evaluation showed that non-expert end-users were not able to make use of this freedom, supporting the notion that time-series are particularly non-intuitive for humans.

Theissler et al [10] give an intriguing taxonomy of XAI for TSC, divided into those focused on (i) Time-points (e.g. SHAP and LIME) or (ii) Subsequences (e.g. Shapelets) or (iii) Instances (Prototypes, CF, Feature-based). Explaining a classifier by instances like prototypes has a definite appeal, but it is a challenge how to highlight the features important for the classification, while at the same time simplifying the instance to mitigate the non-intuitive aspects of this domain, as we attempt in this work. Theissler et al [10] review the literature on XAI for TSC and of the 9 papers they discuss on Instance-based explanations using Prototypes or Features, it is remarkable that not a single one is specialized for Local explanations, rather they are all Global. In this work we fill this gap, and introduce a way of simplifying a time series by straight-line segments, and doing this in a way that is robust with respect to the classification of a given TSC ℎ. 1 Our ORS-algorithm, for Optimal Robust Simplifications, is model-agnostic, with robustness of a simplification being the fraction of local perturbations that retains the classification under TSC ℎ. There are various ways of perturbing times series, see e.g. [11,12,13]. In this work we develop a different method of computing a set of perturbations, both since we start with a simplification on straight-line segments and also because we want the aggregate information about their classifications to be visualized in an informative way. When computing robustness we start with a straight-line simplification, and any perturbation must to a certain degree keep that position and shape, while it is also allowed to deviate as the perturbation moves inside a band around the simplification, see Figure 8.

The ORS-Algorithm computing a robust simplification 𝑠𝑡𝑠 of a given times series 𝑡𝑠, under a TSC ℎ, has three parameters that allows to balance between error (Squared Euclidean distance between 𝑡𝑠 and 𝑠𝑡𝑠), simplicity (number of segments of 𝑠𝑡𝑠), and robustness (what fraction of perturbations of 𝑠𝑡𝑠 have same classification as 𝑡𝑠 by ℎ), and 𝑠𝑡𝑠 is defined as the minimization over an objective function taking all this into account, see Definition 1. Perhaps surprisingly, in Theorem 4 we show that under some mild conditions on ℎ and the three parameters, we can find the optimal solution in time polynomial in the length of the original time series 𝑡𝑠, by first solving for error and simplicity, and then adding robustness. Note that our algorithm allows not only the balance between error, simplicity and robustness to change, but also their computation, i.e. the algorithm is agnostic to the particular methods used to calculate these values. Hence Squared Euclidean distance can be replaced by some other distance function as long as it is amenable to the dynamic programming, simplicity can depend on other aspects than number of segments, and robustness can be computed by other types of perturbations.

In the rest of this paper we first discuss related work and cover some standard definitions before presenting the ORS-Algorithm together with proof of correctness and runtime. We then discuss the usefulness of the robust simplifications, with a focus in this paper on time series that are discrete, univariate, evenly sampled, aperiodic, short, and non-stationary, based on UCR datasets Chinatown, Italy Power Demand and ECG200. We also provide an aid to visualize the robustness information. Our findings suggest that the robust simplifications of a prototype 𝑡𝑠 can be used to extract explanations of the classification done by the TSC, as they simplify 𝑡𝑠 to the point where human intuition can more easily be applied. We conclude the paper by discussing some directions for future work.

Related Work

Several techniques have been applied to generate explanations from TSC models [10,14]. Specifically, techniques previously used on Convolutional Neural Networks or Recurrent Neural Networks have been applied when these techniques have been used for TSC. For instance, the authors in [7] apply to time series several XAI methods previously used on image and text domain. They also introduce verification techniques specific to times series, in particular a perturbation analysis and a sequence evaluation. Related to the datasets used in our paper, [15] presents a user study of XAI methods to explain the ECG200 dataset from UCR.

Another alternative is to produce explanations through examples, and these can be specifically utilized in the time series domain. One type of this explanation method involves giving the nearest example from the training dataset that acts as a prototype to depict the normal behavior of a similar sample [16]. A method to generate prototypes for time series data using an autoencoder is presented in [17]. The main novelty of the work is the method to generate diverse prototypes.

Given that in many cases raw time series can be too complex for humans, several studies have tried to employ simplified versions as explanations. In [18], the authors propose a dimensionality reduction technique for time series, called Piecewise Aggregate Approximation, as a tool for similarity search in large time series databases. The work [19] presents a review of piecewise linear approximations employed for time series data, and the authors also propose a method named SWAB (based on the Sliding Window and Bottom-up approaches). Finally, in [20] Camponogara and Nazari introduce a range of piecewise-linear models and algorithms for unknown functions. Another option is to segment time series into fixed-width windows and employ these to justify decisions. In [21] Schlegel et al propose TS-MULE, a model explanation method working to segment and perturb time series data and extending LIME. A study on the effect of segmentation on time series, in particular in the field of finance for the use of pattern matching was presented in [22]. In [23], Si and Yin also apply segmentation to financial time series data, now as a preprocessing step to locate technical patterns.

Perturbations have also been previously employed for XAI and time series. In [11], the authors propose a framework to generate explanations for time series classifications based on a generative model to create with-in-distribution perturbations. Enguehard, in [12], introduces a technique to explain multivariate time series predictions using a perturbation-based saliency method. Recently, the approach ContraLSP has been detailed in [13]. This method is based on a locally sparse model that produces counterfactual samples to build uninformative perturbations but keeps distribution using contrastive learning.

Some tools have been proposed to allow users to interact with time series and, in that way, to get some insights about their classification, like [24] where model inspection is the key aspect. In [25], the authors develop an interactive XAI tool for loan applications that allows users to experiment with hypothetical input values and inspect their effect on the outcomes of the model and perform a user evaluation on MTurk. [26] presents a Python package to provide a unified interface for the interpretation of time series classification.

Some papers have developed a similar approach to our proposal. Im [27], Tang et al. presents the method Dual Prototypical Shapelet Networks (DPSN) for few-shot time-series classification. This method trains a neural network from few examples and also interprets the model from two levels of granularity: a global overview with representative time series examples, and local highlights with discriminative shapelets. Another related paper for sequence learning is [28]. The work proposes ProSeNet a RNN-based method for deep sequence modeling. The approach combines prototype learning and RNNs to construct models with high accuracy and interpretability. The method considers three criteria in constructing prototypes: Simplicity, the prototypes can be subsequences with only the key events determining the output; Diversity, redundant prototypes should be avoided; Sparsity, for each example to explain only a few prototypes are shown to avoid long explanations. The authors show the validity of ProSeNet for time series using the MIT-BIH Arrhythmia ECG dataset.

Standard definitions

Let us present formal definitions for Time Series Classification (TSC) and recall basic notions.

Staying consistent with earlier notation [10,6] a time series 𝑇 = {𝑡 1 , 𝑡 2 , . . . , 𝑡 𝑚 } is an ordered set of 𝑚 real-valued observations (or time steps). Note we may also view each 𝑡 𝑖 as a pair consisting of an 𝑥-value (the time) and a 𝑦-value (the observation). A time series dataset 𝐷 = {𝑇 1 , 𝑇 2 , ..., 𝑇 𝑛 } ∈ 𝑅 𝑛×𝑚 is a collection of such time series where each time series has a class label 𝑐 forming a vector of class labels. In this paper we consider only binary classification tasks. Given such a dataset, Time Series Classification is the task of training a mapping 𝑏 from the space of possible inputs to a probability distribution over the class values. Thus, a black-box classifier 𝑏(𝑇 ) takes a time series 𝑇 as input and predicts a probability output over the class values.

Prototypes are time series exemplifying the main aspects responsible for a classifier's specific decision outcome. It can be a real instance (which is what we opt for) sampled from the dataset that is important and meaningful because it summarizes the shape of many other similar instances, or a synthetic one, for example a cluster centroid or an instance generated by following some ad-hoc processes.

An Algorithm for Optimal Robust Simplifications

In this section we give the ORS-Algorithm that takes a binary Time Series Classifier ℎ and a univariate times series 𝑡𝑠, on 𝑛 points 𝑝 1 , ..., 𝑝 𝑛 given by increasing 𝑥-values, and computes the optimal robust simplification 𝑜𝑝𝑡 𝑠𝑖𝑚𝑝 (𝑡𝑠, ℎ, 𝛼, 𝛽, 𝛾), where 𝛼, 𝛽, 𝛾 are factors freely chosen to balance between error, simplicity and robustness. This simplification will select 𝑘 + 1 of the 𝑛 points 𝑝 𝑖 1 , 𝑝 𝑖 2 , ..., 𝑝 𝑖 𝑘+1 with 𝑖 𝑗 < 𝑖 𝑗+1 , ∀𝑗 to form a simplified time series 𝑠𝑡𝑠 on 𝑘 segments by drawing a line between each consecutive pairs of the 𝑘 + 1 points and by letting the first segment and the last segment extend to the 𝑥-values of 𝑝 1 and 𝑝 𝑛 , respectively. See Figure 1. The optimal simplification thus reflects a selected balance of error, simplicity and robustness achieved by freely choosing the 3 factors 𝛼, 𝛽, 𝛾:

• 𝛼 for error: 𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠) is Squared Euclidean distance between 𝑡𝑠 and 𝑠𝑡𝑠, i.e. the sum over all 𝑥-values, of the square of the difference between the corresponding 𝑦-value of 𝑡𝑠 and 𝑠𝑡𝑠. • 𝛽 for simplicity: 𝑘 𝑠𝑡𝑠 is the number of segments • 𝛾 for robustness: 𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠, ℎ) is a measure of how robust the classifier ℎ is on perturbations of 𝑠𝑡𝑠, measured by the fraction of perturbations that do not fall in the same class as 𝑡𝑠, thus in the range [0, 1] with 0 being most robust (𝑓 𝑟𝑎𝑔 is short for fragility and can be viewed as (1-robustness)). See subsection 4.2.2 for the definition of the perturbations used.

Perhaps surprisingly, under some mild assumptions this complicated objective function still allows an algorithm that finds the optimal in time polynomial in 𝑛. In this section we describe this algorithm in detail.

If no robustness: Dynamic Programming

We first solve the case of 𝛾 = 0, i.e. without giving any weight to robustness. Firstly, Least Squares is a foundational problem in statistics and numerical analysis, given points 𝑝 1 , 𝑝 2 , ..., 𝑝 𝑛 in the plane find the straight line that minimizes the squared error. In the more general Segmented Least Squares (SLS) problem we ask for a sequence of straight-line segments that minimizes cost, which is some balance of error and simplicity (number of segments). The SLS problem is famously solvable in polynomial time by a textbook Dynamic Programming Algorithm based on the following recurrence for 𝑂𝑃 𝑇 (𝑗), the minimum cost for points 𝑝 1 , ..., 𝑝 𝑗 :

𝑂𝑃 𝑇 (𝑗) = {︃ 0, if 𝑗 = 0 min 1≤𝑖≤𝑗 {𝑙𝑠𝑒(𝑖, 𝑗) + 𝛽 + 𝑂𝑃 𝑇 (𝑖 − 1)} else

where 𝑙𝑠𝑒(𝑖, 𝑗) is the least square error over all single lines that cover 𝑝 𝑖 to 𝑝 𝑗 , and 𝛽 is the the cost of one extra segment, with value of 𝛽 chosen to balance between error and simplicity. This problem formulation differs from our setting in 3 important ways:

• We want a continuous set of segments, with each segment starting and ending in given points.

• We want the first (resp. last) segment to be defined by any two points 𝑝 𝑖 , 𝑝 𝑗 and the line drawn between them continued until it hits the 𝑥-value of 𝑝 1 (resp. of 𝑝 𝑛 ). • We want robustness to perturbations as part of the objective function.

The first two differences are relatively easy to accommodate, as follows:

𝑂𝑃 𝑇 (𝑗) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ 0, if 𝑗 = 0 min 1≤𝑖≤𝑗 {[𝛼 • 𝑒𝑟𝑟(𝑖, 𝑗) + 𝛽 + 𝑂𝑃 𝑇 (𝑖)], [𝛼 • 𝑒𝑟𝑟(1, 𝑖, 𝑗) + 𝛽]} if 𝑗 < 𝑛 min 1≤𝑖≤𝑛 min 𝑖+1≤𝑘≤𝑛 {[𝛼 • 𝑒𝑟𝑟(𝑖, 𝑘, 𝑛) + 𝛽 + 𝑂𝑃 𝑇 (𝑖)], [𝛼 • 𝑒𝑟𝑟(1, 𝑖, 𝑘, 𝑛) + 𝛽]} if 𝑗 = 𝑛

where • 𝑒𝑟𝑟(𝑖, 𝑗) is the Squared Euclidean distance between the straight line from 𝑝 𝑖 to 𝑝 𝑗 and the part of the original time series 𝑡𝑠 given by points 𝑝 𝑖 , ..., 𝑝 𝑗 • 𝑒𝑟𝑟(𝑖, 𝑗, 𝑛) is the same except the straight line continues past 𝑝 𝑗 to the 𝑥-value of 𝑝 𝑛 and we compare to the final part of 𝑡𝑠 given by 𝑝 𝑖 , ..., 𝑝 𝑛 • 𝑒𝑟𝑟(1, 𝑖, 𝑗) is similar as above except the line continues in the other direction to the 𝑥-value of 𝑝 1 and we compare to the initial part of 𝑡𝑠 𝑝 1 , ..., 𝑝 𝑗 • 𝑒𝑟𝑟(1, 𝑖, 𝑗, 𝑛) compares 𝑡𝑠 to a single line covering all 𝑥-values and going through points 𝑝 𝑖 , 𝑝 𝑗 For the case of 𝑜𝑝𝑡 𝑠𝑖𝑚𝑝 (𝑡𝑠, ℎ, 𝛼, 𝛽, 𝛾) where 𝛾 = 0 i.e. where robustness does not play a part, the argument for correctness of the above recurrence is similar to the textbook argument for correctness of the recurrence of SLS, so we do not repeat it here, except to note the new parts being that • We use 𝑂𝑃 𝑇 (𝑖) rather than 𝑂𝑃 𝑇 (𝑖 − 1) since now for two consecutive segments the endpoint of the first is the startpoint of the next. • When defining 𝑂𝑃 𝑇 (𝑗) we use 𝑒𝑟𝑟(1, 𝑖, 𝑗) so first segment can end in 𝑗 • When defining 𝑂𝑃 𝑇 (𝑛) we use 𝑒𝑟𝑟(𝑖, 𝑘, 𝑛) so last segment can choose any pair 𝑝 𝑖 , 𝑝 𝑘 , and 𝑒𝑟𝑟(1, 𝑖, 𝑘, 𝑛) to allow simplification with a single segment.

Transforming this recurrence into a dynamic programming algorithm is straightforward, by first computing each of the 𝑂(𝑛 2 ) error terms in time 𝑂(𝑛) each, followed by a nested loop computing the 𝑛 𝑂𝑃 𝑇 (𝑗) values in 𝑂(𝑛) time each. Retrieving the optimal solution is done by a second pass over the values stored, also standard. This gives the following result. Theorem 1. Given a time series 𝑡𝑠 of length 𝑛, a binary TSC ℎ, and factors 𝛼, 𝛽, 𝛾 that balance between error, simplicity and robustness. For the case of 𝛾 = 0 (no robustness) we can compute the optimal simplification achieving the minimum 𝑜𝑝𝑡 𝑠𝑖𝑚𝑝 (𝑡𝑠, ℎ, 𝛼, 𝛽, 𝛾 = 0) in time 𝑂(𝑛 3 ) by dynamic programming.

Accommodating Robustness: DP with Heaps

Dynamic programming relies on the property that an optimal solution to a larger problem consists of optimal solutions to subproblems. Robustness depends on the TS Classifier ℎ and we cannot expect it to satisfy this property, since its classification on only a part of a time series will not in general correspond to its classification on the full time series. Thus, we cannot incorporate robustness as part of the DP scheme. Instead, our algorithm for Optimal Robust Simplifications (ORS), the ORS-Algorithm, uses a 2-stage approach: Pseudo-code is given in the Appendix. We will here give a high-level explanation, but let us first state the following formal result. Theorem 2. Let the difference in cost between 𝑠𝑡𝑠 1 and 𝑠𝑡𝑠 𝑞 , as computed by Stage 1, be 𝑑. For any value of 𝛾 ≤ 𝑑 the simplification returned by the ORS-Algorithm will then be an optimal one achieving 𝑜𝑝𝑡 𝑠𝑖𝑚𝑝 (𝑡𝑠, ℎ, 𝛼, 𝛽, 𝛾).

Proof 1. We prove the statement by contradiction. Let the simplification 𝑠𝑡𝑠 returned by the ORS-Algorithm have 𝑘 𝑠𝑡𝑠 segments and assume there is another simplification 𝑠𝑡𝑠 ′ with 𝑘 𝑠𝑡𝑠 ′ segments such that

𝛼 • 𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠 ′ ) + 𝛽 • 𝑘 𝑠𝑡𝑠 ′ + 𝛾 • 𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠 ′ , ℎ) < 𝛼 • 𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠) + 𝛽 • 𝑘 𝑠𝑡𝑠 + 𝛾 • 𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠, ℎ)

Since Stage 2 optimized this value over all simplifications 𝑠𝑡𝑠 1 , ..., 𝑠𝑡𝑠 𝑞 we cannot have 𝑠𝑡𝑠 ′ among these, and thus by the assumption in the statement of the theorem we have 𝛼

• 𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠 ′ ) + 𝛽 • 𝑘 𝑠𝑡𝑠 ′ ≥ 𝑑 + 𝛼 • 𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠 1 ) + 𝛽 • 𝑘 𝑠𝑡𝑠 1 , with 𝑘 𝑠𝑡𝑠 1 the number of segments of 𝑠𝑡𝑠 1 . Thus we get 𝑑 + 𝛼 • 𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠 1 ) + 𝛽 • 𝑘 𝑠𝑡𝑠 1 + 𝛾 • 𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠 ′ , ℎ) < 𝛼 • 𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠) + 𝛽 • 𝑘 𝑠𝑡𝑠 + 𝛾 • 𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠, ℎ) ≤ 𝛼 • 𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠 1 ) + 𝛽 • 𝑘 𝑠𝑡𝑠 1 + 𝛾 • 𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠 1 , ℎ)

with the latter inequality following since 𝑠𝑡𝑠 was the minimum over 𝑠𝑡𝑠 1 , ..., 𝑠𝑡𝑠 𝑞 . Rearranging, this gives

𝑑 < 𝛾•𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠 1 , ℎ)−𝛾•𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠 ′ , ℎ)+((𝛼•𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠 1 )+𝛽•𝑘 𝑠𝑡𝑠 1 )−(𝛼•𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠 1 )+𝛽•𝑘 𝑠𝑡𝑠 1 ))

with the latter term being zero and thus 𝑑 < 𝛾 • (𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠, ℎ) − 𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠 ′ , ℎ))

Note we have 𝑓 𝑟𝑎𝑔(•) in the range [0, 1] which means the above implies that 𝛾 > 𝑑, contradicting the assumption in the statement of the theorem. □

Important details of Stage 1.

We describe some key details of how to implement Stage where 𝑠 1 (and 𝑠 2 ) is the 𝑦-value of 𝑠𝑡𝑠 that corresponds to the smallest (and largest) 𝑥-value, where smallest is same as 𝑥-value of 𝑝 1 (and largest is same as 𝑥-value of 𝑝 𝑛 ). In Figure 1

Usefulness of the robust simplifications

We have implemented2 the ORS-Algorithm and in this section we illustrate its use for explaining a Time Series Classifier. The algorithm is designed for classifiers working on univariate discrete time series with a binary classification. Moreover, we believe its use is more easily evaluated if the times series are evenly sampled, aperiodic and non-stationary. However, it is important to note that apart from the above constraints we did not want simple datasets where the binary classification is particularly easy or could be described (e.g. by ourselves) in any straightforward way. Based on these criteria we have chosen to focus on three datasets.

• From UCR [29]: Chinatown. This dataset shows the number of pedestrians on a street corner of Chinatown in Melbourne over a 24-hour time period, and classifies these into Weekend and Weekdays. • From UCR [29]: ECG200. This dataset traces the electrical activity recorded in 96 time steps during one heartbeat and classifies the time series into a normal heartbeat and a Myocardial Infarction. • From UCR [29]: Italy Power Demand. This dataset shows power demand in Italian households over a 24-hour time period, and classifies these into Winter (October-March) and Summer (April-September). For each of these datasets we trained a fully convolutional neural network (FCN), originally proposed by Wang et al. [30]. Specifically we implemented a model closely following the work by Delaney et al. [6]. To select prototypes we used SPOTGreedy by Gurumoorthy et al. [31], implemented in the python package InterpretML [32]. We then ran the robust simplification algorithm on the resulting prototypes, with varying balancing factors, to evaluate their power as explanatory tools for the trained classifiers.

In this section we focus on three aspects of the robust simplifications to argue that they are helpful explanations of the classifier model.

• Showing prototypes together with their simplifications is more informative than showing prototypes only • Varying the balance of error vs simplicity vs robustness will highlight different aspects of the prototypes • Extracting a rule for class membership from this information can give simple and powerful explanations

We cover these aspects in separate subsections.

Simplifications are informative as explanations

Chinatown

For the Chinatown dataset we have computed 3 prototypes from each of the two classes, that we call Red and Blue. These 6 prototypes are presented in Figure 3. From these prototypes only it seems difficult to identify with any kind of certainty a rule-of-thumb that can be used to decide the classification of new instances. On the next figure, Figure 4, we show a single simplification added to each of the prototypes. These simplifications are chosen with a well-balanced mix of error, simplicity and robustness, choosing each of 𝛼, 𝛽, 𝛾 in the mid-range of values 3 These simplifications function as explanations for the classification of each of the prototypes, as they also incorporate the robustness to 10.000 perturbations, and as such they allow a rule-of-thumb to be more readily guessed at. Looking at the first segment of each of the robust simplifications it is striking that the blue slopes are quite sharply downwards, whereas none of the red are. A natural guess is as follows:

• Rule-of-thumb for Chinatown classification: if the time series starts by a noticable downward slope then it is Blue, otherwise Red

Varying balancing factors

Chinatown

In Figure 5 we see the effect of varying the balancing factors for a Red prototype in the Chinatown dataset. We see that increasing the importance of simplicity from low to middle to high will decrease the number of segments, while naturally also increasing the Euclidean distance to the original time series. We also see that increasing the importance of robustness changes the slope of the first segment from slightly downward (in the other 6-segment simplification) to slightly upward. This means that a higher percentage of time series with first segment having slightly downward slope in this vicinity are classified as Blue, as compared to the percentage classified Blue having slightly upward slope. This change is significant information, and it holds even in the presence of a higher Euclidean distance to the original prototype. Note that this observation constitutes supporting evidence for the rule-of-thumb guessed at previously for Chinatown.

Italy Power Demand

In Figure 6 we see a single prototype from ItalyPowerDemand in black and two robust simplifications of it in blue. Both simplifications have the same balancing factor for error and simplicity, but (a) has 4 segments while (b) has 5 segments resulting from an increase in the factor for robustness. Simplifying, we could say that a single segment in (a) has been replaced by two segments in (b) to give two peaks. This indicates a rule-of-thumb for the classifier trained on ItalyPowerDemand, namely that two such peaks at or around those 𝑥-values is important for the Blue classification.

ECG200

In Figure 7 we see a prototype from ECG200 in black and two distinct simplifications of it in blue. Note that even though the factor for robustness is much higher in (b) than (a), the two simplifications are (almost) identical, both on 5 segments. To explain this lack of significance of robustness we turn to Theorem 2 which says that the difference between the best cost and 𝑞th best cost output by Stage 1 of the ORS Algorithm, may limit the values of 𝛾 for which Stage 2 returns the optimal robust simplification.

For the simplifications of Figure 7 we ran Stage 1 with the high value of 𝑞 = 1.000.000 and found this difference to be 0.03630-0.02884=0.00746. Thus by Theorem 2 the simplification in (a) may not be optimal for that value of 𝛾 = 0.01 > 0.00746, whereas the one in (b) is likely not optimal, as 𝛾 = 0.1 >> 0.00746. Let us explain why we believe this happens. This time series has a length of 96, and note the DP to get 5 segments optimizes over all possible ways of choosing 6 points from 96. Note that there are many such choices of 6 points that give 5 segments similar to the ones in Figure 7.

Counting the number of ways of choosing these 6 points, starting with the rightmost point and going left, a back-of-the-envelope calculation estimates this to be in the ballpark of 35*20*10*10*3*5 which is over 1 million. Thus, we take this as evidence that almost all the 𝑞 best simplifications returned by the dynamic programming of Stage 1, that do not take robustness into account, are very close to these 5 segments. Thus increasing the importance of robustness cannot help, as the result will always be a simplification similar to the one given. We leave for future work to deal with such cases, for example by a heuristic during the DP that discards simplifications that are only marginally different from others.

Evaluating a classification rule guessed from robust simplifications

Chinatown

For the Chinatown classification we made a rule-of-thumb based on information gathered from the explanations of the classifications given for 6 prototypes, by means of the robust simplifications. To check if this rule-of-thumb is actually useful we need to make it more specific. Looking at the values on the 𝑦-axis for Figure 4 we see that the first segments of the 3 Blue simplifications are lines starting at point (0, 500) and decreasing to about (5, 0) while the first segments of the Red simplifications are only very slightly decreasing or increasing. Thus, the difference between the 𝑦-values at 𝑥 = 0 and 𝑥 = 5 would distinguish between these Red and Blue prototypes. We guess that this difference also distinguishes between many other time series in the dataset, as this insight is gathered from the robust simplifications. Taking the halfway point of 250 between the 𝑦-values of 500 and 0 as the cutoff the simple classification rule becomes

• Simple Rule for Chinatown: if the 𝑦-value at 𝑥 = 0 minus the 𝑦-value at 𝑥 = 5 is larger than 250 then 𝑡𝑠 is classified Blue, otherwise Red

The Chinatown dataset has 363 time series. We checked their classification by the model and compared to the Simple Rule. Indeed, the Simple Rule classifies all but 1 of the model-classified Blue time series as Blue, and it classifies all but 19 of the Red time series as Red, for an accuracy of 94.5 %. Note this is accuracy with respect to the model, not the ground truth, as the Simple Rule is trying to explain the model. Let us mention that the accuracy of the model wrt ground truth is 96.4 %, while accuracy of the simple rule wrt ground truth is 97 %.

Finally, let us also mention that if we raise the cutoff in the Simple Rule from 250 to 325 it actually achieves an accuracy of 99.7 % (with accuracy of the simple rule vs ground truth down to 96.7%) failing on only one instance. In retrospect, we could ask if there was any information available to us from the robust simplifications that would point to this higher cutoff. In fact, consider Figure 5 (a) that visualizes the robustness of a simplification whose first segment goes from (0, 400) to (7,50). Note the red part of the band around the first segment stretches up above (0, 250). This should have made us wonder if the cutoff for the simple rule should be higher than 250. We could have made a test of this, for example by constructing a time series whose first segment goes from (0, 300) to (5, 0) and visualize its robustness. See Figure 8, where we have done exactly this, and pay close attention to the colourings of the band around the first segment. This band is predominantly blue down to somewhere above (0, 300), and already then it turns red. This in itself could have made us guess a higher cutoff than 250 for the Simple Rule. We believe these insights show the usefulness of the robust simplifications and their visualization.

ItalyPowerDemand

In the previous subsection we saw that two peaks in the mid-afternoon seemed to signify Blue class for the classifier on the ItalyPowerDemand dataset. Looking more closely at Figure 6 we note that the difference between the high part of the peaks and the low part is about 0.5. We also note from the original prototype that the first peak is centered at 10, the second peak at 18, and that in (b) the bottom of the simplification is centered at 14. From this we extract the following rule: The dataset ItalyPowerDemand has 1096 time series. We checked their classification by the model and compared to the Simple Rule. Indeed, the Simple Rule classifies all but 50 of the model-classified Blue correctly, and all but 49 of the Red correctly, for an accuracy of 91%. Let us mention that accuracy of the model wrt ground truth is 96.3%, and of the Simple Rule wrt Ground Truth 89.6%. Note that classifiers of ItalyPowerDemand have previously been shown hard to explain to end-users [24]. Thus, the Simple Rule, developed by examining the robust simplifications in Figure 6, has a surprisingly high accuracy.

Conclusions and Future Work

In this paper, we addressed the challenge of interpretability in Time Series Classification (TSC) by introducing the Optimal Robust Simplifications (ORS) algorithm. Our approach focuses on simplifying time series data into straight-line segments while ensuring that these simplifications remain robust with respect to the classifications made by a given TSC model.

Our ORS-Algorithm is model-agnostic and allows a balance between error (fidelity with respect to the original instance), simplicity (complexity of the simplification), and robustness (fraction of local perturbations that retain the correct classification). We prove that the ORS-Algorithm, under certain mild conditions, finds the optimal solution in polynomial time, by a first dynamic programming stage solving for error and simplicity, and then incorporating robustness into the solution. This simplification process aids in making time series data more intuitive and interpretable for humans.

Our experimental results, based on UCR datasets Chinatown, Italy Power Demand, and ECG200, confirm the practical utility of our approach. The robust simplifications provided by the ORS algorithm make time series data more accessible to human intuition, facilitating better user understanding and trust in the TSC models. In conclusion, our work contributes a novel method for enhancing the interpretability of TSC models through robust simplifications.

Future research could explore other ways to compute the factors employed in the algorithm. For instance, the squared Euclidean distance could be replaced by some other distance function, or simplicity can depend on other aspects than only the number of segments, and robustness can be computed by other types of perturbations. Finally, we plan to conduct experiments including human studies to demonstrate empirically the benefits of using optimal robust simplifications to explain TSC models and compare to related XAI methods.

Figure 1 :Definition 1 .11Figure 1: Original time series 𝑡𝑠 in black, with selected points given by 6 grey circles, resulting in a simplification 𝑠𝑡𝑠 in red with 5 segments.

𝑠 1 is the leftmost circled point and 𝑠 2 the rightmost circled point, at 𝑥-values 0 and 23 respectively. For these 𝑘 + 1 pivot points of 𝑠𝑡𝑠 we allow an increase or decrease of its 𝑦-value by an 𝜖 which is 10% of the 𝑦-range of the dataset the classifier ℎ was trained on, i.e. 𝜖 = 0.1 * (𝑦 𝑚𝑎𝑥 − 𝑦 𝑚𝑖𝑛 ) with 𝑦 𝑚𝑎𝑥 and 𝑦 𝑚𝑖𝑛 the maximum and minimum 𝑦-value in this dataset.We compute 10.000 random perturbations of 𝑠𝑡𝑠, each consisting of 𝑘 segments anchored at the same 𝑥-values as the 𝑘 + 1 pivot points of 𝑠𝑡𝑠 but with the 𝑦-value of each pivot point shifted by an amount drawn independently for the 𝑘 + 1 values uniformly at random in the range [−𝜖, +𝜖]. Thus the perturbation will lie inside a band around 𝑠𝑡𝑠 of height 2 • 𝜖 at each pivot point. See Figure2. We then use ℎ to classify each of the perturbations, and define the 𝑓 𝑟𝑎𝑔-value of 𝑠𝑡𝑠 as the fraction of the 10.000 random perturbations that have the opposite classification as 𝑡𝑠.

( a )a𝑓 𝑟𝑎𝑔 = 0.45 blue. Not robust. (b) 𝑓 𝑟𝑎𝑔 = 0.04 blue. Very robust. (c) 𝑓 𝑟𝑎𝑔 = 0.02 red. Very robust.

Figure 2 :2Figure 2: Visualizing robustness of the three simplified time series in dotted black lines. All perturbations consist of straight-line segments that are forced to lie inside the colored band, with pivot points at same 𝑥-values as the simplification, but different 𝑦-values. We compute several perturbations of 𝑠𝑡𝑠 by randomly altering the 𝑦-values of the 𝑘 𝑠𝑡𝑠 + 1 pivot points inside the narrow band, e.g. in (b) 𝑠𝑡𝑠 in black has 5 segments and the 6 pivot points of the perturbations are at 𝑥-values 0,5,9,13,18,24. Each perturbation is drawn with a color according to its classification by the classifier ℎ. Thus if many have same color as 𝑡𝑠 then robustness is good (and 𝑓 𝑟𝑎𝑔-value low).

Figure 3 :3Figure 3: Six prototypes from Chinatown. Hard to make a rule-of-thumb.

Figure 4 :4Figure 4: Same six prototypes from Chinatown, now shown in dotted black lines, with robust simplifications making it easier to extract a rule-of-thumb.

(a) 11 Figure 5 :Figure 6 :1156Figure 5: Prototype from Chinatown in black and four distinct simplifications of it in red. From (a) to (b) to(c) we see that increasing the importance of simplicity from low to middle to high will decrease the number of segments from 11 to 6 to 2, and increase the Euclidean distance. In (d) we see that increasing the importance of robustness, when comparing to (b) which also has 6 segments, changes the slope of the first segment even though this increases the Euclidean distance, which is significant information about the red classification.

(a) 5 Figure 7 :57Figure 7: Prototype from ECG200 in black and two distinct simplifications of it in blue. Note that even though the factor for robustness is much higher in (b) than (a), the two simplifications are almost identical (difference being that in (a) the third segment ends in 𝑥 = 40 whereas in (b) it ends in 𝑥 = 39).

•Simple Rule for ItalyPowerDemand: let 𝑚𝑎𝑥1 = maximum of the 𝑦-values over 𝑥 = 9, 10, 11, let 𝑚𝑖𝑛2= minimum of the 𝑦-values over 𝑥 = 13, 14, 15 and 𝑚𝑎𝑥3 = maximum of the 𝑦-values over 𝑥 = 17, 18, 19. If 𝑚𝑎𝑥1 − 𝑚𝑖𝑛2 ≥ 0.5 and 𝑚𝑎𝑥3 − 𝑚𝑖𝑛2 ≥ 0.5 then classify this instance Blue, else Red.

Figure 8 :8Figure 8: Visualization of robustness of a time series starting at (0, 300) suggesting that a cutoff somewhat higher than (0, 250) for Blue vs Red would have been a better guess for the Simple Rule for Chinatown.

Stage 1 Use DP to compute a 2-dimensional table of size 𝑛 × 𝑞 that in cell 𝐷[𝑗, 𝑘] stores the 𝑘th least costly simplification for the points 𝑝 1 , ..., 𝑝 𝑗 of the input time series 𝑡𝑠, considering only error and simplicity but not robustness, as in above DP. Stage 2 Consider the 𝑞 least costly simplifications 𝑠𝑡𝑠 1 , ..., 𝑠𝑡𝑠 𝑞 , ordered by non-decreasing cost under error and simplicity as stored in 𝐷[𝑛, 1]..𝐷[𝑛, 𝑞] respectively. Compute robustness for any simplification that ℎ classifies same as 𝑡𝑠. The ORS-Algorithm then returns the simplification having lowest overall total cost, i.e. the 𝑠𝑡𝑠 minimizing the value of value of 𝛼 • 𝑒𝑟𝑟(𝑡𝑠, 𝑠𝑡𝑠) + 𝛽 • 𝑘 𝑠𝑡𝑠 + 𝛾 • 𝑓 𝑟𝑎𝑔(𝑡𝑠, 𝑠𝑡𝑠, ℎ)

Definition of Perturbations and Robustness in Stage 2.1, computing a table of size 𝑛 × 𝑞 with 𝐷[𝑗, 𝑘] storing the 𝑘th least costly simplification of points 𝑝 1 , ..., 𝑝 𝑗 under error and simplicity only. The difference to the earlier DP is that we compute not only the least costly but the 𝑞 least costly with 𝑞 > 1. After computing error terms we run an outer loop from 𝑗 = 1 to 𝑛 with two inner loops one after the other. The first loop from 𝑖 = 1 to 𝑗 − 1 is similar to the previous DP, computing the cost 𝐷[𝑖, 1] + 𝑒𝑟𝑟(𝑖, 𝑗) of having the last segment go from 𝑝 𝑖 to 𝑝 𝑗 , but now adding all these values to a heap 𝐻 𝑗 . The second loop from 𝑘 = 1 to 𝑞 fills the 𝐷[𝑗, 𝑘] entries by extracting the minimum element from 𝐻 𝑗 , say this element is 𝐷[𝑖, 𝑟] + 𝑒𝑟𝑟(𝑖, 𝑗), then first set 𝐷[𝑗, 𝑘] to this element, but most importantly also add to the heap 𝐻 𝑗 the new value 𝐷[𝑖, 𝑟 + 1] + 𝑒𝑟𝑟(𝑖, 𝑗) since the 𝑟 + 1st least costly solution up to 𝑝 𝑖 may be lower than many other solutions in the heap. For details, in particular regarding the edge-cases which are cumbersome to implement properly, see the pseudo-code in the appendix.

4.2.2.

A simplified time series 𝑠𝑡𝑠 on 𝑘 segments is computed by choosing 𝑘 + 1 points 𝑝 𝑖 1 , ..., 𝑝 𝑖 𝑘+1 from an original time series 𝑡𝑠 on points 𝑝 1 , ..., 𝑝 𝑛 . When computing the robustness of 𝑠𝑡𝑠 focus on the following 𝑘 + 1 pivot points 𝑠 1 , 𝑝 𝑖 2 , ..., 𝑝 𝑖 𝑘 , 𝑠 2 of 𝑠𝑡𝑠 (note the first and last may not be points of 𝑡𝑠)

The initialization of error terms is 𝑂(𝑛3 ) as in the first DP version. We use binary heaps having 𝑂(log 𝑛) insertion and extraction operations. Then, for each of the 𝑛 values of 1 ≤ 𝑗 ≤ 𝑛 we have two for-loops, one inserting at most 𝑛 values into heap 𝐻 𝑗 (for time 𝑂(𝑛 log 𝑛)), and the other extracting and inserting at most 𝑞 values of 𝐻 𝑗 (for time 𝑂(𝑞 log 𝑛)). Note there are never more than 𝑛 elements in any heap 𝐻 𝑗 as the first loop never inserts more than 𝑛 elements and the latter loop extracts one element and inserts at most one new element. This completes the proof. The optimal simplification 𝑜𝑝𝑡 𝑠𝑖𝑚𝑝 (𝑡𝑠, ℎ, 𝛼, 𝛽, 𝛾) can be computed in polynomial time for a time series 𝑡𝑠 on 𝑛 points, a binary TSC ℎ, and balancing factors 𝛼, 𝛽, 𝛾, if there exists a constant 𝑐, such that the difference is ≥ 𝛾 between the least costly simplification that does not take robustness into account and the 𝑛 𝑐 th least costly one. This follows from Theorems 2 and 3 by running the ORS Algorithm with 𝑞 = 𝑛 𝑐 .Theorem 4. Proof 3.

Theorem 3. The runtime of the ORS-ALgorithm is 𝑂(𝑛 • max{𝑛 2 , 𝑞 log 𝑛}).Proof 2. Note that if one pieces together several such local explanations, say for prototypes of each class, then this can still form a global explanation of the given TSC ℎ. Available on Github: https://github.com/BrigtHaavardstun/kSimplification Exact values used for parameters can be found on the github page. Here we only refer to them as Low-Mid-High to keep the presentation simple.

Scalable and accurate deep learning with electronic health records ARajkomar EOren KChen AMDai NHajaj MHardt PJLiu XLiu JMarcus MSun NPJ digital medicine 1 2018 Time-series classification methods: Review and applications to power systems data, Big data application in power systems GASusto ACenedese MTerzi 2018 Deep learning for time series classification: a review HIsmail Fawaz GForestier JWeber LIdoumghar P.-AMuller Data mining and knowledge discovery 33 2019 Agnostic local explanation for time series classification MGuillemé VMasson LRozé ATermier 10.1109/ICTAI.2019.00067 IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019

Portland, OR, USA

IEEE November 4-6, 2019. 2019 Benchmarking deep learning interpretability in time series predictions AAIsmail MKGunady HCBravo SFeizi Annual Conference on Neural Information Processing Systems 2020

NeurIPS; virtual

2020. December 6-12, 2020. 2020 Instance-based counterfactual explanations for time series classification EDelaney DGreene MTKeane Case-Based Reasoning Research and Development -Int. Conference, ICCBR Springer 2021. 2021 Towards a rigorous evaluation of xai methods on time series USchlegel HArnout MEl-Assady DOelke DAKeim IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) IEEE 2019. 2019 Tsviz: Demystification of deep learning models for time-series analysis SASiddiqui DMercier MMunir ADengel SAhmed 10.1109/ACCESS.2019.2912823 IEEE Access 7 2019 An interactive tool for interpretability of time series classification BHåvardstun CFerri JATelle European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2024 To be published Explainable AI for time series classification: A review, taxonomy and research directions ATheissler FSpinnato USchlegel RGuidotti IEEE Access 10 2022 Explaining time series classifiers through meaningful perturbation and optimisation HMeng CWagner ITriguero 10.1016/j.ins.2023.119334 Information Sciences 645 119334 2023 Learning perturbations to explain time series predictions JEnguehard International Conference on Machine Learning

PMLR

2023 Explaining time series via contrastive and locally sparse perturbations ZLiu YZhang TWang ZWang DLuo MDu MWu YWang CChen LFan QWen The Twelfth International Conference on Learning Representations 2024 TRojat RPuget DFilliat JDel RSer NGelin Díaz-Rodríguez arXiv:2104.00950 Explainable artificial intelligence (xai) on timeseries data: A survey 2021 arXiv preprint Example or prototype? learning concept-based explanations in time-series CObermair AFuchs FPernkopf LFelsberger AApollonio DWollmann Asian Conference on Machine Learning

PMLR

2023 Weighted kNN and constrained elastic distances for time-series classification ZGeler VKurbalija MIvanović MRadovanović Expert Systems with Applications 162 113829 2020 Explaining deep classification of time-series data with learned prototypes AHGee DGarcía-Olano JGhosh DPaydarfar 2019 Dimensionality reduction for fast similarity search in large time series databases EKeogh KChakrabarti MPazzani SMehrotra Knowledge and information Systems 3 2001 Segmenting time series: A survey and novel approach EKeogh SChu DHart MPazzani Data mining in time series databases World Scientific 2004 Models and algorithms for optimal piecewise-linear function approximation ECamponogara LFNazari Mathematical Problems in Engineering 2015 876862 2015 USchlegel DVLam DAKeim DSeebacher arXiv:2109.08438 Ts-mule: Local interpretable model-agnostic explanations for time series forecast models 2021 Effect of segmentation on financial time series pattern matching YWan XGong Y.-WSi 10.1016/j.asoc.2015.10.012 Applied Soft Computing 38 2016 Obst-based segmentation approach to financial time series Y.-WSi JYin 10.1016/j.engappai.2013.08.015 Engineering Applications of Artificial Intelligence 26 2013 XAI for time series classification: Evaluating the benefits of model inspection for end-users BHåvardstun CFerri KFlikka JATelle Explainable Artificial Intelligence 2024 to be published GAM coach: Towards interactive and usercentered algorithmic recourse ZJWang JWVaughan RCaruana DHChau Proc. Conf. on Human Factors in Computing Systems Conf. on Human Factors in Computing Systems ACM 2023 835 20 Tsinterpret: A python package for the interpretability of time series classification JHöllig CKulbach SThoma J. Open Source Softw 8 5220 2023 Interpretable time-series classification on few-shot samples WTang LLiu GLong 2020 International Joint Conference on Neural Networks (IJCNN), IEEE 2020 Interpretable and steerable sequence learning via prototypes YMing PXu HQu LRen Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2019 HADau EKeogh KKamgar C.-CMYeh YZhu SGharghabi CARatanamahatana BYanping NHu ABegum ABagnall GMueen Batista The UCR time series classification archive 2018 Hexagon-ML ZWang WYan TOates CoRR abs/1611.06455 Time series classification from scratch with deep neural networks: A strong baseline 2016 KSGurumoorthy PJawanpuria BMishra arXiv:2103.10159 Spot: A framework for selection of prototypes using optimal transport 2021 IContributors Interpretml: Fit interpretable models. explain blackbox machine learning 2023 GitHub repository