1. Introduction

I. Hussain, R. Jany, R. Boyer, A. Azad, S. A. Alyami, S. J. Park, M. M. Hasan, M. A. Hossain, An explainable EEG-based human activity recognition model using machine- learning approach and lime, Sensors

10.3390/s23177452

xSTAE: Explaining Classifier Decisions through EEG Signal Style Transfer Autoencoding

Natalia Koliou

nataliakoliou@iit.demokritos.gr 2

Panagiotis Zazos

pzazos@fourdotinfinity.com 1

Christoforos Romesis

chris.romesis@iit.demokritos.gr 2

Cristian Bosch

cristian.boschserrano@ucd.ie 0

Stasinos Konstantopoulos

konstant@iit.demokritos.gr 2

Panagiotis Trakadas

1 0 CeADAR, University College Dublin , Dublin, Irelard 1 Four Dot Infinity , Athens , Greece 2 Institute of Informatics and Telecommunications, NCSR 'Demokritos' , Ag. Paraskevi , Greece

2017

23 2023 665 666

Style transfer methods are a powerful visualization tool that can be used to generate counterfactual explanations, plausible alternatives to the original input that leads to a diferent classification. In this paper we present xSTAE, a system that restyles a misclassified example into the correct class, in order to help the expert understand what patterns the classifier was looking for to assign the correct class, and failed to see in the instance. The system is based on an Autoencoder trained on a loss function that balances between identity loss (similarity with the original instance) and a classification loss derived from a pre-trained classifier, allowing xSTAE to remain completely agnostic with respect to the internals of the classifier it interprets. We present promising experimental results on sleep-stage classification decisions over EEG data, which validate the core of the idea and show future research directions.

eol>EEG Style Transfer Autoencoders Sleep Stage Classification Generative AI Explainable AI

1. Introduction

Explainable AI (XAI) has gained significant attention in the last few years, mainly due to the wide application of deep learning models in high-stake domains, prominently including healthcare where understanding model decisions can directly impact patient needs, treatment and overall well-being. In practice, XAI helps users not only determine whether to trust individual predictions, but also compare diferent models and identify areas for improvement when a model performs poorly [1].

While most XAI research has focused on image and tabular data, time-series remain relatively under-explored. One possible reason regarding this gap is that, unlike images, the semantics of time-series cannot be easily visualized [2] and their interpretation requires combining domain expertise with an understanding of temporal dependencies. This gap has considerable impact in the healthcare domain, where bio-signals are widely used in clinical practice. Electroencephalography (EEG) signals for instance—which are the focus of this paper—are widely used in clinical practice to monitor brain activity and support the diagnosis of neurological conditions [3], sleep and mental disorders [4, 5], and cognitive impairments [6].

In this paper we propose a method for generating instance-based interpretations of sleep-stage classification decisions over EEG data. Our method is based on training a generative AI model to interpret a pre-trained classifier by giving counterfactual examples of what a misclassified instance should have looked like to be correctly classified. Observing such examples can help the operator observe the patterns that lead to misclassifications and create a basis for a more focused collection and annotation of further training data.

In the remainder of this paper we first discuss related work (Section 2) and formally state the problem (Section 3). We then present our method (Section 4) and proceed to provide and discuss experimental results (Section 5) and conclude (Section 6).

2. Related Work

While deep learning is used for a variety of tasks, classification has been a primary focus within the XAI research community due to its broad applicability and conceptual simplicity. Relevant research has primarily focused on image and tabular data, where model interpretability is more intuitive and visual explanations are easier to generate [7, 8, 9].

As attention has only recently turned to the, more challenging, problem of explaining timeseries classifiers, the literature on this topic remains relatively limited [ 10, 2]. In many cases, the need to interpret decisions drives the development of the classifier itself, with LAXCAT and PatchX being characteristic examples. LAXCAT [11] is a deep neural network architecture that simultaneously identifies the time intervals and the variables of a multi-variate time-series that contribute to classification decisions. LAXCAT uses convolutional layers to extract features from the time series, and two attention modules to identify which variables and time intervals are most important for the classification. PatchX [ 12] divides time series into smaller patches and performs fine-grained classification on each patch using deep neural networks. These patch-level results are then combined with a traditional classifier to generate the final prediction, and explanations are provided by interpreting the importance of each patch in the overall decision.

A diferent line of research couples the training of the classifier with the training of its interpreter. Gee et al. [13] propose a prototype-based approach for explaining deep time-series classifiers by learning diverse and representative latent prototypes that highlight class-discriminative patterns across multiple modalities, including ECG, respiration, and audio waveforms. They achieve this by integrating a prototype learning mechanism directly into the training process, encouraging the model to associate input instances with learned prototypes that reflect meaningful and diverse latent representations for each class. XTF-CNN [14] is a dual-channel convolutional neural network that learns representations of microseismic waveforms from both the time and frequency domains to improve classification of rock fracturing events. XTF-CNN is integrated with EUG-CAM that generates fine-grained, gradient-based activation maps over input waveforms to illustrate which parts of the signal influence the model’s decisions. Finally, DeepVix [ 15] is a visual analytics system designed to explain Long Short-Term Memory (LSTM) networks applied to high-dimensional multivariate time series data. DeepVix provides interactive visualizations of the LSTM architecture, including node activations and gate weights across layers and time steps, enabling users to investigate intermediate computations and trace how diferent input variables contribute to the model’s predictions.

Naturally, the lines of research above require a tight coupling between the classifier and its interpretation module, restricting the range of possible networks that can be used to those for which an appropriate interpretation module has been devised. On the other hand, timeXplain [16] is a post-hoc, model-agnostic explanation framework for time series classifiers based on SHAP. timeXplain uses domain-specific perturbation strategies organized by time, frequency, and statistical mappings. By observing the efect of perturbing the inputs to the results, timeXplain evaluates feature importance. timeXplain demonstrates that certain mappings (e.g., time slices with noise replacements) can produce explanations that are more faithful than some model-specific methods.

Focusing in the EEG domain, Apicella et al. [17] evaluated several established XAI methods to explain machine learning models trained on EEG data for emotion recognition, aiming to address the dataset shift problem, a common issue in Brain-Computer Interfaces where EEG signal characteristics vary across recording sessions causing models trained on one session to perform poorly on others. They applied these XAI techniques to identify which EEG signal components the models rely on and tested how consistently these important features appear within and across diferent recording sessions of the same subjects. Their results showed that many relevant features detected by XAI remain stable across sessions, indicating that these explanations can be used to improve the generalization and robustness of EEG-based classification systems.

Zanola et al. [18] developed xEEGNet, a compact and fully interpretable neural network for EEG-based dementia classification that transforms a traditional “black box” model (ShallowNet) into a “white box” by progressively modifying its architecture to highlight explainable components. They achieve interpretability by designing the network to learn EEG band-specific filters and spatial topographies, which correspond to meaningful brain signal features clinicians can relate to dementia pathology. This approach not only reduces the number of parameters by over 200 times—helping to resist overfitting—but also allows direct inspection of the learned kernels and weights to provide clear, medically relevant explanations for the model’s decisions.

Hussain et al. [19] developed machine learning models to classify human activities—resting, motor, and cognitive—using EEG spectral features collected from healthy individuals. They applied the model-agnostic explainability technique LIME to interpret which EEG features most influenced the classification decisions, providing clinically relevant insights into brain activity during these tasks. Their results showed strong classification performance and meaningful explanations that could support improved patient monitoring and rehabilitation.

3. Problem Statement

According to the taxonomy introduced by Theissler et al. [2], counterfactual explanations are considered a promising instance-based approach for interpreting time-series classifiers. A counterfactual is a plausible alternative to the original input that leads to a diferent classification, while remaining as similar as possible to the original. By comparing the original time series with its counterfactual, one can infer which parts of the signal had the greatest impact on the model’s decision.

Let = { ( )} =1 be an EEG dataset of sequences, where each sequence ( ) ∈ ⊆ R is represented as a -dimensional vector, and : → a trained classifier that maps each input ∈ to one of discrete class labels in = {1, 2, . . . , }. Given an input , the classifier outputs a label = ( ). Our goal is to provide an explanation for why the classifier assigned that specific label to the input. This becomes particularly insightful in cases of misclassification, where understanding what made the classifier predict an incorrect label can reveal class-specific patterns that were strong enough to override the correct label. Identifying the distinguishing characteristics that the model associates with the incorrect class provides a means to explain its decision-making process.

Formally, we define the problem as follows: Given that is an imperfect classifier, we focus on the subset of inputs ∈ for which the predicted label ( ) difers from the true label *( ). For each such misclassified instance, we aim to identify the minimal modification ′ of the input such that the classifier’s prediction aligns with the ground truth, i.e., ( ′ ) = *( ), while ensuring that the modification is as small as possible according to a chosen distance metric (·, ·). This can be expressed as the optimization problem: ′ = arg min ( , ) so that ( ) = *( ) ∈ ( 1 ) Here, ′ serves as a counterfactual that reveals how the original input would need to change to be correctly classified. By analyzing the diference dominant patterns associated with class ( ) and gain insight into the model’s decision Δ = ′ − , we can identify boundaries.

4. Proposed Methodology

To generate meaningful counterfactuals, we propose a generative framework based on a set of class-conditional autoencoders, where each autoencoder is trained to reconstruct inputs from any class, while restyling them toward a specific target class tgt ∈ .

For every target class, we train a separate autoencoder tgt : → and a target label tgt ̸= ( ), the corresponding autoencoder tgt generates a counterfactual ′ = tgt( ) that closely resembles , but is modified just enough for to classify it as tgt, i.e., ( ′) = tgt. By comparing the original input with its counterfactual ′, one can identify the patterns in that were responsible for the classifier’s original decision, and thus detect the tgt. Given an input characteristics of class ( ) that were most prominent.

Training each of these autoencoders requires incorporating two forms of feedback: 1. The first evaluates how well the generated output approximates the original input. This can be quantified using a similarity function (, ′), where is the original input and ′ is the reconstructed output. The goal is to keep (, ′) suficiently small to ensure ∈ the generated output remains close to the original. 2. The second evaluates how well the output aligns with the desired target class tgt ∈ .

This feedback is provided by the classifier

that we aim to explain. During training, the generated output ′ is passed through , and its prediction ( ′) is compared against the target label tgt.

This dual feedback guides the autoencoders towards producing counterfactuals that are both similar to the input and representative of the target class.

4.1. EEG Data

Our methodology is specifically designed for time-series EEG data. Let ∈ R × denote an EEG segment in the time domain (or epoch), where is the number of time samples and is the number of recording channels. Each epoch corresponds to a fixed-duration window of seconds of EEG signal recording.

To reduce the complexity of the input data, we transform raw EEG time-series from the time domain to the frequency domain. Given an input signal ∈ R × , we first apply the Fast Fourier Transform (FFT) to obtain its spectral representation. To retain only domain-relevant information, we filter out frequencies outside a predefined range of interest (e.g., those not associated with meaningful brain activity). The remaining frequency band is then divided into ′ non-overlapping segments, each represented by three features: the midpoint frequency, the phase at that frequency, and the average amplitude across the segment. This results in a matrix

∈ R ′ × , where each row represents a frequency-region waveform using features.

4.2. Classifier

The classifier is a two-stage convolutional neural network designed to process EEG data in the time domain. The architecture is based on the one proposed by Youness [20] and Esparza-Iaizzo et al. [21]. This architecture captures short-term temporal dependencies by analyzing sequences of consecutive epochs. To predict the label for some epoch considers a sequence of epochs, including the current epoch and the previous − 1 epochs: ∼ (for simplicity), the model = [ − +1, − +2, . . . , −1, ] ∈ R × ′ ×

4.3. Autoencoder

The autoencoder is a hybrid architecture combining self-attention mechanisms and convolutional operations. The input to is a sequence of ′ vectors, each of dimensionality + 1: = ⊕ ∈ R ′ ×( +1) →

The first dimensions correspond to the extracted features, while the last dimension is a positional encoding added to inject a sense of temporal order. Since Transformers inherently treat their input as a set of unordered tokens, this positional encoding, implemented as a simple increasing counter (e.g., 1, 2, 3, . . . , ′ ), allows the model to capture temporal dependencies across the samples in the epoch. A detailed overview of the autoencoder architecture is provided in Table 2. ,

5. Experiments

We apply our methodology to sleep stage classification using EEG data from the Bitbrain Open Access Sleep (BOAS) dataset [22]. This dataset consists of 128 full-night recordings collected from healthy volunteers wearing a two-channel EEG headband ( = 2). The EEG signals were sampled at 256 Hz and segmented into non-overlapping 30-second epochs, each containing 7,680 samples per channel ( = 7680). Every epoch was independently scored by three certified sleep experts following the American Academy of Sleep Medicine (AASM) guidelines [23]. These guidelines define five standard sleep stages: Wake (W), N1 (light sleep), N2 (intermediate sleep), N3 (deep sleep), and REM (rapid eye movement sleep). Since our focus is on sleep stage classification, we exclude the Wake class and restrict our task to the four sleep-related stages: N1, N2, N3, and REM ( = 4). To address typical inter-scorer variability (∼ 85% agreement [24][25]), a fourth expert reviewed the annotations and produced a consensus label for each epoch. These consensus sleep-stage labels were then aligned with the EEG segments to provide a reliable ground truth for each 30-second window. 5.1. Setup According to our methodology, the first step involves transforming the Bitbrain EEG data from the time domain to the frequency domain. To achieve this, we pre-process each epoch (2) (3) (per channel) using the pipeline shown in Figure 1. The raw time-domain signal ∈ R × (Figure 1a) is first converted to the frequency domain using a Fast Fourier Transform (FFT) (Figure 1b). We then filter out all frequency components outside the 0.4-30 Hz range (Figure 1c), which includes the primary EEG waveforms relevant to sleep staging: Delta (0.5-4 Hz), Theta (4-8 Hz), Alpha (8-13 Hz), and Beta (13-30 Hz) [26]. with computational eficiency.

Next, we split the retained frequency range into ′ equal, non-overlapping segments. From each segment, we extract a representative waveform defined by its midpoint frequency and phase, along with the average amplitude across the segment (Figure 1d). This produces a compact and structured representation of the original time-series in the frequency domain:

∈ R ′ × , where = 6, corresponding to three features (frequency, phase, amplitude) for each of the two EEG channels. In our experiments, we set ′ = 300 to balance expressiveness

We split the dataset subject-wise into training, validation, and test sets to ensure that all epochs from a given participant remained within a single split. This way, we prevent subject leakage and preserve the validity of evaluation results. Approximately 80% of the nights (80 participants) were assigned to the training set, 10% (12 participants) to the validation set, and 20% (25 participants) to the test set. To standardize the data, we apply z-score normalization using the mean and standard deviation computed from the training set. These statistics are then used to normalize the validation and test sets accordingly.

˜ = − train

train

5.1.1. Classifier

The sleep stage classifier described in Section 4.2 is the model we aim to explain. It takes as input a sequence of = 5 consecutive EEG epochs in the frequency domain and produces a prediction for the final epoch 5 (Equation 5).

= [ 1, 2, 3, 4, 5] ∈ R5×300×6 → ^ ∈ {1, 2, 3, 4} (4) (5)

A critical part of training this model is choosing a loss function that matches the data distribution and learning objectives. The Bitbrain dataset has a highly imbalanced class distribution: N2 accounts for around 50% of the epochs, while N1 and N3 are less common (about 5% and 20%, respectively), and REM makes up the remaining 25%. This skewed class distribution can bias the model toward over-predicting the majority class, resulting in poor performance on minority classes. To address this pronounced class imbalance, we adopt the sparse categorical focal loss, which modifies the standard cross-entropy loss to emphasize learning Classifier’s Hyperparameter Search Space and Optimal Values. from hard-to-classify examples. The focal loss is defined as: ︁∑ =1 ℒ focal = − (1 − ) log( ), (6) where is the predicted probability for the true class , is a weighting factor that balances the importance of each class, and ≥ 0 is a focusing parameter that reduces the contribution of well-classified examples. By tuning and , the focal loss places greater importance on dificult or misclassified instances, enhancing performance on underrepresented classes. Although the loss is defined as a sum over all classes, in practice only the term corresponding to the true class contributes to the loss for each training sample. This is due to the one-hot encoding of the target labels: the correct class is represented with a value of 1, while all other classes are 0. As a result, all terms involving incorrect classes are multiplied by zero and do not afect the outcome.

To identify the optimal configuration for the classifier, we conducted an automated hyperparameter search using the Optuna framework. The goal was to maximize the macro-F1 score on the validation set, which better reflects performance across all sleep stages. We performed 50 trials, with each trial running for up to 20 training epochs. Early stopping was applied if the validation macro-F1 did not improve for 5 consecutive epochs. The search space included a variety of architectural and training parameters, presented in the middle column of Table 3. These ranged from convolutional filter sizes and kernel widths to dropout rates, dense layer units, learning rate, and optimizer choice. The focal loss parameters and were also included to fine-tune the loss function’s sensitivity to class imbalance. The optimal hyperparameters identified by Optuna’s best trial are presented in the rightmost column of the same table. The focal loss parameters and also had great impact.

During evaluation, our classifier achieved an overall accuracy of 0.87. Table 4 provides a detailed per-class performance breakdown, including precision, recall, F1-score, and support.

To benchmark our approach, we compare against the work of Esparza-Iaizzo et al. [21], who performed non-causal, single-channel sleep stage decoding directly on raw EEG signals, including the Wake stage. Comparing the confusion matrices in Figure 3, we note that Light Sleep (N1) remains the most challenging stage to classify, with our model achieving a recall of only 27%. However, by excluding Wake, our pipeline avoids masking N1 errors within dominant Wake bins, as seen in previous works. Instead, misclassified N1 epochs are primarily confused with N2 (52%) and REM (21%), reflecting the transitional nature of light sleep. Mid-stage non-REM sleep (N2) detection shows considerable improvement, with recall increasing from 85% in the baseline to 91% in our model. Performance on deep slow-wave sleep (N3) remains comparable, with both methods efectively capturing delta-band features. The largest boost is observed in REM sleep detection, where recall rises from 75% to 93%. This suggests that spectral representation combined with sequential modeling better isolates the characteristic EEG patterns of REM.

Together, these results validate that converting raw headband signals into compact, twochannel spectral slices and modeling their temporal evolution over consecutive epochs yields a more discriminative representation for the predominant sleep stages (N2 and REM) without sacrificing accuracy in deep sleep. Light sleep (N1) remains inherently dificult due to its low prevalence (∼ 5%) and high intra-stage variability in healthy adults [27].

5.1.2. Autoencoders

To perform style transfer across the = 4 sleep stages, we train a separate autoencoder tgt for each target class tgt ∈ {1, 2, 3, 4}. Each model learns to restyle input epochs from any class to match the target class tgt. Unlike the classifier, which takes a sequence of epochs as input, the autoencoder processes a single 30-second EEG epoch in the frequency domain. Instead of using all frequency-domain features (frequency, phase, and amplitude), we choose to transform only the amplitude values, as we observed that learning is more efective when focusing on energy patterns alone. This reduces the input dimensionality to = 3, corresponding to two amplitude values (one per channel) along with the positional encoding.

( ) = ⊕ pos ∈ R300×3 → , tgt( )

After decoding, we reconstruct the full frequency-domain representation by adding back the original frequency and phase features: = , tgt( ) ⊕ (,

) = , tgt ∈ R300×7 To train the autoencoders, we combine two loss functions: (a) Our sleep-stage classifier (b) Esparza-Iaizzo et al.’s classifier (7) (8) (9) (10) ℒ ae = id · ℒ id + clf · ℒ clf, their respective weighting coeficients. where ℒ id is the reconstruction loss computed as the mean squared error (MSE) between the input and its reconstruction, ℒ clf is the focal loss defined in Equation 6, and id and clf are

Identity Loss

To ensure that the counterfactual remains close to the original input, we compute the identity loss in the time domain. The predicted amplitude-phase-frequency triplets , tgt are denormalized and decoded back into the four canonical EEG bands (delta, theta, alpha, and beta). We denote the resulting restyled signal in the time domain as , tgt. The identity loss is then computed as the mean squared error between the reconstructed signal and the original time-domain signal:

ℒ id = MSE( , , tgt)

Classification Loss

To encourage the output to resemble the target class, we guide the autoencoders’ training using the pre-trained classifier. The classifier operates on sequences of = 5 consecutive epochs. For each input epoch , we retrieve four preceding epochs from a donor sequence that belongs to the target class tgt. We then concatenate the restyled output , tgt to Autoencoders’ Hyperparameter Search Space and Optimal Values. these donor epochs (excluding positional encodings), forming a complete input sequence for the classifier.

seq = [ d

−4, d−3, d−2, d−1, , tgt] ∈ R5×300×6

The predicted label ^ for the last epoch in the sequence is then compared to the target class tgt using the focal loss:

ℒ clf = ℒ focal( ^ , tgt)

The autoencoders were trained for a maximum of 1000 epochs using early stopping with a patience of 50. We manually tuned the architecture and training setup based on performance on the validation set. Our experiments included varying model depth, attention configurations, learning rates, and the weighting of loss terms. Table 5 summarizes the hyperparameters explored and the final configuration that yielded the best results.

To better understand the training dynamics, we plotted the training and validation loss curves for each of the four autoencoders. Each figure displays three loss components per model: the total loss, the identity loss, and the classification loss. The identity and classification losses are unweighted (i.e., before applying id and clf) to provide a clearer view of how each component evolves through the epochs. Figure 4 shows the training and validation loss curves for each autoencoder. Each row corresponds to one autoencoder, with training loss on the left and validation loss on the right.

We refer to xSTAE as the complete architecture that combines the autoencoder with the classifier guiding its training. Figure 5 illustrates the entire pipeline, from the input EEG spectrum to the style-transferred output. The top section of the figure depicts the classifier architecture, while the bottom section shows the autoencoder.

5.2. Quantitative Results

To quantitatively evaluate how well each autoencoder learns the style of its corresponding sleep stage, we pass the entire test dataset through each class-specific autoencoder and then classify the restyled outputs using the original sleep stage classifier. Table 6 shows the classification accuracy for each sleep stage when the test signals are restyled into that stage. The fact that the restyles signals dramatically reduce the classification error, demonstrates that the autoencoders have indeed captured the patterns that the classifier uses in order to recognize sleep stages. That is to say, that xSTAE learns to interpret the answers of that specific classifier.

5.3. Qualitative Discussion

We present three examples of misclassified EEG segments and use xSTAE to explore why the classifier may have assigned the wrong label. By comparing the original signals with their (11) (12) (a) 1 training and validation loss curves. (b) 2 training and validation loss curves.

training and validation loss curves. restyled counterparts, we aim to uncover the patterns the classifier has implicitly learned to associate with each sleep stage.

In each case, the original EEG fragment is passed through the autoencoder corresponding to the correct target class. The restyled signal represents what the classifier "expects" to see for that class, highlighting which features may have been overlooked in the original. This visual comparison allows us to extract two kinds of insights: ( 1 ) why the original signal was misclassified and (2) what changes xSTAE introduced that led the classifier to correctly identify the restyled signal as belonging to the target class. Table 7 summarizes the typical EEG characteristics of each sleep stage, which serve as a reference in our interpretations.

Low amplitude, moderate density, almost no spindles.

Low amplitude, high density, clear presence of spindles.

High amplitude, low density, no spindles.

Low amplitude, low density, almost no spindles.

N1 signal misclassified as N2 The EEG segment shown in Fig. 6 was originally labeled as N2 but was misclassified by the model as N1. To interpret this decision, we use xSTAE to restyle the signal into the correct target class (N2), as shown in Fig. 7. This allows us to visualize what the classifier was expecting to see in order to recognize the signal as N2.

Comparing the original and restyled signals, we observe that the restyled version has more pronounced features, e.g., around timesteps 55-60, 70-75, and 90-95, where existing spikes rise more prominently above the baseline. This suggests that the classifier’s patterns are overall in the correct direction, but in this case, it did not assign enough importance to the existing features. The restyling reveals that the cause of the misclassification was not that the classifier is looking for the wrong features, but that it does not place enough significance on the spikes it does identify and expects a higher number of spike instances or more prominent spikes. N3 signal misclassified as N2 The EEG segment shown in Fig. 8 was originally labeled as N3 but was misclassified by the model as N2. To interpret this decision, we again use xSTAE to restyle the signal into the correct target class (N3), as shown in Fig.9.

Comparing the original and restyled signals, we observe that the restyled version shows significantly higher amplitudes, especially around timesteps 60-80 and 80-100, where the fluctuations become more pronounced. This suggests that the classifier had already detected some high-amplitude activity but did not consider it strong enough to classify the signal as N3. The restyling process amplifies these features, making them more consistent with the typical N3 pattern. So again, the classifier appears to be looking for the right features (namely, high-amplitude waveforms), but fails to assign suficient weight to the ones already present.

6. Conclusions and Future Work

We presented xSTAE, system that interprets erroneous decisions by a classifier by restyling an instance where the classifier failed into an instance that would make the classifier give the correct label. At the core of xSTAE is an autoencoder that reconstructs the signal in a way that minimizes a loss that balances between maintaining identity and pushing towards the correct label. This makes the restyled instance not just any example of what the correct class looks like, but what morphological features the classifier was looking for in this instance (and didn’t find) in order to avoid the error.

The general strategy of providing counterfactual explanations is an established method for explaining AI systems, but one of the most developed ones. As demonstrated by recent—mostly visual— advances in Generative AI, this strategy is rapidly approaching within grasp. Our contribution is to (a) ground the general strategy into a specific problem statement for time-series data; (b) identify the spectral representation that work well specifically for EEG signals, where training converges and where the restyled EEG is visually convincing; and (c) validate the approach on open data and publish the complete experimental setup as open source.1

Although the empirical validation was promising, there are some further steps before the system can be validated in trials with experts. Specifically, we plan to further explore diferent ways to define the identity loss. For one, visual observation has shown that mean square error biases the model towards making small changes everywhere, which makes it harder for the expert to identify what changes have been efected. Before trials, we need to define (and validate the convergence and low classification loss) of alternative identity loss definitions, e.g. preferring bigger local changes or preferring changes that do not afect all four brainwave bands. The trials can then be used to establish with notion of ‘identity’ makes it easier to spot the changes afected in order to achieve the intended re-labelling.

A further, more ambitious, step is to link the insights extracted from interpreting misclassifications to possible actions for alleviating them. Since xSTAE is specifically designed to be model-agnostic and can be matched to any pre-trained classifier, such actions also need to operate at the same level of abstraction to maintain the generality of the system. In other words, the outcome of an expert’s understanding of the classifier’s pain-points should operate at the level of re-balancing data or of post-hoc establishing the confidence of specific classification decisions; as opposed to recommending hyper-parameter or architectural changes or similar 1The data is the BOAS dataset [28] and the experimental setup is available at https://doi.org/10.5281/zenodo. 17085776.

Acknowledgments

This research was co-funded by the European Union under GA no. 101135782 (MANOLO project). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or CNECT. Neither the European Union nor CNECT can be held responsible for them.

Declaration on Generative AI

As the subject of this article is generative AI, the example outputs in Figures 7 and 9 are generated by AI. No generative AI was used to prepare any of the remaining content, either textual or graphical. [2] A. Theissler, F. Spinnato, U. Schlegel, R. Guidotti, Explainable AI for time series classification: A review, taxonomy and research directions, IEEE Access PP (2022) 1–1. doi:10.1109/ACCESS.2022.3207765. [3] M. Allahbakhshi, A. Sadri, S. O. Shahdi, Diagnosis of Parkinson’s disease using EEG signals and machine learning techniques: A comprehensive study, arXiv:2405.00741 [eess.SP], 2024.

URL: https://arxiv.org/abs/2405.00741. [4] H.-N. Jo, Y.-S. Kweon, S.-H. Lee, EEG spectral analysis in gray zone between healthy and insomnia, 2024. doi:10.48550/arXiv.2411.09875. [5] M. Jafari, D. Sadeghi, A. Shoeibi, H. Alinejad-Rokny, A. Beheshti, D. L. García, Z. Chen, U. R. Acharya, J. M. Gorriz, Empowering precision medicine: AI-driven schizophrenia diagnosis via EEG signals: A comprehensive review from 2002-2023, arXiv:2309.12202 [eess.SP], 2023. URL: https://arxiv.org/abs/2309.12202. [6] A. Ortiz, F. J. Martínez-Murcia, M. A. Formoso, J. L. Luque, A. Sánchez, Dyslexia Detection from EEG Signals Using SSA Component Correlation and Convolutional Neural Networks, Springer International Publishing, 2020, pp. 655–664. URL: http://dx.doi.org/10.1007/ 978-3-030-61705-9_54. doi:10.1007/978-3-030-61705-9_54. [7] M. M. Karim, Y. Li, R. Qin, Towards explainable artificial intelligence (XAI) for early anticipation of trafic accidents, arXiv:2108.00273 [cs.CV], 2022. URL: https://arxiv.org/ abs/2108.00273. [8] E. Kadar, G. Gilboa, DXAI: Explaining classification by image decomposition, arXiv:2401.00320 [cs.CV], 2024. URL: https://arxiv.org/abs/2401.00320. [9] T. Vermeire, D. Martens, Explainable image classification with evidence counterfactual, arXiv:2004.07511 [cs.LG], 2020. URL: https://arxiv.org/abs/2004.07511. [10] T. Rojat, R. Puget, D. Filliat, J. D. Ser, R. Gelin, N. Díaz-Rodríguez, Explainable artificial intelligence (XAI) on timeseries data: A survey, arXiv:2104.00950 [cs.LG], 2021. URL: https://arxiv.org/abs/2104.00950. [11] T.-Y. Hsieh, S. Wang, Y. Sun, V. Honavar, Explainable multivariate time series classification: A deep neural network which learns to attend to important variables as well as informative time intervals, arXiv, 2020. URL: https://arxiv.org/abs/2011.11631. arXiv:2011.11631. [12] D. Mercier, A. Dengel, S. Ahmed, Patchx: Explaining deep models by intelligible pattern patches for time-series classification, in: 2021 International Joint Conference on Neural Networks (IJCNN), IEEE, 2021, pp. 1–8. URL: http://dx.doi.org/10.1109/IJCNN52387. 2021.9533293. doi:10.1109/ijcnn52387.2021.9533293. [13] A. H. Gee, D. Garcia-Olano, J. Ghosh, D. Paydarfar, Explaining deep classification of time-series data with learned prototypes, arXiv:1904.08935 [cs.LG], 2019. URL: https: //arxiv.org/abs/1904.08935. [14] X. Bi, Z. Chao, Y. He, X. Zhao, Y. Sun, Y. Ma, Explainable time-frequency convolutional neural network for microseismic waveform classification, Information Sciences 546 (2021) 883–896. doi:10.1016/j.ins.2020.08.109. [15] T. Dang, H. Van, H. Nguyen, P. Vung, R. Hewett, DeepVix: Explaining long short-term memory network with high dimensional time series data, in: Proceedings of the 11th International Conference on Advances in Information Technology (IAIT ’20), 2020, pp. 1–10. doi:10.1145/3406601.3406643. [16] F. Mujkanovic, V. Doskoč, M. Schirneck, P. Schäfer, T. Friedrich, timexplain – a framework for explaining the predictions of time series classifiers, arXiv:2007.07606 [cs.LG], 2023. URL: https://arxiv.org/abs/2007.07606. [17] A. Apicella, F. Isgrò, A. Pollastro, R. Prevete, Toward the application of XAI methods in EEG-based systems, arXiv:2210.06554 [cs.LG], 2024. URL: https://arxiv.org/abs/2210. 06554. [18] A. Zanola, L. F. Tshimanga, F. D. Pup, M. Baiesi, M. Atzori, xEEGNet: Towards explainable AI in EEG dementia classification, arXiv:2504.21457 [cs.LG], 2025. URL: https://arxiv.org/ abs/2504.21457.

[1]

M. T.

Ribeiro ,

Singh ,

Guestrin , “ Why should I trust you?”: Explaining the predictions of any classifier , arXiv:1602.04938 [cs.LG] , 2016 . URL: https://arxiv.org/abs/1602.04938.