Online Explainable Forecasting using Regions of Competence

Online Explainable Forecasting using Regions of Competence AmalSaadallah amal.saadallah@cs.tu-dortmund.de Lamarr Institute for Machine Learning and Artificial Intelligence TU Dortmund University

Dortmund Germany

MatthiasJakobs matthias.jakobs@tu-dortmund.de Lamarr Institute for Machine Learning and Artificial Intelligence TU Dortmund University

Dortmund Germany

Online Explainable Forecasting using Regions of Competence 1613-0073 2ECF9DCF0C7691DBFA54ACC4DD8572C4 GROBID - A machine learning software for extracting information from scholarly documents Time Series Forecasting Explainability Model Selection Ensembling

Several machine learning models have been applied to time series forecasting. However, it is generally acknowledged that none of these models is universally valid for every application and over time. This is due to the complex and changing nature of time series data that may involve non-stationary processes but can also be explained by the fact that different models have varying expected regions of expertise or so-called Regions of Competence (RoCs) over the time series. Therefore, adequate and adaptive online model selection is often required. In this work, we review online model selection works that exploit the notion of RoCs by summarizing their methods and highlighting their strengths as well as limitations. In particular, we present a taxonomy of these methods and show how they can be exploited for both single model selection and ensemble of models pruning. We additionally discuss how the RoCs can promote explainability. Finally, we suggest future directions to provide useful research guidance.

Introduction

Time series forecasting is considered a key step in making informed decisions in a wide range of applications [1,2,3]. Several Machine Learning (ML) models have been applied to the time series forecasting task [4,5,6]. While certain guidelines, such as task complexity or data size, can help select a suitable family of ML models for forecasting [7], it is generally acknowledged that no existing ML model is universally applicable to all forecasting problems. This observation aligns with Wolpert's No Free Lunch theorem [8], suggesting that no learning algorithm can be optimal for all learning tasks. In addition, ML models may demonstrate a time-dependent performance [9,10]. This means their accuracy may not remain consistent over time due to the dynamic nature of time series data, which may involve non-stationary processes and, as a result, be subject to the so-called concept drift phenomenon [11]. Thus, it is clear that different forecasting models have different expected areas of expertise placed over different parts in the input time series. The part where a specific model outperforms certain candidate models is referred to as a Region of Competence (RoC) of that model [5,12,9].

Several works in the ML literature have used this notion of RoCs either implicitly [5,13,12] or explicitly [9,10,14,15,16] to perform online model selection for forecasting that copes with the timeevolving nature of time series and the fact that models have a certain expected level of competence in predicting a particular region in the time series. While some of these works have focused on the online selection of a single model [9,15,16], others have extended on the assumption that no single model is an expert all the time, and have proposed to adaptively select and combine multiple forecasting models into a single model using an ensemble technique [5,17,12,10,14]. The selection of individual models that make up the ensemble is referred to as ensemble pruning [18].

Regions of Competence Computation

Let 𝑋 be a univariate time-series and let 𝑋 𝑡 be the value of that time-series at time 𝑡. Additionally, we denote with 𝑋 𝑎:𝑏 the subsequence entailing values from 𝑡 = 𝑎 to 𝑡 = 𝑏 with 𝑎 < 𝑏. A Region of Competence (RoC) for a forecasting model 𝑓 ∈ P, where P is a set of given models, refers to a set of subsequence 𝑋 𝑎:𝑏 where 𝑓 exhibits superior performance compared to all other models in P.However, pinpointing these subsequences and determining their optimal length through empirical evaluations is computationally infeasible. Recent literature has leveraged machine learning to approximate these RoCs. These approaches fall into two main categories. The first category includes model-agnostic methods that compute RoCs independent of a family of forecasting models. Within this category, two primary machine learning paradigms have been used, namely pattern matching and meta-learning. The second category comprises model-specific methods tailored to a particular class of forecasting models, such as tree-based models or DNNs.

The main idea of pattern matching methods is to, using the current subsequence, find the closest subsequence in the training or validation data. The model exhibiting the best performance on these patterns is considered as the expert and this recent subsequence is treated as one of its RoCs. In [12], 1-Nearest Neighbors is used to determine the closest pattern. In [16], clustering is used to cluster the known datapoints first. Then, each model in P is evaluated on all subsequences from each cluster. The Mean Squared Error (MSE) for each model is computed and each cluster is assigned as Regions of Competence (RoC) of the model that minimizes the computed MSE. This ensures that each cluster is associated with an expert model that excels in predicting values based on the dominant pattern in the corresponding cluster.

In [17,5], meta-learning is used to build models capable of modeling the competence of each candidate model across the input time series. Their meta-learning approach is composed of an arbitrating architecture [19] and a mixture of experts [20]. A meta-learner is created for each candidate model. While each model 𝑓 𝑖 is trained to model the future values of the time series, its meta-learning associate 𝑔 𝑖 is trained to model the error of 𝑓 𝑖 . The arbiter 𝑔 𝑖 then can make predictions regarding the error that will incur when predicting the future values of the time series. At test time, the candidate models 𝑓 𝑖 are weighted according to their expected degree of competence in 𝑋 𝑡−𝑝:𝑡−1 , estimated by the predictions of the meta-learners. These methods can be viewed as an implicit exploitation of the RoC concept as they do not result in some identified subsequences in the input time series but rather a modeling of the competence of the models using error estimates.

The works in this category mainly focus on CNN-based models, comprising 1D-convolutional layers with varied filter and kernel sizes [21,10]. More recently, in [14], a hybrid architecture is proposed by concatenating the convolutional layers to a set of heterogeneous models that can be trained jointly using a gradient-based approach. However, these works use the same principle to compute the RoCs. They suggest a modified version of Grad-CAM [22], which utilizes spatial information preserved in convolutional layers to identify important regions of the input. To do so, they start by evaluating the Mean Squared Error on a specific validation time window 𝑋 𝑣𝑎𝑙 , denoted as 𝜁 𝑓 𝑖 , for model 𝑓 𝑖 . The objective is to determine the significance of each time point in this window with respect to the computed error. The last feature maps layer of the CNN is utilized for this purpose. Importance weights 𝛼 𝑓 𝑖 associated with 𝜁 𝑓 𝑖 are computed for each activation unit 𝑢 in each generic feature map 𝐴 by calculating the gradient of 𝜁 𝑓 𝑖 relative to 𝐴. Finally, a global average is computed over all units in

𝐴: 𝛼 𝑓 𝑖 = 1 𝑈 ∑︀ 𝑢 𝜕𝜁 𝑓 𝑖

𝜕𝐴𝑢 , where 𝑈 is the total number of units in 𝐴. We use 𝛼 𝑓 𝑖 to compute a weighted combination between all the feature maps for a given measured value of the error 𝜁 𝑓 𝑖 . Since we are mainly interested in highlighting temporal features contributing most to 𝜁 𝑓 𝑖 , ReLU is used to remove all the negative contributions by: 𝐿 𝑓 𝑖 = 𝑅𝑒𝐿𝑈 ( ∑︀ 𝛼 𝑓 𝑖 𝐴). To identify the RoC in 𝑋 𝑣𝑎𝑙 that primarily contributed to 𝜁 𝑓 𝑖 , 𝐿 𝑓 𝑖 ∈ R 𝑈 is used. Note that 𝑈 is chosen such that 𝑈 is smaller than 𝑋 𝑣𝑎𝑙 size. Note that multiple time windows are created from 𝑋 𝑣𝑎𝑙 using a time-sliding approach to evaluate the performance of the same model on different windows and increase the number of computed RoCs. While the methods presented in the previous section proved successful in providing an explanation for the model selection and ensembling processes, the problem of opaque, non-interpretable base models still persists. Recent work [15] utilized tree-based models to make the model decision process more transparent in addition to having an interpretable model selection algorithm. The authors create a pool of Decision Trees, Random Forests, and Gradient Boosting Trees that are subsequently trained on 𝑋 𝑡𝑟𝑎𝑖𝑛 . Shapley values, a well-established Explainability method for feature attribution, are used to generate the explanations necessary for the creation of RoCs. Since computing Shapley values is in general NP-hard [23], the authors utilize TreeSHAP [24], which is an estimation method designed for tree-based models that is able to estimate Shapley values in polynomial time. Similar to [9], the loss of the prediction for each model is explained rather than the prediction itself.

Online Model Selection

At test time 𝑡, after computing the RoCs of the candidate models for selection, an online decision on model selection for forecasting the value of 𝑋 at 𝑡 has to be made. Note that the following can be applied to any future time instant 𝑡 + ℎ, ℎ ≥ 0. For simplicity of notations, assume ℎ = 0. In this part, we focus on the works that explicitly compute the RoCs, i.e., result in identified subsequences within the input time series.

The models in P are devised such that they use the same 𝑝-lagged values of the time series as input. As a result, at time 𝑡, the input subsequence is 𝑋 𝑡−𝑝:𝑡−1 , 𝑡 ≥ 𝑝. To perform the selection, the distance of the input subsequence 𝑋 𝑡−𝑝:𝑡−1 to the RoCs of each candidate model is measured [9,12,16,15]. Since the length of each RoC can be different from 𝑝 (i.e. length of 𝑋 𝑡−𝑝:𝑡−1 ), Dynamic Time Wrapping (DTW) [25] is generally used to measure the similarity between 𝑋 𝑡−𝑝:𝑡−1 and each RoC member. The model with a RoC member of smallest DTW distance to 𝑋 𝑡−𝑝:𝑡−1 is selected for forecasting. In [16], the RoCs are smoothed using a moving-average process and filtered to keep only the RoCs with length 𝑝, and DTW is replaced by Euclidean Distance.

Recent works have extended the RoC concept for single model selection to ensemble pruning [16,14,10]. They based their reasoning on the expected error of the ensemble that can be decomposed at each time point into a weighted average error term of the individual composing models and an ambiguity term, which is simply the variance of the ensemble around the weighted mean of its composing models. The weighted error term reflects the ensemble accuracy, while the ambiguity term reflects the ensemble diversity [26,27]. Since one model can have multiple RoCs, the first step is to select one representative RoC for each model based on the topological closeness to the current input subsequence 𝑋 𝑡−𝑝:𝑡−1 . In this manner, we avoid selecting the same model multiple times. This contributes to the ensemble diversity [16,14]. In [10], diversity is promoted further by clustering the RoCs representatives so that selecting models with similar competencies or expertise is avoided. The second step consists of ranking the models using the distance of their representative RoCs to the current input subsequence 𝑋 𝑡−𝑝:𝑡−1 to promote accuracy. The top-𝑀 closest models are selected to make up the ensemble. The top-𝑀 is arbitrarily set to a fixed number in [16]. The authors in [10] derived bounds for both error types to set the top-𝑀 number automatically.

Explainability Aspects

Most studies mentioned earlier emphasized the importance of transparency in models and learning decisions, including the selection of models [9,10,16,14,15]. They demonstrated that the computed Regions of Competence serve as a valuable tool for explaining why a specific forecast value is generated at a given time and for choosing a particular individual model or ensemble member within a specific time frame or interval. In essence, these RoCs act as an explanatory mechanism, shedding light on the decision-making process behind forecast outputs and model selections. More specific, they are part of the so-called exemplar-based or prototype-based explainability approaches [28,29]. Thus, local explanations, i.e., why a certain model was chosen at time 𝑡 can be given by visualy inspecting the Visualization of RoCs for AbnormalHeartbeat data using the DNN-base approach for RoCs computation [9]. Notice that some RoCs can be empty, reflecting that this model did not outperform the anothers at any point during validation.

closest RoC member for each model. Global insights can be gained by visualizing the (clustered) RoCs for all models, as seen in Figure 1.

Conclusion

This work provides a unified view of several methods for online adaptive time series forecasting methods that are based on Regions of Competence. These regions not only enable state-of-the-art selection and ensemble performance but can also be used to gain insights into the selection and ensembling process on a local, as well as a global level. One avenue of further research could be concerned with utilizing model-agnostic explainability methods such as KernelSHAP to enable heterogeneous model selection and ensembling. However, such a method may be limited by runtime, which would need to be improved to allow for concept drift adaption. Another direction of future work might be to improve overall explainability by limiting the model pool to small, transparent models. An open question remains if it is possible to select or ensemble these small models using Regions of Competence in a way that reaches state-of-the-art performance compared to incomprehensible methods such selecting from pools of Deep Neural Networks.

11 Figure 1 :111Figure1: Visualization of RoCs for AbnormalHeartbeat data using the DNN-base approach for RoCs computation[9]. Notice that some RoCs can be empty, reflecting that this model did not outperform the anothers at any point during validation.

Acknowledgments

This research has been funded by the Federal Ministry of Education and Research of Germany and the state of North Rhine-Westphalia as part of the Lamarr Institute for Machine Learning and Artificial Intelligence.

25 years of time series forecasting JGDe Gooijer RJHyndman International journal of forecasting 22 2006 Bright-drift-aware demand predictions for taxi networks ASaadallah LMoreira-Matias RSousa JKhiari EJenelius JGama IEEE Transactions on Knowledge and Data Engineering 2018 Monash time series forecasting archive RGodahewa CBergmeir GIWebb RJHyndman PMontero-Manso Neural Information Processing Systems Track on Datasets and Benchmarks 2021 Forthcoming An empirical comparison of machine learning models for time series forecasting NKAhmed AFAtiya NEGayar HEl-Shishiny Econometric reviews 29 2010 Arbitrated ensemble for time series forecasting VCerqueira LTorgo FPinto CSoares Joint European conference on machine learning and knowledge discovery in databases Springer 2017 Time-series forecasting with deep learning: a survey BLim SZohren Philosophical Transactions of the Royal Society A 379 20200209 2021 VCerqueira LTorgo CSoares arXiv:1909.13316 Machine learning vs statistical methods for time series forecasting: Size matters 2019 arXiv preprint The lack of a priori distinctions between learning algorithms DHWolpert Neural computation 8 1996 Explainable online deep neural network selection using adaptive saliency maps for time series forecasting ASaadallah MJakobs KMorik Machine Learning and Knowledge Discovery in Databases NOliver FPérez-Cruz SKramer JRead JALozano

Cham

Springer International Publishing 2021 Explainable online ensemble of deep neural network pruning for time series forecasting ASaadallah MJakobs KMorik Machine Learning 111 2022 A survey on concept drift adaptation JGama IŽliobaitė ABifet MPechenizkiy ABouchachia ACM computing surveys (CSUR) 46 2014 Dynamic model selection for automated machine learning in time series FPriebe 2019 Constructive aggregation and its application to forecasting with dynamic ensembles VCerqueira FPinto LTorgo CSoares NMoniz Joint European Conference on Machine Learning and Knowledge Discovery in Databases Springer 2018 Online deep hybrid ensemble learning for time series forecasting ASaadallah MJakobs Joint European Conference on Machine Learning and Knowledge Discovery in Databases Springer 2023 Explainable Adaptive Tree-based Model Selection for Time-Series Forecasting MJakobs ASaadallah 10.1109/ICDM58522.2023.00027 2023 IEEE International Conference on Data Mining (ICDM) 2023 Online explainable model selection for time series forecasting ASaadallah IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA), IEEE 2023. 2023 Arbitrage of forecasting experts VCerqueira LTorgo FPinto CSoares Machine Learning 2018 An ensemble pruning primer GTsoumakas IPartalas IVlahavas Applications of supervised and unsupervised ensemble methods Springer 2009 Arbitrating among competing classifiers using learned referees JOrtega MKoppel SArgamon Knowledge and Information Systems 3 2001 Adaptive mixtures of local experts RAJacobs MIJordan SJNowlan GEHinton Neural computation 3 1991 An actor-critic ensemble aggregation model for time-series forecasting ASaadallah MTavakol KMorik IEEE ICDE 2021 Grad-cam: Visual explanations from deep networks via gradient-based localization RRSelvaraju MCogswell ADas RVedantam DParikh DBatra Proceedings of the IEEE international conference on computer vision the IEEE international conference on computer vision 2017 On the complexity of cooperative solution concepts XDeng CHPapadimitriou arXiv:3690220 Mathematics of Operations Research 19 1994 From local explanations to global understanding with explainable AI for trees SMLundberg GErion HChen ADegrave JMPrutkin BNair RKatz JHimmelfarb NBansal S.-ILee 10.1038/s42256-019-0138-9 Nature Machine Intelligence 2 2020 Using dynamic time warping to find patterns in time series DJBerndt JClifford KDD workshop 1994 10 Neural network ensembles, cross validation, and active learning AKrogh JVedelsby Advances in Neural Information Processing Systems MIT Press 1995 7 Managing Diversity in Regression Ensembles GBrown JLWyatt PTiňo Journal of Machine Learning Research 6 2005 A Comprehensive Review on Explainable AI Techniques, Challenges, and Future Scope APatil MPatil 10.1007/978-981-99-3177-4_39 Intelligent Computing and Networking VEBalas VBSemwal AKhandare

Singapore

Springer Nature 2023 Interpretable Machine Learning CMolnar 2020