1. Introduction

Journal of Machine Learning Research 6 (2005) 1621-1650. [28] A. Patil

10.1109/ICDM58522.2023.00027

Online Explainable Forecasting using Regions of Competence

Amal Saadallah

Matthias Jakobs

0 Lamarr Institute for Machine Learning and Artificial Intelligence, TU Dortmund University , Dortmund , Germany

2023

10 180 189

Several machine learning models have been applied to time series forecasting. However, it is generally acknowledged that none of these models is universally valid for every application and over time. This is due to the complex and changing nature of time series data that may involve non-stationary processes but can also be explained by the fact that diferent models have varying expected regions of expertise or so-called Regions of Competence (RoCs) over the time series. Therefore, adequate and adaptive online model selection is often required. In this work, we review online model selection works that exploit the notion of RoCs by summarizing their methods and highlighting their strengths as well as limitations. In particular, we present a taxonomy of these methods and show how they can be exploited for both single model selection and ensemble of models pruning. We additionally discuss how the RoCs can promote explainability. Finally, we suggest future directions to provide useful research guidance.

eol>Time Series Forecasting Explainability Model Selection Ensembling

1. Introduction

Time series forecasting is considered a key step in making informed decisions in a wide range of applications [ 1, 2, 3 ]. Several Machine Learning (ML) models have been applied to the time series forecasting task [ 4, 5, 6 ]. While certain guidelines, such as task complexity or data size, can help select a suitable family of ML models for forecasting [7], it is generally acknowledged that no existing ML model is universally applicable to all forecasting problems. This observation aligns with Wolpert’s No Free Lunch theorem [8], suggesting that no learning algorithm can be optimal for all learning tasks. In addition, ML models may demonstrate a time-dependent performance [9, 10]. This means their accuracy may not remain consistent over time due to the dynamic nature of time series data, which may involve non-stationary processes and, as a result, be subject to the so-called concept drift phenomenon [11]. Thus, it is clear that diferent forecasting models have diferent expected areas of expertise placed over diferent parts in the input time series. The part where a specific model outperforms certain candidate models is referred to as a Region of Competence (RoC) of that model [ 5, 12, 9 ].

Several works in the ML literature have used this notion of RoCs either implicitly [ 5, 13, 12 ] or explicitly [9, 10, 14, 15, 16] to perform online model selection for forecasting that copes with the timeevolving nature of time series and the fact that models have a certain expected level of competence in predicting a particular region in the time series. While some of these works have focused on the online selection of a single model [9, 15, 16], others have extended on the assumption that no single model is an expert all the time, and have proposed to adaptively select and combine multiple forecasting models into a single model using an ensemble technique [ 5, 17, 12, 10, 14 ]. The selection of individual models that make up the ensemble is referred to as ensemble pruning [18].

2. Regions of Competence Computation

Let be a univariate time-series and let be the value of that time-series at time . Additionally, we denote with : the subsequence entailing values from = to = with < . A Region of Competence (RoC) for a forecasting model ∈ P, where P is a set of given models, refers to a set of subsequence : where exhibits superior performance compared to all other models in P.However, pinpointing these subsequences and determining their optimal length through empirical evaluations is computationally infeasible. Recent literature has leveraged machine learning to approximate these RoCs. These approaches fall into two main categories. The first category includes model-agnostic methods that compute RoCs independent of a family of forecasting models. Within this category, two primary machine learning paradigms have been used, namely pattern matching and meta-learning. The second category comprises model-specific methods tailored to a particular class of forecasting models, such as tree-based models or DNNs.

The main idea of pattern matching methods is to, using the current subsequence, find the closest subsequence in the training or validation data. The model exhibiting the best performance on these patterns is considered as the expert and this recent subsequence is treated as one of its RoCs. In [12], 1-Nearest Neighbors is used to determine the closest pattern. In [16], clustering is used to cluster the known datapoints first. Then, each model in P is evaluated on all subsequences from each cluster. The Mean Squared Error (MSE) for each model is computed and each cluster is assigned as Regions of Competence (RoC) of the model that minimizes the computed MSE. This ensures that each cluster is associated with an expert model that excels in predicting values based on the dominant pattern in the corresponding cluster.

In [ 17, 5 ], meta-learning is used to build models capable of modeling the competence of each candidate model across the input time series. Their meta-learning approach is composed of an arbitrating architecture [19] and a mixture of experts [20]. A meta-learner is created for each candidate model. While each model is trained to model the future values of the time series, its meta-learning associate is trained to model the error of . The arbiter then can make predictions regarding the error that will incur when predicting the future values of the time series. At test time, the candidate models are weighted according to their expected degree of competence in − :− 1, estimated by the predictions of the meta-learners. These methods can be viewed as an implicit exploitation of the RoC concept as they do not result in some identified subsequences in the input time series but rather a modeling of the competence of the models using error estimates.

The works in this category mainly focus on CNN-based models, comprising 1D-convolutional layers with varied filter and kernel sizes [ 21, 10]. More recently, in [14], a hybrid architecture is proposed by concatenating the convolutional layers to a set of heterogeneous models that can be trained jointly using a gradient-based approach. However, these works use the same principle to compute the RoCs. They suggest a modified version of Grad-CAM[ 22], which utilizes spatial information preserved in convolutional layers to identify important regions of the input. To do so, they start by evaluating the Mean Squared Error on a specific validation time window , denoted as , for model . The objective is to determine the significance of each time point in this window with respect to the computed error. The last feature maps layer of the CNN is utilized for this purpose. Importance weights associated with are computed for each activation unit in each generic feature map by calculating the gradient of relative to . Finally, a global average is computed over all units in : = 1 ∑︀ , where is the total number of units in . We use to compute a weighted combination between all the feature maps for a given measured value of the error . Since we are mainly interested in highlighting temporal features contributing most to , ReLU is used to remove all the negative contributions by: = (∑︀ ). To identify the RoC in that primarily contributed to , ∈ R is used. Note that is chosen such that is smaller than size. Note that multiple time windows are created from using a time-sliding approach to evaluate the performance of the same model on diferent windows and increase the number of computed RoCs. While the methods presented in the previous section proved successful in providing an explanation for the model selection and ensembling processes, the problem of opaque, non-interpretable base models still persists. Recent work [15] utilized tree-based models to make the model decision process more transparent in addition to having an interpretable model selection algorithm. The authors create a pool of Decision Trees, Random Forests, and Gradient Boosting Trees that are subsequently trained on . Shapley values, a well-established Explainability method for feature attribution, are used to generate the explanations necessary for the creation of RoCs. Since computing Shapley values is in general NP-hard [23], the authors utilize TreeSHAP [24], which is an estimation method designed for tree-based models that is able to estimate Shapley values in polynomial time. Similar to [9], the loss of the prediction for each model is explained rather than the prediction itself.

3. Online Model Selection

At test time , after computing the RoCs of the candidate models for selection, an online decision on model selection for forecasting the value of at has to be made. Note that the following can be applied to any future time instant + ℎ, ℎ ≥ 0. For simplicity of notations, assume ℎ = 0. In this part, we focus on the works that explicitly compute the RoCs, i.e., result in identified subsequences within the input time series.

The models in P are devised such that they use the same -lagged values of the time series as input. As a result, at time , the input subsequence is − :− 1, ≥ . To perform the selection, the distance of the input subsequence − :− 1 to the RoCs of each candidate model is measured [9, 12, 16, 15]. Since the length of each RoC can be diferent from (i.e. length of − :− 1), Dynamic Time Wrapping (DTW) [25] is generally used to measure the similarity between − :− 1 and each RoC member. The model with a RoC member of smallest DTW distance to − :− 1 is selected for forecasting. In [16], the RoCs are smoothed using a moving-average process and filtered to keep only the RoCs with length , and DTW is replaced by Euclidean Distance.

Recent works have extended the RoC concept for single model selection to ensemble pruning [16, 14, 10]. They based their reasoning on the expected error of the ensemble that can be decomposed at each time point into a weighted average error term of the individual composing models and an ambiguity term, which is simply the variance of the ensemble around the weighted mean of its composing models. The weighted error term reflects the ensemble accuracy, while the ambiguity term reflects the ensemble diversity [26, 27]. Since one model can have multiple RoCs, the first step is to select one representative RoC for each model based on the topological closeness to the current input subsequence − :− 1. In this manner, we avoid selecting the same model multiple times. This contributes to the ensemble diversity [16, 14]. In [10], diversity is promoted further by clustering the RoCs representatives so that selecting models with similar competencies or expertise is avoided. The second step consists of ranking the models using the distance of their representative RoCs to the current input subsequence − :− 1 to promote accuracy. The top- closest models are selected to make up the ensemble. The top- is arbitrarily set to a fixed number in [ 16]. The authors in [10] derived bounds for both error types to set the top- number automatically.

4. Explainability Aspects

Most studies mentioned earlier emphasized the importance of transparency in models and learning decisions, including the selection of models [9, 10, 16, 14, 15]. They demonstrated that the computed Regions of Competence serve as a valuable tool for explaining why a specific forecast value is generated at a given time and for choosing a particular individual model or ensemble member within a specific time frame or interval. In essence, these RoCs act as an explanatory mechanism, shedding light on the decision-making process behind forecast outputs and model selections. More specific, they are part of the so-called exemplar-based or prototype-based explainability approaches [28, 29]. Thus, local explanations, i.e., why a certain model was chosen at time can be given by visualy inspecting the 0 2 1 0 1 0

5. Conclusion

This work provides a unified view of several methods for online adaptive time series forecasting methods that are based on Regions of Competence. These regions not only enable state-of-the-art selection and ensemble performance but can also be used to gain insights into the selection and ensembling process on a local, as well as a global level.

One avenue of further research could be concerned with utilizing model-agnostic explainability methods such as KernelSHAP to enable heterogeneous model selection and ensembling. However, such a method may be limited by runtime, which would need to be improved to allow for concept drift adaption. Another direction of future work might be to improve overall explainability by limiting the model pool to small, transparent models. An open question remains if it is possible to select or ensemble these small models using Regions of Competence in a way that reaches state-of-the-art performance compared to incomprehensible methods such selecting from pools of Deep Neural Networks.

Acknowledgments

This research has been funded by the Federal Ministry of Education and Research of Germany and the state of North Rhine-Westphalia as part of the Lamarr Institute for Machine Learning and Artificial Intelligence.

[1] J. G. De Gooijer , R. J. Hyndman , 25 years of time series forecasting, International journal of forecasting 22 ( 2006 ) 443 - 473 .

[2]

Saadallah ,

Moreira-Matias ,

Sousa ,

Khiari ,

Jenelius ,

Gama , Bright-drift-aware demand predictions for taxi networks , IEEE Transactions on Knowledge and Data Engineering ( 2018 ).

[3]

Godahewa ,

Bergmeir , G. I. Webb ,

R. J.

Hyndman ,

Montero-Manso , Monash time series forecasting archive , in: Neural Information Processing Systems Track on Datasets and Benchmarks , 2021 . Forthcoming.

[4]

N. K.

Ahmed ,

A. F.

Atiya ,

N. E.

Gayar ,

El-Shishiny , An empirical comparison of machine learning models for time series forecasting , Econometric reviews 29 ( 2010 ) 594 - 621 .

[5]

Cerqueira ,

Torgo ,

Pinto ,

Soares , Arbitrated ensemble for time series forecasting , in: Joint European conference on machine learning and knowledge discovery in databases , Springer, 2017 , pp. 478 - 494 .

[6]

Lim ,

Zohren , Time-series forecasting with deep learning: a survey , Philosophical Transactions of the Royal Society A 379 ( 2021 ) 20200209 .