Online Explainable Ensemble of Tree Models Pruning for Time Series Forecasting ⋆

Online Explainable Ensemble of Tree Models Pruning for Time Series Forecasting ⋆ AmalSaadallah amal.saadallah@tu-dortmund.de Lamarr Institute for Machine Learning and AI

Dortmund Germany

Online Explainable Ensemble of Tree Models Pruning for Time Series Forecasting ⋆ 1613-0073 F23FBC5C73D5EE07D1D0948CEF747EA8 GROBID - A machine learning software for extracting information from scholarly documents Tree Models Online Ensemble Pruning TreeSHAP Time Series Forecasting Concept-drift Explainability

Tree-based models are commonly used in time series forecasting due to their inherent interpretability, which makes them preferable to more complex black-box models. However, simple tree-based models are prone to overfitting, limiting their applicability in real-world scenarios. Ensembles of tree-based models are employed to mitigate this, but ensemble pruning is challenging, especially in the presence of dynamic time series data and concept drift. In this paper, we use TreeSHAP, a tree-specific explainability tool, to perform online tree-based ensemble pruning that adapts dynamically to changes in the time series, addressing the concept drift issue. Empirical evaluations on real-world time series datasets demonstrate that our method performs on par with or better than state-of-the-art techniques. In future research, we plan to automate the determination of the optimal number of clusters for ensemble pruning by leveraging ensemble properties like diversity, accuracy, and stability. This automation aims to enhance both the flexibility and explainability of the model selection process. Given that this work is in its early stages, we seek feedback and collaboration with experts to create a robust and explainable framework for ensemble-based time series forecasting.

Introduction

Time series forecasting is crucial for real-time planning and decision-making across various fields like traffic management, weather prediction, and financial markets. However, it is also one of the most challenging tasks due to the complex and dynamic nature of time series data, which often involves non-stationary variations and is susceptible to concept drift [1]. This makes accurate forecasting inherently difficult, necessitating models that can adapt to changing data patterns [2,3,4,5,6,7]. Given these challenges, explainability in forecasting models has become increasingly important, especially for safety-critical applications. Tree-based models are often favored for their intrinsic explainability, but identifying appropriate models for specific time series requires adaptability due to time-varying characteristics. Decision Trees and their ensembles, like Random Forests and Gradient-boosted Trees, are commonly used for time series forecasting. However, these models can struggle with dynamic data since they typically operate in a static manner, not inherently considering variations in the underlying time series. In addition, combining multiple models into ensembles can improve forecasting accuracy, but at the cost of explainability. To address these issues, we propose an online ensemble pruning approach for time series forecasting, where the ensemble members are selected based on an adaptive clustering procedure that uses TreeSHAP values to group models with similar modeling paradigms. This methodology not only ensures diversity within the ensemble but also allows for an explainable selection process by indicating which aspects of the time series data contribute most to the predictions.

In our future research, one key goal is to automatically determine the optimal number of clusters, which corresponds to the ideal number of trees or ensemble members in the ensemble. This would involve using ensemble properties such as diversity, accuracy, and stability to guide the selection of the most suitable cluster count. By automating this process, we aim to improve the ensemble's flexibility and effectiveness in adapting to dynamic time series data. Moreover, we intend to deepen the explainability aspect of our approach by explicitly demonstrating that selecting models based on different TreeSHAP values aligns with distinct modeling paradigms and hypotheses. This could be achieved by visualizing or analyzing how these varying TreeSHAP values translate into different interpretations of the underlying data, providing insights into the rationale behind model selection. Given that this is early-stage work, we plan to engage with experts in the field to exchange ideas and gather feedback. Collaboration with specialists will be instrumental in refining our methodology for selecting the optimal number of trees and enhancing explainability. By incorporating diverse perspectives, we hope to develop a robust and transparent approach that addresses the complexities of time series forecasting while maintaining clarity in model selection and ensemble pruning. This collaborative effort will contribute to building a reliable framework for ensemble-based forecasting, with a particular emphasis on explainability and adaptability.

Methodology

Our proposed method uses TreeSHAP for online ensemble pruning using model clustering. First, we define the used notation. Second, we describe Shapley values with a focus on TreeSHAP values [8]. Third, we show how we generate the candidate tree-based models. Finally, we demonstrate how TreeSHAP values are used for model clustering to allow for efficient ensemble pruning and how the whole process is made adaptive to the changes in the time series.

Preliminaries

A time series 𝑋 is a temporal sequence of values, where are constrained to be positive and sum to one. In addition, it can be seen from the notation that the weights are time-dependent. This is one of the requirements in online ensemble learning, where the weights are required to be set in a timely manner to cope with the dynamic nature of the time series and the time-changing performance of the ensemble members [5,6]. The goal of dynamic online ensemble pruning is to identify the subset of models S ⊂ T that should compose the ensemble at each time step 𝑡 + ℎ such that the expected prediction error of the pruned ensemble is reduced compared the full ensemble 𝑇 ¯T for each forecast.

𝑋 1:𝑡 = {𝑥 1 , 𝑥 2 , • • • , 𝑥 𝑡 }𝑎𝑟𝑔𝑚𝑎𝑥 S⊂T E [︀(︀ 𝑥 𝑡+ℎ − 𝑇 ¯T(𝑥 ^𝑡+ℎ ) )︀ 2 |𝑋 1:𝑡+ℎ−1 ]︀ − E [︀(︀ 𝑥 𝑡+ℎ − 𝑇 ¯S(𝑥 ^𝑡+ℎ ) )︀ 2 |𝑋 1:𝑡+ℎ−1 ]︀(1)

TreeSHAP Ensemble Learning

Ensemble Pruning

We divide the time series

𝑋 1:𝑡 into 𝑋 𝑡𝑟𝑎𝑖𝑛 𝜔 = {𝑥 1 , 𝑥 2 , • • • , 𝑥 𝑡−𝜔 } and 𝑋 𝑣𝑎𝑙 𝜔 = {𝑥 𝑡−𝜔+1 , 𝑥 𝑡−𝜔+2 , • • • , 𝑥 𝑡 },𝐼 𝑗 𝑖 = 1 𝜔 𝜔 ∑︁ 𝑘=1 |𝜑 𝑗 𝑖 (𝑥 𝑡−𝜔+𝑘 )|, ∀ 𝑖 ∈ [1, 𝑙 𝑗 ], ∀ 𝑇 𝑗 ∈ 𝒯(2)

Each model 𝑇 𝑗 ∈ 𝒯 can then be characterized by a vector

I 𝑗 = {𝐼 𝑗 1 , 𝐼 𝑗 2 , • • • 𝐼 𝑗 𝑙 𝑗 }.

The models can thus be clustered using their SHAP-based lag importance vectors I 𝑗 . However, different models in 𝒯 might be trained using different lag values. As a result, the length of the vectors I 𝑗 can vary between 𝑙 𝑚𝑖𝑛 and 𝑙 𝑚𝑎𝑥 . It exists clustering distance measure that can handle vectors of different lengths [9]. However, we are mainly interested in grouping models based on the way they represent the relationship between the input lagged values and the output. Therefore, we assume that the models that are trained using a lag value 𝑙 𝑗 lower than 𝑙 𝑚𝑎𝑥 ignore the importance and the contribution of lagged features that are greater than 𝑙 𝑗 . In other words, if the mode 𝑇 𝑗 is trained on 𝑙 𝑗 ≤ 𝑙 𝑚𝑎𝑥 , for each 𝑖, such that 𝑙 𝑗 ≤ 𝑖 ≤ 𝑙 𝑚𝑎𝑥 , the value of its corresponding SHAP-based lag importance 𝐼 𝑗 𝑖 on 𝑖 is set to zero. In this manner, we bring all the vectors I 𝑗 for all the models 𝑇 𝑗 ∈ 𝒯 to the same length 𝑙 𝑚𝑎𝑥 , and we use K-means with Euclidean distance for model clustering. Models belonging to different clusters are expected to have different modeling paradigms of the contributions of different lagged values to the predictions, which contributes to boosting the ensemble diversity. We select only cluster representatives to take part in the ensemble. We simply select the closest model to each cluster center.

Ensemble Adaptation

Streaming time series data is prone to significant changes, leading to concept drifts . To account for these shifts, the selection of ensemble members must be updated, allowing for the inclusion of models that can better address newly emerging patterns. Concept drift is detected by monitoring deviations in the mean of the time series over time, using the Hoeffding Bound to evaluate if these deviations are significant. If a drift is detected, an alarm is triggered, the TreeSHAP-based model clustering is updated, and the ensemble is adjusted to reflect the new patterns in the data.

Experiments

Our method is denoted in the following as OEP-TT: Online explainable Ensemble Pruning of Tree models for Time series forecasting.

Experimental Setup

We use 100 univariate time series datasets from various application domains, including financial, weather, and synthetic data. These datasets are provided by the Monash Forecasting Repository [10]. We process each time series 𝑋 by using the first 50% for training (𝑋 𝑡𝑟𝑎𝑖𝑛 𝜔 ), the following 25% for validation (𝑋 𝑣𝑎𝑙 𝜔 ) and the remaining 25% for testing. Due to this way of splitting the time series, we discard series that are shorter than 250 to allow enough training and validation data. All experiments have been performed on consumer hardware, namely on a 2022 MacBook Pro in R.

OEP-TT Setup

Tree-based models set-up: We construct a pool 𝒯 of tree-based models using different parameter settings that are summarized in Table 1. The list of parameters and their value ranges Hyper-parameters values of the tree-based models. Different configurations are generated by taking some combinations of these hyper-parameters as described in the last column.

in Table 1 is not exhaustive, and further parameters and values can be considered to generate more base learners. We also vary the lag parameter 𝑙 on which the tree-based models are trained, i.e., 𝑙 ∈ {3, 5, 7, 10, 15, 20}. Considering different combinations of all the parameters, we train a total of 294 tree-based models.

OEP-TT set-up:

OEP-TT has also a number hyper-parameters: 𝑀 is the Size of the Pool of the tree-based models T: 294, 𝜔 is the Size of the validation time window :25% of the data length, |𝑆| is the Number of final selected models: 6.

State-of-the-Art Methods Setup

We compare OEP-TT against State-of-the-Art (SoA) methods for online ensemble pruning, treebased ensembles, and time series forecasting in general. These models include: Auto-Regressive Integrated Moving Average (ARIMA) [11], Exponential Smoothing (ETS) [11], Long Short-Term Memory (LSTM) [12], Multi-Layer Perceptron (MLP) [12], Convolutional Neural Network with LSTM (CNN-LSTM, Bi-LSTM) [13], Random Forest (RF) [14], Gradient-Boosted Decisions Trees (GBDT) [15], eXtrem Gradient Boosting (XGBoost) [16], and Light Gradient-Boosting Machine (LGBM) [17].

To enable a fair comparison with OEP-TT, we feed to these ensemble pruning methods the same pool of tree-based models T that was used for OEP-TT: Ens: Ensemble of all the base modes in T; OCL [5]: Online drift-aware clustering of the tree-based models in T using covariance-based clustering; OTOP [5]: Online drift-aware Top best-performing tree-based models ranking using temporal correlation analysis; DEMSC [5]: Dynamic Ensemble Members Selection using Clustering: Online drift-aware Top best-performing models ranking using temporal correlation analysis combined with covariance-based clustering; ADE [18,19] was recently developed for an online dynamic ensemble of forecasters construction. A meta-learning strategy that specializes the tree-based models across the input time series. A sequential weighting schema is developed to automatically select ensemble members by setting their weights to zero.

We also compare OEP-TT to its variants: OEP-TT-ST: Static variant of OEP-TT. Pruning is decided at the initial forecasting instant and kept fixed along testing; OEP-TT-Per: Pruning is updated periodically in a blind manner (i.e. without considering the occurrence of the drift).

Results

Predictive Performance

Table 2 presents the average ranks and their deviation for OEP-TT and its variants and SoA methods for time series forecasting and online ensemble pruning. For the paired comparison, we compare our method OEP-TT against each of the other methods. We counted wins and losses for each dataset using the RMSE scores. We use the non-parametric Wilcoxon Signed Rank test to compute significant wins and losses, which are presented in parenthesis (significance level 0.05). In the results in Table 2, OEP-TT outperforms almost all the baseline methods in terms of ranks and wins/loses in pairwise comparison.

In this part, we show how initially OEP-TT supports explainability for the reason behind specific tree-based model selection to construct the ensemble at a specific time instant or interval, for model performance, and for the importance of input lagged time series observations. Comparison (in terms of average rank achieved over 100 datasets) between our method and the baselines The rank column presents the average rank and its standard deviation across different time series. An average rank of 1 means the model was the best performing on all time series.

Method

Explainability Aspects

Figure 1 shows the TreeSHAP values clusters of the Saugeen River Flow data set. The dots on each lag value stand for the TreeSHAP values taken by the models belonging to the same cluster, while the line connects the mean values to show the TreeSHAP values for each lag of the representative selected model on each cluster (only for visualization purposes). Note that we show the name of the model and the value of the first hyper-parameter plus the lag value on which it is trained to distinguish between selected models belonging to the same family of tree-based models, e.g., RF200(Lag10) and RF50(Lag7). It can be seen that on different clusters different patterns of lagged values contributions to the target time series observations are observed. This confirms that our clustering procedure promotes ensemble diversity by enforcing the selection of models that have different modeling paradigms and distinct views on the importance of specific lag values. For example, while models in cluster 6 favor higher lag values and emphasize the contribution of their corresponding value to the output forecast value, models in cluster 5 are built on the assumption of restricting the memory of the models to lower lag values (𝑙 = 3). We can notice that in 3 clusters out of 6, models rely on restricted lagged values (clusters 2, 3, and 5). Even with this limited width of memory, i.e., historical data, they can excel in terms of predictive performance.

Concluding Remarks and Future Work

This paper introduces OEP-TT a novel method for online adaptive ensemble of tree-based models pruning. Through the use of TreeSHAP values, we are able to gain insight into its decision-making process, both for model selection, as well as for the input time series points relevance. We showed the advantages of OEP-TT on 100 real-world datasets, both in terms of predictive performance as well as its explainability aspects. In future work, we plan to extend our method to hybrid model pools by using the most efficient Shapley value estimation methods for each model family, such as TreeSHAP for tree-based models, DeepSHAP [8] for Neural Networks, as well as KernelSHAP [8] for remaining models, tune the size of the ensemble and dive further into the explainability aspects. Given that this is early-stage work, we plan to engage with experts in the field to exchange ideas and gather feedback.

Figure 1 :1Figure 1: Comparison of TreeSHAP-based models clusters on the Saugeen River Flow data set.

is a sequence of 𝑋 until time 𝑡 and 𝑥 𝑖 is the value of 𝑋 at time 𝑖. Denote with T = {𝑇 1 , 𝑇 2 , • • • , 𝑇 𝑀 } the pool of 𝑀 tree-based models trained to approximate a true unknown function 𝑓 that generated 𝑋. 𝑥 𝑡+ℎ ) by each of the models in T. An ensemble model 𝑇 ¯T of T at time instant 𝑡 + ℎ can be formally expressed as a convex combination of the forecasts of the models inLet 𝑥 ^𝑡+ℎ = (𝑥 ^𝑇 1 𝑡+ℎ , 𝑥 ^𝑇 2 𝑡+ℎ , • • • , 𝑥 ^𝑇 𝑀 𝑡+ℎ ) be the vector of forecast values of 𝑋 at a future timeinstant 𝑡 + ℎ, ℎ ≥ 1 (i.e. T: 𝑇 ¯T(𝑥 ^𝑡+ℎ ) = ∑︀ 𝑀 𝑗=1 𝑤 𝑗 𝑡+ℎ 𝑥 ^𝑇 𝑗

𝑡+ℎ where 𝑤 𝑗 𝑡+ℎ ∈ [1, 𝑀 ] are the ensemble weights. The weights

with 𝜔 a provided window size. 𝑋 𝑡𝑟𝑎𝑖𝑛 𝜔 is used for training the models in T and 𝑋 𝑣𝑎𝑙 𝜔 is used to compute the TreeSHAP values. For each tree-based model 𝑇 𝑗 ∈ 𝒯 , for each observation 𝑥 𝑡−𝜔+𝑘 ∈ 𝑋 𝑣𝑎𝑙 𝑙 𝑗 ], where 𝑙 𝑗 is the number of lags on which the model 𝑇 𝑗 is trained. Then, we aggregate absolute SHAP values over all the observations in 𝑋 𝑣𝑎𝑙 𝜔 to acquire SHAP-based lag importance 𝐼 𝑗 𝑖 for each lag 𝑖 ∈ [1, 𝑙 𝑗 ] using the model 𝑇 𝑗 :𝜔 with 𝑘 ∈ [1, 𝜔], we compute a TreeSHAP value 𝜑 𝑗 𝑖 (𝑥 𝑡−𝜔+𝑘 ) for each lagged value, i.e., 𝑖 ∈ [1,

Table 11

Tree-based Model

Table 22

⋆ This research has partly been funded by the Federal Ministry of Education and Research of Germany and the state of North-Rhine-Westphalia as part of the Lamarr-Institute for Machine Learning and Artificial Intelligence.

A survey on concept drift adaptation JGama IŽliobaitė ABifet MPechenizkiy ABouchachia ACM computing surveys (CSUR) 46 2014 Explainable online deep neural network selection using adaptive saliency maps for time series forecasting ASaadallah MJakobs KMorik Machine Learning and Knowledge Discovery in Databases NOliver FPérez-Cruz SKramer JRead JALozano

Cham

Springer International Publishing 2021 Explainable online ensemble of deep neural network pruning for time series forecasting ASaadallah MJakobs KMorik Machine Learning 111 2022 Online adaptive multivariate time series forecasting ASaadallah HMykula KMorik Joint European conference on machine learning and knowledge discovery in databases Springer 2022 A drift-based dynamic ensemble members selection using clustering for time series forecasting ASaadallah FPriebe KMorik Joint European conference on machine learning and knowledge discovery in databases Springer 2019 An actor-critic ensemble aggregation model for time-series forecasting ASaadallah MTavakol KMorik IEEE ICDE 2021 MJakobs ASaadallah arXiv:2401.01124 Explainable adaptive tree-based model selection for time series forecasting 2024 arXiv preprint A Unified Approach to Interpreting Model Predictions SMLundberg S.-ILee Advances in Neural Information Processing Systems 30 IGuyon UVLuxburg SBengio HWallach RFergus SVishwanathan RGarnett Curran Associates, Inc Using dynamic time warping to find patterns in time series DJBerndt JClifford KDD workshop 1994 10 Monash time series forecasting archive RGodahewa CBergmeir GIWebb RJHyndman PMontero-Manso Neural Information Processing Systems Track on Datasets and Benchmarks 2021 Forthcoming GEBox GMJenkins GCReinsel GMLjung Time series analysis: forecasting and control John Wiley & Sons 2015 Applying lstm to time series predictable through time-window approaches FAGers DEck JSchmidhuber Neural Nets WIRN Vietri-01 Springer 2002 Time-series forecasting of indoor temperature using pre-trained deep neural networks PRomeu FZamora-Martínez PBotella-Rocamora JPardo International conference on artificial neural networks Springer 2013 Random forests LBreiman Machine learning 45 2001 A gradient boosting approach to the kaggle load forecasting competition SBTaieb RJHyndman International journal of forecasting 30 2014 Xgboost: extreme gradient boosting TChen THe MBenesty VKhotilovich YTang HCho KChen RMitchell ICano TZhou R package version 0 4-2 1 2015 Lightgbm: A highly efficient gradient boosting decision tree GKe QMeng TFinley TWang WChen WMa QYe T.-YLiu Advances in neural information processing systems 30 2017 Arbitrated ensemble for time series forecasting VCerqueira LTorgo FPinto CSoares Joint European conference on machine learning and knowledge discovery in databases Springer 2017 Arbitrage of forecasting experts VCerqueira LTorgo FPinto CSoares Machine Learning 2018