1. Introduction

Online Explainable Ensemble of Tree Models Pruning for Time Series Forecasting⋆

Amal Saadallah

0 0 Lamarr Institute for Machine Learning and AI , Dortmund , Germany

Tree-based models are commonly used in time series forecasting due to their inherent interpretability, which makes them preferable to more complex black-box models. However, simple tree-based models are prone to overfitting, limiting their applicability in real-world scenarios. Ensembles of tree-based models are employed to mitigate this, but ensemble pruning is challenging, especially in the presence of dynamic time series data and concept drift. In this paper, we use TreeSHAP, a tree-specific explainability tool, to perform online tree-based ensemble pruning that adapts dynamically to changes in the time series, addressing the concept drift issue. Empirical evaluations on real-world time series datasets demonstrate that our method performs on par with or better than state-of-the-art techniques. In future research, we plan to automate the determination of the optimal number of clusters for ensemble pruning by leveraging ensemble properties like diversity, accuracy, and stability. This automation aims to enhance both the flexibility and explainability of the model selection process. Given that this work is in its early stages, we seek feedback and collaboration with experts to create a robust and explainable framework for ensemble-based time series forecasting.

eol>Tree Models Online Ensemble Pruning TreeSHAP Time Series Forecasting Concept-drift Explainability

1. Introduction

Time series forecasting is crucial for real-time planning and decision-making across various ifelds like trafic management, weather prediction, and financial markets. However, it is also one of the most challenging tasks due to the complex and dynamic nature of time series data, which often involves non-stationary variations and is susceptible to concept drift [ 1 ]. This makes accurate forecasting inherently dificult, necessitating models that can adapt to changing data patterns [ 2, 3, 4, 5, 6, 7 ]. Given these challenges, explainability in forecasting models has become increasingly important, especially for safety-critical applications. Tree-based models are often favored for their intrinsic explainability, but identifying appropriate models for specific time series requires adaptability due to time-varying characteristics. Decision Trees and their ensembles, like Random Forests and Gradient-boosted Trees, are commonly used for time series forecasting. However, these models can struggle with dynamic data since they typically operate in a static manner, not inherently considering variations in the underlying time series. In addition, combining multiple models into ensembles can improve forecasting accuracy, but at the cost of explainability. To address these issues, we propose an online ensemble pruning approach for time series forecasting, where the ensemble members are selected based on an adaptive clustering procedure that uses TreeSHAP values to group models with similar modeling paradigms. This methodology not only ensures diversity within the ensemble but also allows for an explainable selection process by indicating which aspects of the time series data contribute most to the predictions.

In our future research, one key goal is to automatically determine the optimal number of clusters, which corresponds to the ideal number of trees or ensemble members in the ensemble. This would involve using ensemble properties such as diversity, accuracy, and stability to guide the selection of the most suitable cluster count. By automating this process, we aim to improve the ensemble’s flexibility and efectiveness in adapting to dynamic time series data. Moreover, we intend to deepen the explainability aspect of our approach by explicitly demonstrating that selecting models based on diferent TreeSHAP values aligns with distinct modeling paradigms and hypotheses. This could be achieved by visualizing or analyzing how these varying TreeSHAP values translate into diferent interpretations of the underlying data, providing insights into the rationale behind model selection. Given that this is early-stage work, we plan to engage with experts in the field to exchange ideas and gather feedback. Collaboration with specialists will be instrumental in refining our methodology for selecting the optimal number of trees and enhancing explainability. By incorporating diverse perspectives, we hope to develop a robust and transparent approach that addresses the complexities of time series forecasting while maintaining clarity in model selection and ensemble pruning. This collaborative efort will contribute to building a reliable framework for ensemble-based forecasting, with a particular emphasis on explainability and adaptability.

2. Methodology

Our proposed method uses TreeSHAP for online ensemble pruning using model clustering. First, we define the used notation. Second, we describe Shapley values with a focus on TreeSHAP values [ 8 ]. Third, we show how we generate the candidate tree-based models. Finally, we demonstrate how TreeSHAP values are used for model clustering to allow for eficient ensemble pruning and how the whole process is made adaptive to the changes in the time series.

2.1. Preliminaries

A time series is a temporal sequence of values, where 1: = {1, 2, · · · , } is a sequence of until time and is the value of at time . Denote with T = { 1, 2, · · · , } the pool of tree-based models trained to approximate a true unknown function that generated . Let ^+ℎ = (^+1ℎ, ^+2ℎ, · · · , ^+ℎ) be the vector of forecast values of at a future time instant + ℎ, ℎ ≥ 1 (i.e. +ℎ) by each of the models in T. An ensemble model ¯ T of T at time instant + ℎ can be formally expressed as a convex combination of the forecasts of the models in T: ¯ T(^+ℎ) = ∑︀=1 +ℎ^+ℎ where +ℎ ∈ [1, ] are the ensemble weights. The weights are constrained to be positive and sum to one. In addition, it can be seen from the notation that the weights are time-dependent. This is one of the requirements in online ensemble learning, where the weights are required to be set in a timely manner to cope with the dynamic nature of the time series and the time-changing performance of the ensemble members [ 5, 6 ]. The pruned ensemble is reduced compared the full ensemble ¯ T for each forecast. goal of dynamic online ensemble pruning is to identify the subset of models S ⊂ compose the ensemble at each time step + ℎ such that the expected prediction error of the T that should S⊂ T

E︀[( +ℎ − ¯ T(^+ℎ) ︀) 2|1:+ℎ− 1 − E︀[( +ℎ − ¯ S(^+ℎ) ︀] ︀) 2|1:+ℎ− 1 ︀]

2.2. TreeSHAP Ensemble Learning 2.2.1. Ensemble Pruning

We divide the time series 1: into = {1, 2, · · · , − } and {− +1, − +2, · · ·

, }, with a provided window size. is used for training the models in T and is used to compute the TreeSHAP values. For each tree-based model ∈ , for each observation − + ∈ with ∈ [1, ], we compute a TreeSHAP value (− +) for each lagged value, i.e., ∈ [1, ], where is the number of lags on which the model is trained. Then, we aggregate absolute SHAP values over all the observations in to acquire SHAP-based lag importance for each lag ∈ [1, ] using the model : = 1 ∑︁ |

(− +)|, ∀ ∈ [1, ], ∀ ∈ we bring all the vectors I for all the models

Each model can thus be clustered using their SHAP-based lag importance vectors I . However, diferent models in might be trained using diferent lag values. As a result, the length of the vectors I can vary between and . It exists clustering distance measure that can handle vectors of diferent lengths [ 9 ]. However, we are mainly interested in grouping models based ∈ can then be characterized by a vector I = {1 , 2 , · · · }. The models on the way they represent the relationship between the input lagged values and the output. Therefore, we assume that the models that are trained using a lag value lower than ignore the importance and the contribution of lagged features that are greater than . In other words, if the mode is trained on ≤ , for each , such that ≤ ≤ , the value of its corresponding SHAP-based lag importance on is set to zero. In this manner, ∈ to the same length , and we use K-means with Euclidean distance for model clustering. Models belonging to diferent clusters are expected to have diferent modeling paradigms of the contributions of diferent lagged values to the predictions, which contributes to boosting the ensemble diversity. We select only cluster representatives to take part in the ensemble. We simply select the closest model to each cluster center.

2.2.2. Ensemble Adaptation

Streaming time series data is prone to significant changes, leading to concept drifts . To account for these shifts, the selection of ensemble members must be updated, allowing for the inclusion of (1) = (2) models that can better address newly emerging patterns. Concept drift is detected by monitoring deviations in the mean of the time series over time, using the Hoefding Bound to evaluate if these deviations are significant. If a drift is detected, an alarm is triggered, the TreeSHAP-based model clustering is updated, and the ensemble is adjusted to reflect the new patterns in the data.

3. Experiments

Our method is denoted in the following as OEP-TT: Online explainable Ensemble Pruning of Tree models for Time series forecasting.

3.1. Experimental Setup

We use 100 univariate time series datasets from various application domains, including financial, weather, and synthetic data. These datasets are provided by the Monash Forecasting Repository [ 10 ]. We process each time series by using the first 50% for training (), the following 25% for validation () and the remaining 25% for testing. Due to this way of splitting the time series, we discard series that are shorter than 250 to allow enough training and validation data. All experiments have been performed on consumer hardware, namely on a 2022 MacBook Pro in R.

3.2. OEP-TT Setup

Tree-based models set-up: We construct a pool of tree-based models using diferent parameter settings that are summarized in Table 1. The list of parameters and their value ranges Tree-based Model Decision Tree (DT) Random Forest (RF) Gradient Boosted DT (GBDT) eXtreme Gradient Boosting (Xgboost) Light GBM (LGBM)

Maximum Depth Number of trees Num. of variables sampled at each split Minimum size of terminal nodes Number of trees Maximum depth of each tree Shrinkage parameter Max number of iterations Step size of each boosting step Maximum Depth Metric Max number of iterations Maximum depth of each tree

Configurations ∈ {4, 8, 16} ∈ {50, 100, 150, 200}

∈ {3, 5, 7} ∈ {5, 10, 15} ∈ {50, 100, 150, 200} .ℎ ∈ {5, 7, 15} ℎ ∈ {0.001, 0.01, 0.1} ∈ {50, 100, 150, 200} ∈ {0.001, 0.01, 0.1} .ℎ ∈ {5, 7, 15} ∈ {1, 2}-Regularization ∈ {50, 100} ℎ ∈ {5, 7, 15} in Table 1 is not exhaustive, and further parameters and values can be considered to generate more base learners. We also vary the lag parameter on which the tree-based models are trained, i.e., ∈ {3, 5, 7, 10, 15, 20}. Considering diferent combinations of all the parameters, we train a total of 294 tree-based models.

OEP-TT set-up: OEP-TT has also a number hyper-parameters: is the Size of the Pool of the tree-based models T: 294, is the Size of the validation time window :25% of the data length, || is the Number of final selected models: 6.

3.3. State-of-the-Art Methods Setup

We compare OEP-TT against State-of-the-Art (SoA) methods for online ensemble pruning, treebased ensembles, and time series forecasting in general. These models include: Auto-Regressive Integrated Moving Average (ARIMA) [ 11 ], Exponential Smoothing (ETS) [ 11 ], Long Short-Term Memory (LSTM) [ 12 ], Multi-Layer Perceptron (MLP) [ 12 ], Convolutional Neural Network with LSTM (CNN-LSTM, Bi-LSTM) [ 13 ], Random Forest (RF) [ 14 ], Gradient-Boosted Decisions Trees (GBDT) [ 15 ], eXtrem Gradient Boosting (XGBoost) [ 16 ], and Light Gradient-Boosting Machine (LGBM) [ 17 ].

To enable a fair comparison with OEP-TT, we feed to these ensemble pruning methods the same pool of tree-based models T that was used for OEP-TT: Ens: Ensemble of all the base modes in T; OCL [ 5 ]: Online drift-aware clustering of the tree-based models in T using covariance-based clustering; OTOP [ 5 ]: Online drift-aware Top best-performing tree-based models ranking using temporal correlation analysis; DEMSC [ 5 ]: Dynamic Ensemble Members Selection using Clustering: Online drift-aware Top best-performing models ranking using temporal correlation analysis combined with covariance-based clustering; ADE [ 18, 19 ] was recently developed for an online dynamic ensemble of forecasters construction. A meta-learning strategy that specializes the tree-based models across the input time series. A sequential weighting schema is developed to automatically select ensemble members by setting their weights to zero.

We also compare OEP-TT to its variants: OEP-TT-ST: Static variant of OEP-TT. Pruning is decided at the initial forecasting instant and kept fixed along testing; OEP-TT-Per: Pruning is updated periodically in a blind manner (i.e. without considering the occurrence of the drift).

3.4. Results 3.4.1. Predictive Performance

Wins Losses

3.4.2. Explainability Aspects

Figure 1 shows the TreeSHAP values clusters of the Saugeen River Flow data set. The dots on each lag value stand for the TreeSHAP values taken by the models belonging to the same cluster, while the line connects the mean values to show the TreeSHAP values for each lag of the representative selected model on each cluster (only for visualization purposes). Note that we show the name of the model and the value of the first hyper-parameter plus the lag value on which it is trained to distinguish between selected models belonging to the same family of tree-based models, e.g., RF200(Lag10) and RF50(Lag7). It can be seen that on diferent clusters diferent patterns of lagged values contributions to the target time series observations are observed. This confirms that our clustering procedure promotes ensemble diversity by enforcing the selection of models that have diferent modeling paradigms and distinct views on the importance of specific lag values. For example, while models in cluster 6 favor higher lag values and emphasize the contribution of their corresponding value to the output forecast value, models in cluster 5 are built on the assumption of restricting the memory of the models to lower lag values ( = 3). We can notice that in 3 clusters out of 6, models rely on restricted lagged values (clusters 2, 3, and 5). Even with this limited width of memory, i.e., historical data, they can excel in terms of predictive performance.

Clustered TreeShap values 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 0.00 0.00 lrheeapeuaTSV00001.....0257005050 3 00001.....0257005050 4 1.00 5 1.00 6 0.75 0.75 0.50 0.50 0.25 0.25 0.00 lag15 lag14 lag13 lag12 lag11 lag10 lag9 lag8 lag7 lag6 lag5 lag4 lag3 lag2 lag1Lag0.v0a0luelag15 lag14 lag13 lag12 lag11 lag10 lag9 lag8 lag7 lag6 lag5 lag4 lag3 lag2 lag1 Selected Models

RF200(Lag10) GBM(Lag5) XGboost(Lag3) RF50(Lag7) LGBM(Lag3) RF150(Lag15)

4. Concluding Remarks and Future Work

This paper introduces OEP-TT a novel method for online adaptive ensemble of tree-based models pruning. Through the use of TreeSHAP values, we are able to gain insight into its decision-making process, both for model selection, as well as for the input time series points relevance. We showed the advantages of OEP-TT on 100 real-world datasets, both in terms of predictive performance as well as its explainability aspects. In future work, we plan to extend our method to hybrid model pools by using the most eficient Shapley value estimation methods for each model family, such as TreeSHAP for tree-based models, DeepSHAP [ 8 ] for Neural Networks, as well as KernelSHAP [ 8 ] for remaining models, tune the size of the ensemble and dive further into the explainability aspects. Given that this is early-stage work, we plan to engage with experts in the field to exchange ideas and gather feedback.

[1]

Gama , I. Žliobaitė,

Bifet ,

Pechenizkiy ,

Bouchachia , A survey on concept drift adaptation, ACM computing surveys (CSUR) 46 ( 2014 ) 1 - 37 .

[2]

Saadallah ,

Jakobs ,

Morik , Explainable online deep neural network selection using adaptive saliency maps for time series forecasting , in: N. Oliver , F.

Pérez-Cruz , S.

Kramer , J.

Read , J. A.

Lozano (Eds.), Machine Learning and Knowledge Discovery in Databases. Research Track , Springer International Publishing, Cham, 2021 , pp. 404 - 420 .

[3]

Saadallah ,

Jakobs ,

Morik , Explainable online ensemble of deep neural network pruning for time series forecasting , Machine Learning 111 ( 2022 ).

[4]

Saadallah ,

Mykula ,

Morik , Online adaptive multivariate time series forecasting , in: Joint European conference on machine learning and knowledge discovery in databases , Springer, 2022 .

[5]

Saadallah ,

Priebe ,

Morik , A drift-based dynamic ensemble members selection using clustering for time series forecasting , in: Joint European conference on machine learning and knowledge discovery in databases , Springer, 2019 .

[6]

Saadallah ,

Tavakol ,

Morik , An actor-critic ensemble aggregation model for time-series forecasting , in: IEEE ICDE, 2021 .

[7]

Jakobs ,

Saadallah , Explainable adaptive tree-based model selection for time series forecasting , arXiv preprint arXiv:2401.01124 ( 2024 ).

[8]

S. M.

Lundberg ,

S.-I.

Lee , A Unified Approach to Interpreting Model Predictions , in: I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems 30 , Curran

Associates

, Inc., ????, pp. 4765 - 4774 . URL: http://papers.nips.cc/paper/ 7062-a -unified-approach-to-interpreting-model-predictions .pdf.

[9]

D. J.

Berndt ,

Cliford , Using dynamic time warping to find patterns in time series ., in: KDD workshop, volume 10 , 1994 , pp. 359 - 370 .

[10]

Godahewa ,

Bergmeir , G. I. Webb ,

R. J.

Hyndman ,

Montero-Manso , Monash time series forecasting archive , in: Neural Information Processing Systems Track on Datasets and Benchmarks , 2021 . Forthcoming.

[11]

G. E.

Box ,

G. M.

Jenkins ,

G. C.

Reinsel ,

G. M.

Ljung , Time series analysis: forecasting and control , John Wiley & Sons, 2015 .

[12]

F. A.

Gers ,

Eck ,

Schmidhuber , Applying lstm to time series predictable through time-window approaches , in: Neural Nets WIRN Vietri-01 , Springer, 2002 , pp. 193 - 200 .

[13]

Romeu ,

Zamora-Martínez ,

Botella-Rocamora ,

Pardo , Time-series forecasting of indoor temperature using pre-trained deep neural networks , in: International conference on artificial neural networks , Springer, 2013 , pp. 451 - 458 .

[14]

Breiman , Random forests, Machine learning 45 ( 2001 ) 5 - 32 .

[15]

S. B.

Taieb ,

R. J.

Hyndman , A gradient boosting approach to the kaggle load forecasting competition , International journal of forecasting 30 ( 2014 ) 382 - 394 .

[16]

Chen ,

He ,

Benesty ,

Khotilovich ,

Tang ,

Cho ,

Chen ,

Mitchell , I. Cano,

Zhou , et al., Xgboost: extreme gradient boosting , R package version 0.4-2 1 ( 2015 ) 1 - 4 .

[17]

Ke ,

Meng ,

Finley ,

Wang ,

Chen , W. Ma,

Ye , T.-Y. Liu, Lightgbm: A highly eficient gradient boosting decision tree , Advances in neural information processing systems 30 ( 2017 ).

[18]

Cerqueira ,

Torgo ,

Pinto ,

Soares , Arbitrated ensemble for time series forecasting , in: Joint European conference on machine learning and knowledge discovery in databases , Springer, 2017 , pp. 478 - 494 .

[19]

Cerqueira ,

Torgo ,

Pinto ,

Soares , Arbitrage of forecasting experts, Machine Learning ( 2018 ).