<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Machine
Learning Research 6 (2005) 1621-1650.
[28] A. Patil</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/ICDM58522.2023.00027</article-id>
      <title-group>
        <article-title>Online Explainable Forecasting using Regions of Competence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amal Saadallah</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Jakobs</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lamarr Institute for Machine Learning and Artificial Intelligence, TU Dortmund University</institution>
          ,
          <addr-line>Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>10</volume>
      <fpage>180</fpage>
      <lpage>189</lpage>
      <abstract>
        <p>Several machine learning models have been applied to time series forecasting. However, it is generally acknowledged that none of these models is universally valid for every application and over time. This is due to the complex and changing nature of time series data that may involve non-stationary processes but can also be explained by the fact that diferent models have varying expected regions of expertise or so-called Regions of Competence (RoCs) over the time series. Therefore, adequate and adaptive online model selection is often required. In this work, we review online model selection works that exploit the notion of RoCs by summarizing their methods and highlighting their strengths as well as limitations. In particular, we present a taxonomy of these methods and show how they can be exploited for both single model selection and ensemble of models pruning. We additionally discuss how the RoCs can promote explainability. Finally, we suggest future directions to provide useful research guidance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Time Series Forecasting</kwd>
        <kwd>Explainability</kwd>
        <kwd>Model Selection</kwd>
        <kwd>Ensembling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Time series forecasting is considered a key step in making informed decisions in a wide range of
applications [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Several Machine Learning (ML) models have been applied to the time series
forecasting task [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. While certain guidelines, such as task complexity or data size, can help select
a suitable family of ML models for forecasting [7], it is generally acknowledged that no existing ML
model is universally applicable to all forecasting problems. This observation aligns with Wolpert’s No
Free Lunch theorem [8], suggesting that no learning algorithm can be optimal for all learning tasks. In
addition, ML models may demonstrate a time-dependent performance [9, 10]. This means their accuracy
may not remain consistent over time due to the dynamic nature of time series data, which may involve
non-stationary processes and, as a result, be subject to the so-called concept drift phenomenon [11].
Thus, it is clear that diferent forecasting models have diferent expected areas of expertise placed over
diferent parts in the input time series. The part where a specific model outperforms certain candidate
models is referred to as a Region of Competence (RoC) of that model [
        <xref ref-type="bibr" rid="ref5">5, 12, 9</xref>
        ].
      </p>
      <p>
        Several works in the ML literature have used this notion of RoCs either implicitly [
        <xref ref-type="bibr" rid="ref5">5, 13, 12</xref>
        ] or
explicitly [9, 10, 14, 15, 16] to perform online model selection for forecasting that copes with the
timeevolving nature of time series and the fact that models have a certain expected level of competence in
predicting a particular region in the time series. While some of these works have focused on the online
selection of a single model [9, 15, 16], others have extended on the assumption that no single model is
an expert all the time, and have proposed to adaptively select and combine multiple forecasting models
into a single model using an ensemble technique [
        <xref ref-type="bibr" rid="ref5">5, 17, 12, 10, 14</xref>
        ]. The selection of individual models
that make up the ensemble is referred to as ensemble pruning [18].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Regions of Competence Computation</title>
      <p>Let  be a univariate time-series and let  be the value of that time-series at time . Additionally,
we denote with : the subsequence entailing values from  =  to  =  with  &lt; . A Region of
Competence (RoC) for a forecasting model  ∈ P, where P is a set of given models, refers to a set of
subsequence : where  exhibits superior performance compared to all other models in P.However,
pinpointing these subsequences and determining their optimal length through empirical evaluations
is computationally infeasible. Recent literature has leveraged machine learning to approximate these
RoCs. These approaches fall into two main categories. The first category includes model-agnostic
methods that compute RoCs independent of a family of forecasting models. Within this category, two
primary machine learning paradigms have been used, namely pattern matching and meta-learning. The
second category comprises model-specific methods tailored to a particular class of forecasting models,
such as tree-based models or DNNs.</p>
      <p>The main idea of pattern matching methods is to, using the current subsequence, find the closest
subsequence in the training or validation data. The model exhibiting the best performance on these
patterns is considered as the expert and this recent subsequence is treated as one of its RoCs. In [12],
1-Nearest Neighbors is used to determine the closest pattern. In [16], clustering is used to cluster the
known datapoints first. Then, each model in P is evaluated on all subsequences from each cluster.
The Mean Squared Error (MSE) for each model is computed and each cluster is assigned as Regions of
Competence (RoC) of the model that minimizes the computed MSE. This ensures that each cluster is
associated with an expert model that excels in predicting values based on the dominant pattern in the
corresponding cluster.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref5">17, 5</xref>
        ], meta-learning is used to build models capable of modeling the competence of each candidate
model across the input time series. Their meta-learning approach is composed of an arbitrating
architecture [19] and a mixture of experts [20]. A meta-learner is created for each candidate model.
While each model  is trained to model the future values of the time series, its meta-learning associate
 is trained to model the error of . The arbiter  then can make predictions regarding the error that
will incur when predicting the future values of the time series. At test time, the candidate models  are
weighted according to their expected degree of competence in − :− 1, estimated by the predictions
of the meta-learners. These methods can be viewed as an implicit exploitation of the RoC concept as
they do not result in some identified subsequences in the input time series but rather a modeling of the
competence of the models using error estimates.
      </p>
      <p>The works in this category mainly focus on CNN-based models, comprising 1D-convolutional layers
with varied filter and kernel sizes [ 21, 10]. More recently, in [14], a hybrid architecture is proposed by
concatenating the convolutional layers to a set of heterogeneous models that can be trained jointly
using a gradient-based approach. However, these works use the same principle to compute the RoCs.
They suggest a modified version of Grad-CAM[ 22], which utilizes spatial information preserved in
convolutional layers to identify important regions of the input. To do so, they start by evaluating
the Mean Squared Error on a specific validation time window , denoted as   , for model .
The objective is to determine the significance of each time point in this window with respect to the
computed error. The last feature maps layer of the CNN is utilized for this purpose. Importance weights
  associated with   are computed for each activation unit  in each generic feature map  by
calculating the gradient of   relative to . Finally, a global average is computed over all units in
:   = 1 ∑︀   , where  is the total number of units in . We use   to compute a weighted
 
combination between all the feature maps for a given measured value of the error   . Since we are
mainly interested in highlighting temporal features contributing most to   , ReLU is used to remove
all the negative contributions by:  =  (∑︀   ). To identify the RoC in  that primarily
contributed to   ,  ∈ R is used. Note that  is chosen such that  is smaller than  size.
Note that multiple time windows are created from  using a time-sliding approach to evaluate the
performance of the same model on diferent windows and increase the number of computed RoCs.
While the methods presented in the previous section proved successful in providing an explanation for
the model selection and ensembling processes, the problem of opaque, non-interpretable base models
still persists. Recent work [15] utilized tree-based models to make the model decision process more
transparent in addition to having an interpretable model selection algorithm. The authors create a
pool of Decision Trees, Random Forests, and Gradient Boosting Trees that are subsequently trained on
. Shapley values, a well-established Explainability method for feature attribution, are used to
generate the explanations necessary for the creation of RoCs. Since computing Shapley values is in
general NP-hard [23], the authors utilize TreeSHAP [24], which is an estimation method designed for
tree-based models that is able to estimate Shapley values in polynomial time. Similar to [9], the loss of
the prediction for each model is explained rather than the prediction itself.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Online Model Selection</title>
      <p>At test time , after computing the RoCs of the candidate models for selection, an online decision on
model selection for forecasting the value of  at  has to be made. Note that the following can be
applied to any future time instant  + ℎ, ℎ ≥ 0. For simplicity of notations, assume ℎ = 0. In this part,
we focus on the works that explicitly compute the RoCs, i.e., result in identified subsequences within
the input time series.</p>
      <p>The models in P are devised such that they use the same -lagged values of the time series as input.
As a result, at time , the input subsequence is − :− 1,  ≥ . To perform the selection, the distance
of the input subsequence − :− 1 to the RoCs of each candidate model is measured [9, 12, 16, 15].
Since the length of each RoC can be diferent from  (i.e. length of − :− 1), Dynamic Time Wrapping
(DTW) [25] is generally used to measure the similarity between − :− 1 and each RoC member. The
model with a RoC member of smallest DTW distance to − :− 1 is selected for forecasting. In [16],
the RoCs are smoothed using a moving-average process and filtered to keep only the RoCs with length
, and DTW is replaced by Euclidean Distance.</p>
      <p>Recent works have extended the RoC concept for single model selection to ensemble pruning
[16, 14, 10]. They based their reasoning on the expected error of the ensemble that can be decomposed at
each time point into a weighted average error term of the individual composing models and an ambiguity
term, which is simply the variance of the ensemble around the weighted mean of its composing models.
The weighted error term reflects the ensemble accuracy, while the ambiguity term reflects the ensemble
diversity [26, 27]. Since one model can have multiple RoCs, the first step is to select one representative
RoC for each model based on the topological closeness to the current input subsequence − :− 1.
In this manner, we avoid selecting the same model multiple times. This contributes to the ensemble
diversity [16, 14]. In [10], diversity is promoted further by clustering the RoCs representatives so that
selecting models with similar competencies or expertise is avoided. The second step consists of ranking
the models using the distance of their representative RoCs to the current input subsequence − :− 1
to promote accuracy. The top- closest models are selected to make up the ensemble. The top- is
arbitrarily set to a fixed number in [ 16]. The authors in [10] derived bounds for both error types to set
the top- number automatically.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Explainability Aspects</title>
      <p>Most studies mentioned earlier emphasized the importance of transparency in models and learning
decisions, including the selection of models [9, 10, 16, 14, 15]. They demonstrated that the computed
Regions of Competence serve as a valuable tool for explaining why a specific forecast value is generated
at a given time and for choosing a particular individual model or ensemble member within a specific
time frame or interval. In essence, these RoCs act as an explanatory mechanism, shedding light on
the decision-making process behind forecast outputs and model selections. More specific, they are
part of the so-called exemplar-based or prototype-based explainability approaches [28, 29]. Thus, local
explanations, i.e., why a certain model was chosen at time  can be given by visualy inspecting the
0
2
1
0
1
0</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work provides a unified view of several methods for online adaptive time series forecasting methods
that are based on Regions of Competence. These regions not only enable state-of-the-art selection and
ensemble performance but can also be used to gain insights into the selection and ensembling process
on a local, as well as a global level.</p>
      <p>One avenue of further research could be concerned with utilizing model-agnostic explainability
methods such as KernelSHAP to enable heterogeneous model selection and ensembling. However, such
a method may be limited by runtime, which would need to be improved to allow for concept drift
adaption. Another direction of future work might be to improve overall explainability by limiting the
model pool to small, transparent models. An open question remains if it is possible to select or ensemble
these small models using Regions of Competence in a way that reaches state-of-the-art performance
compared to incomprehensible methods such selecting from pools of Deep Neural Networks.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research has been funded by the Federal Ministry of Education and Research of Germany and the
state of North Rhine-Westphalia as part of the Lamarr Institute for Machine Learning and Artificial
Intelligence.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>J. G. De Gooijer</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          <string-name>
            <surname>Hyndman</surname>
          </string-name>
          ,
          <volume>25</volume>
          years of time series forecasting,
          <source>International journal of forecasting 22</source>
          (
          <year>2006</year>
          )
          <fpage>443</fpage>
          -
          <lpage>473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Saadallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreira-Matias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sousa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Khiari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jenelius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          ,
          <article-title>Bright-drift-aware demand predictions for taxi networks</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Godahewa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bergmeir</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. I. Webb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Hyndman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Montero-Manso</surname>
          </string-name>
          ,
          <article-title>Monash time series forecasting archive</article-title>
          ,
          <source>in: Neural Information Processing Systems Track on Datasets and Benchmarks</source>
          ,
          <year>2021</year>
          . Forthcoming.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Atiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. E.</given-names>
            <surname>Gayar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>El-Shishiny</surname>
          </string-name>
          ,
          <article-title>An empirical comparison of machine learning models for time series forecasting</article-title>
          ,
          <source>Econometric reviews 29</source>
          (
          <year>2010</year>
          )
          <fpage>594</fpage>
          -
          <lpage>621</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Cerqueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torgo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Soares</surname>
          </string-name>
          ,
          <article-title>Arbitrated ensemble for time series forecasting</article-title>
          ,
          <source>in: Joint European conference on machine learning and knowledge discovery in databases</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>478</fpage>
          -
          <lpage>494</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zohren</surname>
          </string-name>
          ,
          <article-title>Time-series forecasting with deep learning: a survey</article-title>
          ,
          <source>Philosophical Transactions of the Royal Society A</source>
          <volume>379</volume>
          (
          <year>2021</year>
          )
          <fpage>20200209</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>