Model Reuse with Subgroup Discovery

                                 Hao Song and Peter Flach

            Intelligent Systems Laboratory, University of Bristol, United Kingdom
                     {Hao.Song,Peter.Flach}@bristol.ac.uk


       Abstract. In this paper we describe a method to reuse models with Model-Based
       Subgroup Discovery (MBSD), which is a extension of the Subgroup Discovery
       scheme. The task is to predict the number of bikes at a new rental station 3 hours
       in advance. Instead of training new models with the limited data from these new
       stations, our approach first selects a number of pre-trained models from old rental
       stations according to their mean absolute errors (MAE). For each selected model,
       we further performed MBSD to locate a number of subgroups that the selected
       model has a deviated prediction performance. Then another set of pre-trained
       models are selected only according to their MAE over the subgroup. Finally, the
       prediction are made by averaging the prediction from the models selected dur-
       ing the previous two steps. The experiments show that our method performances
       better than selecting trained models with the lowest MAE, and the averaged low-
       MAE models.


1   Introduction

In this paper we propose a model reuse approach exploiting Model-Based Subgroup
Discovery (MBSD). The general idea of model reuse is to use trained models from
other operating contexts under a new operating context. Such a strategy has two main
benefits. Firstly, it can dramatically reduce model training time on the new operating
contexts. Secondly, if the new operating context only has limited data, as model reuse
essentially extends the scale of the training data by adding training data from other
operating contexts, it can help further improve the prediction’s performance.
    One major challenge for model reuse is that the patterns in the data can vary through
different operating contexts. This makes it difficult to directly apply trained models
from the training contexts to a new context. For instance, to predict the activities of
daily living (ADLs) from the reading of sensors is one of the leading applications of
a smart home. However, as both the households and layout of the house varies from
different houses, it is hard to directly use a model trained in one particular house to
another house. Therefore, to recognise and deal with such variations through different
operating contexts has become a non-trivial research task for model reuse [1].
    In this paper, we will use a variation of the Subgroup Discovery (SD) scheme [2–4]
to help the reused models to adopt the new context. SD is a data mining technique. It
uses a descriptive model to learn the unusual statistic of a target variable in a given data-
set. However, traditional SD approaches generally focus on the statistic of a single at-
tribute in a fixed data-set. This makes it less appropriate for model reuse. Therefore, we
propose an extended method MBSD in this paper. The main modification is to change
2       Model Reuse with Subgroup Discovery

the target variable in the SD task from an attribute into the prediction performance of an
attribute from a particular base model. Through this modification, MBSD can be used to
discover the prediction pattern of a trained model in a new operating context. This can
help locate the potential sub-context where the trained model can be directly applied,
or the potential sub-context where other trained models are required.
     The experiments are based on a machine learning challenge MoReBikeS, which is
organised by the workshop LMCE 2015, within the conference ECML-PKDD 2015.
The task is to predict the number of bikes available at a particular rental station 3 hours
later, given some history data. In detail, the overall data-set is obtained from 275 bike
rental stations located in Valencia, Spain. For the participants, everyone gets access to
the data for all the 275 stations during the October of 2014 and use these data as training
data. 6 trained linear regression models are also provided for each station from station
1 to station 200. These linear regression models are trained with the data that covers
the whole year of 2014. Therefore, the task of the challenge is, by reusing these trained
models and limited training data, to predict the number of available bike at some new
bike stations (station 201 to station 275).
     The method we used can be briefly described as follows. For any station to be pre-
dicted, the one-month training data can be applied to select a number of models with
good performances (low MAE values), these models are called base-models. The as-
sumption here is that these base-models are only suitable to some unknown sub-context
of the context to be predicted (the sub-context is similar to the training context of the
base-models), and not suitable to some other sub-context. With such an assumption,
we can perform MBSD to discover these sub-contexts and to further select a number
of models with good performances only under the sub-contexts. These models are de-
noted as sub-models. Finally, the overall prediction can be obtained by averaging the
prediction from both base-models and sub-models, with some averaging strategy. The
experiments show that, with MBSD, the MAE can be further reduced comparing to
simple averaging the prediction of the base-models.
     This paper is organised as follows. In section 2 some preliminaries of Subgroup Dis-
covery are given and in section 3 the basic concept of MBSD is introduced. The method
to reuse models with MBSD is stated in section 4. Section 5 shows some experiments
with the MoReBikes data. A conclusion of the whole paper is provided in section 6.


2   Subgroup Discovery

In this section we will give some preliminaries and corresponding notations of SD.
    Subgroup Discovery (SD) [2–4] is a data mining technique that learns rules to de-
scribe patterns of some attributes in a given data-set. Since the construction of sub-
groups is driven by some attributes, called target variable, it can be seen as a descriptive
model which is learnt in a supervised way. However, SD still differs from predictive
models as in SD we are not aiming to predict the target variable, but to discovery some
interesting patterns with respect to it. Therefore, the definition of an interesting pattern
needs to be given. In existing literature, an interesting pattern often refers to a dif-
ferent class distribution (for binary/nominal target variable), or in general to an unusual
statistic (for binary/nominal/numerical target variable). On the other hand, because such
                                              Model Reuse with Subgroup Discovery            3

patterns often have a small coverage, some literature also define SD as a model to find
patterns that have both large coverage and unusual statistic.
     Mathematically, suppose the data-set contains N instances and M attributes. Tra-
ditional SD assumes that one from the M attributes is selected as the target variables,
the corresponding attribute of the ith instance is denoted as yi ∈ R1 , the domain of this
attribute is denoted as Y = {yi }Ni=1 . The rest M − 1 attributes are used as the description
attributes, denoted as di ∈ RM−1 , the domain of this attributes is denoted as D = {di }Ni=1 .
     A subgroup is denoted as a function g : D → {0, 1}. Hence g(di ) = 1 means that the
ith instance is covered by this subgroup and vice versa. We use G = {i : g(di ) = 1} to
denote the set of instances to be covered by the subgroup g.
     The task of (top q) subgroup discovery can be defined as, given a set of candidate
subgroups G ⊆ 2D , and a quality measure φ : g → R, to find a set of q subgroups
Gq = {g1 , ..., gq }, so that φ (g1 ) ≥ φ (g2 ) ≥ ... ≥ φ (gq ), and ∀gi ∈ Gq , ∀g j ∈ G \ Gq :
φ gi ≥ φ g j .
     With respect to the quality measure, since all the quality measures used in this pa-
per can be seen as a extension of the quality measure Continuous Weighted Relative
Accuracy (CWRAcc) [5], the definition of CWRAcc is given here:

                                          |G| ∑i∈G yi ∑Ni=1 yi
                         φCW RAcc (g) =      ·(      −         )                           (1)
                                           N    |G|      N


3     Model-Based Subgroup Discovery

In this section we will briefly introduce the concept of MBSD, together with the qual-
ity measures and search strategy applied in the following experiments. An example of
MBSD performing with a particular bike station will be given.


3.1   Motivation

The motivation of MBSD is to import models in a SD process, so that the resulted
subgroups can contain richer information. Although this concept is similar to the Ex-
ceptional Model Mining (EMM) framework [6, 7], but MBSD differs from EMM by
the way of using the models. In EMM, a model will be trained under each candidate
subgroup, and then the quality of the subgroup is evaluated by the parameter deviation
between the model trained under the subgroup and the model trained under the whole
data-set (global model). On the other hand, in MBSD only the global model is involved
in the discovery process. For each candidate subgroup, the quality of the subgroup will
be evaluated either according to the likelihood of the global model under the subgroup
(for both non-predictive models and predictive models), or the prediction performance
of the global model in the subgroup (for predictive models). Since the purpose of this
paper is to reuse models via MBSD, we omit a detailed discussion about the differences
between MBSD and EMM. In general, since in MBSD only a global model is required,
repeated training through different candidate subgroups is avoided, this makes MBSD
more appropriate for model reuse.
4       Model Reuse with Subgroup Discovery

3.2   Quality Measure for Regression Models

As in this paper the MBSD task only involves regression, here we only show 4 quality
measures for regression models.
    Suppose the target attribute to be predicted is denoted as yi for the ith instance and
the prediction made by the base-model is denoted as ŷi . The first proposed quality mea-
sure Weighted Relative Mean Absolute Error (WRMAE) is based on the absolute error
of the base model, zAE
                     i = |ŷi − yi |. This quality measure is designed to find subgroups
with large coverage and relatively higher MAE than the population.

                                            |G| ∑i∈G zAE  ∑N zAE
                   φW RMAE ( fbase , G) =      ·(     i
                                                         − i=1 i )                    (2)
                                             N    |G|        N
   Similarly, if the aim is to find subgroups where the base model tends to have lower
MAE than the population, the negative absolute error zNAE
                                                       i   = − |ŷi − yi | can be applied.
The second proposed quality measure Weighted Relative Mean Negative Absolute Error
(WRNMAE) is given as:

                                            |G| ∑i∈G zNAE  ∑N zNAE
                 φW RMNAE ( fbase , G) =       ·(     i
                                                          − i=1 i )                   (3)
                                             N     |G|        N
    Another scenario is to discover the subgroups where the base-model tends to over-
estimate the target attribute. Now the quality measure should be designed according to
the over-estimated error:

                                      (
                                       ŷi − yi   if ŷi ≥ yi
                              zOE
                               i =
                                       0          otherwise

    Notice here the under-estimations are forced to be zeros, hence the quality of sub-
groups will not be affected by having both high over-estimated error and high under-
estimated error. On the other hand, subgroups with both high errors can be discovered
with the quality measure WRMAE. The quality measure Weighted Relative Mean Over-
Estimated Error (WRMOE) is given as:

                                            |G| ∑i∈G zOE  ∑N zOE
                  φW RMOE ( fbase , G) =       ·(      i
                                                         − i=1 i )                    (4)
                                             N     |G|       N
   As shown above, the under-estimated error and corresponding quality measure Weighted
Relative Mean Under-Estimated Error (WRMUE) can be defined as:

                                      (
                                       yi − ŷi   if yi ≥ ŷi
                              zUE
                               i  =
                                       0          otherwise

                                            |G| ∑i∈G zUE  ∑N zUE
                  φW RMUE ( fbase , G) =       ·(      i
                                                         − i=1 i )                    (5)
                                             N     |G|       N
                                             Model Reuse with Subgroup Discovery          5

3.3   Description Language and Search Strategy

In traditional SD, the description of subgroups can be built on any attribute other than
the target variable. In EMM with predictive models, the description of subgroups can
be built on any attribute except the input and output of the model. This is because the
essential aim of SD is to use some other attributes to describe the pattern of some target
attributes, hence the description should avoid to use the target attributes. However, for
MBSD with predictive models, the description of subgroups can potentially be built
on any attribute in the data-set. The reason behind this is that the pattern MBSD (with
predictive models) tries to describe is the prediction pattern of the base model, instead
the pattern of the attributes.
     As many other logical models, there exists many ways to split the hypothesis space
to generate the candidate subgroups. This generally involves fixing the operations on
each attribute. In this paper we will simply use a conjunction of attribute-value pairs
as the description language. For numerical attributes, a pre-processing is performed to
divide each numerical attribute into equal size bins and further treat them as nominal
attributes. Since in the experiments there are a large amount of SD tasks, we also assume
all the subgroups are described by any single attribute from the description attributes.
This can help further reduce the search cost. Also, for each attribute, only the best
subgroup described by that attribute will be selected. Hence top q subgroups can be
seen as subgroups described by q attributes respectively.
     As only a single attribute is used to describe each subgroup, the search strategy
can be seen as a refinement process with adding different values of the corresponding
attribute. Here we further use a greedy covering algorithm to increase the search speed
and reduce the memory usage. The algorithm is performed as follows. The bin with the
highest mean value of the target variable (e.g. AE) is added to the description at each
step. The algorithm terminates once the quality measure is smaller than the previous
step. This covering algorithm is generally similar to a beam search algorithm except the
beam width is fixed as 1, due to the fact that the refinement is done within the same
attribute.


3.4   MBSD with Single Bike Station

In the MoReBikeS challenge, there are totally 25 attributes in the data-set. Table 1 sum-
marises the information for each attribute in the provided one-month data, such as name,
type (binary, nominal, numerical), number of values, and number of bins configured in
the MBSD task (only for numerical attributes).
      Although the 25 attributes can all be used to construct the candidate subgroups, as
the attribute bikes is the variable to be predicted, it can be removed from the description.
Also because the MBSD task is going to be performed for each individual station during
Oct 2014, the attribute station, latitude, longitude, year, month, and timestamp can
be further excluded.
      For simplicity, from now on we will use model i − j to refer the model j of station
i ( j = 1 for short, j = 2 for short temp, j = 3 for full, j = 4 for full temp, j = 5 for
short full, j = 6 for short full temp).
6       Model Reuse with Subgroup Discovery

                       attribute          type number of values number of bins
                        station         nominal     275             NA
                       latitude        numerical    275             275
                      longitude        numerical    275             275
                    numDocks           numerical     19              19
                     timestamp         numerical    745             745
                         year          numerical      1               1
                        month          numerical      1               1
                          day          numerical     31              31
                         hour          numerical     24              24
                       weekday         numerical      7               7
                     weekhour          numerical    168             168
                      isHoliday          binary       2             NA
               windMaxSpeed.m.s        numerical     28              28
              windMeanSpeed.m.s numerical            16              16
              windDirection.grades numerical         17              17
                  temperature.C        numerical    142              16
                 relHumidity.HR        numerical     72               8
                  airPressure.mb       numerical    283              32
                precipitation.l.m2     numerical      1               1
                   bikes 3h ago        numerical     41              41
            full profile 3h diff bikes numerical   17304             32
                 full profile bikes    numerical   17632             41
           short profile 3h diff bikes numerical    419              32
               short profile bikes     numerical    231              32
                         bikes         numerical     41              41
         Table 1: The 25 attributes in the October data-set and their properties.


    For instance, Figure 1 (left) shows the prediction of station 201 from the model 1−1
during Oct 2014, together with the ground truth. Figure 1 (right) gives the empirical
distribution of the prediction errors.
     If MBSD is performed with the prediction shown above and the quality measure
WRMAE is applied, the best (rank 1) subgroup is found with the attribute weekhour.
The corresponding attribute values are shown in Figure 2 (left). It can be seen that, since
we treat this numerical attribute as a nominal attribute (e.g. the candidate subgroups can
contain any combination of attribute values), the found attributes values look sparse.
However, there are still some patterns can be told from the figure. For instance, most
of the attribute values are located around the night of each day. Figure 2 (right) gives
the empirical distribution of the prediction errors within the subgroup. Comparing to
Figure 1, here the distribution of errors has a significantly higher variance, which in-
dicates a higher MAE. Figure 3 (left) shows the best subgroup found with the quality
measure WRMNAE. Since WRMNAE can be seen as a negative version of WRMAE,
it can be seen the best subgroup with WRMNAE is the compliment of the best group of
WRMAE.
                                                                                         Model Reuse with Subgroup Discovery                                    7

                                Station 201 with model 1-1October                                                         MAE = 2.7514
              25
                                                                                                 0.35
                            ground truth        preicted
              20
                                                                                                  0.3

              15

                                                                                                 0.25
              10


               5                                                                                  0.2
      Bikes


                                                                                         P (z)
               0
                                                                                                 0.15

               -5

                                                                                                  0.1
              -10


                                                                                                 0.05
              -15


              -20                                                                                  0
                    1   4         7        10     13       16   19   22   25   28   31              -12   -10   -8   -6   -4       -2       0   2   4   6   8
                                                Day of the Month                                                               z = ŷ − y


Fig. 1: The prediction of station 1 from the model short (left), and the empirical distri-
bution of errors (right).

       Rank WRMAE(WRMNAE)                    WRMOE                        WRMUE
        1          weekhour                  weekhour                    weekhour
        2            hour            full profile 3h diff bikes            hour
        3 full profile 3h diff bikes      windMaxSpeed                      day
        4             day                       hour            full profile 3h diff bikes
        5      full profile bikes          windDirection             full profile bikes
    Table 2: The description attributes for top-5 subgroups with each quality measure.


     Similarly, we can also find subgroups with the quality measure WRMOE and WR-
MUE. The results (attribute values of the best subgroup and error distribution within the
subgroup) of WRMOE and WRMUE are given in Figure 4 and Figure 5 respectively.
     It can be seen for all the 4 quality measures the best subgroup is described by the
attribute weekhour. However, the description attributes for top-q subgroups can vary
with different quality measures. The description attributes for top-5 subgroups with
each quality measure is given in Table 2.
     In general, MBSD can be used to find the deviated prediction patterns in a given
data-set. For the regression models, MBSD is set to use one attribute to describe the
data points that the base model tends to predict well/not well. Therefore, each attribute
used by the subgroups can be seen sharing some non-linear correlation to the model’s
prediction. This is similar to an attribute selection (e.g. regularisation), but in a non-
linear form.


4     Model Reuse with MBSD

In this section we will introduce how to reuse trained models with MBSD. The general
idea is, for each deploy context, we can select a bunch of trained models according to
their performance. Then with MBSD, we can further detect the (pattern of) data points
that the previous models predict well / not well, which can be seen as a sub-context. A
8                       Model Reuse with Subgroup Discovery

                               Station 201 with model 1-1, WRMAE
                    1                                                                                             MAE(G) = 4.5534
                                                                                    0.35

                                                                                               Population        Subgroup

                                                                                     0.3


                                                                                    0.25
      g(weekhour)


                                                                                     0.2


                                                                            P (z)
                                                                                    0.15


                                                                                     0.1


                                                                                    0.05


                    0                                                                 0
                          20      40    60     80   100   120   140   160              -12   -10     -8     -6       -4       -2       0   2   4   6   8
                                             weekhour                                                                     z = ŷ − y


Fig. 2: The best subgroup found with the quality measure WRMAE (left), and the em-
pirical distribution of errors within the subgroup (right).

                               Station 201 with model 1-1, WRMNAE
                    1                                                                                             MAE(G) = 1.5434
                                                                                    0.35

                                                                                               Population        Subgroup

                                                                                     0.3


                                                                                    0.25
      g(weekhour)


                                                                                     0.2
                                                                            P (z)


                                                                                    0.15


                                                                                     0.1


                                                                                    0.05


                    0                                                                 0
                          20      40    60     80   100   120   140   160              -12   -10     -8     -6       -4       -2       0   2   4   6   8
                                             weekhour                                                                     z = ŷ − y


Fig. 3: The best subgroup found with the quality measure WRMNAE (left), and the
empirical distribution of errors within the subgroup (right).


number of sub-models are then selected just for these data points. The final prediction
is hence estimated by averaging the prediction from the base models and sub-models.


4.1            Baseline Method 1

The first base line method is, for each deploy context, to simply select one model from
the 1200 trained models (200 stations, 6 models per station) that has the lowest MAE
on the test station.


4.2            Baseline Method 2

The second base line method is, for each test station, to rank the 1200 trained models
according to their MAE on the test station. The final prediction is hence the average of
the prediction of the top-n models (the selected models are referred as base models):
                                                                           Model Reuse with Subgroup Discovery                                             9

                             Station 201 with model 1-1, WRMOE
                    1                                                                                            MAE(G) = 3.9921
                                                                                   0.35

                                                                                              Population        Subgroup

                                                                                    0.3


                                                                                   0.25
      g(weekhour)


                                                                                    0.2


                                                                           P (z)
                                                                                   0.15


                                                                                    0.1


                                                                                   0.05


                    0                                                                0
                        20      40   60     80    100   120    140   160              -12   -10     -8     -6       -4       -2       0   2   4   6   8
                                          weekhour                                                                       z = ŷ − y


Fig. 4: The best subgroup found with the quality measure WROE (left), and the empir-
ical distribution of errors within the subgroup (right).

                             Station 201 with model 1-1, WRMUE
                    1                                                                                            MAE(G) = 4.6019
                                                                                   0.35

                                                                                              Population        Subgroup

                                                                                    0.3


                                                                                   0.25
      g(weekhour)


                                                                                    0.2
                                                                           P (z)


                                                                                   0.15


                                                                                    0.1


                                                                                   0.05


                    0                                                                0
                        20      40   60     80    100   120    140   160              -12   -10     -8     -6       -4       -2       0   2   4   6   8
                                          weekhour                                                                       z = ŷ − y


Fig. 5: The best subgroup found with the quality measure WRUE (left), and the empir-
ical distribution of errors within the subgroup (right).


                                                                      n
                                                                         j
                                                              yˆni = ∑ fbase (x)                                                                          (6)
                                                                     k=1
    Baseline method 2 can be seen as a special case of Bootstrap aggregating (bagging),
as each station can be treated as a bootstrap of a mixed context (the data-set of all bike
stations).

4.3            MBSD-reuse method
The proposed method is to use MBSD to find the top q subgroups (subgroups described
by q attributes) for each base model in the previous method. Then a sub-model is se-
lected according to the MAE within the subgroups:

                                                   j                   g j (di ) · |yi − f (xi )|
                                                 fsub = argmin f                                                                                          (7)
                                                                                    Gj
10       Model Reuse with Subgroup Discovery

                             Station 1 to 10, SO, q = 1                                                                   Station 1 to 10, SO, q = 5
               2.76                                                                                     2.74


                                                                                                       2.735
               2.75

                                                                                                        2.73

               2.74
                                                                                                       2.725
                                                                                    Base 1                                                                                  Base 1
                                                                                    Base 2                                                                                  Base 2
                                                                                                        2.72
               2.73                                                                 MAE                                                                                     MAE


         MAE


                                                                                                 MAE
                                                                                    MNAE                                                                                    MNAE
                                                                                    MOE                2.715                                                                MOE
                                                                                    MUE                                                                                     MUE
               2.72                                                                 3-mix-MAE                                                                               3-mix-MAE
                                                                                    3-mix-MNAE          2.71                                                                3-mix-MNAE
                                                                                    full-mix                                                                                full-mix

               2.71                                                                                    2.705


                                                                                                         2.7
                2.7
                                                                                                       2.695


               2.69                                                                                     2.69
                      50   100   150   200   250   300    350   400   450   500                                50   100     150   200   250   300   350   400   450   500

                            Number of models averaged                                                                Number of models averaged
                             Station 1 to 10, SO, q = 9                                                              Station 1 to 10, SO, q = 16
               2.75                                                                                    2.75


               2.74                                                                                    2.74


               2.73                                                                                    2.73
                                                                                    Base 1                                                                                  Base 1
                                                                                    Base 2                                                                                  Base 2
                                                                                    MAE                                                                                     MAE
         MAE


                                                                                                 MAE
                                                                                    MNAE                                                                                    MNAE
               2.72                                                                 MOE                2.72                                                                 MOE
                                                                                    MUE                                                                                     MUE
                                                                                    3-mix-MAE                                                                               3-mix-MAE
                                                                                    3-mix-MNAE                                                                              3-mix-MNAE
                                                                                    full-mix                                                                                full-mix
               2.71                                                                                    2.71


                2.7                                                                                     2.7


               2.69                                                                                    2.69
                      50   100   150   200   250   300    350   400   450   500                                50   100     150   200   250   300   350   400   450   500

                            Number of models averaged                                                                Number of models averaged


Fig. 6: The error curve for station 1 to 10 (Station-orientated), with top q subgroups
adopted for each quality measure.


    To combine the predictions from both the base models and sub-models, here the
strategy is to use the base models for the data points that are not covered by the sub-
groups, and use the average of base models and sub-models for the data points within
the subgroups. For the jth base model, the mixture model can be given as:

                                                                                    j                   j
                                                          j                       fbase (xi ) + g j · fsub (xi )
                                                         fmix (xi ) =                                                                                                                     (8)
                                                                                         1 + g j (di )

   For the case that there are multiple subgroups (hence multiple sub-models) for each
base model (with different rank or different quality measures), the mixture model (with
K different subgroups) can be given as:

                                                                              j                           j,k
                                               j                            fbase (xi ) + ∑Kk=1 g j,k · fsub  (xi )
                                              fmix (xi ) =                                K
                                                                                                                                                                                          (9)
                                                                                   1 + ∑k=1 g j,k (di )

     Again, we can get the final prediction by averaging the top-n mixture models:

                                                                                          n
                                                                                      j
                                                                            yˆni = ∑ fmix (xi )                                                                                          (10)
                                                                                       j=1
                                                                                               Model Reuse with Subgroup Discovery                                                   11

                            Station 226 to 275, SO, q = 1                                                         Station 226 to 275, SO, q = 5
                2.08                                                                                  2.11


               2.075                                                                                   2.1


                2.07
                                                                                                      2.09


               2.065
                                                                                  Base 1                                                                                Base 1
                                                                                                      2.08
                                                                                  Base 2                                                                                Base 2
                                                                                  MAE                                                                                   MAE
                2.06


                                                                                                MAE
         MAE
                                                                                  MNAE                                                                                  MNAE
                                                                                  MOE                 2.07                                                              MOE
                                                                                  MUE                                                                                   MUE
               2.055                                                                                                                                                    3-mix-MAE
                                                                                  3-mix-MAE
                                                                                  3-mix-MNAE                                                                            3-mix-MNAE
                                                                                                      2.06
                                                                                  full-mix                                                                              full-mix
                2.05

                                                                                                      2.05
               2.045


                                                                                                      2.04
                2.04


               2.035                                                                                  2.03
                       50   100   150   200   250   300   350   400   450   500                              50   100   150   200   250   300   350   400   450   500

                             Number of models averaged                                                             Number of models averaged
                            Station 226 to 275, SO, q = 9                                                         Station 226 to 275, SO, q = 16
               2.11                                                                                   2.09


                2.1
                                                                                                      2.08

               2.09

                                                                                                      2.07
               2.08
                                                                                  Base 1                                                                                Base 1
                                                                                  Base 2                                                                                Base 2
                                                                                  MAE                 2.06                                                              MAE
               2.07
         MAE


                                                                                                MAE
                                                                                  MNAE                                                                                  MNAE
                                                                                  MOE                                                                                   MOE
                                                                                  MUE                                                                                   MUE
               2.06
                                                                                  3-mix-MAE           2.05                                                              3-mix-MAE
                                                                                  3-mix-MNAE                                                                            3-mix-MNAE
                                                                                  full-mix                                                                              full-mix
               2.05
                                                                                                      2.04

               2.04

                                                                                                      2.03
               2.03


               2.02                                                                                   2.02
                       50   100   150   200   250   300   350   400   450   500                              50   100   150   200   250   300   350   400   450   500

                             Number of models averaged                                                             Number of models averaged


Fig. 7: The error curve for station 226 to 275 (Station-orientated), with top q subgroups
adopted for each quality measure.


5   Experiments

With respect to the experiments, the training data is fixed to be the data of 275 stations
during Oct 2014. For testing data, the full year data of station 1 to station 10 and the
3-month data of station 226 to station 275 will be used. In the first experiment (Station-
oriented), each station is seen as a deploy context. In the second experiment (Non-
station-oriented), each group of station (1 to 10, 226 to 275) is seen as a deploy context.
    In both experiments, the performances will be compared among 9 methods: 1) base
method 1, base method 2, MBSD-WRMAE reuse, MBSD-WRMNAE reuse, MBSD-
WRMOE reuse, MBSD-WRMUE reuse, MBSD-3-mixuture reuse (WRMAE, WRMOE,
WRMUE), MBSD-3-mixture reuse (WRMNAE, WRMOE, WRMUE), MBSD-4-mixture
reuse. A number of up to top 16 subgroups will be used in the prediction, and up to 512
base models are selected and averaged for each deploy context.
    The station-oriented error curves for station 1 to station 10 and station 226 to station
275 are given in Fig 6 and Fig 7 respectively. The non-station oriented error curves for
the two groups of stations are shown in Figure 8 and Fig 9 respectively.
    With respect to the station-oriented approach, it can be seen that the baseline method
2 generally beats baseline method 1. This indicates that, when the training data of the
deployment context is limited, to select a bunch of trained models to get the average can
potentially help reduce the prediction error. As previously discussed, in this scenario
each station can be treated as a bootstrap, the baseline method is hence similar to using
12      Model Reuse with Subgroup Discovery

                            Station 1 to 10, NSO, q = 1                                                          Station 1 to 10, NSO, q = 5
               2.75                                                                                 2.75


               2.74                                                                                 2.74


               2.73                                                                                 2.73
                                                                                 Base 1                                                                               Base 1
                                                                                 Base 2                                                                               Base 2
                                                                                 MAE                                                                                  MAE


         MAE


                                                                                              MAE
                                                                                 MNAE                                                                                 MNAE
               2.72                                                              MOE                2.72                                                              MOE
                                                                                 MUE                                                                                  MUE
                                                                                 3-mix-MAE                                                                            3-mix-MAE
                                                                                 3-mix-MNAE                                                                           3-mix-MNAE
                                                                                 full-mix                                                                             full-mix
               2.71                                                                                 2.71


                2.7                                                                                  2.7


               2.69                                                                                 2.69
                      50   100   150   200   250   300   350   400   450   500                             50   100   150   200   250   300   350   400   450   500

                            Number of models averaged                                                           Number of models averaged
                            Station 1 to 10, NSO, q = 9                                                         Station 1 to 10, NSO, q = 16
               2.75                                                                                 2.75


               2.74                                                                                 2.74


               2.73                                                                                 2.73
                                                                                 Base 1                                                                               Base 1
                                                                                 Base 2                                                                               Base 2
                                                                                 MAE                                                                                  MAE
         MAE


                                                                                              MAE
                                                                                 MNAE                                                                                 MNAE
               2.72                                                              MOE                2.72                                                              MOE
                                                                                 MUE                                                                                  MUE
                                                                                 3-mix-MAE                                                                            3-mix-MAE
                                                                                 3-mix-MNAE                                                                           3-mix-MNAE
                                                                                 full-mix                                                                             full-mix
               2.71                                                                                 2.71


                2.7                                                                                  2.7


               2.69                                                                                 2.69
                      50   100   150   200   250   300   350   400   450   500                             50   100   150   200   250   300   350   400   450   500

                            Number of models averaged                                                            Number of models averaged


Fig. 8: The error curve for station 1 to 10 (Non-station-orientated), with top q subgroups
adopted for each quality measure.


the bagging strategy. However, for this approach an important issue is to decide the
number of models to be averaged, as it tends to over-fit quickly this number gets larger.
     For the proposed methods, the figures show that the method MBSD-WRMNAE
generally gets the best performance except in one case, where only the top 1 subgroup
is used to predict the group station 1 to 10. The reason behind the good performance of
MBSD-WRMNAE can be linked to the error distributions given in the previous section.
As Fig 3 (right) shows, only with the quality measure WRMNAE, the error distribution
is still close to a Gaussian distribution with 0 mean, but with less variance than in the
population. The subgroup can hence be seen as a less noisy context, which helps the
regression model to capture better parameters. On the other hand, it can be seen, espe-
cially with large q, the proposed methods tend to reduce the effect of over-fitting from
baseline method 2. This is mainly because these methods are designed to fit a better
model for the data points that are not well predicted by the base models. Therefore,
the effect will become more significant when the number of q gets larger, as more sub-
models are involved in the prediction. This makes the choice of number of averaged
models less problematic.
     With respect to the non-station-orientated method, the first interesting observation
is that, for both groups, the MAE of baseline method is significantly lower than in the
station-orientated approach. This indicates that to treat a set of stations as the deploy
context can potentially help to get better performance. This also means that the attribute
station might not be the best attribute to separate (describe) the deploy context. The
second observation is that the baseline method 2 generally has a higher MAE than the
                                                                                             Model Reuse with Subgroup Discovery                                                    13

                          Station 226 to 275, NSO, q = 1                                                        Station 226 to 275, NSO, q = 5
              2.09                                                                                   2.1


              2.08

                                                                                                    2.08
              2.07


              2.06
                                                                                                    2.06
                                                                                Base 1                                                                                 Base 1
              2.05                                                              Base 2                                                                                 Base 2
                                                                                MAE                                                                                    MAE


        MAE


                                                                                              MAE
                                                                                MNAE                                                                                   MNAE
              2.04                                                              MOE                 2.04                                                               MOE
                                                                                MUE                                                                                    MUE
                                                                                3-mix-MAE                                                                              3-mix-MAE
              2.03                                                              3-mix-MNAE                                                                             3-mix-MNAE
                                                                                full-mix                                                                               full-mix
                                                                                                    2.02
              2.02


              2.01
                                                                                                      2

                2


              1.99                                                                                  1.98
                     50   100   150   200   250   300   350   400   450   500                              50    100   150   200   250   300   350   400   450   500

                           Number of models averaged                                                              Number of models averaged
                          Station 226 to 275, NSO, q = 9                                                        Station 226 to 275, NSO, q = 16
               2.1                                                                                  2.09


                                                                                                    2.08

              2.08
                                                                                                    2.07


                                                                                                    2.06
              2.06
                                                                                Base 1                                                                                 Base 1
                                                                                Base 2              2.05                                                               Base 2
                                                                                MAE                                                                                    MAE
        MAE


                                                                                              MAE
                                                                                MNAE                                                                                   MNAE
              2.04                                                              MOE                 2.04                                                               MOE
                                                                                MUE                                                                                    MUE
                                                                                3-mix-MAE                                                                              3-mix-MAE
                                                                                3-mix-MNAE          2.03                                                               3-mix-MNAE
                                                                                full-mix                                                                               full-mix
              2.02
                                                                                                    2.02


                                                                                                    2.01
                2

                                                                                                      2


              1.98                                                                                  1.99
                     50   100   150   200   250   300   350   400   450   500                              50    100   150   200   250   300   350   400   450   500

                           Number of models averaged                                                              Number of models averaged


Fig. 9: The error curve for station 226 to 275 (Non-station-orientated), with top q sub-
groups adopted for each quality measure.


baseline method 1 in the non-station-orientated approach. One possible reason could
be that, since now the training data is mixed with different stations, simply select base
models according to MAE can cause a significant over-fitting and hence lower down
the performance of the averaged prediction.
    Since all the proposed methods are essentially based on the baseline method 2,
although they generally perform better than the baseline method 2, their MAE is still
higher than baseline method 1. However, with the case that q = 16, it can be seen both
MBSD-WRMNAE and MBSD-3-mixture (WRMNAE, WRMOE, WRMUE) can still
reach a MAE lower than the baseline method 1.


6   Conclusion
This paper investigates how SD can be adopted for model reuse. A variation of SD,
called Model-Based Subgroup Discovery is used to detect the predictive patterns (sub-
groups) of the trained models in the new context. A set of sub-models are then se-
lected for these subgroups to construct a mixture model. The experiments show that our
proposed method can reduce the MAE of regression models and potentially stop the
over-fitting of averaged models.
    One further research direction is to develop a model ensemble algorithm with MBSD.
Since in this paper some trained models are provided, a more interesting research task
is hence to start from preparing the base models that can be further reused. So that the
algorithm can finish the whole model reuse procedure.
14       Model Reuse with Subgroup Discovery

References
1. Niall Twomey and Peter A Flach. Context modulation of sensor data applied to activity recog-
   nition in smart homes. In LMCE 2014, First International Workshop on Learning over Multi-
   ple Contexts, 2014.
2. Willi Klösgen. Advances in knowledge discovery and data mining. chapter Explora: A Mul-
   tipattern and Multistrategy Discovery Assistant, pages 249–271. American Association for
   Artificial Intelligence, Menlo Park, CA, USA, 1996.
3. Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Principles of
   Data Mining and Knowledge Discovery, pages 78–87. Springer, 1997.
4. Nada Lavrač, Branko Kavšek, Peter Flach, and Ljupčo Todorovski. Subgroup discovery with
   CN2-SD. The Journal of Machine Learning Research, 5:153–188, 2004.
5. Martin Atzmueller and Florian Lemmerich. Fast subgroup discovery for continuous target
   concepts. In Foundations of Intelligent Systems, pages 35–44. Springer, 2009.
6. Dennis Leman, Ad Feelders, and Arno Knobbe. Exceptional model mining. In Machine
   Learning and Knowledge Discovery in Databases, pages 1–16. Springer, 2008.
7. Wouter Duivesteijn, Ad J Feelders, and Arno Knobbe. Exceptional model mining. Data
   Mining and Knowledge Discovery, pages 1–52, 2013.