<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Systems Laboratory, University of Bristol</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe a method to reuse models with Model-Based Subgroup Discovery (MBSD), which is a extension of the Subgroup Discovery scheme. The task is to predict the number of bikes at a new rental station 3 hours in advance. Instead of training new models with the limited data from these new stations, our approach first selects a number of pre-trained models from old rental stations according to their mean absolute errors (MAE). For each selected model, we further performed MBSD to locate a number of subgroups that the selected model has a deviated prediction performance. Then another set of pre-trained models are selected only according to their MAE over the subgroup. Finally, the prediction are made by averaging the prediction from the models selected during the previous two steps. The experiments show that our method performances better than selecting trained models with the lowest MAE, and the averaged lowMAE models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In this paper we propose a model reuse approach exploiting Model-Based Subgroup
Discovery (MBSD). The general idea of model reuse is to use trained models from
other operating contexts under a new operating context. Such a strategy has two main
benefits. Firstly, it can dramatically reduce model training time on the new operating
contexts. Secondly, if the new operating context only has limited data, as model reuse
essentially extends the scale of the training data by adding training data from other
operating contexts, it can help further improve the prediction’s performance.</p>
      <p>
        One major challenge for model reuse is that the patterns in the data can vary through
different operating contexts. This makes it difficult to directly apply trained models
from the training contexts to a new context. For instance, to predict the activities of
daily living (ADLs) from the reading of sensors is one of the leading applications of
a smart home. However, as both the households and layout of the house varies from
different houses, it is hard to directly use a model trained in one particular house to
another house. Therefore, to recognise and deal with such variations through different
operating contexts has become a non-trivial research task for model reuse [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In this paper, we will use a variation of the Subgroup Discovery (SD) scheme [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2–4</xref>
        ]
to help the reused models to adopt the new context. SD is a data mining technique. It
uses a descriptive model to learn the unusual statistic of a target variable in a given
dataset. However, traditional SD approaches generally focus on the statistic of a single
attribute in a fixed data-set. This makes it less appropriate for model reuse. Therefore, we
propose an extended method MBSD in this paper. The main modification is to change
the target variable in the SD task from an attribute into the prediction performance of an
attribute from a particular base model. Through this modification, MBSD can be used to
discover the prediction pattern of a trained model in a new operating context. This can
help locate the potential sub-context where the trained model can be directly applied,
or the potential sub-context where other trained models are required.
      </p>
      <p>The experiments are based on a machine learning challenge MoReBikeS, which is
organised by the workshop LMCE 2015, within the conference ECML-PKDD 2015.
The task is to predict the number of bikes available at a particular rental station 3 hours
later, given some history data. In detail, the overall data-set is obtained from 275 bike
rental stations located in Valencia, Spain. For the participants, everyone gets access to
the data for all the 275 stations during the October of 2014 and use these data as training
data. 6 trained linear regression models are also provided for each station from station
1 to station 200. These linear regression models are trained with the data that covers
the whole year of 2014. Therefore, the task of the challenge is, by reusing these trained
models and limited training data, to predict the number of available bike at some new
bike stations (station 201 to station 275).</p>
      <p>The method we used can be briefly described as follows. For any station to be
predicted, the one-month training data can be applied to select a number of models with
good performances (low MAE values), these models are called base-models. The
assumption here is that these base-models are only suitable to some unknown sub-context
of the context to be predicted (the sub-context is similar to the training context of the
base-models), and not suitable to some other sub-context. With such an assumption,
we can perform MBSD to discover these sub-contexts and to further select a number
of models with good performances only under the sub-contexts. These models are
denoted as sub-models. Finally, the overall prediction can be obtained by averaging the
prediction from both base-models and sub-models, with some averaging strategy. The
experiments show that, with MBSD, the MAE can be further reduced comparing to
simple averaging the prediction of the base-models.</p>
      <p>This paper is organised as follows. In section 2 some preliminaries of Subgroup
Discovery are given and in section 3 the basic concept of MBSD is introduced. The method
to reuse models with MBSD is stated in section 4. Section 5 shows some experiments
with the MoReBikes data. A conclusion of the whole paper is provided in section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Subgroup Discovery</title>
      <p>In this section we will give some preliminaries and corresponding notations of SD.</p>
      <p>
        Subgroup Discovery (SD) [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2–4</xref>
        ] is a data mining technique that learns rules to
describe patterns of some attributes in a given data-set. Since the construction of
subgroups is driven by some attributes, called target variable, it can be seen as a descriptive
model which is learnt in a supervised way. However, SD still differs from predictive
models as in SD we are not aiming to predict the target variable, but to discovery some
interesting patterns with respect to it. Therefore, the definition of an interesting pattern
needs to be given. In existing literature, an interesting pattern often refers to a
different class distribution (for binary/nominal target variable), or in general to an unusual
statistic (for binary/nominal/numerical target variable). On the other hand, because such
patterns often have a small coverage, some literature also define SD as a model to find
patterns that have both large coverage and unusual statistic.
      </p>
      <p>Mathematically, suppose the data-set contains N instances and M attributes.
Traditional SD assumes that one from the M attributes is selected as the target variables,
the corresponding attribute of the ith instance is denoted as yi 2 R1, the domain of this
attribute is denoted as Y = fyigiN=1. The rest M 1 attributes are used as the description
attributes, denoted as di 2 RM 1, the domain of this attributes is denoted as D = fdigiN=1.</p>
      <p>A subgroup is denoted as a function g : D ! f0; 1g. Hence g(di) = 1 means that the
ith instance is covered by this subgroup and vice versa. We use G = fi : g(di) = 1g to
denote the set of instances to be covered by the subgroup g.</p>
      <p>The task of (top q) subgroup discovery can be defined as, given a set of candidate
subgroups G 2D, and a quality measure f : g ! R, to find a set of q subgroups
Gq = fg1; :::; gqg, so that f (g1) f (g2) ::: f (gq), and 8gi 2 Gq; 8g j 2 G n Gq :
f gi f g j.</p>
      <p>
        With respect to the quality measure, since all the quality measures used in this
paper can be seen as a extension of the quality measure Continuous Weighted Relative
Accuracy (CWRAcc) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the definition of CWRAcc is given here:
fCW RAcc(g) = jGj ( åi2G yi
      </p>
      <p>N jGj
åiN=1 yi )</p>
      <p>N
(1)
3</p>
    </sec>
    <sec id="sec-3">
      <title>Model-Based Subgroup Discovery</title>
      <p>In this section we will briefly introduce the concept of MBSD, together with the
quality measures and search strategy applied in the following experiments. An example of
MBSD performing with a particular bike station will be given.
3.1</p>
      <sec id="sec-3-1">
        <title>Motivation</title>
        <p>
          The motivation of MBSD is to import models in a SD process, so that the resulted
subgroups can contain richer information. Although this concept is similar to the
Exceptional Model Mining (EMM) framework [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ], but MBSD differs from EMM by
the way of using the models. In EMM, a model will be trained under each candidate
subgroup, and then the quality of the subgroup is evaluated by the parameter deviation
between the model trained under the subgroup and the model trained under the whole
data-set (global model). On the other hand, in MBSD only the global model is involved
in the discovery process. For each candidate subgroup, the quality of the subgroup will
be evaluated either according to the likelihood of the global model under the subgroup
(for both non-predictive models and predictive models), or the prediction performance
of the global model in the subgroup (for predictive models). Since the purpose of this
paper is to reuse models via MBSD, we omit a detailed discussion about the differences
between MBSD and EMM. In general, since in MBSD only a global model is required,
repeated training through different candidate subgroups is avoided, this makes MBSD
more appropriate for model reuse.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Quality Measure for Regression Models</title>
        <p>As in this paper the MBSD task only involves regression, here we only show 4 quality
measures for regression models.</p>
        <p>Suppose the target attribute to be predicted is denoted as yi for the ith instance and
the prediction made by the base-model is denoted as yˆi. The first proposed quality
measure Weighted Relative Mean Absolute Error (WRMAE) is based on the absolute error
of the base model, ziAE = jyˆi yij. This quality measure is designed to find subgroups
with large coverage and relatively higher MAE than the population.</p>
        <p>fW RMAE ( fbase; G) = jGj ( åi2G ziAE åiN=1 ziAE ) (2)</p>
        <p>N jGj N</p>
        <p>Similarly, if the aim is to find subgroups where the base model tends to have lower
MAE than the population, the negative absolute error ziNAE = jyˆi yij can be applied.
The second proposed quality measure Weighted Relative Mean Negative Absolute Error
(WRNMAE) is given as:
fW RMNAE ( fbase; G) = jGj ( åi2G ziNAE åiN=1 ziNAE ) (3)</p>
        <p>N jGj N</p>
        <p>Another scenario is to discover the subgroups where the base-model tends to
overestimate the target attribute. Now the quality measure should be designed according to
the over-estimated error:
ziOE =
(
yˆi
0
yi
if yˆi yi
otherwise</p>
        <p>Notice here the under-estimations are forced to be zeros, hence the quality of
subgroups will not be affected by having both high over-estimated error and high
underestimated error. On the other hand, subgroups with both high errors can be discovered
with the quality measure WRMAE. The quality measure Weighted Relative Mean
OverEstimated Error (WRMOE) is given as:
fW RMOE ( fbase; G) = jGj ( åi2G ziOE</p>
        <p>N jGj
åiN=1 ziOE )</p>
        <p>N</p>
        <p>As shown above, the under-estimated error and corresponding quality measure Weighted
Relative Mean Under-Estimated Error (WRMUE) can be defined as:
zUiE =
(
yi
0
yˆi
if yi yˆi
otherwise
fW RMUE ( fbase; G) = jGj ( åi2G zUiE</p>
        <p>N jGj
åiN=1 zUiE )</p>
        <p>N
(4)
(5)</p>
      </sec>
      <sec id="sec-3-3">
        <title>Description Language and Search Strategy</title>
        <p>In traditional SD, the description of subgroups can be built on any attribute other than
the target variable. In EMM with predictive models, the description of subgroups can
be built on any attribute except the input and output of the model. This is because the
essential aim of SD is to use some other attributes to describe the pattern of some target
attributes, hence the description should avoid to use the target attributes. However, for
MBSD with predictive models, the description of subgroups can potentially be built
on any attribute in the data-set. The reason behind this is that the pattern MBSD (with
predictive models) tries to describe is the prediction pattern of the base model, instead
the pattern of the attributes.</p>
        <p>As many other logical models, there exists many ways to split the hypothesis space
to generate the candidate subgroups. This generally involves fixing the operations on
each attribute. In this paper we will simply use a conjunction of attribute-value pairs
as the description language. For numerical attributes, a pre-processing is performed to
divide each numerical attribute into equal size bins and further treat them as nominal
attributes. Since in the experiments there are a large amount of SD tasks, we also assume
all the subgroups are described by any single attribute from the description attributes.
This can help further reduce the search cost. Also, for each attribute, only the best
subgroup described by that attribute will be selected. Hence top q subgroups can be
seen as subgroups described by q attributes respectively.</p>
        <p>As only a single attribute is used to describe each subgroup, the search strategy
can be seen as a refinement process with adding different values of the corresponding
attribute. Here we further use a greedy covering algorithm to increase the search speed
and reduce the memory usage. The algorithm is performed as follows. The bin with the
highest mean value of the target variable (e.g. AE) is added to the description at each
step. The algorithm terminates once the quality measure is smaller than the previous
step. This covering algorithm is generally similar to a beam search algorithm except the
beam width is fixed as 1, due to the fact that the refinement is done within the same
attribute.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>MBSD with Single Bike Station</title>
        <p>In the MoReBikeS challenge, there are totally 25 attributes in the data-set. Table 1
summarises the information for each attribute in the provided one-month data, such as name,
type (binary, nominal, numerical), number of values, and number of bins configured in
the MBSD task (only for numerical attributes).</p>
        <p>Although the 25 attributes can all be used to construct the candidate subgroups, as
the attribute bikes is the variable to be predicted, it can be removed from the description.
Also because the MBSD task is going to be performed for each individual station during</p>
      </sec>
      <sec id="sec-3-5">
        <title>Oct 2014, the attribute station, latitude, longitude, year, month, and timestamp can</title>
        <p>be further excluded.</p>
        <p>For simplicity, from now on we will use model i j to refer the model j of station
i ( j = 1 for short, j = 2 for short temp, j = 3 for full, j = 4 for full temp, j = 5 for
short full, j = 6 for short full temp).
attribute
type</p>
        <p>number of values number of bins</p>
        <p>For instance, Figure 1 (left) shows the prediction of station 201 from the model 1 1
during Oct 2014, together with the ground truth. Figure 1 (right) gives the empirical
distribution of the prediction errors.</p>
        <p>If MBSD is performed with the prediction shown above and the quality measure
WRMAE is applied, the best (rank 1) subgroup is found with the attribute weekhour.
The corresponding attribute values are shown in Figure 2 (left). It can be seen that, since
we treat this numerical attribute as a nominal attribute (e.g. the candidate subgroups can
contain any combination of attribute values), the found attributes values look sparse.
However, there are still some patterns can be told from the figure. For instance, most
of the attribute values are located around the night of each day. Figure 2 (right) gives
the empirical distribution of the prediction errors within the subgroup. Comparing to
Figure 1, here the distribution of errors has a significantly higher variance, which
indicates a higher MAE. Figure 3 (left) shows the best subgroup found with the quality
measure WRMNAE. Since WRMNAE can be seen as a negative version of WRMAE,
it can be seen the best subgroup with WRMNAE is the compliment of the best group of
WRMAE.</p>
        <p>MAE = 2.7514</p>
        <p>Similarly, we can also find subgroups with the quality measure WRMOE and
WRMUE. The results (attribute values of the best subgroup and error distribution within the
subgroup) of WRMOE and WRMUE are given in Figure 4 and Figure 5 respectively.</p>
        <p>It can be seen for all the 4 quality measures the best subgroup is described by the
attribute weekhour. However, the description attributes for top-q subgroups can vary
with different quality measures. The description attributes for top-5 subgroups with
each quality measure is given in Table 2.</p>
        <p>In general, MBSD can be used to find the deviated prediction patterns in a given
data-set. For the regression models, MBSD is set to use one attribute to describe the
data points that the base model tends to predict well/not well. Therefore, each attribute
used by the subgroups can be seen sharing some non-linear correlation to the model’s
prediction. This is similar to an attribute selection (e.g. regularisation), but in a
nonlinear form.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Model Reuse with MBSD</title>
      <p>In this section we will introduce how to reuse trained models with MBSD. The general
idea is, for each deploy context, we can select a bunch of trained models according to
their performance. Then with MBSD, we can further detect the (pattern of) data points
that the previous models predict well / not well, which can be seen as a sub-context. A</p>
      <p>Station 201 with model 1-1, WRMAE</p>
      <p>Population</p>
      <p>MAE(G) = 4.5534
Subgroup
1
number of sub-models are then selected just for these data points. The final prediction
is hence estimated by averaging the prediction from the base models and sub-models.
4.1</p>
      <sec id="sec-4-1">
        <title>Baseline Method 1</title>
        <p>The first base line method is, for each deploy context, to simply select one model from
the 1200 trained models (200 stations, 6 models per station) that has the lowest MAE
on the test station.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Baseline Method 2</title>
        <p>The second base line method is, for each test station, to rank the 1200 trained models
according to their MAE on the test station. The final prediction is hence the average of
the prediction of the top-n models (the selected models are referred as base models):
Station 201 with model 1-1, WRMOE</p>
        <p>Population</p>
        <p>MAE(G) = 3.9921
Subgroup
1
The proposed method is to use MBSD to find the top q subgroups (subgroups described
by q attributes) for each base model in the previous method. Then a sub-model is
selected according to the MAE within the subgroups:
fsjub = argmin f
g j(di) jyi</p>
        <p>G j
f (xi)j
(7)</p>
        <p>n
yˆin = å fmj ix(xi)</p>
        <p>j=1</p>
        <p>To combine the predictions from both the base models and sub-models, here the
strategy is to use the base models for the data points that are not covered by the
subgroups, and use the average of base models and sub-models for the data points within
the subgroups. For the jth base model, the mixture model can be given as:
fmj ix(xi) =
fbase(xi) + g j fsjub(xi)
j</p>
        <p>1 + g j(di)</p>
        <p>For the case that there are multiple subgroups (hence multiple sub-models) for each
base model (with different rank or different quality measures), the mixture model (with
K different subgroups) can be given as:
fmj ix(xi) =
fbase(xi) + åk=1 g j;k fsju;bk(xi)
j K</p>
        <p>K
1 + åk=1 g j;k(di)
Again, we can get the final prediction by averaging the top-n mixture models:</p>
        <p>Station 226 to 275, SO, q = 5
2.03 50 100 150 200 250 300 350 400 450 500</p>
        <p>Number of models averaged</p>
        <p>Station 226 to 275, SO, q = 16
2.02 50 100 150 200 250 300 350 400 450 500</p>
        <p>Number of models averaged
With respect to the experiments, the training data is fixed to be the data of 275 stations
during Oct 2014. For testing data, the full year data of station 1 to station 10 and the
3-month data of station 226 to station 275 will be used. In the first experiment
(Stationoriented), each station is seen as a deploy context. In the second experiment
(Nonstation-oriented), each group of station (1 to 10, 226 to 275) is seen as a deploy context.</p>
        <p>In both experiments, the performances will be compared among 9 methods: 1) base
method 1, base method 2, MBSD-WRMAE reuse, MBSD-WRMNAE reuse,
MBSDWRMOE reuse, MBSD-WRMUE reuse, MBSD-3-mixuture reuse (WRMAE, WRMOE,
WRMUE), MBSD-3-mixture reuse (WRMNAE, WRMOE, WRMUE), MBSD-4-mixture
reuse. A number of up to top 16 subgroups will be used in the prediction, and up to 512
base models are selected and averaged for each deploy context.</p>
        <p>The station-oriented error curves for station 1 to station 10 and station 226 to station
275 are given in Fig 6 and Fig 7 respectively. The non-station oriented error curves for
the two groups of stations are shown in Figure 8 and Fig 9 respectively.</p>
        <p>With respect to the station-oriented approach, it can be seen that the baseline method
2 generally beats baseline method 1. This indicates that, when the training data of the
deployment context is limited, to select a bunch of trained models to get the average can
potentially help reduce the prediction error. As previously discussed, in this scenario
each station can be treated as a bootstrap, the baseline method is hence similar to using</p>
        <p>Station 1 to 10, NSO, q = 1</p>
        <p>Station 1 to 10, NSO, q = 5
the bagging strategy. However, for this approach an important issue is to decide the
number of models to be averaged, as it tends to over-fit quickly this number gets larger.</p>
        <p>For the proposed methods, the figures show that the method MBSD-WRMNAE
generally gets the best performance except in one case, where only the top 1 subgroup
is used to predict the group station 1 to 10. The reason behind the good performance of
MBSD-WRMNAE can be linked to the error distributions given in the previous section.
As Fig 3 (right) shows, only with the quality measure WRMNAE, the error distribution
is still close to a Gaussian distribution with 0 mean, but with less variance than in the
population. The subgroup can hence be seen as a less noisy context, which helps the
regression model to capture better parameters. On the other hand, it can be seen,
especially with large q, the proposed methods tend to reduce the effect of over-fitting from
baseline method 2. This is mainly because these methods are designed to fit a better
model for the data points that are not well predicted by the base models. Therefore,
the effect will become more significant when the number of q gets larger, as more
submodels are involved in the prediction. This makes the choice of number of averaged
models less problematic.</p>
        <p>With respect to the non-station-orientated method, the first interesting observation
is that, for both groups, the MAE of baseline method is significantly lower than in the
station-orientated approach. This indicates that to treat a set of stations as the deploy
context can potentially help to get better performance. This also means that the attribute
station might not be the best attribute to separate (describe) the deploy context. The
second observation is that the baseline method 2 generally has a higher MAE than the
Station 226 to 275, NSO, q = 1</p>
        <p>Station 226 to 275, NSO, q = 5
baseline method 1 in the non-station-orientated approach. One possible reason could
be that, since now the training data is mixed with different stations, simply select base
models according to MAE can cause a significant over-fitting and hence lower down
the performance of the averaged prediction.</p>
        <p>Since all the proposed methods are essentially based on the baseline method 2,
although they generally perform better than the baseline method 2, their MAE is still
higher than baseline method 1. However, with the case that q = 16, it can be seen both
MBSD-WRMNAE and MBSD-3-mixture (WRMNAE, WRMOE, WRMUE) can still
reach a MAE lower than the baseline method 1.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper investigates how SD can be adopted for model reuse. A variation of SD,
called Model-Based Subgroup Discovery is used to detect the predictive patterns
(subgroups) of the trained models in the new context. A set of sub-models are then
selected for these subgroups to construct a mixture model. The experiments show that our
proposed method can reduce the MAE of regression models and potentially stop the
over-fitting of averaged models.</p>
      <p>One further research direction is to develop a model ensemble algorithm with MBSD.
Since in this paper some trained models are provided, a more interesting research task
is hence to start from preparing the base models that can be further reused. So that the
algorithm can finish the whole model reuse procedure.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Niall</given-names>
            <surname>Twomey and Peter A Flach</surname>
          </string-name>
          .
          <article-title>Context modulation of sensor data applied to activity recognition in smart homes</article-title>
          .
          <source>In LMCE 2014, First International Workshop on Learning over Multiple Contexts</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Willi</surname>
          </string-name>
          <article-title>Kl o¨sgen. Advances in knowledge discovery and data mining. chapter Explora: A Multipattern and Multistrategy Discovery Assistant</article-title>
          , pages
          <fpage>249</fpage>
          -
          <lpage>271</lpage>
          . American Association for Artificial Intelligence, Menlo Park, CA, USA,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Wrobel</surname>
          </string-name>
          .
          <article-title>An algorithm for multi-relational discovery of subgroups</article-title>
          .
          <source>In Principles of Data Mining and Knowledge Discovery</source>
          , pages
          <fpage>78</fpage>
          -
          <lpage>87</lpage>
          . Springer,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Nada</surname>
            <given-names>Lavracˇ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Branko</surname>
            <given-names>Kavsˇek</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peter Flach</surname>
          </string-name>
          , and Ljupcˇo Todorovski.
          <article-title>Subgroup discovery with CN2-SD</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>5</volume>
          :
          <fpage>153</fpage>
          -
          <lpage>188</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Atzmueller</surname>
          </string-name>
          and
          <string-name>
            <given-names>Florian</given-names>
            <surname>Lemmerich</surname>
          </string-name>
          .
          <article-title>Fast subgroup discovery for continuous target concepts</article-title>
          .
          <source>In Foundations of Intelligent Systems</source>
          , pages
          <fpage>35</fpage>
          -
          <lpage>44</lpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Dennis</given-names>
            <surname>Leman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ad</given-names>
            <surname>Feelders</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Arno</given-names>
            <surname>Knobbe</surname>
          </string-name>
          .
          <article-title>Exceptional model mining</article-title>
          .
          <source>In Machine Learning and Knowledge Discovery in Databases</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Wouter</given-names>
            <surname>Duivesteijn</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ad J Feelders</surname>
            , and
            <given-names>Arno</given-names>
          </string-name>
          <string-name>
            <surname>Knobbe</surname>
          </string-name>
          .
          <article-title>Exceptional model mining</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>52</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>