Learning Model Rules from High-Speed Data Streams

                              Ezilda Almeida and Carlos Ferreira and João Gama
                                      LIAAD - INESC Porto L.A., Portugal


                          Abstract                                 minimizes the mean square error of the target attribute com-
                                                                   puted from the set of examples covered by rule. This func-
     Decision rules are one of the most expressive lan-            tion might be either a constant, the mean of the target at-
     guages for machine learning. In this paper we                 tribute, or a linear combination of the attributes. Each rule
     present Adaptive Model Rules (AMRules), the first             is equipped with an online change detector. It monitors the
     streaming rule learning algorithm for regression              mean square error using the Page-Hinkley test, providing in-
     problems. In AMRules the antecedent of a rule is              formation about the dynamics of the process generating data.
     a conjunction of conditions on the attribute values,             The paper is organized has follows. The next Section
     and the consequent is a linear combination of at-             presents the related work in learning regression trees and
     tribute values. Each rule in AMRules uses a Page-             rules from data focusing on streaming algorithms. Sec-
     Hinkley test to detect changes in the process gener-          tion 3 describe in detail the AMRules algorithm. Section 4
     ating data and react to changes by pruning the rule           presents the experimental evaluation using stationary and
     set. In the experimental section we report the re-            time-evolving streams. AMRules is compared against other
     sults of AMRules on benchmark regression prob-                regression systems. Last Section presents the lessons learned.
     lems, and compare the performance of our algo-
     rithm with other streaming regression algorithms.             2   Related Work
                                                                   In this section we analyze the related work in two dimensions.
Keywords: Data Streams, Regression, Rule Learning,                 One dimension is related to regression algorithms, the other
Change Detection                                                   dimension is related to incremental learning of regression al-
                                                                   gorithms.
1   Introduction                                                      In regression domains, [14] presented the system M5. It
                                                                   builds multivariate trees using linear models at the leaves. In
Regression analysis is a technique for estimating a functional     the pruning phase for each leaf a linear model is built. Later,
relationship between a dependent variable and a set of in-         [5] have presented M50 a rational reconstruction of Quinlan’s
dependent variables. It has been widely studied in statistics,     M5 algorithm. M50 first constructs a regression tree by re-
machine learning and data mining. Predicting numeric val-          cursively splitting the instance space using tests on single at-
ues usually involves complicated regression formulae. Model        tributes that maximally reduce variance in the target variable.
trees [14] and regression rules [15] are the most powerful data    After the tree has been grown, a linear multiple regression
mining models. Trees and rules do automatic feature selec-         model is built for every inner node, using the data associated
tion, being robust to outliers and irrelevant features; exhibit    with that node and all the attributes that participate in tests
high degree of interpretability; and structural invariance to      in the subtree rooted at that node. Then the linear regression
monotonic transformation of the independent variables. One         models are simplified by dropping attributes if this results in
important aspect of rules is modularity: each rule can be in-      a lower expected error on future data (more specifically, if the
terpreted per si [6].                                              decrease in the number of parameters outweighs the increase
   In the data stream computational model [7] examples             in the observed training error). After this has been done, ev-
are generated sequentially from time evolving distributions.       ery subtree is considered for pruning. Pruning occurs if the
Learning from data streams require incremental learning, us-       estimated error for the linear model at the root of a subtree
ing limited computational resources, and the ability to adapt      is smaller or equal to the expected error for the subtree. Af-
to changes in the process generating data. In this paper we        ter pruning terminates, M50 applies a smoothing process that
present Adaptive Model Rules, the first one-pass algorithm         combines the model at a leaf with the models on the path to
for learning regression rule sets from time-evolving streams.      the root to form the final model that is placed at the leaf.
AMRules can learn ordered and unordered rules. The an-                Cubist [15] is a rule based model that is an extension of
tecedent of a rule is a set of literals (conditions based on the   Quinlan’s M5 model tree. A tree is grown where the termi-
attribute values). The consequent of a rule is a function that     nal leaves contain linear regression models. These models are
based on the predictors used in previous splits. Also, there are   concept change. Even though unexpected observations should
intermediate linear models at each step of the tree. A predic-     be removed only in the former but not in the latter case.
tion is made using the linear regression model at the terminal
node of the tree, but is smoothed by taking into account the        Algorithm 1: AMRules Algorithm
prediction from the linear model in the previous node of the
tree (which also occurs recursively up the tree). The tree is          Input:
                                                                         S: Stream of examples
reduced to a set of rules, which initially are paths from the            ordered-set: boolean flag
top of the tree to the bottom. Rules are eliminated via pruning          Nmin : Minimum number of examples
and/or combined for simplification.                                      λ: Constant to solve ties
                                                                         α: the magnitude of changes that are allowed
2.1   Streaming Regression Algorithms                                    j: rule index
                                                                       Result: RS Set of Decision Rules
Many methods can be found in the literature for solving clas-          begin
sification tasks on streams, but only a few exist for regres-              Let RS ← {}
sion tasks. To the best of our knowledge, we note only two                 Let defaultRule {} → (L ← N U LL)
                                                                           foreach example (xi , yi ) do
papers for online learning of regression and model trees. In                    foreach Rule r ∈ RSj do
the algorithm of [13] for incremental learning of linear model                      if r covers the example then
trees the splitting decision is formulated as hypothesis test-                           Let ŷi be the prediction of the rule r,
ing. The split least likely to occur under the null hypothesis                           computed using Lr
of non-splitting is considered the best one. The linear mod-                             Compute error =(ŷi − yi )2
els are computed using the RLS (Recursive Least Square) al-                              Call PHTest(error, α, λ)
gorithm that has a complexity, which is quadratic in the di-                             if Change is detected then
mensionality of the problem. This complexity is then multi-                                   Remove the rule
plied with a user-defined number of possible splits per nu-                              else
                                                                                              Update sufficient statistics of r
merical attribute for which a separate pair of linear models is                               Update Perceptron of r
updated with each training example and evaluated. The Fast                                    if Number of examples in Lr ≥ Nmin
Incremental Model Tree (FIMT) proposed in [10], is an in-                                     then
cremental algorithm for any-time model trees learning from                                         r ← ExpandRule(r)
evolving data streams with drift detection. It is based on the
Hoeffding tree algorithm, but implements a different splitting                         if ordered-set then
                                                                                            BREAK
criterion, using a standard deviation reduction (SDR) based
measure more appropriate to regression problems. The FIMT                      if none of the rules in RS triggers then
algorithm is able to incrementally induce model trees by pro-                       Update sufficient statistics of the default rule
cessing each example only once, in the order of their arrival.                      Update Perceptron of the default rule
Splitting decisions are made using only a small sample of the                       if Number of examples in L ≥ Nmin then
data stream observed at each node, following the idea of Ho-                            RS ← RS ∪ ExpandRule(def aultRule)
effding trees. Another data streaming issue addressed in [10]
is the problem of concept drift. Data streaming models capa-
ble of dealing with concept drift face two main challenges:
how to detect when concept drift has occurred and how to
adapt to the change. Change detection in the FIMT is carried       3     The AMRules Algorithm
out using the Page-Hinkley change detection test [11]. Adap-       The problem of learning model rules from data streams raises
tation in FIMT involves growing an alternate subtree from the      several issues. First, the dataset is no longer finite and avail-
node in which change was detected.                                 able prior to learning, it is impossible to store all data in
   IBLStreams (Instance Based Learner on Streams) is an ex-        memory and learn from them as a whole. Second, multiple
tension of MOA that consists in an instance-based learning         sequencial scans over the training data are not allowed. An
algorithm for classification and regression problems on data       algorithm must therefore collect the relevant information at
streams by [1]; IBLStreams optimizes the composition and           the speed it arrives and incrementally decide about splitting
size of the case base autonomously. On arrival of a new ex-        decisions. Third the training dataset may consist of data from
ample (x0 , y0 ), this example is first added to the case base.    different distributions. In this section we present an incremen-
Moreover, it is checked whether other examples might be re-        tal algorithm for learning model rules to address these issues,
moved, either since they have become redundant or since they       named Adaptive Model Rules from High-Speed Data Streams
are outliers. To this end, a set C of examples within a neigh-     (AMRules).The pseudo code of the algorithm is given in Al-
borhood of x0 are considered as candidates. This neighbor-         gorithm 1.
hood if given by the kc nearest neighbors of x0 , determined          The algorithm begins with an empty rule set (RS), and a
according a distance measure ∆, and the candidate set C con-       default rule {} → L, where L is initialized to NULL. L is a
sists of the examples within that neighborhood. The most re-       data structure used to store the sufficient statistics required to
cent examples are excluded from removal due to the difficulty      expand a rule and for prediction. Every time a new training
to distinguish potentially noisy data from the beginning of a      example is available the algorithm proceeds with checking
                                                                                       v
 Algorithm 2: Expandrule: Expanding one Rule
                                                                                       u
                                                                                       u1 X N          N
                                                                                                    1 X 2
  Input:                                                                              =t (     yi2 − (    yi) )
                                                                                         N i=1      N i=1
    r: One Rule
    τ : Constant to solve ties                                             To make the actual decision regarding a split, the SDR
    δ : Confidence                                                     measured for the best two potential splits are compared, by
  Result: r0 : Expanded Rule
                                                                       dividing the second-best value by the best one to generate
  begin
       Let Xa be the attribute with greater SDR                        a ratio r in the range 0 to 1. Having a predefined range for
       Let Xb be the q
                     attribute with second greater SDR                 the values of the random variables, the Hoeffding probability
                     R2 ln(1/δ)                                        bound () [17] can be used to obtain high confidence intervals
       Compute  =              (Hoeffding bound)
                         2n                                            for the true average of the sequence of random variables. The
                   SDR(Xb )
       Compute r = SDR(Xa ) (Ratio of the SDR values for the           value of  is calculated using the formula:
       best two splits)                                                                             r
       Compute U pperBound = r +                                                                     R2 ln (1/δ)
       if U pperBound < 1 ∨  < τ then                                                         =
            Extend r with a new condition based on the best                                               2n
            attribute Xa ≤ vj or Xa > vj                                   The process to expand a rule by adding a new condition
            Release sufficient statistics of Lr                        works as follows. For each attribute Xi , the value of the SDR
            r ← r ∪ {Xa ≤ vj orXa > vj }                               is computed for each attribute value vj . If the upper bound
       return r                                                        (r̄+ = r̄ + ) of the sample average is below 1 then the true
                                                                       mean is also below 1. Therefore with confidence 1− the best
                                                                       attribute over a portion of the data is really the best attribute.
whether for each rule from rule set (RS) the example is cov-           In this case, the rule is expanded with condition Xa ≤ vj or
ered by any rule, that is if all the literals are true for the exam-   Xa > vj . However, often two splits are extremely similar or
ple. The target values of the examples covered by a rule are           even identical, in terms of their SDR values, and despite the 
used to update the sufficient statistic of the rule (L). To detect     intervals shrinking considerably as more examples are seen,
changes we propose to use the Page-Hinkley (PH) change de-             it is still impossible to choose one split over the other. In these
tection test. If a change is detected the rule is removed from         cases, a threshold (τ ) on the error is used. If  falls below this
the rule set. Otherwise, the rule might be expanded. The ex-           threshold and the splitting criterion is still not met, the split is
pansion of the rule is considered only after certain minimum           made on the best split with a higher SDR value and the rule
number of examples (Nmin ). The expansion of a rule is ex-             is expanded.
plained in Algorithm 2.
                                                                       3.2   Prediction Strategies
   The set of rules is learned in parallel, as described in Al-
gorithm 1. We consider two cases: learning ordered or un-              The set of rules learned by AMRules can be ordered or
ordered set of rules. In the former case, every example up-            unordered. They employ different prediction strategies to
dates statistics of the first rule that covers it. In the latter ev-   achieve optimal prediction. In the former, only the first rule
ery example updates statistics of all the rules that covers it.        that cover an example is used to predict the target example.
If an example is not covered by any rule, the default rule is          In the latter, all rules covering the example are used for pre-
updated.                                                               diction and the final prediction is decided by using weighted
                                                                       vote.
3.1   Expansion of a Rule                                                 Each rule in AMrules implements 3 prediction strategies: i)
                                                                       the mean of the target attribute computed from the examples
Before discussing how rules are expanded, we will first                covered by the rule; ii) a linear combination of the indepen-
discuss the evaluation measure used in the attribute selection         dent attributes; iii) an adaptive strategy, that chooses between
process. [10] describe a standard deviation reduction measure          the first two strategies, the one with lower MSE in the previ-
(SDR) for determining the merit of a given split. It can be            ous examples.
efficiently computed in an incremental way. Given a leaf                  Each rule in AMRules contains a linear model, trained us-
where a sample of the dataset S of size N has been observed,           ing an incremental gradient descent method, from the ex-
a hypothetical binary split hA over attribute A would divide           amples covered by the rule. Initially, the weights are set to
the examples in S in two disjoint subsets SL and SR , with             small random numbers in the range -1 to 1. When a new
sizes NL and NR respectively. The formula for SDR measure              example arrives, the output is computed using the current
of the split hA is given below:                                        weights. Each weight is then updated using the Delta rule:
                                                                       wi ← wi + η(ŷ − y)xi , where ŷ is the output, y the real value
                                                                       and η is the learning rate.
                               NL           NR
      SDR(hA ) = sd(S) −          sd(SL ) −    sd(SR )                 3.3   Change Detection
                               N            N
                       v                                               The AMRules uses the Page-Hinkley (PH) test [12] to moni-
                       u
                       u1 X N                                          tor the error and signals a drift when a significant increase of
               sd(S) = t ( (yi − ȳ)2 ) =                              this variable is observed. The PH test is a sequential analysis
                         N i=1                                         technique typically used for monitoring change detection in
signal processing. The PH test is designed to detect a change     error is estimated from the same test sets. In scenarios with
in the average of a Gaussian signal [11]. This test considers     concept drift, we use the prequential (predictive sequential)
a cumulative variable mT , defined as the accumulated differ-     error estimate [8]. This evaluation method evaluates a model
ence between the observed error and the mean of the error till    sequentially. When an example is available, the current re-
the current moment:                                               gression model makes a prediction and the loss is computed.
                           T                                      After the prediction the regression model is updated with that
                  mT =
                           X
                             (et − ēT − α)                       example.
                           t=1
                                                                  Datasets
                   t
                                                                  The experimental datasets include both artificial and real data,
                   P
where e¯T = 1/T         et and α corresponds to the magnitude
                  t=1                                             as well sets with continuous attributes. We use ten regression
of changes that are allowed.                                      datasets from the UCI Machine Learning Repository [3] and
   The minimum value of this variable is also computed:           other sources. The datasets used in our experimental work
MT = min(mt , t = 1 . . . T ). As a final step, the test moni-    are:
tors the difference between MT and mT : P HT = mT − MT .             2dplanes this is an artificial data set described in [4]. Air-
When this difference is greater than a given threshold (λ) we     lerons this data set addresses a control problem, namely fly-
signal a change in the distribution. The threshold λ depends      ing a F16 aircraft. Puma8NH and Puma32H is a family of
on the admissible false alarm rate. Increasing λ will entail      datasets synthetically generated from a realistic simulation of
fewer false alarms, but might miss or delay change detection.     the dynamics of a Unimation Puma 560 robot arm. Pol this
                                                                  is a commercial application described in [18].The data de-
4     Experimental Evaluation                                     scribes a tele communication problem. Elevators this data
The main goal of this experimental evaluation is to study the     set is also obtained from the task of controlling a F16 aircraft.
behavior of the proposed algorithm in terms of mean absolut       Fried is an artificial data set used in Friedman (1991) and
error (MAE) and root mean squared error (RMSE). We are            also described in Breiman (1996,p.139). Bank8FM a fam-
interested in studying the following scenarios:                   ily of datasets synthetically generated from a simulation of
                                                                  how bank-customers choose their banks. Kin8nm this dataset
    – How to grow the rule set?                                   is concerned with the forward kinematics of an 8 link robot
        • Update only the first rule that covers training ex-     arm. Airline this dataset using the data from the Data Expo
          amples. In this case the rule set is ordered, and the   competition (2009). The dataset consists of a large amount of
          corresponding prediction strategy uses only the first   records, containing flight arrival and departure details for all
          rule that covers test examples.                         the commercial flights within the USA, from October 1987
        • Update all the rules that covers training examples.     to April 2008. This is a large dataset with nearly 120 mil-
          In this case the rule set is unordered, and the cor-    lion records (11.5 GB memory size) [10]. Table 1 summarizes
          responding prediction strategy uses a weighted sum      the number of instances and the number of attributes of each
          of all rules that covers test examples.                 dataset.
    – How does AMRules compares against other streaming                             Table 1. Summary of datasets
      algorithms?
    – How does AMRules compares against other state-of-the-                   Datasets # Instances # Attributes Learning rate
                                                                              2dplanes    40768        11           0.01
      art regression algorithms?                                             Airlerons    13750        41           0.01
    – How does AMRules learned models evolve in time-                        Puma8NH      8192          9           0.01
      changing streams?                                                      Puma32H      8192         32           0.01
                                                                                 Pol      15000        49          0.001
                                                                             Elevators    8752         19          0.001
4.1    Experimental Setup                                                       Fried     40769        11           0.01
All our algorithms were implemented in java using Massive                    Bank8FM      8192          9           0.01
                                                                              Kin8nm      8192          9           0.01
Online Analysis (MOA) data stream software suite [2]. For                      Airline 115Million      11           0.01
all the experiments, we set the input parameters of AMRules
to: Nmin = 200, τ = 0.05 and δ = 0.01. The parameters
for the Page-Hinkley test are λ = 50 and α = 0.005. Table 1
summarizes information about the datasets used and reports        4.2   Experimental Results
the learning rate used in the perceptron learning.
   All of the results in the tables 2, 3 and 4 are averaged of    In this section, we empirically evaluate the AMRules. The
ten-fold cross-validation [16]. The accuracy is measured us-      results are described in four parts. In the first part we compare
ing the following metrics: Mean absolute error (MAE) and          the AMRules variants, the second part we compare AMRules
root mean squared error (RMSE) [19]. We used two evalua-          against other streaming algorithms and the third part compare
tion methods. When no concept drift is assumed, the evalua-       AMRules against other state-of-the-art regression algorithms.
tion method we employ uses the traditional train and test sce-    The last part presents the analysis of AMRules behavior in the
nario. All algorithms learn from the same training set and the    context of time-evolving data streams.
Comparison between AMRules Variants                                           Table 5. Average results from the evaluation of change detection
                                                                              over ten experiments.
In this section we focus on two strategies that we found po-
tentially interesting. It is a combination of expanding only                                    Algorithms Delay      Size
one rule, the rule that first triggered, with predicting strat-                                  AMRules 1484 56 (nr. Rules)
egy uses only the first rule that covers test examples. Obvi-                                     FIMT     2096 290 (nr. Leaves)
                                                                                                IBLStreams   -          -
ously, for this approach it is necessary to use ordered rules
(AM Ruleso ). The second setting employs unordered rule
set, where all the covering rules expand and the correspond-                  Evaluation in Time-Evolving Data streams
ing prediction strategy uses a weighted sum of all rules that                 In this subsection we first study the evolution of the error
cover test examples (AM Rulesu ).                                             measurements (MAE and RMSE) and evaluate the change de-
   Ordered rule sets specializes one rule at a time, and as a                 tection method. After, we evaluate the streaming algorithms
result it often produces less rules than the unordered strat-                 on non-stationary streaming real-world problem, we use the
egy. Ordered rules need to consider the previous rules and re-                Airline dataset from the DataExpo09 competition.
maining combinations, which might not be easy to interpret                       To simulate drift we use Fried dataset. The simulations al-
in more complex sets. Unordered rule sets are more modular,                   low us to control the relevant parameters and to evaluate the
because they can be interpreted alone.                                        drift detection. Figure 1 and Figure 2 depict the MAE and
   Table 2 summarize the mean absolute error and the root                     RMSE curves of the streaming algorithms using the dataset
mean squared error of these variants. Overall, the experimen-                 Fried. These figures also illustrate the point of drift and the
tal results points out the unordered rule sets are more compet-               points where the change was detected. Only two of the algo-
itive than ordered rule sets in terms of MAE and RMSE.                        rithms – FIMT and AMRules – were able to detect a change.
                                                                              Table 5 report the average results over ten experiments vary-
Table 2. Results of ten-fold cross-validation for AMRules algo-               ing the seed of the Fried dataset. We measure the number of
rithms
                                                                              nodes for FIMT, the number of rules AMrules and the the de-
           Mean absolut error (variance) Root mean squared error (variance)   lay (in terms of number of examples) in detection the drift.
 Datasets    AMRuleso        AMRulesu        AMRuleso        AMRulesu         The delay gives indication of how fast the algorithm will be
 2dplanes 1.23E+00 (0.01) 1.16E+00 (0.01) 1.67E+00 (0.02) 1.52E+00 (0.01)
Airlerons 1.10E-04 (0.00) 1.00E-04 (0.00) 1.90E-04 (0.00) 1.70E-04 (0.00)
                                                                              able to start the adaptation strategy. These two algorithms ob-
Puma8NH 3.21E+00 (0.04) 3.26E+00 (0.02) 4.14E+00 (0.05) 4.28E+00 (0.03)       tained similar results. The general conclusions are that FIMT
Puma32H 1.10E-02 (0.00) 1.20E-02 (0.00) 1.60E-02 (0.00) 1.20E-02 (0.00)       and AMRules algorithms are robust and have better results
   Pol    14.0E+00 (25.1) 15.6E+00 (3.70) 23.0E00 (44.50)  23.3E00 (4.08)
Elevators 3.50E-03 (0.00) 1.90E-03 (0.00) 4.80E-03 (0.00) 2.20E-03 (0.00)     than IBLStreams. Figures 3 and 4 show the evaluation of the
  Fried   2.08E+00 (0.01) 1.13E+00 (0.01) 2.78E+00 (0.08) 1.67E+00 (0.25)     MAE and the RMSE of the streaming algorithms on non-
Bank8FM 4.31E-02 (0.00) 4.30E-02 (0.00) 4.80E-02 (0.00) 4.30E-02 (0.00)       stationary real-world problem. FIMT and AMRules obtain
 Kin8nm 1.60E-01 (0.00) 1.50E-01 (0.00) 2.10E-01 (0.00) 2.00E-01 (0.00)
                                                                              approximately similar behavior in terms of MAD and MSE.
                                                                              Both exhibit somewhat better performance than IBLStreams,
                                                                              but not significantly different.
Comparison with other Streaming Algorithms
We compare the performance of our algorithm with three
other streaming algorithms, FIMT and IBLStreams. FIMT is
an incremental algorithm for learning model trees, addressed
in [10]. IBLStreams is an extension of MOA that consists in
an instance-based learning algorithm for classification and re-
gression problems on data streams by [1].
   The performance measures for these algorithms are given
in Table 3. The comparison of these streaming algorithms
shows that AMRules get better results.
Comparison with State-of-the-art Regression Algorithms
Another experiment which involves adaptive model rules is
shown in Table 4. We compare AMRules with other non-
incremental regression algorithms available in WEKA [9].                      Fig. 1. Mean absolut error of streaming algorithms using the dataset
All these experiments using algorithms are performed us-                      Fried.
ing WEKA. We use the standard method of ten-fold cross-
validation, using the same folds for all the algorithms in-
cluded.                                                                       5    Conclusions
   The comparison of these algorithms show that AMRules                       Learning regression rules from data streams is an interest-
is very competitive in terms of (MAE, RMSE) than all the                      ing approach that has not been explored by the stream mining
other methods, except M5Rules. AMRules is faster than all                     community. In this paper, we presented a new regression rules
the other algorithms considered in this study. These results                  approach for streaming data with change detection. The AM-
were somewhat expected, since these datasets are relatively                   Rules algorithm is able to learn very fast and the only mem-
small for the incremental algorithm.                                          ory it requires is for storing sufficient statistics of the rules. To
                                   Table 3. Results of ten-fold cross-validation for Streaming Algorithms

                              Mean absolut error (variance)                Root mean squared error (variance)
             Datasets    AMRulesu          FIMT         IBLStreams       AMRulesu          FIMT         IBLStreams
             2dplanes 1.16E+00 (0.01) 8.00E-01 (0.00) 1.03E+00 (0.00) 1.52E+00 (0.01) 1.00E+00 (0.00) 1.30E+00 (0.00)
            Airlerons 1.00E-04 (0.00) 1.90E-04 (0.00) 3.20E-04 (0.00) 1.70E-04 (0.00) 1.00E-09 (0.00) 3.00E-04 (0.00)
            Puma8NH 2.66E+00 (0.01) 3.26E+00 (0.03) 3.27E+00 (0.01) 4.28E+00 (0.03) 12.0E+00 (0.63) 3.84E+00 (0.02)
            Puma32H 1.20E-02 (0.00) 7.90E-03 (0.00) 2.20E-02 (0.00) 1.00E-04 (0.01) 1.20E-02 (0.00) 2.70E-02 (0.00)
               Pol    15.6E+00 (3.70) 38.2E+00 (0.17) 29.7E+00 (0.55) 23.3E+00 (4.08) 1,75E+03 (1383) 50,7E+00 (0.71)
            Elevators 1.90E-03 (0.00) 3.50E-03 (0.00) 5.00E-03 (0.00) 2.20E-03 (0.00) 3.00E-05 (0.00) 6.20E-03 (0.00)
              Fried 1.13E+00 (0.01) 1.72E+00 (0.00) 2.10E+00 (0.00) 1.67E+00 (0.25) 4.79E+00 (0.01) 2.21E+00 (0.00)
            Bank8FM 4.30E-02 (0.00) 3.30E-02 (0.00) 7.70E-02 (0.00) 4.30E-02 (0.00) 2.20E-03 (0.00) 9.60E-02 (0.00)
             Kin8nm 1.60E-01 (0.00) 1.60E-01 (0.00) 9.50E-01 (0.00) 2.00E-01 (0.00) 2.10E-01 (0.00) 1.20E-01 (0.00)

                       Table 4. Results of ten-fold cross-validation for AMRulesu and others Regression Algorithms

                           Mean absolute error (variance)                                Root mean squared error (variance)
  Datasets    MRulesu         M5Rules       MLPerceptron LinRegression        MRulesu         M5Rules       MLPerceptron LinRegression
  2dplanes 1.16E+00 (0.01) 8.00E-01 (0.01) 8.70E-01 (0.01) 1.91E+00 (0.00) 1.52E+00 (0.01) 9.8E-01 (0.01) 1.09E+00 (0.01) 2.37E+00 (0.00)
 Airlerons 1.00E-04 (0.00) 1.00E-04 (0.00) 1.40E-04 (0.00) 1.10E-04 (0.00) 1.70E-04 (0.00) 2.00E-04 (0.00) 1.71E-04 (0.00) 2.00E-04 (0.00)
 Puma8NH 3.26E+00 (0.03) 2.46E+00 (0.00) 3.34E+00 (0.17) 3.64E+00 (0.01) 4.28E+00 (0.03) 3.19E+00 (0.01) 4.14E+00 (0.20) 4.45E+00 (0.01)
 Puma32H 1.20E-02 (0.00) 6.80E-03 (0.00) 2.30E-02 (0.00) 2.00E-02 (0.00) 1.20E-02 (0.00) 8.60E-03 (0.00) 3.10E-02 (0.00) 2.60E-02 (0.00)
    Pol    15.6E+00 (3.70) 2.79E+00 (0.05) 14.7E+00 (5.53) 26.5E+00 (0.21) 23.3E+00 (4.08) 6.56E+00 (0.45) 20.1E+00 (15.1) 30.5E+00 (0.16)
 Elevators 1.90E-03 (0.00) 1.70E-03 (0.00) 2.10E-03 (0.00) 2.00E-03 (0.00) 2.20E-03 (0.00) 2.23E-03 (0.00) 2.23E-03 (0.00) 2.29E-03 (0.00)
   Fried 1.13E+00 (0.01) 1.25E+00 (0.00) 1.35E+00 (0.03) 2.03E+00 (0.00) 1.67E+00 (0.25) 1.60E+00 (0.00) 1.69E+00 (0.04) 2.62E+00 (0.00)
 Bank8FM 4.30E-02 (0.00) 2.20E-02 (0.00) 2.60E-02 (0.00) 2.90E-02 (0.00) 4.30E-02 (0.00) 3.10E-02 (0.00) 3.40E-02 (0.00) 3.80E-02 (0.00)
  Kin8nm 1.60E-01 (0.00) 1.30E-01 (0.00) 1.30E-01 (0.00) 1.60E-01 (0.00) 2.00E-01 (0.00) 1.70E-01 (0.00) 1.60E-01 (0.00) 2.00E-01 (0.00)


                                                                        Fig. 4. Root mean squared error of streaming algorithms using the
Fig. 2. Root mean squared error of streaming algorithms using the       dataset Airlines.
dataset Fried.
                                                                        the best of our knowledge, in the literature there is no other
                                                                        method that addresses this issue.
                                                                           AMRules learns ordered and unordered rule sets. The ex-
                                                                        perimental results point out that unordered rule sets, in com-
                                                                        parison to ordered rule sets, are more competitive in terms
                                                                        of error metrics (MAE and RMSE). AMRules achieves bet-
                                                                        ter results than the others algorithms even for medium sized
                                                                        datasets. The AMRule algorithm is equipped with explicit
                                                                        change detection mechanisms that signals change points dur-
                                                                        ing the learning process. This information is relevant to un-
                                                                        derstand the dynamics of evolving streams.

                                                                        Acknowledgments:
Fig. 3. Mean absolut error of streaming algorithms using the dataset    The authors acknowledge the financial support given by the
Airlines.                                                               projects FCT-KDUS (PTDC/EIA/098355/2008); FCOMP -
                                                                        01-0124-FEDER-010053, the ERDF through the COMPETE
Programme and by Portuguese National Funds through FCT                  19. C. J. Willmott and K. Matsuura. Advantages of the mean ab-
within the project FCOMP - 01-0124-FEDER-022701.                            solute error (mae) over the mean square error (rmse) in assess-
                                                                            ing average model performance. Climate Research, 30:79–82,
                                                                            2005.
References
 1. S. Ammar and H. Eyke. Iblstreams: a system for instance-based
    classification and regression on data streams. Evolving Systems,
    3:235–249, 2012.
 2. A. Bifet, G. Holmes, B. Pfahringer, P. Kranen, H. Kremer,
    T. Jansen, and T. Seidl. Moa: Massive online analysis. Jour-
    nal of Machine learning Research (JMLR), pages 1601–1604,
    2010.
 3. K. E. Blake and C. Merz. Uci repository of machine learning
    databases. 1999.
 4. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classifica-
    tion and Regression Trees. Wadsworth and Brooks, Monterey,
    CA, 1984.
 5. E. Frank, Y. Wang, S. Inglis, G. Holmes, and I. H. Witten. Using
    model trees for classification. Machine Learning, 32(1):63–76,
    1998.
 6. J. Frnkranz, D. Gamberger, and N. Lavra. Foundations of Rule
    Learning. Springer, 2012.
 7. J. Gama. Knowledge Discovery from Data Streams. Chapman
    & Hall, CRC Press, 2010.
 8. J. Gama, R. Sebasti ao, and P. P. Rodrigues. Issues in evaluation
    of stream learning algorithms. In Proceedings of the 15th ACM
    SIGKDD international conference on Knowledge discovery and
    data mining, KDD ’09, pages 329–338, New York, NY, USA,
    2009. ACM.
 9. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
    and I. H. Witten. The weka data mining software: an update.
    SIGKDD Explor. Newsl., 11:10–18, 2009.
10. E. Ikonomovska, J. Gama, and S. Dzeroski. Learning model
    trees from evolving data streams. Data Min. Knowl. Discov.,
    23(1):128–168, 2011.
11. H. Mouss, D. Mouss, N. Mouss, and L. Sefouhi. Test f page-
    hinckley, an approach for fault detection in an agro-alimentary
    production system. In Proceedings of the Asian Control Con-
    ference, 2:815–818, 2004.
12. E. S. Page. Continuous inspection schemes.           Biometrika,
    41(1):100–115, 1954.
13. D. Potts and C. Sammut. Incremental learning of linear model
    trees. Machine Learning, 61(1-3):5–48, 2005.
14. J. R. Quinlan. Learning with continuous classes. In Aus-
    tralian Joint Conference for Artificial Intelligence, pages 343–
    348. World Scientific, 1992.
15. J. R. Quinlan. Combining instance-based and model-based
    learning. pages 236–243. Morgan Kaufmann, 1993.
16. K. Ron. A study of cross-validation and bootstrap for accuracy
    estimation and model selection. pages 1137–1143, 1995.
17. H. Wassily. Probability inequalities for sums of bounded ran-
    dom variables. Journal of the American Statistical Association,
    58(301):13–30, 1963.
18. S. M. Weiss and N. Indurkhya. Rule-based machine learning
    methods for functional prediction. Journal of Artificial Intelli-
    gence Research, 3:383–403, 1995.