=Paper=
{{Paper
|id=Vol-2473/paper5
|storemode=property
|title=Sequential Model Building in Symbolic Regression
|pdfUrl=https://ceur-ws.org/Vol-2473/paper5.pdf
|volume=Vol-2473
|authors=Jan Žegklitz,Petr Pošík
|dblpUrl=https://dblp.org/rec/conf/itat/ZegklitzP19
}}
==Sequential Model Building in Symbolic Regression==
<pdf width="1500px">https://ceur-ws.org/Vol-2473/paper5.pdf</pdf>
<pre>
                          Sequential Model Building in Symbolic Regression

                                                       Jan Žegklitz1,2 and Petr Pošík1
                                1   Czech Technical University in Prague, Faculty of Electrical Engineering,
                                           Technická 2, 166 27 Praha 6, Prague, Czech Republic
                                       2 Czech Academy of Sciences, Institute of Computer Science,

                                    Pod Vodárenskou věží 271/2, 182 07 Praha 8, Prague, Czech Republic

Abstract: Symbolic Regression is a supervised learning                    2     Related Work
technique for regression based on Genetic Programming.
A popular algorithm is the Multi-Gene Genetic Program-                    In this section we briefly revisit important approaches and
ming which builds models as a linear combination of a                     algorithms related to the sequential model building and to
number of components which are all built together. How-                   the experiments performed later in this article.
ever, in recent years a different approach emerged, repre-                   First we introduce Multi-Gene Genetic Programming
sented by the Sequential Symbolic Regression algorithm,                   which is not a sequential algorithm but is important for
which builds the model sequentially, one component at a                   this article. Then we briefly revisit boosting, a classical
time, and the components are combined using a method                      machine learning ensemble method. Finally we introduce
based on geometric semantic crossover. In this article                    the most important related research, the Sequential Sym-
we show that the SSR algorithm effectively produces lin-                  bolic Regression algorithm.
ear combination of components and we introduce another
sequential approach very similar to classical ensemble
method of boosting. All algorithms are compared with                      2.1    Multi-Gene Genetic Programming
MGGP as a baseline on a number of real-world datasets.                    Multi-Gene GP (MGGP) [2, 3, 4] is an extension of clas-
The results show that the sequential approaches are over-                 sical GP for symbolic regression. The main idea behind
all worse than MGGP both in terms of accuracy and model                   MGGP is that each individual is composed of one or more
size.                                                                     independent trees, called genes, and these then form the
                                                                          complete model together. To make the complete model,
1    Introduction                                                         the outputs of the genes are linearly combined with the
                                                                          coefficients determined using linear regression. In this ar-
Symbolic Regression (SR) is a supervised learning task                    ticle, we call this linear combination of genes a “top-level
with the goal to find a mathematical function (preferably a               linear combination”.
simple one) of a number of variables that fits the training                  MGGP does not build the model sequentially. However,
data available for training. Regression is a well known task              since we use the concept of top-level linear combination
with a number of well known, well tested and successful                   to put SSR (see the next section) in a similar framework
approaches, e.g. neural networks, support vector machines                 and since we use the algorithm in the experiments, it is
and random forests to name a few. However, these con-                     necessary that it is mentioned.
ventional approaches produce models which, while use-
ful, are often essentially black boxes which are difficult to             2.2    Boosting
interpret. One of the goals of SR is to produce “white-
box” models in the form of a mathematical expression.                     Building predictive models sequentially is a well known
The overwhelming majority of SR approaches utilize some                   machine learning technique. The prominent example
form of Genetic Programming (GP) [1].                                     of this approach is the boosting ensemble method [5, 6, 7].
   This article is structured as follows: Section 2 intro-                Boosting in the context of least-squares regression (i.e. re-
duces previous research relevant for this article and and                 gression using the squared error as the criterion) works
provide a new view on the main algorithm examined in                      in the following way (simplified):
this article, in Section 3 we propose a simple boosting-
based sequential SR algorithm, Sections 4 and 5 present                       1. Start with a constant model f0 (x) = c, e.g. using the
the experiments and results respectively and provide a dis-                      mean target value: f0 (x) = ȳ
cussion of these results. Finally, Section 6 concludes the                    2. Iterate for k = 1, 2, ..., M, where M is the desired
article and proposes further research.                                           number of iterations:

                                                                                  (a) Compute the residuals ỹ = y − fk−1 (x).
      Copyright c 2019 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC BY                 (b) Train a model component h(x) using ỹ as the
4.0).                                                                                 target values.
       (c) Update the complete model fk (x) = fk−1 (x) +             and the next iteration with different training data is started.
           h(x).                                                     It is important to note that the convex combinations are
                                                                     “nested” in depth in the second term of the GSX expres-
  3. fM (x) is the final model.                                      sion. In the last iteration, the f 0 placeholder is not replaced
One possible view is that the latter model components cor-           with another GSX result but with the function found in that
rect the mistakes of the previous ones.                              iteration, terminating the nesting. The complete SSR pro-
                                                                     cedure has the following form:

2.3   Sequential Symbolic Regression                                   1. y1 = y, k ← 1

SSR [8, 9] is a GP-based method that builds the model                  2. Run the base algorithm with yk as target values, pro-
in a sequential fashion. The overall structure of the SSR                 ducing a component fk .
method is following:
                                                                       3. Sample a random number rk ∈ [0, 1).
  1. An SR algorithm (e.g. GP) is executed with the train-
                                                                       4. Update the complete model:
     ing set, producing a model component.
                                                                             • if k = 1 and it is not the last iteration:
  2. The model component is added to the complete
                                                                                f ∗ = rk fk + (1 − rk ) f 0
     model and the training data is modified.
                                                                             • if k = 1 and it is the last iteration:
  3. Unless stopping criteria are met, go to step 1 and re-                     f ∗ = fk , terminate
     peat, using the modified training data.
                                                                             • if k > 1 and it is not the last iteration:
This structure is somewhat similar to boosting – a model                        f 0 ← rk fk + (1 − rk ) f 0
component is trained, added to the complete model and the                    • if k > 1 and it is the last iteration:
next one is trained using a modified training data. How-                        f 0 ← fk , terminate
ever, SSR does not do boosting. The difference is in step 2,
i.e. how is the new component combined with the other                  5. Adjust target values yk+1 = yk −r
                                                                                                         1−r
                                                                                                            k f k (x)
                                                                                                                        k
components of the complete model and how is the training
data modified.                                                         6. k ← k + 1, repeat from step 2 unless a stopping con-
   For the first iteration, the target values are not modi-               dition is met.
fied and the new component simply becomes the complete                 7. return f ∗
model as at the time it is the only component. In the sub-
sequent iterations, SSR utilizes the Geometric Semantic              where f 0 is the placeholder. The nesting happens in the
Crossover (GSX) from Geometric Semantic Genetic Pro-                 third case in step 4, where the placeholder is replaced with
gramming (GSGP) [10] to add the new component to the                 the GSX of the new function and the placeholder again.
complete model. GSX performs a convex combination of
two functions:                                                       2.4   SSR as Top-Level Linear Combination
                  ∗                               0
                f (x) = r f (x) + (1 − r) f (x)               (1)    In previous subsection we have briefly described the SSR
                                                                     procedure and how it constructs the models via GSX. Now
where r is a random number from the interval [0, 1), and f           we rewrite the procedure in such a way that the result
and f 0 are the two combined functions.                              is a linear combination of expressions as in MGGP and
   In SSR, the GSX is employed in a different way than               Boost-GP.
to combine two known functions. Only one of the two                     By iterating Equation 1 (with added subscripts to indi-
functions is known (the component from the previous it-              cate the iteration number) and expanding the multiplica-
eration, i.e. f ). SSR therefore performs an “incomplete”            tion at each step, we can see how the model looks like as
GSX, meaning that f 0 is only a placeholder for a function           if it was finished at that iteration, i.e. in the second and
to be found in the next iteration using a modified training          fourth case of step 4 of the SSR procedure from the previ-
data. The training data is derived by the following reason-          ous subsection (for the sake of simplicity, let r̄k = 1 − rk ):
ing. The result of the GSX should optimally produce zero
error, i.e. f ∗ (x) = y. Substituting this into Equation 1 and
slightly rearranging it, the following equation is obtained            k=1:         f1∗ = f1
                                                                       k=2:         f2∗ = r1 f1 + r̄1 f2
                                  y − r f (x)                          k=3:         f3∗ = r1 f1 + r̄1 (r2 f2 + r̄2 f3 )
                      f 0 (x) =               .               (2)
                                    1−r                                                 = r1 f1 + r̄1 r2 f2 + r̄1 r̄2 f3                       (3)
                                                                       k=4:         f4∗ = r1 f1 + r̄1 r2 f2 + r̄1 r̄2 (r3 f3 + r̄3 f4 )
This is, in effect, a requirement on how the outputs of f 0
                                                                                        = r1 f1 + r̄1 r2 f2 + r̄1 r̄2 r3 f3 + r̄1 r̄2 r̄3 f4
should look like, i.e. it is a definition of its target values for
                                                                                  ..
the next iteration. Therefore the problem is transformed                           .
   From the above iteration we can see that the model          MGGP This algorithm represents an algorithm that pro-
in fact is a linear combination of several expressions. We     duces models composed of multiple components but
can also see a pattern of the multiplicative coefficients:     which are all evolved simultaneously.
each time a component fk is added to the model, the co-
efficient of component fk−1 is multiplied by rk−1 and the      1G The same as MGGP but the individuals are limited
coefficient of the new component fk is the coefficient of      to a single gene, making it effectively a Scaled Symbolic
component fk−1 multiplied by r̄k−1 . This allows for ef-       Regression algorithm [11] (not to be confused with Se-
ficient implementation using a list of expressions and a       quential Symbolic Regression that is discussed in this arti-
corresponding list of their coefficients, just multiplying a   cle, which has the same abbreviation SSR). In order to pro-
number in the second list and adding an element to the end     vide similar complexity, the tree depth limit is increased
of both lists. This view also allows further manipulation      to allow for at least the same number of nodes as MGGP.
of the coefficients and model components in a way similar
to MGGP, e.g. performing an additional linear regression.      SSR-GP, SSR-1G Sequential Symbolic Regression algo-
                                                               rithm (see Section 2.3) with ordinary GP and 1G (see
                                                               above) as the base algorithm respectively.
3     Boosting-GP
                                                               nSSR-GP, nSSR-1G Similar to SSR-GP and SSR-1G but
In the previous section we briefly introduced Sequential
                                                               using the normalization and denormalization as described
Symbolic Regression algorithm (SSR) which utilizes Ge-
                                                               in [9].
ometric Semantic Crossover to sequentially combine the
components of the complete model and to derive modified
training data for subsequent iterations. We propose an al-     SSR with final fitting All the SSR-based algorithms men-
ternative approach that is almost identical to the boosting    tioned above (coded identically but then suffixed with the
scheme (see Section 2.2), hence we call it Boost-GP.           word “fit”, e.g. “SSR-GP fit”) are also used in such way
   The procedure matches that of boosting we already de-       that after the algorithm completes, the coefficients of the
scribed, except for a few details. In Boost-GP, we do          individual components are recalculated using linear re-
not start with a constant model, instead we start with no      gression in the same manner as is used in MGGP.
prior model at all and the first component is found using
GP too. The procedure that finds the individual compo-         Boost-GP, Boost-1G Boosting-type sequential approach
nents is a GP-based algorithm. If the underlying algorithm     (see Section 3) with ordinary GP and 1G (see above) as the
produces a linear combination of several expressions (as       base algorithm respectively.
MGGP does), those are treated as several components and
are added to the model at the same time, summing possible
                                                               4.2   Algorithm settings
constant offsets. Other than these details, the procedure is
exactly the same as that of boosting.                          The description of settings of the algorithms (except for
                                                               LR which has no settings) follows. MGGP, SSR and Boost
                                                               algorithms are set to produce a model of 10 components
4     Experiments                                              with the maximum depth of 10. That means that MGGP
                                                               has the gene number limit set to 10 and SSR and Boost
We perform a series of experiments with the aim to com-        algorithms are run for 10 iterations. The stopping crite-
pare the performance of a number of algorithms both            rion for MGGP is a wall-clock time limit of 1800 s, for
in terms of model accuracy and complexity. First we de-        SSR and Boost algorithms it is the number of 10 itera-
scribe the algorithms used in the experiments and their set-   tions of which each has a wall-clock time limit of 180 s,
tings, then we describe the datasets and finally we describe   therefore all algorithms have an equal amount of time
the experimental protocol.                                     available to find a good solution. The 1G algorithm pro-
                                                               duces only one component and in order to have roughly
                                                               the same amount of nodes available, the maximum depth
4.1   Algorithms
                                                               is set to 16. 1G is also run for 1800 s. Other settings which
For the experimental evaluation we selected a number           are common for the algorithms are described in Table 1.
of algorithms and their variants based on SSR, boosting
with GP, MGGP and linear regression as a baseline.             4.3   Datasets

LR The first algorithm that serves purely as a baseline        We use four real-world datasets freely available from the
is an ordinary least-squares linear regression on the prob-    UCI repository [12]. The datasets are described in the
lem’s features. Its purpose is to frame all the other algo-    following paragraphs. Summary information about these
rithms to a well established context.                          datasets is in Table 2.
           parameter                        value               score.1 The model complexity is measured as the number
                                                                of nodes in the model. The coefficients of the linear com-
         population size                      800               bination that combines the components of the model is not
            # of elites             10 % of population          counted towards this number. These two measures (per-
                                                                formance and complexity) are then aggregated over the 25
         tournament size             4 % of population
                                                                runs of each algorithm for each dataset.
        prob. of mutation                     0.2                  All runs were performed on machines of identical con-
        prob. of crossover                    0.7               figuration that were part of the MetaCentrum grid (see Ac-
                                                                knowledgements).
   constant/subtree mutation
                                            0.3/0.7
             ratio
      σ of constant mutation                  10
                                                                5      Results
                                     a + b, a − b, a · b,       In this section we present the results of the experiments.
           function set            √ a , ea , |a|, sin a, a2    As described in Section 4.4, we focus on the performance
                                     1+b2
                                                                (R2 ) the complexity (number of nodes) of the models.
           Table 1: Common algorithm settings.                  Test-set performance along with complexity is depicted
                                                                in Figure 1. The result is displayed in the form of two
                                                                crossing lines whose intersection is at the median R2 and
                                        # of samples            median number of nodes, and they stretch from 1st to 3rd
      dataset    # of dimensions
                                     training testing           quartile in both the measures. SSR algorithms of the same
        ASN                5           1052         451         configuration except for final fitting are plotted in the same
        CCS                8            721         309         color for easier visual assessment of the effect of final fit-
        ENC                8            537         231         ting.
        ENH                8           537          231
                                                                5.1      Discussion
Table 2: Summary information on the datasets used in
the experiments. The division into training and testing         From Figure 1 it is clear that MGGP produces the small-
samples comes from a random 70:30 split of the original         est models for all four datasets. We hypothesize that this
dataset.                                                        is caused by the fact that all the components are evolved
                                                                together so they can complement each other while in the
                                                                other algorithms (except for 1G) each component is forced
ASN Airfoil self-noise (ASN) is a 5-dimensional dataset         to evolve on its own. When the component is being
describing acoustic pressure for various airfoils during        evolved, it doesn’t “know” about the other components so
wind tunnel tests.                                              the evolution tries to solve the current problem with the
                                                                single component and it produces a big component as it
CCS      Concrete compressive strength (CCS) is                 tries to do that. When it is time for the next component, the
an 8-dimensional dataset describing compressive strength        previous one is fixed and it cannot get smaller anymore.
of concrete based on its ingredients.                               In terms of test-set performance, MGGP, Boost-GP and
                                                                Boost-1G are the best performers overall, with (n)SSR-1G
                                                                fit being close. On the other hand, SSR-GP and nSSR-GP
ENC, ENH Energy efficiency of cooling/heating (ENC,
                                                                are, overall, the worst performers, performing even worse
ENH) are 8-dimensional datasets describing energy effi-
                                                                than LR in some cases. Even though the Boost algorithms
ciency of cooling and heating buildings.
                                                                are competitive in terms of performance, they are in line
                                                                with the SSR algorithms regarding the number of nodes,
4.4    Experimental protocol                                    which is much larger than that of MGGP or 1G.
                                                                    The SSR algorithms, especially the variants without fi-
Each dataset is randomly split into training/testing subset     nal fitting, are performing rather poorly compared to the
25 times. Each algorithm is run once for each split of each     other algorithms except for 1G and LR. It can be seen
dataset, i.e. 25 runs per algorithm per dataset in total. For   that the final fitting improves the performance consider-
each of the 25 runs of an algorithm on a dataset a different    ably. We hypothesize that this is caused by the fact that
seed for the random number generator used by the algo-          the original coefficients are chosen randomly and from a
rithm (except for LR which is fully deterministic).             bounded interval. Even though the target values are modi-
   During the algorithm run, only the training set is used.     fied in such way that the result should be close to optimum
After the run completes, the models produced by the al-         if the new component models the new target values well,
gorithms are evaluated on the testing set. Performance                               N          2
                                                                      1 R2 = 1 − ∑i=0 (yi −ŷi )
is measured using the coefficient of determination, or R2                        ∑Ni=0 (yi −ȳ)
                                                                                                2
              1G                    Boost-1G       SSR-1G               nSSR-GP fit                      1G                    Boost-1G             SSR-1G              nSSR-GP fit
              MGGP                  SSR-GP         SSR-1G fit           nSSR-1G                          MGGP                  SSR-GP               SSR-1G fit          nSSR-1G
              Boost-GP              SSR-GP fit     nSSR-GP              nSSR-1G fit                      Boost-GP              SSR-GP fit           nSSR-GP             nSSR-1G fit
     0.9                                                                                   0.9
     0.8                                                                                   0.8
     0.7                                                                                   0.7

     0.6                                                                                   0.6
R2


                                                                                      R2
     0.5                                                                                   0.5

     0.4                                                                                   0.4

     0.3                                                                                   0.3
                  400         600         800      1000         1200      1400                   200         400         600         800     1000      1200      1400      1600
                                        no. of nodes                                                                                 no. of nodes
                                       (a) ASN                                                                                       (b) CCS
                  1G                Boost-1G        SSR-1G              nSSR-GP fit                          1G                 Boost-1G            SSR-1G              nSSR-GP fit
                  MGGP              SSR-GP          SSR-1G fit          nSSR-1G                              MGGP               SSR-GP              SSR-1G fit          nSSR-1G
                  Boost-GP          SSR-GP fit      nSSR-GP             nSSR-1G fit                          Boost-GP           SSR-GP fit          nSSR-GP             nSSR-1G fit
                                                                                           1.00
     0.98
                                                                                           0.98
     0.96                                                                                  0.96
     0.94                                                                                  0.94
R2


                                                                                      R2


     0.92                                                                                  0.92
                                                                                           0.90
     0.90
                                                                                           0.88
     0.88
                                                                                           0.86
            200         400     600          800    1000         1200     1400                         200         400         600         800      1000      1200       1400
                                        no. of nodes                                                                                 no. of nodes
                                       (c) ENC                                                                                       (d) ENH

Figure 1: Complexity-performance plot of the final models for all four datasets. The horizontal lines are vertically placed
at the median R2 value and stretch in the horizontal direction from 1st to 3rd quartile of the number of nodes. The vertical
lines are horizontally placed at the median number of nodes and stretch in the vertical direction from 1st to 3rd quartile of
R2 . The intersections show the point of median number of nodes and median R2 . The closer the lines are to the upper left
corner of the plot, the better (i.e. well performing and small) the models are. The dotted line shows the median R2 of LR
and the gray band stretches from 1st to 3rd quartile of R2 of LR.


the contribution of previous component can be multiplied                               linear scaling can considerably reduce the error, especially
by a very small number which makes the new target values                               if the inner structure of the component is good but it only
not much “easier” than for previous components. Also,                                  lacks such scaling.
looking at Equation 3, each component is multiplied by a
number of values, each smaller than or equal to 1 and the
later the component is added the more such multipliers it
has and therefore it contributes less and less to the final                               The final interesting pattern is that the normalization in
model. However, SSR-1G (and its fitted version) perform                                SSR has almost no effect in terms of performance when
much better. The reason is that with 1G, the component                                 the base algorithm is 1G, and quite inconsistent effect
is scaled by a factor, mitigating the just described issue to                          when the base algorithm is GP. The reason of the small
some extent.                                                                           effect with 1G is again the fact that 1G introduces its
   Another notable pattern is that algorithms using 1G as                              own multiplicative and additive constant so the normaliza-
the base algorithm perform considerably better than those                              tion and denormalization coefficients become redundant to
using GP, which is to be expected since even the simple                                some extent.
5.2     Overfitting and underfitting                             6.1   Future Work

Figure 2 displays the results on training and testing set side   This article investigated the sequential algorithms in their
by side. With a few exceptions, it can be said that no algo-     basic form only, using a predefined number of components
rithm suffered from overfitting as the testing performance       and a predefined scheme of the evolution of the individual
values are about as much smaller than the training ones          components. However, the sequential algorithms naturally
as is exhibited by the linear regression. The most notable       lend themselves to tuning and tweaking. One of possi-
exceptions are the Boost algorithms, especially on ASN           ble approaches is not to switch components regularly but
and CCS datasets. Boost-1G produced two significant out-         based on some other scheme, e.g. the performance of the
liers in both datasets and the other testing values are also     current component. Another approach worth investigating
generally smaller than the training ones. It is best seen in     might be using full MGGP as the base algorithm in the se-
the CCS dataset where Boost-1G is the best performer by          quential algorithms which would mean that multiple com-
training performance but the testing performance is signif-      ponents will be produced at each stage, not just one. This
icantly2 worse than the training performance.                    would allow for a “continuum” of algorithms with MGGP
   With the exception of nSSR-GP fit on the CCS dataset,         on one side as a totally non-sequential algorithm, and SSR
all the SSR algorithms have, overall, the smallest differ-       or boosting-inspired algorithms as described in this article
ence in training and testing performance. In some cases          on the other side, as totally sequential algorithms.
the testing performance is even better than the training per-       Another area worth exploring is the interplay of the
formance. This could indicate potential underfitting, either     complexity of individual components or limits on their
of the whole algorithm or the individual components.             size, the number of components, and the amount of com-
                                                                 putation resources available for a single component. The
                                                                 classical boosting works with so-called weak learners,
6      Conclusions and Future Work                               i.e. models which by themselves are not capable of solv-
                                                                 ing the problem to any reasonable degree, and uses many
In this article we reviewed the idea of sequential approach      of such models. A similar approach could be used for the
to symbolic regression. We reviewed an existing approach,        sequential approaches reviewed in this article, by setting
called Sequential Symbolic Regression, which is unique           tighter limits on the complexity and/or computational re-
by using Geometric Semantic Crossover to combine the             sources for the evolution of each component.
individual model components and derive training data for
subsequent component evolution.
    We have shown that the models SSR produces are ef-           Acknowledgements
fectively a linear combination of a number of compo-
nents, similar to MGGP, but constructed by a different ap-       The reported research was supported by the Czech Science
proach. We have also proposed a simple boosting-inspired         Foundation grant No. 17-01251. Access to computing and
approach which uses the residuals directly as the train-         storage facilities owned by parties and projects contribut-
ing target values for the next component and combines the        ing to the National Grid Infrastructure MetaCentrum, pro-
components just by adding them together.                         vided under the programme “Projects of Large Research,
    We have performed a series of experiments with the al-       Development, and Innovations Infrastructures ” (CESNET
gorithms, using different base algorithms of GP and 1G,          LM2015042), is greatly appreciated.
and different variants of SSR, most importantly with final
fitting of the coefficients with linear regression. The re-
                                                                 References
sults have shown that MGGP, a non-sequential algorithm,
is the best or one of the best algorithms on each dataset
                                                                  [1] John R. Koza. Genetic Programming: On the Pro-
and producing by far the smallest models. Of the sequen-
                                                                      gramming of Computers by Means of Natural Se-
tial algorithms, the SSR is, overall, the worst, except for
                                                                      lection. MIT Press, Cambridge, MA, USA, 1992.
the variants using final fitting and 1G as the base, which
                                                                      ISBN 0-262-11170-5. URL http://mitpress.
are comparable to the boosting-inspired algorithms on two
                                                                      mit.edu/books/genetic-programming.
of the four datasets.
    Except for the boosting-inspired algorithms in some           [2] Mark Hinchliffe, Hugo Hiden, Ben McKay, Mark
cases, no algorithm experienced overfitting, which might              Willis, Ming Tham, and Geoffery Barton. Modelling
be a useful information for some users, and it also shows             chemical process systems using a multi-gene genetic
that big and complex models don’t necessarily mean over-              programming algorithm. In Late Breaking Paper,
fitting.                                                              GP’96, pages 56–65, Stanford, USA, 1996.
                                                                  [3] Dominic P Searson, David E Leahy, and Mark J
                                                                      Willis. GPTIPS: an open source genetic program-
                                                                      ming toolbox for multigene symbolic regression. In
      2 Mann-Whitney U test with significance level α = 0.01.         Proceedings of the International MultiConference of
     1.0                                                                 1.0
     0.8                                                                 0.8
     0.6                                                                 0.6
R2


                                                                    R2
     0.4                                                                 0.4
     0.2
     0.0                                                                 0.2
                   LR
                   1G
               MGGP
             Boost-GP
             Boost-1G
              SSR-GP
            SSR-GP fit
              SSR-1G
            SSR-1G fit
             nSSR-GP
           nSSR-GP fit
             nSSR-1G
           nSSR-1G fit


                                                                                       LR
                                                                                       1G
                                                                                   MGGP
                                                                                 Boost-GP
                                                                                 Boost-1G
                                                                                  SSR-GP
                                                                                SSR-GP fit
                                                                                  SSR-1G
                                                                                SSR-1G fit
                                                                                 nSSR-GP
                                                                               nSSR-GP fit
                                                                                 nSSR-1G
                                                                               nSSR-1G fit
(a) ASN; two outlier values in testing Boost-1G at -206 and -2.85   (b) CCS; two outlier values in testing Boost-1G at -127797 and
are not shown.                                                      -85.6 and one value in testing nSSR-GP fit at -2.91 are not shown.
     1.00                                                                1.00
     0.95                                                                0.95
     0.90                                                                0.90
R2


                                                                    R2
     0.85                                                                0.85
     0.80                                                                0.80
                    LR
                    1G
                MGGP
              Boost-GP
              Boost-1G
               SSR-GP
             SSR-GP fit
               SSR-1G
             SSR-1G fit
              nSSR-GP
            nSSR-GP fit
              nSSR-1G
            nSSR-1G fit


                                                                                        LR
                                                                                        1G
                                                                                    MGGP
                                                                                  Boost-GP
                                                                                  Boost-1G
                                                                                   SSR-GP
                                                                                 SSR-GP fit
                                                                                   SSR-1G
                                                                                 SSR-1G fit
                                                                                  nSSR-GP
                                                                                nSSR-GP fit
                                                                                  nSSR-1G
                                                                                nSSR-1G fit
                            (c) ENC                                                             (d) ENH

Figure 2: Performance box plots of the final models for all four datasets. Each algorithm has two box plots, the left
displays training set performance, the right one displays the testing set performance. In plots for ASN and CCS datasets
some outlier values are not shown in in the plot for the sake of clarity.


       Engineers and Computer Scientists, volume 1, pages                      2015. ISBN 978-3-319-16030-6. doi: 10.1007/
       77–80, March 2010.                                                      978-3-319-16030-6_5. URL https://doi.org/
                                                                               10.1007/978-3-319-16030-6_5.
 [4] Dominic P. Searson. GPTIPS 2: An Open-Source
     Software Platform for Symbolic Data Mining, pages                [9] L. O. V. B. Oliveira, F. E. B. Otero, L. F. Mi-
     551–573. Springer International Publishing, Cham,                    randa, and G. L. Pappa. Revisiting the sequen-
     2015. ISBN 978-3-319-20883-1. doi: 10.1007/                          tial symbolic regression genetic programming. In
     978-3-319-20883-1\_22. URL http://dx.doi.                            2016 5th Brazilian Conference on Intelligent Sys-
     org/10.1007/978-3-319-20883-1_22.                                    tems (BRACIS), pages 163–168, Oct 2016. doi:
                                                                          10.1109/BRACIS.2016.039.
 [5] Robert E. Schapire. The strength of weak learnabil-
     ity. Machine Learning, 5(2):197–227, 1990. ISSN                [10] Alberto Moraglio, Krzysztof Krawiec, and Col-
     1573-0565. doi: 10.1007/BF00116037. URL http:                       inG. Johnson. Geometric semantic genetic pro-
     //dx.doi.org/10.1007/BF00116037.                                    gramming.        In CarlosA.Coello Coello, Vin-
                                                                         cenzo Cutello, Kalyanmoy Deb, Stephanie For-
 [6] Yoav Freund and Robert E. Schapire. A decision-                     rest, Giuseppe Nicosia, and Mario Pavone, edi-
     theoretic generalization of on-line learning and an                 tors, Parallel Problem Solving from Nature - PPSN
     application to boosting. Journal of Computer and                    XII, volume 7491 of Lecture Notes in Computer
     System Sciences, 55(1):119–139, 1997. ISSN 0022-                    Science, pages 21–31. Springer Berlin Heidelberg,
     0000. doi: doi:10.1006/jcss.1997.1504. URL http:                    2012. ISBN 978-3-642-32936-4. doi: 10.1007/
     //dx.doi.org/10.1006/jcss.1997.1504.                                978-3-642-32937-1\_3.      URL http://dx.doi.
                                                                         org/10.1007/978-3-642-32937-1_3.
 [7] Trevor Hastie, Robert Tibshirani, and Jerome Fried-
     man. The Elements of Statistical Learning: Data                [11] Maarten Keijzer.    Scaled symbolic regression.
     Mining, Inference, and Prediction. Springer-Verlag                  Genetic Programming and Evolvable Ma-
     New York, 2 edition, 2009. ISBN 978-0-387-84858-                    chines, 5(3):259–269, 2004.   ISSN 1573-7632.
     7.                                                                  doi:       10.1023/B:GENP.0000030195.77571.f9.
                                                                         URL            http://dx.doi.org/10.1023/B:
 [8] Luiz Otávio V.B. Oliveira, Fernando E.B. Otero,                     GENP.0000030195.77571.f9.
     Gisele L. Pappa, and Julio Albinati. Sequential Sym-
     bolic Regression with Genetic Programming, pages               [12] K. Bache and M. Lichman. UCI machine learning
     73–90. Springer International Publishing, Cham,                     repository, 2013. http://archive.ics.uci.edu/ml.

</pre>