Sequential Model Building in Symbolic Regression Jan Žegklitz1,2 and Petr Pošík1 1 Czech Technical University in Prague, Faculty of Electrical Engineering, Technická 2, 166 27 Praha 6, Prague, Czech Republic 2 Czech Academy of Sciences, Institute of Computer Science, Pod Vodárenskou věží 271/2, 182 07 Praha 8, Prague, Czech Republic Abstract: Symbolic Regression is a supervised learning 2 Related Work technique for regression based on Genetic Programming. A popular algorithm is the Multi-Gene Genetic Program- In this section we briefly revisit important approaches and ming which builds models as a linear combination of a algorithms related to the sequential model building and to number of components which are all built together. How- the experiments performed later in this article. ever, in recent years a different approach emerged, repre- First we introduce Multi-Gene Genetic Programming sented by the Sequential Symbolic Regression algorithm, which is not a sequential algorithm but is important for which builds the model sequentially, one component at a this article. Then we briefly revisit boosting, a classical time, and the components are combined using a method machine learning ensemble method. Finally we introduce based on geometric semantic crossover. In this article the most important related research, the Sequential Sym- we show that the SSR algorithm effectively produces lin- bolic Regression algorithm. ear combination of components and we introduce another sequential approach very similar to classical ensemble method of boosting. All algorithms are compared with 2.1 Multi-Gene Genetic Programming MGGP as a baseline on a number of real-world datasets. Multi-Gene GP (MGGP) [2, 3, 4] is an extension of clas- The results show that the sequential approaches are over- sical GP for symbolic regression. The main idea behind all worse than MGGP both in terms of accuracy and model MGGP is that each individual is composed of one or more size. independent trees, called genes, and these then form the complete model together. To make the complete model, 1 Introduction the outputs of the genes are linearly combined with the coefficients determined using linear regression. In this ar- Symbolic Regression (SR) is a supervised learning task ticle, we call this linear combination of genes a “top-level with the goal to find a mathematical function (preferably a linear combination”. simple one) of a number of variables that fits the training MGGP does not build the model sequentially. However, data available for training. Regression is a well known task since we use the concept of top-level linear combination with a number of well known, well tested and successful to put SSR (see the next section) in a similar framework approaches, e.g. neural networks, support vector machines and since we use the algorithm in the experiments, it is and random forests to name a few. However, these con- necessary that it is mentioned. ventional approaches produce models which, while use- ful, are often essentially black boxes which are difficult to 2.2 Boosting interpret. One of the goals of SR is to produce “white- box” models in the form of a mathematical expression. Building predictive models sequentially is a well known The overwhelming majority of SR approaches utilize some machine learning technique. The prominent example form of Genetic Programming (GP) [1]. of this approach is the boosting ensemble method [5, 6, 7]. This article is structured as follows: Section 2 intro- Boosting in the context of least-squares regression (i.e. re- duces previous research relevant for this article and and gression using the squared error as the criterion) works provide a new view on the main algorithm examined in in the following way (simplified): this article, in Section 3 we propose a simple boosting- based sequential SR algorithm, Sections 4 and 5 present 1. Start with a constant model f0 (x) = c, e.g. using the the experiments and results respectively and provide a dis- mean target value: f0 (x) = ȳ cussion of these results. Finally, Section 6 concludes the 2. Iterate for k = 1, 2, ..., M, where M is the desired article and proposes further research. number of iterations: (a) Compute the residuals ỹ = y − fk−1 (x). Copyright c 2019 for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC BY (b) Train a model component h(x) using ỹ as the 4.0). target values. (c) Update the complete model fk (x) = fk−1 (x) + and the next iteration with different training data is started. h(x). It is important to note that the convex combinations are “nested” in depth in the second term of the GSX expres- 3. fM (x) is the final model. sion. In the last iteration, the f 0 placeholder is not replaced One possible view is that the latter model components cor- with another GSX result but with the function found in that rect the mistakes of the previous ones. iteration, terminating the nesting. The complete SSR pro- cedure has the following form: 2.3 Sequential Symbolic Regression 1. y1 = y, k ← 1 SSR [8, 9] is a GP-based method that builds the model 2. Run the base algorithm with yk as target values, pro- in a sequential fashion. The overall structure of the SSR ducing a component fk . method is following: 3. Sample a random number rk ∈ [0, 1). 1. An SR algorithm (e.g. GP) is executed with the train- 4. Update the complete model: ing set, producing a model component. • if k = 1 and it is not the last iteration: 2. The model component is added to the complete f ∗ = rk fk + (1 − rk ) f 0 model and the training data is modified. • if k = 1 and it is the last iteration: 3. Unless stopping criteria are met, go to step 1 and re- f ∗ = fk , terminate peat, using the modified training data. • if k > 1 and it is not the last iteration: This structure is somewhat similar to boosting – a model f 0 ← rk fk + (1 − rk ) f 0 component is trained, added to the complete model and the • if k > 1 and it is the last iteration: next one is trained using a modified training data. How- f 0 ← fk , terminate ever, SSR does not do boosting. The difference is in step 2, i.e. how is the new component combined with the other 5. Adjust target values yk+1 = yk −r 1−r k f k (x) k components of the complete model and how is the training data modified. 6. k ← k + 1, repeat from step 2 unless a stopping con- For the first iteration, the target values are not modi- dition is met. fied and the new component simply becomes the complete 7. return f ∗ model as at the time it is the only component. In the sub- sequent iterations, SSR utilizes the Geometric Semantic where f 0 is the placeholder. The nesting happens in the Crossover (GSX) from Geometric Semantic Genetic Pro- third case in step 4, where the placeholder is replaced with gramming (GSGP) [10] to add the new component to the the GSX of the new function and the placeholder again. complete model. GSX performs a convex combination of two functions: 2.4 SSR as Top-Level Linear Combination ∗ 0 f (x) = r f (x) + (1 − r) f (x) (1) In previous subsection we have briefly described the SSR procedure and how it constructs the models via GSX. Now where r is a random number from the interval [0, 1), and f we rewrite the procedure in such a way that the result and f 0 are the two combined functions. is a linear combination of expressions as in MGGP and In SSR, the GSX is employed in a different way than Boost-GP. to combine two known functions. Only one of the two By iterating Equation 1 (with added subscripts to indi- functions is known (the component from the previous it- cate the iteration number) and expanding the multiplica- eration, i.e. f ). SSR therefore performs an “incomplete” tion at each step, we can see how the model looks like as GSX, meaning that f 0 is only a placeholder for a function if it was finished at that iteration, i.e. in the second and to be found in the next iteration using a modified training fourth case of step 4 of the SSR procedure from the previ- data. The training data is derived by the following reason- ous subsection (for the sake of simplicity, let r̄k = 1 − rk ): ing. The result of the GSX should optimally produce zero error, i.e. f ∗ (x) = y. Substituting this into Equation 1 and slightly rearranging it, the following equation is obtained k=1: f1∗ = f1 k=2: f2∗ = r1 f1 + r̄1 f2 y − r f (x) k=3: f3∗ = r1 f1 + r̄1 (r2 f2 + r̄2 f3 ) f 0 (x) = . (2) 1−r = r1 f1 + r̄1 r2 f2 + r̄1 r̄2 f3 (3) k=4: f4∗ = r1 f1 + r̄1 r2 f2 + r̄1 r̄2 (r3 f3 + r̄3 f4 ) This is, in effect, a requirement on how the outputs of f 0 = r1 f1 + r̄1 r2 f2 + r̄1 r̄2 r3 f3 + r̄1 r̄2 r̄3 f4 should look like, i.e. it is a definition of its target values for .. the next iteration. Therefore the problem is transformed . From the above iteration we can see that the model MGGP This algorithm represents an algorithm that pro- in fact is a linear combination of several expressions. We duces models composed of multiple components but can also see a pattern of the multiplicative coefficients: which are all evolved simultaneously. each time a component fk is added to the model, the co- efficient of component fk−1 is multiplied by rk−1 and the 1G The same as MGGP but the individuals are limited coefficient of the new component fk is the coefficient of to a single gene, making it effectively a Scaled Symbolic component fk−1 multiplied by r̄k−1 . This allows for ef- Regression algorithm [11] (not to be confused with Se- ficient implementation using a list of expressions and a quential Symbolic Regression that is discussed in this arti- corresponding list of their coefficients, just multiplying a cle, which has the same abbreviation SSR). In order to pro- number in the second list and adding an element to the end vide similar complexity, the tree depth limit is increased of both lists. This view also allows further manipulation to allow for at least the same number of nodes as MGGP. of the coefficients and model components in a way similar to MGGP, e.g. performing an additional linear regression. SSR-GP, SSR-1G Sequential Symbolic Regression algo- rithm (see Section 2.3) with ordinary GP and 1G (see above) as the base algorithm respectively. 3 Boosting-GP nSSR-GP, nSSR-1G Similar to SSR-GP and SSR-1G but In the previous section we briefly introduced Sequential using the normalization and denormalization as described Symbolic Regression algorithm (SSR) which utilizes Ge- in [9]. ometric Semantic Crossover to sequentially combine the components of the complete model and to derive modified training data for subsequent iterations. We propose an al- SSR with final fitting All the SSR-based algorithms men- ternative approach that is almost identical to the boosting tioned above (coded identically but then suffixed with the scheme (see Section 2.2), hence we call it Boost-GP. word “fit”, e.g. “SSR-GP fit”) are also used in such way The procedure matches that of boosting we already de- that after the algorithm completes, the coefficients of the scribed, except for a few details. In Boost-GP, we do individual components are recalculated using linear re- not start with a constant model, instead we start with no gression in the same manner as is used in MGGP. prior model at all and the first component is found using GP too. The procedure that finds the individual compo- Boost-GP, Boost-1G Boosting-type sequential approach nents is a GP-based algorithm. If the underlying algorithm (see Section 3) with ordinary GP and 1G (see above) as the produces a linear combination of several expressions (as base algorithm respectively. MGGP does), those are treated as several components and are added to the model at the same time, summing possible 4.2 Algorithm settings constant offsets. Other than these details, the procedure is exactly the same as that of boosting. The description of settings of the algorithms (except for LR which has no settings) follows. MGGP, SSR and Boost algorithms are set to produce a model of 10 components 4 Experiments with the maximum depth of 10. That means that MGGP has the gene number limit set to 10 and SSR and Boost We perform a series of experiments with the aim to com- algorithms are run for 10 iterations. The stopping crite- pare the performance of a number of algorithms both rion for MGGP is a wall-clock time limit of 1800 s, for in terms of model accuracy and complexity. First we de- SSR and Boost algorithms it is the number of 10 itera- scribe the algorithms used in the experiments and their set- tions of which each has a wall-clock time limit of 180 s, tings, then we describe the datasets and finally we describe therefore all algorithms have an equal amount of time the experimental protocol. available to find a good solution. The 1G algorithm pro- duces only one component and in order to have roughly the same amount of nodes available, the maximum depth 4.1 Algorithms is set to 16. 1G is also run for 1800 s. Other settings which For the experimental evaluation we selected a number are common for the algorithms are described in Table 1. of algorithms and their variants based on SSR, boosting with GP, MGGP and linear regression as a baseline. 4.3 Datasets LR The first algorithm that serves purely as a baseline We use four real-world datasets freely available from the is an ordinary least-squares linear regression on the prob- UCI repository [12]. The datasets are described in the lem’s features. Its purpose is to frame all the other algo- following paragraphs. Summary information about these rithms to a well established context. datasets is in Table 2. parameter value score.1 The model complexity is measured as the number of nodes in the model. The coefficients of the linear com- population size 800 bination that combines the components of the model is not # of elites 10 % of population counted towards this number. These two measures (per- formance and complexity) are then aggregated over the 25 tournament size 4 % of population runs of each algorithm for each dataset. prob. of mutation 0.2 All runs were performed on machines of identical con- prob. of crossover 0.7 figuration that were part of the MetaCentrum grid (see Ac- knowledgements). constant/subtree mutation 0.3/0.7 ratio σ of constant mutation 10 5 Results a + b, a − b, a · b, In this section we present the results of the experiments. function set √ a , ea , |a|, sin a, a2 As described in Section 4.4, we focus on the performance 1+b2 (R2 ) the complexity (number of nodes) of the models. Table 1: Common algorithm settings. Test-set performance along with complexity is depicted in Figure 1. The result is displayed in the form of two crossing lines whose intersection is at the median R2 and # of samples median number of nodes, and they stretch from 1st to 3rd dataset # of dimensions training testing quartile in both the measures. SSR algorithms of the same ASN 5 1052 451 configuration except for final fitting are plotted in the same CCS 8 721 309 color for easier visual assessment of the effect of final fit- ENC 8 537 231 ting. ENH 8 537 231 5.1 Discussion Table 2: Summary information on the datasets used in the experiments. The division into training and testing From Figure 1 it is clear that MGGP produces the small- samples comes from a random 70:30 split of the original est models for all four datasets. We hypothesize that this dataset. is caused by the fact that all the components are evolved together so they can complement each other while in the other algorithms (except for 1G) each component is forced ASN Airfoil self-noise (ASN) is a 5-dimensional dataset to evolve on its own. When the component is being describing acoustic pressure for various airfoils during evolved, it doesn’t “know” about the other components so wind tunnel tests. the evolution tries to solve the current problem with the single component and it produces a big component as it CCS Concrete compressive strength (CCS) is tries to do that. When it is time for the next component, the an 8-dimensional dataset describing compressive strength previous one is fixed and it cannot get smaller anymore. of concrete based on its ingredients. In terms of test-set performance, MGGP, Boost-GP and Boost-1G are the best performers overall, with (n)SSR-1G fit being close. On the other hand, SSR-GP and nSSR-GP ENC, ENH Energy efficiency of cooling/heating (ENC, are, overall, the worst performers, performing even worse ENH) are 8-dimensional datasets describing energy effi- than LR in some cases. Even though the Boost algorithms ciency of cooling and heating buildings. are competitive in terms of performance, they are in line with the SSR algorithms regarding the number of nodes, 4.4 Experimental protocol which is much larger than that of MGGP or 1G. The SSR algorithms, especially the variants without fi- Each dataset is randomly split into training/testing subset nal fitting, are performing rather poorly compared to the 25 times. Each algorithm is run once for each split of each other algorithms except for 1G and LR. It can be seen dataset, i.e. 25 runs per algorithm per dataset in total. For that the final fitting improves the performance consider- each of the 25 runs of an algorithm on a dataset a different ably. We hypothesize that this is caused by the fact that seed for the random number generator used by the algo- the original coefficients are chosen randomly and from a rithm (except for LR which is fully deterministic). bounded interval. Even though the target values are modi- During the algorithm run, only the training set is used. fied in such way that the result should be close to optimum After the run completes, the models produced by the al- if the new component models the new target values well, gorithms are evaluated on the testing set. Performance N 2 1 R2 = 1 − ∑i=0 (yi −ŷi ) is measured using the coefficient of determination, or R2 ∑Ni=0 (yi −ȳ) 2 1G Boost-1G SSR-1G nSSR-GP fit 1G Boost-1G SSR-1G nSSR-GP fit MGGP SSR-GP SSR-1G fit nSSR-1G MGGP SSR-GP SSR-1G fit nSSR-1G Boost-GP SSR-GP fit nSSR-GP nSSR-1G fit Boost-GP SSR-GP fit nSSR-GP nSSR-1G fit 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 R2 R2 0.5 0.5 0.4 0.4 0.3 0.3 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400 1600 no. of nodes no. of nodes (a) ASN (b) CCS 1G Boost-1G SSR-1G nSSR-GP fit 1G Boost-1G SSR-1G nSSR-GP fit MGGP SSR-GP SSR-1G fit nSSR-1G MGGP SSR-GP SSR-1G fit nSSR-1G Boost-GP SSR-GP fit nSSR-GP nSSR-1G fit Boost-GP SSR-GP fit nSSR-GP nSSR-1G fit 1.00 0.98 0.98 0.96 0.96 0.94 0.94 R2 R2 0.92 0.92 0.90 0.90 0.88 0.88 0.86 200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400 no. of nodes no. of nodes (c) ENC (d) ENH Figure 1: Complexity-performance plot of the final models for all four datasets. The horizontal lines are vertically placed at the median R2 value and stretch in the horizontal direction from 1st to 3rd quartile of the number of nodes. The vertical lines are horizontally placed at the median number of nodes and stretch in the vertical direction from 1st to 3rd quartile of R2 . The intersections show the point of median number of nodes and median R2 . The closer the lines are to the upper left corner of the plot, the better (i.e. well performing and small) the models are. The dotted line shows the median R2 of LR and the gray band stretches from 1st to 3rd quartile of R2 of LR. the contribution of previous component can be multiplied linear scaling can considerably reduce the error, especially by a very small number which makes the new target values if the inner structure of the component is good but it only not much “easier” than for previous components. Also, lacks such scaling. looking at Equation 3, each component is multiplied by a number of values, each smaller than or equal to 1 and the later the component is added the more such multipliers it has and therefore it contributes less and less to the final The final interesting pattern is that the normalization in model. However, SSR-1G (and its fitted version) perform SSR has almost no effect in terms of performance when much better. The reason is that with 1G, the component the base algorithm is 1G, and quite inconsistent effect is scaled by a factor, mitigating the just described issue to when the base algorithm is GP. The reason of the small some extent. effect with 1G is again the fact that 1G introduces its Another notable pattern is that algorithms using 1G as own multiplicative and additive constant so the normaliza- the base algorithm perform considerably better than those tion and denormalization coefficients become redundant to using GP, which is to be expected since even the simple some extent. 5.2 Overfitting and underfitting 6.1 Future Work Figure 2 displays the results on training and testing set side This article investigated the sequential algorithms in their by side. With a few exceptions, it can be said that no algo- basic form only, using a predefined number of components rithm suffered from overfitting as the testing performance and a predefined scheme of the evolution of the individual values are about as much smaller than the training ones components. However, the sequential algorithms naturally as is exhibited by the linear regression. The most notable lend themselves to tuning and tweaking. One of possi- exceptions are the Boost algorithms, especially on ASN ble approaches is not to switch components regularly but and CCS datasets. Boost-1G produced two significant out- based on some other scheme, e.g. the performance of the liers in both datasets and the other testing values are also current component. Another approach worth investigating generally smaller than the training ones. It is best seen in might be using full MGGP as the base algorithm in the se- the CCS dataset where Boost-1G is the best performer by quential algorithms which would mean that multiple com- training performance but the testing performance is signif- ponents will be produced at each stage, not just one. This icantly2 worse than the training performance. would allow for a “continuum” of algorithms with MGGP With the exception of nSSR-GP fit on the CCS dataset, on one side as a totally non-sequential algorithm, and SSR all the SSR algorithms have, overall, the smallest differ- or boosting-inspired algorithms as described in this article ence in training and testing performance. In some cases on the other side, as totally sequential algorithms. the testing performance is even better than the training per- Another area worth exploring is the interplay of the formance. This could indicate potential underfitting, either complexity of individual components or limits on their of the whole algorithm or the individual components. size, the number of components, and the amount of com- putation resources available for a single component. The classical boosting works with so-called weak learners, 6 Conclusions and Future Work i.e. models which by themselves are not capable of solv- ing the problem to any reasonable degree, and uses many In this article we reviewed the idea of sequential approach of such models. A similar approach could be used for the to symbolic regression. We reviewed an existing approach, sequential approaches reviewed in this article, by setting called Sequential Symbolic Regression, which is unique tighter limits on the complexity and/or computational re- by using Geometric Semantic Crossover to combine the sources for the evolution of each component. individual model components and derive training data for subsequent component evolution. We have shown that the models SSR produces are ef- Acknowledgements fectively a linear combination of a number of compo- nents, similar to MGGP, but constructed by a different ap- The reported research was supported by the Czech Science proach. We have also proposed a simple boosting-inspired Foundation grant No. 17-01251. Access to computing and approach which uses the residuals directly as the train- storage facilities owned by parties and projects contribut- ing target values for the next component and combines the ing to the National Grid Infrastructure MetaCentrum, pro- components just by adding them together. vided under the programme “Projects of Large Research, We have performed a series of experiments with the al- Development, and Innovations Infrastructures ” (CESNET gorithms, using different base algorithms of GP and 1G, LM2015042), is greatly appreciated. and different variants of SSR, most importantly with final fitting of the coefficients with linear regression. The re- References sults have shown that MGGP, a non-sequential algorithm, is the best or one of the best algorithms on each dataset [1] John R. Koza. Genetic Programming: On the Pro- and producing by far the smallest models. Of the sequen- gramming of Computers by Means of Natural Se- tial algorithms, the SSR is, overall, the worst, except for lection. MIT Press, Cambridge, MA, USA, 1992. the variants using final fitting and 1G as the base, which ISBN 0-262-11170-5. URL http://mitpress. are comparable to the boosting-inspired algorithms on two mit.edu/books/genetic-programming. of the four datasets. Except for the boosting-inspired algorithms in some [2] Mark Hinchliffe, Hugo Hiden, Ben McKay, Mark cases, no algorithm experienced overfitting, which might Willis, Ming Tham, and Geoffery Barton. Modelling be a useful information for some users, and it also shows chemical process systems using a multi-gene genetic that big and complex models don’t necessarily mean over- programming algorithm. In Late Breaking Paper, fitting. GP’96, pages 56–65, Stanford, USA, 1996. [3] Dominic P Searson, David E Leahy, and Mark J Willis. GPTIPS: an open source genetic program- ming toolbox for multigene symbolic regression. In 2 Mann-Whitney U test with significance level α = 0.01. Proceedings of the International MultiConference of 1.0 1.0 0.8 0.8 0.6 0.6 R2 R2 0.4 0.4 0.2 0.0 0.2 LR 1G MGGP Boost-GP Boost-1G SSR-GP SSR-GP fit SSR-1G SSR-1G fit nSSR-GP nSSR-GP fit nSSR-1G nSSR-1G fit LR 1G MGGP Boost-GP Boost-1G SSR-GP SSR-GP fit SSR-1G SSR-1G fit nSSR-GP nSSR-GP fit nSSR-1G nSSR-1G fit (a) ASN; two outlier values in testing Boost-1G at -206 and -2.85 (b) CCS; two outlier values in testing Boost-1G at -127797 and are not shown. -85.6 and one value in testing nSSR-GP fit at -2.91 are not shown. 1.00 1.00 0.95 0.95 0.90 0.90 R2 R2 0.85 0.85 0.80 0.80 LR 1G MGGP Boost-GP Boost-1G SSR-GP SSR-GP fit SSR-1G SSR-1G fit nSSR-GP nSSR-GP fit nSSR-1G nSSR-1G fit LR 1G MGGP Boost-GP Boost-1G SSR-GP SSR-GP fit SSR-1G SSR-1G fit nSSR-GP nSSR-GP fit nSSR-1G nSSR-1G fit (c) ENC (d) ENH Figure 2: Performance box plots of the final models for all four datasets. Each algorithm has two box plots, the left displays training set performance, the right one displays the testing set performance. In plots for ASN and CCS datasets some outlier values are not shown in in the plot for the sake of clarity. Engineers and Computer Scientists, volume 1, pages 2015. ISBN 978-3-319-16030-6. doi: 10.1007/ 77–80, March 2010. 978-3-319-16030-6_5. URL https://doi.org/ 10.1007/978-3-319-16030-6_5. [4] Dominic P. Searson. GPTIPS 2: An Open-Source Software Platform for Symbolic Data Mining, pages [9] L. O. V. B. Oliveira, F. E. B. Otero, L. F. Mi- 551–573. Springer International Publishing, Cham, randa, and G. L. Pappa. Revisiting the sequen- 2015. ISBN 978-3-319-20883-1. doi: 10.1007/ tial symbolic regression genetic programming. In 978-3-319-20883-1\_22. URL http://dx.doi. 2016 5th Brazilian Conference on Intelligent Sys- org/10.1007/978-3-319-20883-1_22. tems (BRACIS), pages 163–168, Oct 2016. doi: 10.1109/BRACIS.2016.039. [5] Robert E. Schapire. The strength of weak learnabil- ity. Machine Learning, 5(2):197–227, 1990. ISSN [10] Alberto Moraglio, Krzysztof Krawiec, and Col- 1573-0565. doi: 10.1007/BF00116037. URL http: inG. Johnson. Geometric semantic genetic pro- //dx.doi.org/10.1007/BF00116037. gramming. In CarlosA.Coello Coello, Vin- cenzo Cutello, Kalyanmoy Deb, Stephanie For- [6] Yoav Freund and Robert E. Schapire. A decision- rest, Giuseppe Nicosia, and Mario Pavone, edi- theoretic generalization of on-line learning and an tors, Parallel Problem Solving from Nature - PPSN application to boosting. Journal of Computer and XII, volume 7491 of Lecture Notes in Computer System Sciences, 55(1):119–139, 1997. ISSN 0022- Science, pages 21–31. Springer Berlin Heidelberg, 0000. doi: doi:10.1006/jcss.1997.1504. URL http: 2012. ISBN 978-3-642-32936-4. doi: 10.1007/ //dx.doi.org/10.1006/jcss.1997.1504. 978-3-642-32937-1\_3. URL http://dx.doi. org/10.1007/978-3-642-32937-1_3. [7] Trevor Hastie, Robert Tibshirani, and Jerome Fried- man. The Elements of Statistical Learning: Data [11] Maarten Keijzer. Scaled symbolic regression. Mining, Inference, and Prediction. Springer-Verlag Genetic Programming and Evolvable Ma- New York, 2 edition, 2009. ISBN 978-0-387-84858- chines, 5(3):259–269, 2004. ISSN 1573-7632. 7. doi: 10.1023/B:GENP.0000030195.77571.f9. URL http://dx.doi.org/10.1023/B: [8] Luiz Otávio V.B. Oliveira, Fernando E.B. Otero, GENP.0000030195.77571.f9. Gisele L. Pappa, and Julio Albinati. Sequential Sym- bolic Regression with Genetic Programming, pages [12] K. Bache and M. Lichman. UCI machine learning 73–90. Springer International Publishing, Cham, repository, 2013. http://archive.ics.uci.edu/ml.