1. Introduction

An Evolutionary Geospatial Regression Tree

Margot Geerts

Seppe vanden Broucke

0 1

Jochen De Weerdt

1 0 Department of Business Informatics and Operations Management, Ghent University , Tweekerkenstraat 2, 9000 Gent , Belgium 1 Research Centre for Information Systems Engineering, KU Leuven , Naamsestaat 69, 3000 Leuven , Belgium

Tree-based methods have become popular for spatial prediction tasks due to their high accuracy in dividing input spaces into regions with diferent predictions. However, traditional decision trees perform univariate splits, resulting in rectangular regions. To address this limitation and provide more intuitive and accurate decision boundaries for spatial data, we propose a novel Geospatial Regression Tree (GeoTree) with two multivariate geospatial split types, i.e. oblique and Gaussian splits. Our approach relies on evolutionary algorithms to decide on the optimal split type, chosen among axis-parallel, oblique and Gaussian splits, in each internal node. We conducted a simulation study using five synthetic datasets to demonstrate GeoTree's ability to capture orthogonal, diagonal and ellipse patterns accurately. The results confirm the proposed method's advantage in tree depth and stability. We also tested GeoTree on real-life residential property valuation and soil content data. Our experiments revealed that the geospatial split types in GeoTree improve interpretability and maintain predictive power for price prediction and soil mapping tasks based on X- and Y-coordinates. The resulting decision boundaries are more intuitive spatial patterns on a geographic map of the study area.

eol>Decision trees Spatial data Evolutionary algorithms Machine learning Property valuation Soil mapping

1. Introduction

cities such as Brussels, Belgium and London, United Kingdom, are surrounded by a ring road that separates the Spatial prediction tasks pose a significant challenge due city center from the suburbs. Moreover, administrative to the unique nature of the spatial data. Spatial data is boundaries frequently follow non-rectangular shapes, as characterized by spatial dependence and spatial hetero- well as house prices, as shown in Figure 1. Despite these geneity, which makes it dificult to analyze. For spatial spatial patterns, previous research in spatial prediction applications such as soil mapping [1], disease mapping has yet to incorporate them into tree-based models. [2, 3], property valuation [4] and land use or cover map- Therefore, we propose a new multivariate decision ping [5], tree-based models have proven to be appropri- tree algorithm that integrates three diferent split types: ate. However, traditional decision trees with univariate axis-parallel splits, oblique splits and Gaussian splits. In splits are limited in their ability to capture spatial pat- addition, we introduce a Genetic Algorithm (GA) to genterns. By drawing axis-oblique decision boundaries, the erate the oblique and Gaussian candidate splits within the accuracy of the method for spatial data can be signifi- internal nodes. The combination of these multivariate cantly improved [6]. Furthermore, introducing multivari- split types with the traditional axis-parallel split enables ate geospatial splits can improve the accuracy of tree the proposed method to recognize spatial patterns in based models for spatial prediction tasks as they allow geographic location data more accurately and with simto learn from X- and Y-coordinates simultaneously. pler trees. As a result, the Geospatial Regression Tree

The recent advancements in Multivariate Decision (GeoTree) provides more intuitive decision boundaries Tree (MDT) learning have primarily focused on linear and less complex trees, making it more interpretable than combinations in splits and their optimization. On the existing decision tree algorithms. We demonstrate that other hand, omnivariate trees can allow test conditions our new decision tree algorithm can capture spatial patto follow any non-linear function. However, most spa- terns more efectively than both traditional regression tial data sets exhibit recognizable patterns. For example, trees and oblique regression trees for regression tasks. house prices often follow street patterns, where certain Using synthetic data sets, we showcase the proposed (parts of) streets are more expensive than others. Sim- method’s eficient pattern learning ability, with limited ilarly, circular patterns can be observed in historic city tree depth and error. In particular, our advanced synthetic centers when zooming out to the city-level. For instance, data set that reflects real spatial patterns demonstrates GeoTree’s superior performance in error, tree depth and STRL’23: Second International Workshop on Spatio-Temporal Reason- stability. Finally, we apply the proposed tree algorithm ing and Learning, 21 August 2023, Macao, S.A.R with the geospatial split types to real-life property valua* Corresponding author. tion and soil mapping data sets, demonstrating similar $ margot.geerts@kuleuven.be (M. Geerts)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License predictive power compared to existing decision tree algoCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) rithms, while significantly enhancing interpretability due

2. Related Work

As spatial prediction tasks benefit from the introduction of multivariate splits in tree-based models, the following section introduces MDTs and approaches to MDT learning. MDTs difer from traditional decision trees in the split type used. While traditional decision trees split the data based on a test condition of the form ≤ , with the iℎ feature, MDTs include multivariate conditions, (x) ≤ . In [7], multivariate decision tree algorithms are discussed. The authors identify three split types, univariate splits, multivariate splits that consist of linear combinations of features, and multivariate splits that include non-linear combinations. Most researchers have focused on axis-oblique splits, i.e. linear combinations, and have demonstrated the benefits for spatial analysis [6, 1] and classification in general [ 8]. However, nonlinear combinations have been included using quadratic functions and power laws [9, 10]. Further, splits have been generalized even more by using neural networks, decision trees and random forest at internal nodes [11]. Nevertheless, this generalization increases the cost complexity of learning an MDT substantially.

Instead of non-linear functions with many parameters,

3. An Evolutionary Geospatial Regression Tree Algorithm

Our work focuses on binary decision trees for regression with numerical features. A regression tree takes as input an observation (1, ..., , ) where are the features which are real-valued, d is the number of features and y is the real-valued target variable. A trained regression tree consists of test conditions of the form ≤ with a constant, producing axis-parallel splits. Typically, a top-down greedy approach is used for building the tree.

This means that the best split, determined by a feature and constant , is chosen at each node, starting at the root node, efectively splitting each node into two child nodes until the specified depth is reached or the leaf nodes are pure. The best split produces the largest reduction in mean squared error of the predictions. In essence, a regression tree sorts the observations in the training set through the tree, according to the test conditions in the internal nodes, and predicts in each leaf node the average of all observations sorted into that node.

GeoTree’s key innovation lies in the inclusion of two new bivariate decision tree splits, the oblique split and a Gaussian split, in addition to the axis-parallel split.

The oblique split divides the input space in two regions based on a linear combination of two features, 1 * 1 + 2 * 2 ≥ . On the other hand, the Gaussian split on two individuals with a probability cxpb. This step is defines an ellipse in a two-dimensional space, (, 1) + implemented with a blend crossover and the required (, 2) ≥ . This is supported by the mathematical parameter which determines the interval to draw new property that an ellipse consists of all points of which the values from based on the parents. Next, the new ofspring sum of the distances to both focal points (1, 2) is equal are mutated with a probability mutpb. Specifically, polyto a constant. Therefore, the Gaussian split separates nomial bounded mutation is used to ensure individuals input data located inside of the ellipse from input data range between the boundaries (, ). This operator relocated on or outside of the ellipse. quires two additional parameters: the probability of each

As for oblique splits, finding the best Gaussian split is value of the individual to be mutated (indpb) and the NP-hard [15]. In addition, a brute-force search that enu- crowding degree ( ). Lastly, the fittest individuals are merates all possible hyperplanes and ellipses, based on chosen among the ofspring and the fittest of the previous two and three data points respectively, has an exponen- population (‘Hall of Fame’). After generations, the tial cost. Two points are necessary to define a hyperplane ifttest individual from the last population is returned as and three to define an ellipse, where two points function the candidate split. The implementation is based on the as focal points and the third is used to calculate the con- Python library DEAP [16]. stant . A GA is employed instead to generate oblique The decision tree algorithm performs a greedy search and Gaussian candidate splits in each internal node. We in each internal node by finding the best orthogonal, base the fitness function on the evaluation criterion for oblique and Gaussian candidate split and finally choosing the decision tree, that is, the gain in mean squared error. the best split among the three candidate splits. OrthogAs such, the fitness function is defined by onal candidate splits are eficiently searched by sorting all unique values for each variable in the data set and − ( |ℒ| ∑︁( − ¯) 2 + |ℛ| ∑︁ ( − ¯) 2) evaluating corresponding candidate splits. Oblique and ∈ℒ ∈ℛ Gaussian splits are generated by Algorithm 1 as explained before. The complete GeoTree algorithm is available at https://github.com/margotgeerts/GeoTree. where ℒ and ℛ are the sets of observations sorted to the left and right respectively by the candidate split. Algorithm 1 presents the logic of the GA-based candidate split Algorithm 1 Genetic algorithm for candidate split gengeneration that returns the best split. This algorithm is eration. invoked for both proposed splits separately, as the no- Require: , , , , ℎ, , cxpb, , mutpb, indpb, tion of an individual is specific to the type of split. As mentioned above, an oblique split can be defined by two Pfoorpu=lat0iotno ← Rdoepeat ( (, ), ) two-dimensional points and a Gaussian split by three two-dimensional points. Consequently, an individual for HallofFame ← an oblique candidate split is a list of four floats, or two Offspring ← phoyipnetrspla1naen.dAn2inadsicvaidnubael fsoerena iGnaFuisgsuiaren2caa,ntdoidmaatkeespthliet Offspring ← is defined as a list of six floats, or three points 1, 2 and Offspring ← is3in(siteiealFizigeudrbey2rba)n,tdoomgelnyedrraatewainngelliptsiem. eTshferopmoptuhleatuionni- HalPloofFpaumlatei,on) ← SelectBest (Offspring + form distribution defined by the lower boundaries ( ) and end for upper boundaries (). The upper and lower boundaries BestIndividual ← SelectBest (Population , 1) are defined by the minimum and maximum values of the return BestIndividual two data dimensions. To reflect individuals, these values are repeated to result in a list of four values in case of oblique splits, while Gaussian boundaries require a list of six values. After initialization, the evolution process 4. Experimental Evaluation is repeated for the number of generations (). First, a ‘Hall of Fame’ variable is updated to retain the individu- The performance of GeoTree is demonstrated by first als that have the highest fitness values from the current comparing the regression outcomes for five synthetic population. This elitist approach requires a parameter ℎ data sets. Second, we show the benefits of our method in to indicate the number of individuals to retain. Then, the two spatial applications, a real-life housing data set and current population is modified using selection, crossover soil mapping data. and mutation operators. A tournament selection is performed by selecting the fittest individual in tournaments of size . Crossover is performed subsequently

Fittest (Population , ℎ) TournamentSelection (Population , , ) Mate(Offspring , cxpb, ) Mutate(Offspring , mutpb, indpb, ) (a) Oblique split individual consisting of four chromosomes, that form two points 1 and 2.

4.1. Experimental Setup In order to thoroughly test the benefits of the proposed

method with two GA-generated split types, axis-oblique and Gaussian, in combination with axis-orthogonal splits, we consider other orthogonal and oblique regression tree algorithms as baselines. The proposed Geospatial Regression Tree (GeoTree) is compared with a traditional

Univariate Regression Tree (SkTree), and an Oblique Re

gression Tree (ObTree). The baseline methods are implemented using Scikit-learn’s decision tree regressor [17] and a HHCart regression tree implementation [18].

To evaluate the models, we employ a repeated 5-fold set-up, where each fold is evaluated using five repeated experiments, resulting in 25 observations per model. We report the performance metrics on the test set among the ifve folds using the average of the five iterations. For all models, we use the MSE as the impurity measure. We set the hyperparameters of SkTree and ObTree to their default values. In GeoTree, we use a GA to generate oblique and Gaussian canidate splits. The parameter settings of the are shown in Table 1, which results from a grid search on a separate real-life housing data set with spatial features. The boundaries (, ) are defined by the minimum and maximum values of the data.

4.2. Simulation Study To assess the performance of GeoTree and the baselines

in detecting patterns in data, we generated five synthetic data sets with distinct patterns (see Figure 3). The target variable in the diferent areas defined in the data sets is normally distributed with a varying mean. The results for the orthogonal data set shows that while

GeoTree and SkTree can identify the true splits almost

perfectly, ObTree struggles to do so. In contrast, GeoTree chooses correctly for orthogonal splits among the three split types and replicates the performance by SkTree. Similarly, GeoTree is efective in detecting the diagonal, elliptic, and mixed patterns in the diagonal, ellipse and mixed data sets. In addition, the experiments show that GeoTree can identify diagonal patterns more eficiently than ObTree. Finally, the bivariate bumps data set, a synthetic spatial regression data set adapted from [19], further confirms the superior performance of

GeoTree in comparison to the baselines. In summary, the results show that GeoTree achieves lower error and requires smaller trees, thereby enhancing interpretability of the final model in comparison to traditional decision The data sets and results are discussed in more detail 4.3. Experiments on Housing and Soil

Two spatial applications, house price prediction and soil mapping, show the performance of the proposed method in a real-life setting. The results are discussed with regards to model accuracy, tree depth and size, and stability defined by the Interquartile Range (IQR), as well as the decision boundaries visualized in geographic space. 4.3.1. Housing data set

A real-life housing data set of the province of Namur in Belgium is used to test the performance of the proposed

model in comparison with the baseline methods. The the proposed GeoTree has an advantage in depth comdata set contains 17 967 houses with the indexed transac- pared to the baseline methods. A smaller depth in turn tion price and the X- and Y-coordinates. The house prices increases interpretability. Moreover, these experiments range between 100 000 and 600 000 euros approximately. show that the baseline methods have a similar test error. For the experiments, the prices are log transformed and Table 2 also reveals that while the RMSE is similar among the X- and Y-coordinates are used as predictor variables. the three models, GeoTree is able to achieve this error The test RMSE is presented in a box plot for all models level using significantly smaller trees indicated by the in Figure 4. GeoTree reaches its lowest RMSE at depth lower amount of leaf nodes. In addition, IQR of GeoTree 4 with a median RMSE of 0.3293. SkTree reaches a min- is comparable to the IQR of the baseline methods, illusimum RMSE of 0.3304 at depth 6, and ObTree reaches trated in Figure 4 by the box sizes. Although the box a value of 0.3285 at depth 8. The diference between plots show more upward outliers for GeoTree than the the test errors of GeoTree and ObTree is not statistically baseline trees, this is only depths larger than the optimal significant ( -value = 0.1584), based on the corrected depth. Therefore, we can conclude that GeoTree is as repeated k-fold cross-validation test [20]. These obser- stable as the baseline methods. vations add to the evidence of the simulation study that Figure 6 presents the decision boundaries by the three methods for the housing data set. While these figures show completely diferent patterns at first glance, the colors (predictions) in most areas are quite similar. For example, the lower left corner of the geographic map contains an area in bright green. In Figure 6b this area is odd-shaped, in Figure 6c it is rectangular shaped and a green triangle is visible in Figure 6d, according to the split types. Further, the decision boundaries of SkTree and ObTree are more similar, by dividing the space in many horizontally wide and vertically narrow rectangular regions. In contrast, GeoTree draws more ellipse patterns, with almost solely ellipses down the tree indicated by small ellipses. Nevertheless, larger areas are bounded by orthogonal, diagonal, and ellipse splits. This indicates that introducing ellipse shaped splits is beneficial, in parThe first soil mapping data set contains clay percentage at 3418 locations around Lake Tahoe in California, USA, sourced from [21]. Figure 5a exhibits the RMSE on the test set with respect to diferent depths, with maximum depth 6, for the three decision trees. While all models show little progress in test error, GeoTree achieves its optimal depth at n = 5, and the baselines at n = 6, based on the median RMSE. The median RMSE at these depths is 14.83, 14.5 and 15.18 for GeoTree, SkTree, and ObTree respectively. The RMSE on the test set for GeoTree is not significantly diferent from SkTree ( -value = 0.753) and

ObTree (-value = 0.6486). The stability of the methods is (c) SkTree of the 5-fold set-up. Similar to related work [1, 21], the target variable is the log transformed zinc concentration and X- and Y-coordinates are used as predictors. The RMSE on the test set is shown in Figure 5b for the three models with respect to depth n = 1 to 8. The median RMSE for GeoTree is lowest at depth 7 and slightly lower than the lowest median RMSE of the baselines, which is achieved at depth 8. Nevertheless, the variance of these error metrics is high, as can be seen in Figure 5b and Table 2, due to the small size of the data set. On average, GeoTree only requires around 60 leaf nodes to achieve a similar error level as the baselines, compared to 93 and 150 leaf nodes for SkTree and ObTree respectively (Table 2). Figure 8 shows the data and the decision boundaries of the methods for the area including the Meuse prediction grid. Although axis-parallel decision boundaries seem predominant even in Figure 8d, the ability to include Gaussian patterns makes the decision boundaries by GeoTree more easy to interpret and intuitive.

5. Conclusion

also comparable, indicated by similarly sized boxes, i.e., In this paper, we address the incompatibility of tree-based IQRs. Table 2 indicates as for the housing data that while methods for spatial data by proposing a novel Geospathe RMSE is similar, GeoTree needs on average less than tial Regression Tree algorithm (GeoTree). While decihalf the number of leaf nodes than the baseline methods. sion trees are known for dividing the input space into This leads to the conclusion that the interpretability of rectangular regions, this feature is not always advantaour proposed method is boosted significantly in compar- geous for spatial prediction, as spatial variables exhibit ison to traditional decision trees. diferent patterns in reality. To address this issue, previ

Furthermore, the decision boundaries in Figure 7 also ous research has suggested replacing axis-parallel splits show much more intuitive patterns in geographic space with axis-oblique splits. However, we propose a more for soil mapping. In contrast to straight-line separations advanced approach that combines bivariate oblique and between areas in the decision boundaries delineated by Gaussian splits with axis-parallel splits. The combination the baseline decision trees, GeoTree is able to present of split types improves the regression tree’s ability to caphotspots of high and low clay percentages in soil defined ture geospatial patterns and introduces more intuitive predominantly by ellipses. Again, some similarities can and accurate decision boundaries. We rely on evolutionbe found. For example, the decision boundaries in the ary algorithms to generate geospatial candidate splits upper right corner of the map show a yellow hotspot of and choose the best split among the three split types at clay percentage for GeoTree and ObTree. SkTree, on the each internal node. Our proposed algorithm is compared other hand, seems to have found this pattern as well, but with Scikit-learn’s univariate regression tree and the HHthe same area is not as explored regarding the number Cart oblique tree using both synthetic data and real-life of splits. Considering the rugged nature of geological house price and soil mapping data. The results show phenomena, we conclude that the combination of elliptic that GeoTree outperforms the oblique decision tree on splits with orthogonal and oblique splits are better suited all synthetic data sets, particularly on the diagonal data to capture patterns in soil contents. set. Using real-life spatial applications, we have demonstrated that GeoTree achieves similar error metrics with less complex and more intuitive models. Furthermore, 4.3.3. Meuse data set GeoTree models are more shallow and smaller in terms Lastly, the proposed method is evaluated on one of the of leaf nodes, which increases interpretability, and almost widely used soil mapping data sets for spatial anal- lows for intuitive visualization of decision boundaries ysis, the Meuse data set [22]. This data set contains mea- on a geographic map. We have also found that the stasurements of metals in soil around the Meuse river at bility of GeoTree is comparable to that of the baseline 155 locations. As the number of observations is small, despite its non-analytical approach. In future work, we a leave-one-out cross-validation set-up is used instead plan to improve GeoTree by including additional features, extending evaluation, and exploring the potential of en- sion trees using genetic algorithms and k-d trees, semble learning. Overall, we believe that the proposed WSEAS Trans. Comput. 3 (2004) 839–845. GeoTree algorithm has the potential to advance the field [11] A. Magana-Mora, V. B. Bajic, OmniGA: Optiof spatial prediction and improve the interpretability and mized Omnivariate Decision Trees for Generalizaccuracy of models. able Classification Models, Sci. Rep. 7 (2017) 3898. doi:10.1038/s41598-017-04281-9. [12] J.-Y. Chang, C.-W. Cho, S.-H. Hsieh, S.-T. Chen, References Genetic algorithm based fuzzy id3 algorithm, in: ICONIP, Springer, Berlin, 2004, pp. 989–995. doi:10. [1] A. B. Møller, A. M. Beucher, N. Pouladi, M. H. Greve,

Oblique geographic coordinates as covariates for [13] 1J.0Y0o7o/,97L8.-S3a-el5, 40G-3au0s4s9ia9n-9S_o1f5t3D. ecision Trees digital soil mapping, SOIL 6 (2020) 269–289. doi:10. for Interpretable Feature-Based Classification, in: 5194/soil-6-269-2020. PAKDD, Springer, Cham, 2021, pp. 143–155. doi:10. [2] J. Gaudart, B. Poudiougou, S. Ranque, O. Doumbo,

Oblique decision trees for spatial pattern detec- [14] 1R0.R07iv/e9ra7-8L-o3p-ez0,3J.0C-a7n5u7l6-R5e-ic6h_,1E2..Mezura-Montes, tion: optimal algorithm and application to malaria M. A. Cruz-Chávez, Induction of decision trees risk, BMC Med. Res. Methodol. 5 (2005) 22. doi:10. as classification models through metaheuristics, 1186/1471-2288-5-22. Swarm Evol. Comput. 69 (2022) 101006. doi:10. [3] J. A. Goungounga, J. Gaudart, M. Colonna, R. Giorgi,

Impact of socioeconomic inequalities on geographic [15] 1S.0K16./Mju.rsthwye,vSo..K2a0s2i1f,.S1.0S1a0lz0b6e.rg, A System for disparities in cancer incidence: comparison of Induction of Oblique Decision Trees, J. Artif. Intell. methods for spatial disease mapping, BMC Res. 2 (1994) 1–32. doi:10.1613/jair.63. Med. Res. Methodol. 16 (2016) 136. doi:10.1186/ [16] F.-M. De Rainville, F.-A. Fortin, M.-A. Gardner, s12874-016-0228-x. M. Parizeau, C. Gagné, DEAP, in: GECCO, ACM [4] Q. Gao, V. Shi, C. Pettit, H. Han, Property valuation Press, New York, USA, 2012, p. 85. doi:10.1145/ using machine learning algorithms on statistical areas in Greater Sydney, Australia, Land Use Policy [17] F2.3P3e0d7re8g4o.s2a3,G30.V79ar9o.quaux, A. Gramfort, V. Michel, 123 (2022) 106409. doi:10.1016/j.landusepol. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, 2022.106409. R. Weiss, V. Dubourg, others, Scikit-learn: Machine [5] Z. Bassa, U. Bob, Z. Szantoi, R. Ismail, Land cover learning in Python, J. Mach. Learn. Res. 12 (2011) and land use mapping of the iSimangaliso Wetland 2825–2830.

Park, South Africa: comparison of oblique and or- [18] ECNU, Oblique Decision Tree in Python, 2021. URL: thogonal random forest algorithms, J. Appl. Re- https://github.com/zhenlingcn/scikit-obliquetree. mote Sens. 10 (2016) 015017. doi:10.1117/1.JRS. [19] M. H. Huque, H. D. Bondell, R. J. Carroll, L. M. Ryan, 10.015017. Spatial regression with covariate measurement er[6] J. Gaudart, N. Grafeo, D. Coulibaly, G. Barbet, S. Re- ror: A semiparametric approach, Biometrics 72 AbanuRdePta,cNk.aDgeestsoaPy,eOrf.oKrm.DSopuamtibaloP,Rar.tGitiioorngiin,gS,PJO.SDtaTt:. [20] (R2.0R16.)B6o7u8c–k6a8e6r.t,dEoi.:1Fr0a.n1k1,1E1v/abliuoatmin.g12t4h7e4r.eplicaSoftw. 63 (2015). doi:10.18637/jss.v063.i16. bility of significance tests for comparing learning [7] L. Canete-Sifuentes, R. Monroy, M. A. Medina- algorithms, in: PAKDD, Springer, Berlin, 2004, pp.

Perez, A Review and Experimental Compari- 3–12. son of Multivariate Decision Trees, IEEE Access [21] T. Hengl, M. Nussbaum, M. N. Wright, G. B. 9 (2021) 110451–110479. doi:10.1109/ACCESS. Heuvelink, B. Gräler, Random forest as a generic 2021.3102239. framework for predictive modeling of spatial and [8] D. Wickramarachchi, B. Robertson, M. Reale, spatio-temporal variables, PeerJ 2018 (2018). doi:10.

C. Price, J. Brown, HHCART: An oblique decision tree, Comput. Stat. Data Anal. 96 (2016) 12–23. [22] 7E7.1J.7P/epbeeesmrja.,5R5.1S8..Bivand, Classes and methods doi:10.1016/j.csda.2015.11.006. for spatial data in R, R News 5 (2005) 9–13. URL: [9] Y. Dhebar, K. Deb, Interpretable rule discovery https://CRAN.R-project.org/doc/Rnews/. through bilevel optimization of split-rules of nonlinear decision trees for classification problems, IEEE Trans. Cybern. 51 (2020) 5573–5584. doi:10.1109/

TCYB.2020.3033003. [10] S.-C. Ng, K.-S. Leung, Induction of quadratic deci

A. Simulation Study A.1. Data sets

We have generated five synthetic data sets to demonstrate the geospatial patterns that GeoTree can detect more accurately (see Figure 3). The data set generation process starts in all five cases by uniformly creating data points in a two-dimensional space. All data sets contain 4096 data points. The target variable generation difers for each data set. The first, orthogonal data set contains two orthogonal boundaries through the middle of the first and second axis, dividing the two-dimensional space in four equal parts (see Figure 3a). The y-values are drawn from (10, 2.02), (20, 2.02), (30, 2.02) and (40, 2.02). For the second data set, the diagonal data set, four points in the two-dimensional space are selected to generate two parallel hyperplanes (see Figure 3b). The three respective areas separated by these hyperplanes are assigned y-values from (40, 2.02), (10, 2.02) and (20, 2.02). The third, ellipse data set consists of an ellipse shape created by three random points in the input space (see Figure 3c). The points inside the ellipse are assigned a y-value from (30, 2.02), while the yvalues for the points outside of the ellipse are drawn from (10, 2.02). In the fourth, mixed data set a hyperplane connecting two randomly sampled points divides the space in two areas (see Figure 3d). From the points above the hyperplane, three points are randomly sampled to generate an ellipse. The y-values in the three resulting areas are drawn from (40, 2.02), (10, 2.02) and (20, 2.02). Lastly, we calculate the y-values based on the bivariate bump function from [19], a study in which it was used previously for spatial regression. The y-value for each point i is calculated as = 1 · 2, with =

1 1 + + ︁( 5.5 · − 50· ( − 0.2)2 )︁

+ ︁( 5 · − 25· ( − 0.8)2 )︁ for j = 1, 2 and 1 and 2 the coordinates of point i. The resulting y-values for the bivariate bumps data set range between 1 and 40 (see Figure 3e).

A.2. Results for orthogonal, diagonal, ellipse, mixed, and bivariate bumps that while GeoTree can replicate SkTree’s performance,

ObTree’s errors are much higher. The decision bound

aries of the three models on the orthogonal data set also reveal the perfect ability of GeoTree and SkTree to ifnd the orthogonal patterns, in contrast to ObTree.

The test RMSE of the three models is presented in box plots as shown in Figure 10 for the diagonal, ellipse mixed and bivariate bumps data set. The maximum depth is set for all methods dependent on the data set, 8 for the diagonal data set, 9 for the ellipse and mixed data sets. GeoTree performs significantly better than the baseline trees on the diagonal data set (see Figure 10a).

Not only can the GA-based tree find the diagonal patterns

with three splits (i.e. depth 2), the median RMSE (2.0066) is considerably lower at this depth than the minima that

SkTree (3.2372) and ObTree (3.2788) achieve at reasonable

depth. What is apparent is that ObTree does not find the diagonal boundaries with a shallow tree, despite the ability to learn axis-oblique splits. While ObTree’s test error continues to improve at depth 9, SkTree achieves a minimum RMSE at depth 8.

For the ellipse data set, GeoTree’s error starts considerably lower at depth 1 than the baseline methods and continues this trend across depths. This is due to the ifrst split where GeoTree is able to draw an ellipse in the input space, replicating the ellipse pattern in the data set. While a decreasing trend in SkTree’s and ObTree’s errors is visible in Figure 10b, they do not reach a similar level. In addition, GeoTree’s Interquartile Range (IQR) is smaller than the baseline methods as of depth 2.

Figure 10c shows a similar trend as the previous plots,

GeoTree’s error declines rapidly, while the decline in the

error of the baseline methods is more gradual. In addition,

SkTree’s and ObTree’s errors do not reach the same level

as GeoTree’s error. GeoTree reaches the lowest median test RMSE at depth 4 (2.4685), whereas SkTree reaches a test RMSE of 3.6054 at depth 8 and ObTree’s error ends and 24). However, at depth 11, where ObTree reaches a minimum, GeoTree’s median test RMSE is 1.31, and at depth 24, where SkTree reaches a minimum, GeoTree has a value of 0.75. ObTree’s IQR of the test error is also considerably higher than the other methods.

A.3. Evaluation The experiments on the synthetic data sets confirm

GeoTree’s ability to eficiently detect orthogonal, diagonal and ellipse patterns. Figure 9 and Figure 10 show that GeoTree finds the patterns in fewer splits and reaches a substantially lower error on the test set with reasonable depth than the baseline methods SkTree and ObTree. The performance on the mixed data set also shows that GeoTree can correctly combine the diferent split types. At the same time, the proposed model can perfectly replicate a traditional univariate tree’s performance on a data set with orthogonal patterns. Notably, GeoTree also outperforms ObTree on the diagonal data set with a substantial diference. The bivariate bumps data set allows to confirm GeoTree’s superior performance on more advanced rounded patterns compared to the baseline methods. Though GeoTree requires a deeper tree to reach the minimum test RMSE, the error is considerably lower. In addition, the proposed method has a lower median test RMSE than the baseline methods at the depth where they reach a minimum. This indicates that GeoTree still has an advantage in depth for the bivariate bumps data set as for the other synthetic data sets. Despite the added variability in candidate split generation due to the GA, GeoTree’s stability is superior to the baseline methods in most cases illustrated by the IQR of the test error. (a) diagonal (b) ellipse (c) mixed