<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Evolutionary Geospatial Regression Tree</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Margot Geerts</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seppe vanden Broucke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jochen De Weerdt</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Business Informatics and Operations Management, Ghent University</institution>
          ,
          <addr-line>Tweekerkenstraat 2, 9000 Gent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Centre for Information Systems Engineering, KU Leuven</institution>
          ,
          <addr-line>Naamsestaat 69, 3000 Leuven</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tree-based methods have become popular for spatial prediction tasks due to their high accuracy in dividing input spaces into regions with diferent predictions. However, traditional decision trees perform univariate splits, resulting in rectangular regions. To address this limitation and provide more intuitive and accurate decision boundaries for spatial data, we propose a novel Geospatial Regression Tree (GeoTree) with two multivariate geospatial split types, i.e. oblique and Gaussian splits. Our approach relies on evolutionary algorithms to decide on the optimal split type, chosen among axis-parallel, oblique and Gaussian splits, in each internal node. We conducted a simulation study using five synthetic datasets to demonstrate GeoTree's ability to capture orthogonal, diagonal and ellipse patterns accurately. The results confirm the proposed method's advantage in tree depth and stability. We also tested GeoTree on real-life residential property valuation and soil content data. Our experiments revealed that the geospatial split types in GeoTree improve interpretability and maintain predictive power for price prediction and soil mapping tasks based on X- and Y-coordinates. The resulting decision boundaries are more intuitive spatial patterns on a geographic map of the study area.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Decision trees</kwd>
        <kwd>Spatial data</kwd>
        <kwd>Evolutionary algorithms</kwd>
        <kwd>Machine learning</kwd>
        <kwd>Property valuation</kwd>
        <kwd>Soil mapping</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>cities such as Brussels, Belgium and London, United
Kingdom, are surrounded by a ring road that separates the
Spatial prediction tasks pose a significant challenge due city center from the suburbs. Moreover, administrative
to the unique nature of the spatial data. Spatial data is boundaries frequently follow non-rectangular shapes, as
characterized by spatial dependence and spatial hetero- well as house prices, as shown in Figure 1. Despite these
geneity, which makes it dificult to analyze. For spatial spatial patterns, previous research in spatial prediction
applications such as soil mapping [1], disease mapping has yet to incorporate them into tree-based models.
[2, 3], property valuation [4] and land use or cover map- Therefore, we propose a new multivariate decision
ping [5], tree-based models have proven to be appropri- tree algorithm that integrates three diferent split types:
ate. However, traditional decision trees with univariate axis-parallel splits, oblique splits and Gaussian splits. In
splits are limited in their ability to capture spatial pat- addition, we introduce a Genetic Algorithm (GA) to
genterns. By drawing axis-oblique decision boundaries, the erate the oblique and Gaussian candidate splits within the
accuracy of the method for spatial data can be signifi- internal nodes. The combination of these multivariate
cantly improved [6]. Furthermore, introducing multivari- split types with the traditional axis-parallel split enables
ate geospatial splits can improve the accuracy of tree the proposed method to recognize spatial patterns in
based models for spatial prediction tasks as they allow geographic location data more accurately and with
simto learn from X- and Y-coordinates simultaneously. pler trees. As a result, the Geospatial Regression Tree</p>
      <p>The recent advancements in Multivariate Decision (GeoTree) provides more intuitive decision boundaries
Tree (MDT) learning have primarily focused on linear and less complex trees, making it more interpretable than
combinations in splits and their optimization. On the existing decision tree algorithms. We demonstrate that
other hand, omnivariate trees can allow test conditions our new decision tree algorithm can capture spatial
patto follow any non-linear function. However, most spa- terns more efectively than both traditional regression
tial data sets exhibit recognizable patterns. For example, trees and oblique regression trees for regression tasks.
house prices often follow street patterns, where certain Using synthetic data sets, we showcase the proposed
(parts of) streets are more expensive than others. Sim- method’s eficient pattern learning ability, with limited
ilarly, circular patterns can be observed in historic city tree depth and error. In particular, our advanced synthetic
centers when zooming out to the city-level. For instance, data set that reflects real spatial patterns demonstrates
GeoTree’s superior performance in error, tree depth and
STRL’23: Second International Workshop on Spatio-Temporal Reason- stability. Finally, we apply the proposed tree algorithm
ing and Learning, 21 August 2023, Macao, S.A.R with the geospatial split types to real-life property
valua* Corresponding author. tion and soil mapping data sets, demonstrating similar
$ margot.geerts@kuleuven.be (M. Geerts)</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License predictive power compared to existing decision tree
algoCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) rithms, while significantly enhancing interpretability due</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>As spatial prediction tasks benefit from the introduction
of multivariate splits in tree-based models, the following
section introduces MDTs and approaches to MDT
learning. MDTs difer from traditional decision trees in the
split type used. While traditional decision trees split the
data based on a test condition of the form  ≤ , with
 the iℎ feature, MDTs include multivariate conditions,
 (x) ≤ . In [7], multivariate decision tree algorithms
are discussed. The authors identify three split types,
univariate splits, multivariate splits that consist of linear
combinations of features, and multivariate splits that
include non-linear combinations. Most researchers have
focused on axis-oblique splits, i.e. linear combinations,
and have demonstrated the benefits for spatial analysis
[6, 1] and classification in general [ 8]. However,
nonlinear combinations have been included using quadratic
functions and power laws [9, 10]. Further, splits have
been generalized even more by using neural networks,
decision trees and random forest at internal nodes [11].
Nevertheless, this generalization increases the cost
complexity of learning an MDT substantially.</p>
      <p>Instead of non-linear functions with many parameters,</p>
    </sec>
    <sec id="sec-3">
      <title>3. An Evolutionary Geospatial</title>
    </sec>
    <sec id="sec-4">
      <title>Regression Tree Algorithm</title>
      <p>Our work focuses on binary decision trees for regression
with numerical features. A regression tree takes as input
an observation (1, ..., , ) where  are the features
which are real-valued, d is the number of features and y
is the real-valued target variable. A trained regression
tree consists of test conditions of the form  ≤  with
 a constant, producing axis-parallel splits. Typically, a
top-down greedy approach is used for building the tree.</p>
      <p>This means that the best split, determined by a feature 
and constant , is chosen at each node, starting at the root
node, efectively splitting each node into two child nodes
until the specified depth is reached or the leaf nodes
are pure. The best split produces the largest reduction
in mean squared error of the predictions. In essence, a
regression tree sorts the observations in the training set
through the tree, according to the test conditions in the
internal nodes, and predicts in each leaf node the average
of all observations sorted into that node.</p>
      <p>GeoTree’s key innovation lies in the inclusion of two
new bivariate decision tree splits, the oblique split and
a Gaussian split, in addition to the axis-parallel split.</p>
      <p>The oblique split divides the input space in two regions
based on a linear combination of two features, 1 * 1 +
2 * 2 ≥ . On the other hand, the Gaussian split on two individuals with a probability cxpb. This step is
defines an ellipse in a two-dimensional space, (, 1) + implemented with a blend crossover and the required
(, 2) ≥ . This is supported by the mathematical parameter  which determines the interval to draw new
property that an ellipse consists of all points of which the values from based on the parents. Next, the new ofspring
sum of the distances to both focal points (1, 2) is equal are mutated with a probability mutpb. Specifically,
polyto a constant. Therefore, the Gaussian split separates nomial bounded mutation is used to ensure individuals
input data located inside of the ellipse from input data range between the boundaries (, ). This operator
relocated on or outside of the ellipse. quires two additional parameters: the probability of each</p>
      <p>As for oblique splits, finding the best Gaussian split is value of the individual to be mutated (indpb) and the
NP-hard [15]. In addition, a brute-force search that enu- crowding degree ( ). Lastly, the fittest individuals are
merates all possible hyperplanes and ellipses, based on chosen among the ofspring and the fittest of the previous
two and three data points respectively, has an exponen- population (‘Hall of Fame’). After  generations, the
tial cost. Two points are necessary to define a hyperplane ifttest individual from the last population is returned as
and three to define an ellipse, where two points function the candidate split. The implementation is based on the
as focal points and the third is used to calculate the con- Python library DEAP [16].
stant . A GA is employed instead to generate oblique The decision tree algorithm performs a greedy search
and Gaussian candidate splits in each internal node. We in each internal node by finding the best orthogonal,
base the fitness function on the evaluation criterion for oblique and Gaussian candidate split and finally choosing
the decision tree, that is, the gain in mean squared error. the best split among the three candidate splits.
OrthogAs such, the fitness function is defined by onal candidate splits are eficiently searched by sorting
all unique values for each variable in the data set and
  − ( |ℒ| ∑︁( − ¯) 2 + |ℛ| ∑︁ ( − ¯) 2) evaluating corresponding candidate splits. Oblique and
 ∈ℒ  ∈ℛ Gaussian splits are generated by Algorithm 1 as explained
before. The complete GeoTree algorithm is available at
https://github.com/margotgeerts/GeoTree.
where ℒ and ℛ are the sets of observations sorted to the
left and right respectively by the candidate split.
Algorithm 1 presents the logic of the GA-based candidate split Algorithm 1 Genetic algorithm for candidate split
gengeneration that returns the best split. This algorithm is eration.
invoked for both proposed splits separately, as the no- Require: , , , , ℎ, , cxpb, , mutpb, indpb, 
tion of an individual is specific to the type of split. As
mentioned above, an oblique split can be defined by two Pfoorpu=lat0iotno ← Rdoepeat ( (, ), )
two-dimensional points and a Gaussian split by three
two-dimensional points. Consequently, an individual for HallofFame ←
an oblique candidate split is a list of four floats, or two Offspring ←
phoyipnetrspla1naen.dAn2inadsicvaidnubael fsoerena iGnaFuisgsuiaren2caa,ntdoidmaatkeespthliet Offspring ←
is defined as a list of six floats, or three points 1, 2 and Offspring ←
is3in(siteiealFizigeudrbey2rba)n,tdoomgelnyedrraatewainngelliptsiem. eTshferopmoptuhleatuionni- HalPloofFpaumlatei,on) ← SelectBest (Offspring +
form distribution defined by the lower boundaries ( ) and end for
upper boundaries (). The upper and lower boundaries BestIndividual ← SelectBest (Population , 1)
are defined by the minimum and maximum values of the return BestIndividual
two data dimensions. To reflect individuals, these values
are repeated to result in a list of four values in case of
oblique splits, while Gaussian boundaries require a list
of six values. After initialization, the evolution process 4. Experimental Evaluation
is repeated for the number of generations (). First, a
‘Hall of Fame’ variable is updated to retain the individu- The performance of GeoTree is demonstrated by first
als that have the highest fitness values from the current comparing the regression outcomes for five synthetic
population. This elitist approach requires a parameter ℎ data sets. Second, we show the benefits of our method in
to indicate the number of individuals to retain. Then, the two spatial applications, a real-life housing data set and
current population is modified using selection, crossover soil mapping data.
and mutation operators. A tournament selection is
performed by selecting the fittest individual in 
tournaments of size . Crossover is performed subsequently</p>
      <p>Fittest (Population , ℎ)
TournamentSelection (Population ,
, )
Mate(Offspring , cxpb,  )
Mutate(Offspring , mutpb, indpb,
 )
(a) Oblique split individual
consisting of four
chromosomes, that form two
points 1 and 2.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <sec id="sec-4-1-1">
          <title>In order to thoroughly test the benefits of the proposed</title>
          <p>method with two GA-generated split types, axis-oblique
and Gaussian, in combination with axis-orthogonal splits,
we consider other orthogonal and oblique regression tree
algorithms as baselines. The proposed Geospatial
Regression Tree (GeoTree) is compared with a traditional</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Univariate Regression Tree (SkTree), and an Oblique Re</title>
          <p>gression Tree (ObTree). The baseline methods are
implemented using Scikit-learn’s decision tree regressor [17]
and a HHCart regression tree implementation [18].</p>
          <p>To evaluate the models, we employ a repeated 5-fold
set-up, where each fold is evaluated using five repeated
experiments, resulting in 25 observations per model. We
report the performance metrics on the test set among the
ifve folds using the average of the five iterations. For all
models, we use the MSE as the impurity measure. We
set the hyperparameters of SkTree and ObTree to their
default values. In GeoTree, we use a GA to generate
oblique and Gaussian canidate splits. The parameter
settings of the are shown in Table 1, which results from
a grid search on a separate real-life housing data set with
spatial features. The boundaries (, ) are defined by the
minimum and maximum values of the data.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Simulation Study</title>
        <sec id="sec-4-2-1">
          <title>To assess the performance of GeoTree and the baselines</title>
          <p>in detecting patterns in data, we generated five synthetic
data sets with distinct patterns (see Figure 3). The
target variable in the diferent areas defined in the data
sets is normally distributed with a varying mean. The
results for the orthogonal data set shows that while</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>GeoTree and SkTree can identify the true splits almost</title>
          <p>perfectly, ObTree struggles to do so. In contrast, GeoTree
chooses correctly for orthogonal splits among the three
split types and replicates the performance by SkTree.
Similarly, GeoTree is efective in detecting the diagonal,
elliptic, and mixed patterns in the diagonal, ellipse
and mixed data sets. In addition, the experiments show
that GeoTree can identify diagonal patterns more
eficiently than ObTree. Finally, the bivariate
bumps
data set, a synthetic spatial regression data set adapted
from [19], further confirms the superior performance of</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>GeoTree in comparison to the baselines. In summary, the results show that GeoTree achieves lower error and requires smaller trees, thereby enhancing interpretability of the final model in comparison to traditional decision</title>
        </sec>
        <sec id="sec-4-2-4">
          <title>The data sets and results are discussed in more detail</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experiments on Housing and Soil</title>
        <p>Two spatial applications, house price prediction and soil
mapping, show the performance of the proposed method
in a real-life setting. The results are discussed with
regards to model accuracy, tree depth and size, and stability
defined by the Interquartile Range (IQR), as well as the
decision boundaries visualized in geographic space.
4.3.1. Housing data set</p>
        <sec id="sec-4-3-1">
          <title>A real-life housing data set of the province of Namur in</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>Belgium is used to test the performance of the proposed</title>
          <p>model in comparison with the baseline methods. The the proposed GeoTree has an advantage in depth
comdata set contains 17 967 houses with the indexed transac- pared to the baseline methods. A smaller depth in turn
tion price and the X- and Y-coordinates. The house prices increases interpretability. Moreover, these experiments
range between 100 000 and 600 000 euros approximately. show that the baseline methods have a similar test error.
For the experiments, the prices are log transformed and Table 2 also reveals that while the RMSE is similar among
the X- and Y-coordinates are used as predictor variables. the three models, GeoTree is able to achieve this error
The test RMSE is presented in a box plot for all models level using significantly smaller trees indicated by the
in Figure 4. GeoTree reaches its lowest RMSE at depth lower amount of leaf nodes. In addition, IQR of GeoTree
4 with a median RMSE of 0.3293. SkTree reaches a min- is comparable to the IQR of the baseline methods,
illusimum RMSE of 0.3304 at depth 6, and ObTree reaches trated in Figure 4 by the box sizes. Although the box
a value of 0.3285 at depth 8. The diference between plots show more upward outliers for GeoTree than the
the test errors of GeoTree and ObTree is not statistically baseline trees, this is only depths larger than the optimal
significant ( -value = 0.1584), based on the corrected depth. Therefore, we can conclude that GeoTree is as
repeated k-fold cross-validation test [20]. These obser- stable as the baseline methods.
vations add to the evidence of the simulation study that Figure 6 presents the decision boundaries by the three
methods for the housing data set. While these figures
show completely diferent patterns at first glance, the
colors (predictions) in most areas are quite similar. For
example, the lower left corner of the geographic map
contains an area in bright green. In Figure 6b this area is
odd-shaped, in Figure 6c it is rectangular shaped and a
green triangle is visible in Figure 6d, according to the split
types. Further, the decision boundaries of SkTree and
ObTree are more similar, by dividing the space in many
horizontally wide and vertically narrow rectangular
regions. In contrast, GeoTree draws more ellipse patterns,
with almost solely ellipses down the tree indicated by
small ellipses. Nevertheless, larger areas are bounded by
orthogonal, diagonal, and ellipse splits. This indicates
that introducing ellipse shaped splits is beneficial, in
parThe first soil mapping data set contains clay percentage
at 3418 locations around Lake Tahoe in California, USA,
sourced from [21]. Figure 5a exhibits the RMSE on the
test set with respect to diferent depths, with maximum
depth 6, for the three decision trees. While all models
show little progress in test error, GeoTree achieves its
optimal depth at n = 5, and the baselines at n = 6, based
on the median RMSE. The median RMSE at these depths
is 14.83, 14.5 and 15.18 for GeoTree, SkTree, and ObTree
respectively. The RMSE on the test set for GeoTree is not
significantly diferent from SkTree ( -value = 0.753) and</p>
          <p>ObTree (-value = 0.6486). The stability of the methods is
(c) SkTree
of the 5-fold set-up. Similar to related work [1, 21], the
target variable is the log transformed zinc concentration
and X- and Y-coordinates are used as predictors. The
RMSE on the test set is shown in Figure 5b for the three
models with respect to depth n = 1 to 8. The median
RMSE for GeoTree is lowest at depth 7 and slightly lower
than the lowest median RMSE of the baselines, which is
achieved at depth 8. Nevertheless, the variance of these
error metrics is high, as can be seen in Figure 5b and
Table 2, due to the small size of the data set. On average,
GeoTree only requires around 60 leaf nodes to achieve a
similar error level as the baselines, compared to 93 and
150 leaf nodes for SkTree and ObTree respectively (Table
2). Figure 8 shows the data and the decision boundaries
of the methods for the area including the Meuse
prediction grid. Although axis-parallel decision boundaries
seem predominant even in Figure 8d, the ability to
include Gaussian patterns makes the decision boundaries
by GeoTree more easy to interpret and intuitive.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>also comparable, indicated by similarly sized boxes, i.e., In this paper, we address the incompatibility of tree-based
IQRs. Table 2 indicates as for the housing data that while methods for spatial data by proposing a novel
Geospathe RMSE is similar, GeoTree needs on average less than tial Regression Tree algorithm (GeoTree). While
decihalf the number of leaf nodes than the baseline methods. sion trees are known for dividing the input space into
This leads to the conclusion that the interpretability of rectangular regions, this feature is not always
advantaour proposed method is boosted significantly in compar- geous for spatial prediction, as spatial variables exhibit
ison to traditional decision trees. diferent patterns in reality. To address this issue,
previ</p>
      <p>Furthermore, the decision boundaries in Figure 7 also ous research has suggested replacing axis-parallel splits
show much more intuitive patterns in geographic space with axis-oblique splits. However, we propose a more
for soil mapping. In contrast to straight-line separations advanced approach that combines bivariate oblique and
between areas in the decision boundaries delineated by Gaussian splits with axis-parallel splits. The combination
the baseline decision trees, GeoTree is able to present of split types improves the regression tree’s ability to
caphotspots of high and low clay percentages in soil defined ture geospatial patterns and introduces more intuitive
predominantly by ellipses. Again, some similarities can and accurate decision boundaries. We rely on
evolutionbe found. For example, the decision boundaries in the ary algorithms to generate geospatial candidate splits
upper right corner of the map show a yellow hotspot of and choose the best split among the three split types at
clay percentage for GeoTree and ObTree. SkTree, on the each internal node. Our proposed algorithm is compared
other hand, seems to have found this pattern as well, but with Scikit-learn’s univariate regression tree and the
HHthe same area is not as explored regarding the number Cart oblique tree using both synthetic data and real-life
of splits. Considering the rugged nature of geological house price and soil mapping data. The results show
phenomena, we conclude that the combination of elliptic that GeoTree outperforms the oblique decision tree on
splits with orthogonal and oblique splits are better suited all synthetic data sets, particularly on the diagonal data
to capture patterns in soil contents. set. Using real-life spatial applications, we have
demonstrated that GeoTree achieves similar error metrics with
less complex and more intuitive models. Furthermore,
4.3.3. Meuse data set GeoTree models are more shallow and smaller in terms
Lastly, the proposed method is evaluated on one of the of leaf nodes, which increases interpretability, and
almost widely used soil mapping data sets for spatial anal- lows for intuitive visualization of decision boundaries
ysis, the Meuse data set [22]. This data set contains mea- on a geographic map. We have also found that the
stasurements of metals in soil around the Meuse river at bility of GeoTree is comparable to that of the baseline
155 locations. As the number of observations is small, despite its non-analytical approach. In future work, we
a leave-one-out cross-validation set-up is used instead plan to improve GeoTree by including additional features,
extending evaluation, and exploring the potential of en- sion trees using genetic algorithms and k-d trees,
semble learning. Overall, we believe that the proposed WSEAS Trans. Comput. 3 (2004) 839–845.
GeoTree algorithm has the potential to advance the field [11] A. Magana-Mora, V. B. Bajic, OmniGA:
Optiof spatial prediction and improve the interpretability and mized Omnivariate Decision Trees for
Generalizaccuracy of models. able Classification Models, Sci. Rep. 7 (2017) 3898.
doi:10.1038/s41598-017-04281-9.
[12] J.-Y. Chang, C.-W. Cho, S.-H. Hsieh, S.-T. Chen,
References Genetic algorithm based fuzzy id3 algorithm, in:
ICONIP, Springer, Berlin, 2004, pp. 989–995. doi:10.
[1] A. B. Møller, A. M. Beucher, N. Pouladi, M. H. Greve,</p>
      <p>Oblique geographic coordinates as covariates for [13] 1J.0Y0o7o/,97L8.-S3a-el5, 40G-3au0s4s9ia9n-9S_o1f5t3D. ecision Trees
digital soil mapping, SOIL 6 (2020) 269–289. doi:10. for Interpretable Feature-Based Classification, in:
5194/soil-6-269-2020. PAKDD, Springer, Cham, 2021, pp. 143–155. doi:10.
[2] J. Gaudart, B. Poudiougou, S. Ranque, O. Doumbo,</p>
      <p>Oblique decision trees for spatial pattern detec- [14] 1R0.R07iv/e9ra7-8L-o3p-ez0,3J.0C-a7n5u7l6-R5e-ic6h_,1E2..Mezura-Montes,
tion: optimal algorithm and application to malaria M. A. Cruz-Chávez, Induction of decision trees
risk, BMC Med. Res. Methodol. 5 (2005) 22. doi:10. as classification models through metaheuristics,
1186/1471-2288-5-22. Swarm Evol. Comput. 69 (2022) 101006. doi:10.
[3] J. A. Goungounga, J. Gaudart, M. Colonna, R. Giorgi,</p>
      <p>Impact of socioeconomic inequalities on geographic [15] 1S.0K16./Mju.rsthwye,vSo..K2a0s2i1f,.S1.0S1a0lz0b6e.rg, A System for
disparities in cancer incidence: comparison of Induction of Oblique Decision Trees, J. Artif. Intell.
methods for spatial disease mapping, BMC Res. 2 (1994) 1–32. doi:10.1613/jair.63.
Med. Res. Methodol. 16 (2016) 136. doi:10.1186/ [16] F.-M. De Rainville, F.-A. Fortin, M.-A. Gardner,
s12874-016-0228-x. M. Parizeau, C. Gagné, DEAP, in: GECCO, ACM
[4] Q. Gao, V. Shi, C. Pettit, H. Han, Property valuation Press, New York, USA, 2012, p. 85. doi:10.1145/
using machine learning algorithms on statistical
areas in Greater Sydney, Australia, Land Use Policy [17] F2.3P3e0d7re8g4o.s2a3,G30.V79ar9o.quaux, A. Gramfort, V. Michel,
123 (2022) 106409. doi:10.1016/j.landusepol. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
2022.106409. R. Weiss, V. Dubourg, others, Scikit-learn: Machine
[5] Z. Bassa, U. Bob, Z. Szantoi, R. Ismail, Land cover learning in Python, J. Mach. Learn. Res. 12 (2011)
and land use mapping of the iSimangaliso Wetland 2825–2830.</p>
      <p>Park, South Africa: comparison of oblique and or- [18] ECNU, Oblique Decision Tree in Python, 2021. URL:
thogonal random forest algorithms, J. Appl. Re- https://github.com/zhenlingcn/scikit-obliquetree.
mote Sens. 10 (2016) 015017. doi:10.1117/1.JRS. [19] M. H. Huque, H. D. Bondell, R. J. Carroll, L. M. Ryan,
10.015017. Spatial regression with covariate measurement
er[6] J. Gaudart, N. Grafeo, D. Coulibaly, G. Barbet, S. Re- ror: A semiparametric approach, Biometrics 72
AbanuRdePta,cNk.aDgeestsoaPy,eOrf.oKrm.DSopuamtibaloP,Rar.tGitiioorngiin,gS,PJO.SDtaTt:. [20]
(R2.0R16.)B6o7u8c–k6a8e6r.t,dEoi.:1Fr0a.n1k1,1E1v/abliuoatmin.g12t4h7e4r.eplicaSoftw. 63 (2015). doi:10.18637/jss.v063.i16. bility of significance tests for comparing learning
[7] L. Canete-Sifuentes, R. Monroy, M. A. Medina- algorithms, in: PAKDD, Springer, Berlin, 2004, pp.</p>
      <p>Perez, A Review and Experimental Compari- 3–12.
son of Multivariate Decision Trees, IEEE Access [21] T. Hengl, M. Nussbaum, M. N. Wright, G. B.
9 (2021) 110451–110479. doi:10.1109/ACCESS. Heuvelink, B. Gräler, Random forest as a generic
2021.3102239. framework for predictive modeling of spatial and
[8] D. Wickramarachchi, B. Robertson, M. Reale, spatio-temporal variables, PeerJ 2018 (2018). doi:10.</p>
      <p>C. Price, J. Brown, HHCART: An oblique decision
tree, Comput. Stat. Data Anal. 96 (2016) 12–23. [22] 7E7.1J.7P/epbeeesmrja.,5R5.1S8..Bivand, Classes and methods
doi:10.1016/j.csda.2015.11.006. for spatial data in R, R News 5 (2005) 9–13. URL:
[9] Y. Dhebar, K. Deb, Interpretable rule discovery https://CRAN.R-project.org/doc/Rnews/.
through bilevel optimization of split-rules of
nonlinear decision trees for classification problems, IEEE
Trans. Cybern. 51 (2020) 5573–5584. doi:10.1109/</p>
      <p>TCYB.2020.3033003.
[10] S.-C. Ng, K.-S. Leung, Induction of quadratic
deci</p>
    </sec>
    <sec id="sec-6">
      <title>A. Simulation Study</title>
      <sec id="sec-6-1">
        <title>A.1. Data sets</title>
        <p>We have generated five synthetic data sets to
demonstrate the geospatial patterns that GeoTree can detect
more accurately (see Figure 3). The data set generation
process starts in all five cases by uniformly creating data
points in a two-dimensional space. All data sets contain
4096 data points. The target variable generation difers
for each data set. The first,
orthogonal data set
contains two orthogonal boundaries through the middle of
the first and second axis, dividing the two-dimensional
space in four equal parts (see Figure 3a). The y-values are
drawn from  (10, 2.02),  (20, 2.02),  (30, 2.02) and
 (40, 2.02). For the second data set, the diagonal data
set, four points in the two-dimensional space are selected
to generate two parallel hyperplanes (see Figure 3b). The
three respective areas separated by these hyperplanes are
assigned y-values from  (40, 2.02),  (10, 2.02) and
 (20, 2.02). The third, ellipse data set consists of
an ellipse shape created by three random points in the
input space (see Figure 3c). The points inside the ellipse
are assigned a y-value from  (30, 2.02), while the
yvalues for the points outside of the ellipse are drawn
from  (10, 2.02). In the fourth, mixed data set a
hyperplane connecting two randomly sampled points divides
the space in two areas (see Figure 3d). From the points
above the hyperplane, three points are randomly sampled
to generate an ellipse. The y-values in the three
resulting areas are drawn from  (40, 2.02),  (10, 2.02) and
 (20, 2.02). Lastly, we calculate the y-values based on
the bivariate bump function from [19], a study in which
it was used previously for spatial regression. The y-value
for each point i is calculated as  = 1 · 2, with
 =</p>
        <p>1
1 + 
+
︁(
5.5 · − 50· ( − 0.2)2 )︁</p>
        <p>+
︁(
5 · − 25· ( − 0.8)2 )︁
for j = 1, 2 and 1 and 2 the coordinates of point i. The
resulting y-values for the bivariate bumps data set
range between 1 and 40 (see Figure 3e).</p>
        <p>A.2. Results for orthogonal, diagonal,
ellipse, mixed, and bivariate
bumps
that while GeoTree can replicate SkTree’s performance,</p>
        <sec id="sec-6-1-1">
          <title>ObTree’s errors are much higher. The decision bound</title>
          <p>aries of the three models on the orthogonal data set
also reveal the perfect ability of GeoTree and SkTree to
ifnd the orthogonal patterns, in contrast to ObTree.</p>
          <p>The test RMSE of the three models is presented in box
plots as shown in Figure 10 for the diagonal, ellipse
mixed and bivariate bumps data set. The maximum
depth is set for all methods dependent on the data set, 8
for the diagonal data set, 9 for the ellipse and mixed
data sets. GeoTree performs significantly better than the
baseline trees on the diagonal data set (see Figure 10a).</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>Not only can the GA-based tree find the diagonal patterns</title>
          <p>with three splits (i.e. depth 2), the median RMSE (2.0066)
is considerably lower at this depth than the minima that</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>SkTree (3.2372) and ObTree (3.2788) achieve at reasonable</title>
          <p>depth. What is apparent is that ObTree does not find
the diagonal boundaries with a shallow tree, despite the
ability to learn axis-oblique splits. While ObTree’s test
error continues to improve at depth 9, SkTree achieves a
minimum RMSE at depth 8.</p>
          <p>For the ellipse data set, GeoTree’s error starts
considerably lower at depth 1 than the baseline methods and
continues this trend across depths. This is due to the
ifrst split where GeoTree is able to draw an ellipse in the
input space, replicating the ellipse pattern in the data
set. While a decreasing trend in SkTree’s and ObTree’s
errors is visible in Figure 10b, they do not reach a similar
level. In addition, GeoTree’s Interquartile Range (IQR) is
smaller than the baseline methods as of depth 2.</p>
          <p>Figure 10c shows a similar trend as the previous plots,</p>
        </sec>
        <sec id="sec-6-1-4">
          <title>GeoTree’s error declines rapidly, while the decline in the</title>
          <p>error of the baseline methods is more gradual. In addition,</p>
        </sec>
        <sec id="sec-6-1-5">
          <title>SkTree’s and ObTree’s errors do not reach the same level</title>
          <p>as GeoTree’s error. GeoTree reaches the lowest median
test RMSE at depth 4 (2.4685), whereas SkTree reaches a
test RMSE of 3.6054 at depth 8 and ObTree’s error ends
and 24). However, at depth 11, where ObTree reaches a
minimum, GeoTree’s median test RMSE is 1.31, and at
depth 24, where SkTree reaches a minimum, GeoTree
has a value of 0.75. ObTree’s IQR of the test error is also
considerably higher than the other methods.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>A.3. Evaluation</title>
        <sec id="sec-6-2-1">
          <title>The experiments on the synthetic data sets confirm</title>
          <p>GeoTree’s ability to eficiently detect orthogonal,
diagonal and ellipse patterns. Figure 9 and Figure 10 show that
GeoTree finds the patterns in fewer splits and reaches
a substantially lower error on the test set with
reasonable depth than the baseline methods SkTree and
ObTree. The performance on the mixed data set also shows
that GeoTree can correctly combine the diferent split
types. At the same time, the proposed model can perfectly
replicate a traditional univariate tree’s performance on
a data set with orthogonal patterns. Notably, GeoTree
also outperforms ObTree on the diagonal data set with
a substantial diference. The bivariate bumps data
set allows to confirm GeoTree’s superior performance
on more advanced rounded patterns compared to the
baseline methods. Though GeoTree requires a deeper
tree to reach the minimum test RMSE, the error is
considerably lower. In addition, the proposed method has
a lower median test RMSE than the baseline methods
at the depth where they reach a minimum. This
indicates that GeoTree still has an advantage in depth for the
bivariate bumps data set as for the other synthetic
data sets. Despite the added variability in candidate split
generation due to the GA, GeoTree’s stability is superior
to the baseline methods in most cases illustrated by the
IQR of the test error.
(a) diagonal
(b) ellipse
(c) mixed</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>