=Paper= {{Paper |id=Vol-2846/paper22 |storemode=property |title=Exploring the Hyperparameters of XGBoost Through 3D Visualizations |pdfUrl=https://ceur-ws.org/Vol-2846/paper22.pdf |volume=Vol-2846 |authors=Ole-Edvard Ørebæk,Marius Geitle |dblpUrl=https://dblp.org/rec/conf/aaaiss/OrebaekG21 }} ==Exploring the Hyperparameters of XGBoost Through 3D Visualizations== https://ceur-ws.org/Vol-2846/paper22.pdf
Exploring the Hyperparameters of XGBoost Through
3D Visualizations
Ole-Edvard Ørebæk, Marius Geitle
Østfold University Collage, Halden, Norway


                                      Abstract
                                      Optimizing the hyperparameters is one of the most important and time-consuming activities to do when
                                      training machine learning models. But the lack of guidance available to optimization algorithms means
                                      that finding values for these hyperparameters is left to black-box methods. Black-box methods can be
                                      made more efficient by incorporating an understanding of where good hyperparameter values might
                                      be located for a specific model. In this paper, we visualize hyperparameter performance-landscapes in
                                      several datasets to discover how the XGBoost algorithm behaves for many combinations of hyperpa-
                                      rameter values across these datasets. Using this knowledge, it might be possible to design more efficient
                                      search strategies for optimizing the hyperparameters of XGBoost.

                                      Keywords
                                      Hyperparameters, hyperparameter optimization, visualizations, performance-landscapes




1. Introduction
Hyperparameter optimization is the task of optimizing machine learning algorithms’ perfor-
mance by tuning the input parameters that influence their training procedure and model ar-
chitecture, referred to as hyperparameters. While essential to most machine learning problems,
hyperparameter optimization is a highly non-trivial task as a given algorithm can have many
hyperparameters of different datatypes, and effective values for these differ from one dataset to
another [1, 2, 3]. This makes it difficult to determine effective values for specific problems and
even more so universally, which in practice results in the use of black-box search algorithms
[2, 1]. Black-box algorithms are designed to find solutions without exploitable knowledge
regarding the problem and are often based on principles similar to brute-forcing or random
guessing. While black-box algorithms for hyperparameter optimization are often quite sophis-
ticated and can be empirically proven to return effective values, they cannot provide much, if
any, insight into what makes these effective compared to others. Obtaining insight into how
different hyperparameter values, individually and in combination, impact performance under
different circumstances would, however, be extremely useful. With such insights, hyperparam-
eter optimization methods can be designed to exploit preexisting knowledge in combination
with black-box methods, which is likely more efficient than pure black-box algorithms.
In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI
2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford
University, Palo Alto, California, USA, March 22-24, 2021.
" oleedvao@hiof.no (O. Ørebæk); mariusge@hiof.no (M. Geitle)
 0000-0001-8468-7409 (O. Ørebæk); 0000-0001-9528-8913 (M. Geitle)
                                    © 2021 Copyright for this paper by its authors.
                                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
   A useful method of obtaining insight into complex problems is to visualize them, as this
allows them to be comprehensively presented. In this paper, we gain insight into the behavior
of the hyperparameters of the XGBoost algorithm by visualizing and comparing landscapes of
prediction performance generated based on hyperparameter combinations.
   The remainder of the paper is structured as follows: In Section 2, we present related work
relevant to the visualization of performance-landscapes. The theory behind the XGBoost algo-
rithm is outlined in Section 3. In Section 4, we document the methodology used for generating
and visualizing samples of hyperparameter-based performance-landscapes. In Section 5, we
present the findings of comparing the generated landscape-samples, and we discuss these find-
ings in Section 6. Finally, in Section 7, we conclude the paper and discuss future work.


2. Related Work
Visualizing problem subjects can be an effective method of intuitively obtaining many types of
insight, as demonstrated with many articles within machine learning research. For instance,
Li et al. [4] studied the loss landscapes of artificial neural networks through a proposed vi-
sualization method based on the principle of random directions and filter-wise normalization.
With this method, they provided valuable insight into the nature of artificial neural networks;
specifically, how skipped connections affect the sharpness and flatness of loss landscapes and
why these are necessary when training very deep networks. Smilkov et al. [5] presented in
their study Tensorflow Playground, a tool for providing users an intuitive understanding of
neural networks by allowing direct manipulation of the neural networks through visual repre-
sentations. Other papers [6, 7], though not directly focused on visualizations, utilize some as a
method of demonstrating concepts to the reader.
   Despite their usefulness, research exploring visualizations directly related to hyperparame-
ters are quite limited. The perhaps most relevant papers are the ones that present tools using
visualizations as a method of aiding hyperparameter tuning/analysis processes [8, 9]. How-
ever, these papers primarily focus on designing the tools instead of using them practically to
obtain insight into hyperparameters.
   There are, however, several interesting papers that have investigated performance-landscapes,
though not necessarily in the context of hyperparameters. Performance-landscapes are rele-
vant to our paper because they can be visually analyzed to obtain insights into hyperparame-
ters’ effects. Performance landscapes have previously been primarily investigated in the con-
text of neural network loss. The most prominent example of this is the earlier mentioned paper
by Li et al. [4], which yielded valuable insight in this context. This study also inspired further
studies proposing similar methods [10, 7]. Of these, Fort and Jastrzebski [7], among other
things, demonstrated similarities in the effects of different neural network hyperparameters.


3. XGBoost
XGBoost, developed by Chen & Guestrin [11], is a gradient boosting decision tree algorithm
designed for both regression and classification problems. Being state-of-the-art, the algorithm
is also regularly featured in winning solutions of, e.g., Kaggle1 competitions. XGBoost is trained
through minimizing a regularized objective function (Eq. 1) by iteratively adding base learners
𝑓𝑡 , in the form of decision trees, to an ensemble [12, 11].
                                          𝑛
                                 (𝑡) = ∑ 𝑙 (𝑦𝑖 , 𝑦̂ (𝑡−1)
                                                     𝑖     + 𝑓𝑡 (𝐱𝑖 )) + Ω (𝑓𝑡 )               (1)
                                         𝑖=1

Here 𝑦̂ 𝑖 and 𝑦𝑖 denote the prediction and the target, and 𝑙 is a loss function that measures
the difference between them. The 𝑓𝑡 that best minimizes the loss between 𝑦𝑖 and the previous
iteration’s prediction 𝑦̂ (𝑡−1)
                          𝑖     is greedily added. Additionally, the complexity of the added 𝑓𝑡 is
penalized to avoid overfitting, as denoted by the regularization term Ω.
   Much of XGBoost’s regularization and model architecture is defined through hyperparame-
ters. Some of the most impactful are the learning rate, the number of base learners in the en-
semble, and the base learners’ maximum depth. In this paper we refer to these as learning_rate,
n_estimators, and max_depth. The learning_rate originates from the principle of shrinkage [12]
and is a value that scales base learner weights to reduce their individual influence on the en-
semble predictions. n_estimators and max_depth are quite natural regularization parameters,
as they directly influence the architecture of XGBoost’s produced models, and therefore signif-
icantly impact performance. Other hyperparameters include gamma, l1 and l2 regularization,
and various parameters for subsampling and column sampling.


4. Visualizing Hyperparameter based Performance-landscapes
The goal with this paper was to investigate how the hyperparameters of XGBoost affect its
prediction performance on different datasets and to investigate potential similarities between
these effects. Our motivation was to gain a general insight into XGBoost’s hyperparameters
compared to simply analyzing datasets individually. To accomplish this, we compared 3D vi-
sualizations of performance-landscapes relative to each selected dataset, where each landscape
was generated based on a combination of two hyperparameters.

4.1. Generating the Visualization Data
To visualize the hyperparameter performance-landscapes, we selected three of XGBoost’s hy-
perparameters and set their respective value-ranges, as tabulated in Table 1. Generated performance-
landscapes were based on the combination of two of these hyperparameters at a time. This
resulted in three different performance-landscapes for a given dataset, based on the combina-
tion of learning_rate and n_estimators, learning_rate and max_depth, and n_estimators and
max_depth. For the classification datasets, the used performance metric was accuracy, while
for the regression datasets, Mean Absolute Error (MAE) was used.

4.1.1. Adaptive Zoom
To reduce the amount of data needed to provide highly detailed visualizations, we developed a
novel algorithm, named Adaptive Zoom, for adaptively generating more data points in regions
   1
       https://www.kaggle.com/
Table 1
Selected hyperparameters and their value ranges.

                         Hyperparameter      Value Range
                         learning_rate       0.1 - 2.0
                         n_estimators        1 - 500
                         max_depth           1 - Number of Dataset Attributes



with better predictive performance, thereby keeping computation time spent on regions of
low performance minimal. The main idea behind the algorithm was to iteratively "zoom" in
on the region of apparent best performance. "Zooming" here refers to adaptively determining
the region of known best performance, based on pre-generated points, and generating more
points within the determined region. Using this algorithm, we could efficiently obtain highly
detailed landscapes by only generating high numbers of points in these specific regions while
leaving the remaining regions at lower details. .
   The algorithm first identifies the p% best performing hyperparameter configurations in the
landscape. From this set of configurations, the region of best performance, defined by hyperpa-
rameter value-ranges, is then determined by the lowest and highest value per hyperparameter.
Finally, a new landscape-sample can be generated based on the determined best performing
region.
   The Adaptive Zoom algorithm is visually demonstrated in Fig. 1, and its pseudocode is
contained in Fig. 2.




                             (a)                                         (b)
Figure 1: Visual demonstration of how the adaptive zoom algorithm works. The hollow dots represent
the best performing points, and the thick outline represent the area of the grid containing these points.
Fig. (a) illustates the first iteration of adaptive zoom, while Fig. (b) represents the second iteration.



4.1.2. Interpolation
To accurately compare the performance landscapes’ characteristics, we needed the same hy-
perparameter combinations across all ranges. However, due to using Adaptive Zoom, a varying
amount of landscapes-samples, with varying ranges, were generated for each hyperparameter
                           procedure AdaptiveZoom(𝑋 , 𝑝, 𝑟 ∈ ℕ)
                              𝑆 ← points 𝑋 sorted by performance
                              𝑝(𝑆) ← 𝑝 percentage best performing points
                              𝑅 ← empty array
                              for all ℎ ∈ 𝑝(𝑆) do
                                  𝑣 ← values of hyperparameter ℎ
                                  𝑣𝑚𝑖𝑛 ← 𝑀𝑖𝑛(𝑣)
                                  𝑣𝑚𝑎𝑥 ← 𝑀𝑎𝑥(𝑣)
                                  ℎ𝑙𝑖𝑛 ← 𝐿𝑖𝑛𝑠𝑝𝑎𝑐𝑒(𝑣𝑚𝑖𝑛 , 𝑣𝑚𝑎𝑥 , 𝑟)
                                  𝑅 ← 𝐴𝑝𝑝𝑒𝑛𝑑(𝑅, ℎ𝑙𝑖𝑛 )
                              end for
                              𝐿 ← generated landscape-sample based on 𝑅
                              return 𝐿
                           end procedure
Figure 2: Pseudocode for the Adaptive Zoom algorithm.


combination. These samples needed to be united so that the performance, relative to each hy-
perparameter combination, would be represented as a single landscape. To achieve this, we
utilized linear interpolation, a method of using a set of known points to generate new points
of a specified resolution within the known points’ range. For the implementation, we used
scipy.interpolate.griddata2 with the "linear" method, which takes a list of points and a list of
their corresponding values, and returns a grid with a specified resolution of interpolated values.

4.1.3. Landscape Generation
The landscapes, with performance values obtained through two-fold cross validation, were
generated with standard values of non-investigated hyperparameters at an initial resolution of
20 x 20. Further landscape-samples based on adaptive zoom were generated at an individual
resolution of 50 x 50. The number of samples for each landscape was dynamically determined
by running Adaptive Zoom until the returned hyperparameter ranges were no different from
the previous iteration. Finally, all generated landscape-samples of each hyperparameter com-
bination were merged into singular landscapes through linear interpolation.

4.2. Datasets
To explore effects on performance landscapes, we selected several datasets for both classifica-
tion and regression problems, as tabulated in Table 2. The datasets were selected to be varied
in dataset characteristics, such as size, and number of categorical and continuous attributes,
to ensure that generated landscapes would be as varied as possible. This was important to en-
sure that obtained findings could be generalized and would represent the general relationship
between the hyperparameters and performance as accurately as possible.
   In terms of preprocessing, all datasets were randomly shuffled to ensure that their contained
data was dispersed, categorical features with text values were one-hot encoded, and id-columns
were removed.
   2
       https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.griddata.html
Table 2
Datasets selected as the basis of generation and comparison of the visualizations.

                                                                   Attributes
                  Dataset          Instances         Type          Categorical       Continuous
              biodegredation3        1055        Classification        24                17
               contraceptive4        1473        Classification         8                1
               soybean-large5         307        Classification        35                0
                  vehicle6            840        Classification         0                18
                   wdbc7              569        Classification         0                31
              winequality-red8       1599        Classification         0                11
                 auto-mpg9            398         Regression            2                5
                forestfires10         517         Regression           19                10
                 housing11            506         Regression            2                11



5. Findings
To obtain insight into the hyperparameters’ effects on performance, we analyzed the land-
scapes by looking at their general convexity, the compared dominance of the hyperparame-
ters’ impact on performance, the locations of optima, the numbers of local optima, and general
observations of similarities between the landscapes. The motivation behind analyzing the land-
scapes’ convexity and local optima was to investigate how applicable methods using gradient
descent are to hyperparameter optimization. The dominance of the different hyperparameters
was investigated to gain insight into which hyperparameters are more important to optimize
to achieve the best results. Comparisons of each landscape’s optima were investigated to gain
insight into the predictability and consistency of their locations.
   Note that the landscapes for the regression datasets are flipped to ensure visual consistency.
For this reason, the MAE-values are presented as negative.

5.1. Convexity
We found that the landscapes of the learning_rate and n_estimators combination (Figure 3)
were generally lacking in convexity and were instead quite flat and jagged in shape. There
were, however, some exceptions to this. The landscape of the forestfires dataset had a clear
convex shape in the learning_rate axis, where lower values of learning rates yielded better

    3
      http://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation
    4
      https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice
    5
      https://archive.ics.uci.edu/ml/datasets/Soybean+(Large)
    6
      https://datahub.io/machine-learning/vehicle
    7
      https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
    8
      https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
    9
      https://archive.ics.uci.edu/ml/datasets/auto+mpg
   10
      https://archive.ics.uci.edu/ml/datasets/Forest+Fires
   11
      https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
             (a) vehicle                       (b) auto-mpg                    (c) forestfires




 (d) biodegreda-   (e) contraceptive   (f) soybean       (g) wdbc   (h) winequality-      (i) housing
     tion                                                               red
Figure 3: The interpolated performance-landscapes based on the combination of learning_rate and
n_estimators.




             (a) vehicle                       (b) auto-mpg                    (c) forestfires




 (d) biodegreda-   (e) contraceptive   (f) soybean       (g) wdbc   (h) winequality-      (i) housing
     tion                                                               red
Figure 4: The interpolated performance-landscapes based on the combination of learning_rate and
max_depth.
             (a) vehicle                       (b) auto-mpg                    (c) forestfires




 (d) biodegreda-   (e) contraceptive   (f) soybean       (g) wdbc   (h) winequality-      (i) housing
     tion                                                               red
Figure 5: The interpolated performance-landscapes based on the combination of n_estimators and
max_depth.


performance than higher values. The landscapes of biodegredation, wdbc, housing, and auto-
mpg also seemed to have some convexity in this axis, though they were still relatively flat in
overall shape.
   For the landscapes of the learning_rate and max_depth combination (Figure 4) the convexity
was quite varied from one dataset to another. The contraceptive dataset was relatively convex
in the max_depth axis, which was also the case for forestfires and winequality-red. Auto-mpg
and housing were somewhat convex in both the learning_rate and max_depth axis.
   Most of the landscapes of the n_estimators and max_depth combination (Figure 5) were flat
in overall shape. However, contraceptive, winequality-red, and perhaps forestfires, had some
convexity to them.

5.2. Dominance
Based on the landscapes, it seemed that n_estimators had, compared to learning_rate, a con-
siderably larger impact on performance. However, this only seemed the case for n_estimators-
values lower than approximately 100, which materialized as a "wall" in the landscapes. This
was apparent for all dataset-relative learning_rate and n_estimators combination landscapes
(Figure 3), except for forestfires, which did not seem to contain this wall. For n_estimators-
values over 100, we found that the combination of learning_rate and n_estimators resulted in
quite jagged landscapes, with learning_rate appearing to be the most dominant. We also ob-
served that jaggedness in the n_estimators axis was not equal for all learning_rate values. Only
the landscapes of the auto-mpg and housing datasets were lacking in visible jaggedness.
   Comparing learning_rate to max_depth (Figure 2), it seems that these hyperparameters tend
to have a nearly equal impact on performance except for with a few datasets, such as forestfires,
contraceptive and winequality-red. For these, max_depth appeared to have a larger effect,
creating a wall similar to those observed in the combinations of learning_rate and n_estimators
(Figure 3). We also found that max_depth tended to stop impacting performance beyond certain
values. These values seemed to never exceed max depth = 15, but were observed for several
datasets as being less.
   For the combination of n_estimators and max_depth (Figure 5), we found that the most
dominant hyperparameter changed from one dataset to another, with there sometimes being
performance walls in the n_estimators axis, sometimes in the max_depth axis, and sometimes
in both. Beyond these performance walls, the two hyperparameters seemed about equally
dominant.

5.3. Optima
For all combinations and datasets except vehicle, the optima were located within the learn-
ing_rate value-range of 0.1 to 1.0. For vehicle, the optima were located around the learning_rate
value of 1.5. We also observed that the optima were always located between the n_estimators
value-range of roughly 100 to 250.
   The optima for each dataset was observed to generally be located around the same learn-
ing_rate value within all relevant hyperparameter combinations (Figure 3 and 4), with winequal-
ity being the only exception to this. For this dataset, the optimum was an entirely different
learning_rate value for the combination of learning_rate and n_estimators (Figure 3) compared
to learning_rate and max_depth (Figure 4). Relative to this, n_estimators optima seemed a bit
more variable. Max_depth optima were located within values less than 15.

5.4. Local Optima
In terms of local optima, landscapes based on the combination of learning_rate and n_estimators
(Figure 3) seemed to contain many local optima. These did, however, seem to be predominantly
formed from the influence of learning_rate.
   Local optima were also observed to be plentiful for the combination of learning_rate and
max_depth (Figure 4). Here, it seemed that local optima were about equally based on learn-
ing_rate and max_depth, resulting in much more chaotic landscapes. This was, however, only
the case for max_depth values less than 15 due to the observations outlined in Section 5.2.
   For the combination of n_estimators and max_depth (Figure 5), we observed that local op-
tima seemed less common. Most of the landscapes were relatively flat and only datasets like
biodegredation, contraceptive, vehicle and winequality-red had anything that could be referred
to as local optima. But even for these, local optima were small in size.

5.5. Generally
For the learning_rate and n_estimators combination (Figure 3), we found that most landscapes
were quite similar in characteristics, such as general shape, convexity, and local optima. There
were, however, some differences between the landscapes of the classification and regression
datasets. Most notably that regression datasets generally seemed to contain less local optima,
and that forestfires’ landscape shape was completely unique compared to the others.
   The different learning_rate and max_depth combination landscapes (Figure 4) also seemed
to have reoccurring characteristics. There seemed to generally be a sharp dip in performance
close to learning_rate = 2.0 and max_depth = 1. The only landscape that was lacking this dip
was the one generated from the vehicle dataset. We also found that biodegreadation, soybean
and wdbc had very similar landscapes for the combination of learning_rate and n_estimators,
which can also be said for winequality-red and forestfires.
   We also observed that several datasets resulted in similar landscapes for all hyperparameter
combinations. The clearest example of this was housing and auto-mpg, which seemed nearly
identical. This was also the case for soybean and wdbc. In terms of landscape characteristics,
the two datasets that stuck out the most were contraceptive and forestfires. For contraceptive,
max_depth seemed to have a much larger influence on performance than for other datasets,
while for forestfires, n_estimators appeared to have no significant performance impact.


6. Discussion
Our goal with this paper was to obtain insights into how and under what conditions the hyper-
parameters of XGBoost affect its performance by analyzing and comparing hyperparameter-
based performance landscapes for various datasets. The motivation was to use such insights
to aid the efficiency of hyperparameter optimization strategies for XGBoost.
   The findings showed several indications that analyzing the performance-landscape visu-
alizations helped gain insight into effective search ranges of the investigated hyperparam-
eters; learning_rate, n_estimators, and max_depth. For learning_rate, the fact that optima
were within the value-range of 0.1 to 1.0 for all datasets except one, indicates that this range
might generally be optimal when searching for values of this hyperparameter. Similarly, with
n_estimators, the search range of 100 to 250 may be more optimal than other alternatives un-
der 500 estimators, based on the optima generally being located within this range. We also
discovered that n_estimators values less than 100 formed a "wall" of performance-increase,
most notably in landscapes of the learning_rate and n_estimators combinations (Figure 3). This
could imply that n_estimators less than 100 can reasonably be excluded from hyperparameter
searches. Regarding max_depth, we observed that values greater than 15 for this hyperpa-
rameter seemed not to affect XGBoost’s performance, implying that limiting the search range
from 1 to 15 might be reasonable. These implications are useful for standardizing searches
regarding these hyperparameters, which makes their optimization more efficient by reducing
computation time.
   By analyzing and comparing the general characteristics between the performance-landscapes,
we also found indications that the general behavior of XGBoost’s hyperparameters is some-
what predictable. Specifically, we found that landscapes based on the same hyperparameter
combination had general similarities across the datasets, and that certain datasets resulted in
almost identical landscapes, optima included, across all combinations. This was observed to
be the case for datasets both similar and dissimilar in general characteristics. Based on this, it
is possible that performance-landscapes based on XGBoost’s hyperparameters can largely be
predicted from aspects of the datasets, though any such specific aspects were not obvious from
the findings. However, if such landscape predictions are possible and reliable, it could mean
that effective search methods and hyperparameter values for XGBoost can potentially be made
somewhat deterministic. Based on this assumption, such predictions would be a prime target
to take advantage of when designing future hyperparameter optimization methods relevant to
XGBoost. Worth noting, however, is that this is still unlikely to completely remove the need
for black-box methods, due to the large amount of local optima and general flatness observed
in the landscapes. Nevertheless, a combination of the two would likely be beneficial.
   Compared to earlier studies, our visualization method seemed effective for obtaining hy-
perparameter insight despite its simple premise and implementation. For instance, compared
to previously suggested tools using visualizations for hyperparameter tuning processes [8, 9],
our method does not need to be tied to a specific tuning process but instead focuses on sev-
eral datasets to obtain more general insight into the hyperparameters. It is also apparent that
exploring performance landscapes, similarly to earlier papers investigating neural networks
[4, 7, 10], is also useful for investigating hyperparameters of other algorithms/methods.


7. Conclusion and Future Work
In this paper, we attempted to gain insight into how XGBoost’s hyperparameters affect perfor-
mance by analyzing hyperparameter-based performance-landscape visualizations. The find-
ings indicate that the method of visualizing performance-landscapes, based on combinations
of two hyperparameters at a time, is a useful tool for gaining insight into, e.g., effective ranges
of XGBoost’s hyperparameters. This was derived from, e.g., how optima were generally lo-
cated between 0.1 and 1.0 learning_rate, and 100 and 250 n_estimators. Also, how max_depth
was never observed to have an effect on performance for values greater than 15, and how
n_estimators typically had a wall of performance within the value range of 1 to 100. We also
found indications that the visualization method is effective for gaining insight into the general
and specific behavior of XGBoost’s hyperparameters with different datasets. This was derived
from how we observed that most datasets had common landscape-characteristics and how
some dataset-landscapes were nearly identical regardless of similarity/dissimilarity of dataset-
characteristics. Thus, we conclude that 3D visualizing hyperparameter-based performance-
landscapes is an effective tool for obtaining various types of insight into the behavior of XG-
Boost’s hyperparameters. And that these insights can possibly be used to design more efficient
hyperparameter optimization strategies for this algorithm.
   While the findings and their implications are promising, several points of investigation
should be explored in future work. For instance, larger datasets should be investigated to
ensure that the findings are generalizable. During the process of generating the visualizations,
we also noticed that the interpolation method would produce artifacts for certain landscapes.
Minor aspects of the visualizations might therefore be inaccurate and should be fixed. The
utilized visualization method is limited in that it can only visualize two hyperparameters at
a time. However, it might be possible to use hyperparameter-based vectors to generate the
visualizations, similar to how Li et al. [4] did with neural network weights, which would solve
this limitation. As discussed in Section 6, we observed that certain datasets, of similar and
dissimilar characteristics, produced strikingly similar landscapes. The cause of this is likely a
worthwhile point of further research.


References
 [1] F. Hutter, J. Lücke, L. Schmidt-Thieme, Beyond manual tuning of hyperparameters, KI-
     Künstliche Intelligenz 29 (2015) 329–337.
 [2] M. Feurer, F. Hutter, Hyperparameter optimization, in: Automated Machine Learning,
     Springer, Cham, 2019, pp. 3–33.
 [3] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, Journal of
     machine learning research 13 (2012) 281–305.
 [4] H. Li, Z. Xu, G. Taylor, C. Studer, T. Goldstein, Visualizing the loss landscape of neural
     nets, in: Advances in Neural Information Processing Systems, 2018, pp. 6389–6399.
 [5] D. Smilkov, S. Carter, D. Sculley, F. B. Viégas, M. Wattenberg, Direct-manipulation visu-
     alization of deep networks, arXiv preprint arXiv:1708.03788 (2017).
 [6] N. Frosst, N. Papernot, G. Hinton, Analyzing and improving representations with the soft
     nearest neighbor loss, arXiv preprint arXiv:1902.01889 (2019).
 [7] S. Fort, S. Jastrzebski, Large scale structure of neural network loss landscapes, in: Ad-
     vances in Neural Information Processing Systems, 2019, pp. 6706–6714.
 [8] T. Li, G. Convertino, W. Wang, H. Most, T. Zajonc, Y.-H. Tsai, Hypertuner: Visual analytics
     for hyperparameter tuning by professionals, in: Proceedings of the Machine Learning
     from User Interaction for Visualization and Analytics Workshop at IEEE VIS, 2018.
 [9] H. Park, J. Kim, M. Kim, J.-H. Kim, J. Choo, J.-W. Ha, N. Sung, Visualhypertuner: Visual
     analytics for user-driven hyperparameter tuning of deep neural networks, in: Demo at
     SysML Conference, 2019.
[10] S. Fort, A. Scherlis, The goldilocks zone: Towards better understanding of neural net-
     work loss landscapes, in: Proceedings of the AAAI Conference on Artificial Intelligence,
     volume 33, 2019, pp. 3574–3581.
[11] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the
     22nd acm sigkdd international conference on knowledge discovery and data mining, 2016,
     pp. 785–794.
[12] J. H. Friedman, Greedy function approximation: a gradient boosting machine, Annals of
     statistics (2001) 1189–1232.