Machine learning modeling exploration for under-bark tree bole volume estimation⋆ Maria J. Diamantopoulou1, ∗,† 1 Aristotle University of Thessaloniki, University Campus 54124, Thessaloniki, Greece Abstract This paper investigates the potential of utilizing both probabilistic and ensemble supervised machine learning modeling strategies to accurately estimate under-bark tree bole volume. For this purpose, primary measurement data from pine trees (Pinus brutia Ten.) in the Seich–Sou suburban forest of Thessaloniki, Greece, were used. The described analysis can offer a strong foundation for understanding the performance of both non-parametric modeling approaches. Specifically, the study employed the probabilistic Gaussian Process Regression (GPR) modeling methodology with an integrated radial basis function (RBF) kernel. Furthermore, based on its well-known ability to predict values for continuous variables, the ensemble learning technique chosen for investigation was Random Forest regression (RFr), which integrates the bootstrap aggregation methodology. A cross-validation procedure, combined with an exhaustive grid- search methodology, was employed to determine the optimal hyperparameter combination for each constructed model. Despite the challenge of identifying the optimal combination of numerous hyperparameters unique to each modeling approach, the results demonstrated that both methodologies, due to their flexibility, have significantly strong potential to provide reliable under-bark tree bole diameters and volume estimations. This contributes to the sustainable management of forest resources and highlights potential areas for further exploration and improvement. Keywords Gaussian Process Regression, Random Forest regression, pine trees 1 1. Introduction Accurately predicting the total volume of trees is crucial for anticipating forest growth and productivity. To estimate the bole volume by section, sophisticated formulas derived from the methods developed by Huber, Smalian, and Newton are employed [1]. These techniques necessitate multiple measurements of bole diameters at specific heights, which can be difficult to obtain from standing trees. Directly measuring the under-bark diameters of a tree bole several meters above the ground, which is necessary for calculating the true under-bark bole volume, is unfeasible, as these measurements can only be obtained from a felled tree. To avoid this destructive method, alternative indirect approaches are being explored. Traditionally, regression analysis has been used to estimate various forest attributes. However, the standard regression methodology encounters difficulties due to the need to meet multiple assumptions [2]. Lately, the emerging field of artificial intelligence (AI), including machine learning (ML) techniques have shown great potential providing accurate estimations and predictions of biological attributes, even when dealing with noisy data and non-normal distributions, which are common in primary forest measurements. Over the past two decades, there has been increasing interest in utilizing machine learning in forestry [3, 4], driven by its advanced computational capabilities. ⋆ Short Paper Proceedings, Volume I of the 11th International Conference on Information and Communication Technologies in Agriculture, Food & Environment (HAICTA 2024), Karlovasi, Samos, Greece, 17-20 October 2024. ∗ Corresponding author. † These authors contributed equally. mdiamant@for.auth.gr 0000-0002-6003-1285 © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop ceur-ws.org 90 ISSN 1613-0073 Proceedings In line with this objective, the goal of this study is to accurately estimate and predict the under- bark tree bole volume of pine trees using field measurements that are easily obtainable. To achieve this, two distinct machine learning approaches were employed: the probabilistic Gaussian Process Regression (GPR) method, known for its effectiveness in handling noisy continuous data, and the Random Forest regression (RFr) technique, an ensemble learning algorithm that enhances overall performance by combining the insights of multiple models. 2. Material and Methods The ground-truth data was collected from measurements on pine trees (Pinus brutia) within the Seich-Sou suburban forest of Thessaloniki, Greece. This forest, covering an area of 3,085.82 ha with an elevation range between 563 meters and 100 meters [5]. Systematic sampling was employed to ensure that all different site classes were represented. Tree measurements included over bark (doh) and under bark diameters (duh) at one-meter height intervals starting from 0.3 meters above the ground (do0.3, du0.3, do1.3, du1.3, …, do9.3, du9.3), as well as the total height (h) of the sampled trees. Upon completion of the measurements, a sample size of n = 999 measurements was obtained. The under bark bole volume (vubole) was calculated using the Smalian’s cross-sectional equation [1]: , ' ' 𝜋 𝑑!& + 𝑑(!&)*) 𝜋 𝑣!"#$% = % & ∙ * . ∙ 𝑙0 + ∙ 𝑑' ∙ 𝑙 , (1) 4 2 12 !, , &-* where 𝑑!& , i=1,…k are the under bark diameters of the lower and upper stem’s sections in m, l is the length of each section in m, in this case equal to one meter, and lk is the length of the tree top, in m, with lk < l=1. The mean and the standard deviation (std) for the observed over and under bark tree diameters, the tree total height and the under bark calculated volumes, are given in Table 1. Table 1 Summary statistics of the observed tree bole diameters, in centimeters, total height, in meters and under bark calculated volumes, in cubic meters diam mean std diam mean std diam mean std diam mean std do0.3 16.57 2.76 do3.3 8.88 2.79 do6.3 3.90 2.03 do9.3 1.99 1.55 du0.3 14.03 2.42 du4.3 8.41 2.59 du6.3 3.64 1.98 du9.3 1.82 1.47 do1.3 13.67 2.61 do4.3 7.01 2.45 do7.3 3.20 1.59 h 8.17 1.33 du1.3 12.16 2.36 du4.3 6.65 2.32 du7.3 2.95 1.56 vubole 0.05 0.02 do2.3 11.28 2.62 do5.3 5.13 2.23 do8.3 2.67 1.48 du2.3 10.32 2.40 du5.3 4.83 2.15 du8.3 2.44 1.44 2.1. Machine learning modeling approaches Using a probabilistic supervised machine learning method like Gaussian process regression (GPR) [6] for estimating under bark bole volume (vbole) brings significant benefits. This approach incorporates prior knowledge through kernels and provides uncertainty measures for predictions. Furthermore, this approach works well on small datasets, and it is more efficient in low dimensional spaces, matching perfectly in the present case study. Generally, GPR is characterized by the mean and covariance of the prior Gaussian process, along with the kernel that defines the relationship between two observations. In this context, the kernel radial basis function (RBF) was employed [7]: # 23! 1 3" 2 01 7 ' 𝑘(𝑥& , 𝑥. ) = 𝜎/ ∙𝑒 '∙$6 # , (2) 91 ' where 𝜎/ is the signal variance that controls the overall variance of functions drown from the Gaussian process regression, ls is the length scale, determines how rapidly the correlation between ' two points diminishes as the distance between them increases, 9𝑥& − 𝑥. 9 is the squared Euclidean distance between the 𝑥& and 𝑥. . ' In the equation (2), both the hyperparameter ls (length scale) and 𝜎/ (signal variance) are critical to the quality of the resulting model and must be properly optimized. To achieve this, the tree samples were randomly divided into a fitting data set, comprising 70% of the total data, and a testing data set with the remaining 30%. Additionally, the fitting data sets were subjected to k-fold cross- validation with k=5, ensuring the constructed model’s predictive ability is adequate. The same data division approach was applied to the Random Forest regression model construction, as well. The second non-parametric approach chosen was the RFr, selected in part for its ability to bypass the assumptions inherent in standard regression modeling. This technique is recognized as a robust non-parametric, supervised machine learning algorithm, originally proposed by [8]. The concept behind this approach is that combining multiple models can better capture the true structure of the data. RFr employs multiple individual models, called decision trees, which are combined into a single model. The goal is to minimize both the variance and bias of the base model—the decision tree—as much as possible within the system. The successful training of the RFr model significantly depends on fine-tuning its hyperparameters, particularly the number of decision trees (ndt), known as learners, and the maximum depth (dmax) of these learners. These hyperparameters are crucial as they govern the complexity of the RFr model. The RFr training utilized the bootstrap aggregation algorithm, commonly known as bagging [8, 9]. Both the machine learning methodologies were implemented in the scikit-learn libraries [10] and the Python programming language [11]. 2.2. Evaluation criteria The evaluation criteria crucial for assessing the suitability of the machine learning models used in this study were as follows: a) root mean square error (RMSE), which calculates the square root of the average squared differences between estimated/predicted and observed values; b) the coefficient of determination (R²), which reflects the proportion of variance in the dependent variable that can be explained by the independent variables; c) bias (BIAS), representing the mean difference between estimated/predicted and observed values; and d) relative sum of square errors (RSSE), which is the (%) ratio of the sum of squared errors (SSE) to the sum of the actual values of the under-bark bole volume values. High model performance is indicated by low RMSE, BIAS, and RSSE values, coupled with high R² values. 3. Results Taking into account the difficulty faced in obtaining tree bole diameters in different heights, the variables used as input variables to the under bark volume machine learning systems with output variable the under bark bole volume (vubole) were the diameters located near the ground, therefore easy to be measured, which were the (do0.3), (du0.3), (do1.3), (du1.3) and the total height (h) of the trees. Moreover, these variables produce high correlation with the (vubole) values, contributing mostly to the (vubole) values configuration. Employing both machine learning Gaussian process regression modeling, and Random Forest for regression modeling, the required hyperparameters were assessed using the grid-search methodology [12], which resulted to the optimal hyperparameters’ values presenting in Table 2. 92 Table 2 Optimal hyperparameters values for both modeling approaches Gaussian process regression (GPR) Random Forest for regression (RFr) hyperparameters range optimal value hyperparameters range optimal value ' 0 - 1 0.05 ndt 1 - 300 10 𝜎/ ls 1-5 1.1 dmax 1 - 10 7 The evaluation criteria for the constructed models are presented in Table 3. As indicated in the table, both models yield similar outcomes. However, the GPR model provides the most accurate and reliable results for both the fitting and testing datasets. Table 3 Evaluation criteria for both the constructed (GPR) and (RFr) modeling approaches, for both fitting and testing data sets data criteria models set RMSE R² BIAS RSSE% GPR fitting 0.0026 0.988 -0.00002 0.0141 testing 0.0032 0.977 -0.00009 0.0233 RFr fitting 0.0028 0.986 -0.00005 0.0163 testing 0.0038 0.974 -0.00136 0.0319 The performance of both constructed models was further assessed through the 45-degree line plots. 4. Discussion As a Bayesian regression technique, GPR modeling offers a probabilistic approach to inference, enabling the prediction of not just the expected value of a target variable but also the uncertainty associated with that prediction. Figure 1: GPR model performance associated by its uncertainty 93 Offering a probabilistic prediction with a mean and variance provides a natural measure of uncertainty in the predictions. Indicatively, the uncertainty in the under bark bole volume predictions against the total tree height and the stump diameter (the tree bole diameter located at 0.3 m from ground) is shown in Figure 1. Similar plots under similar uncertainty could be produced for all predictors. This evaluation is particularly useful in forestry, where risk assessment is essential for the effective implementation of sustainable forest management. The flexible structure of the Random Forest algorithm helps prevent the serious issue of overfitting and enables the system to handle real-world data, which often includes challenges such as high variance, outliers, and missing values. However, it’s important to note that the further a predicted value is from the range of the fitting data, the less reliable that prediction will be. Declaration on Generative AI The author(s) have not employed any Generative AI tools. References [1] T. E. Avery, H. E. Burkhart, Forest Measurements, Mc Graw Hill, New York, NY, 2002. [2] N. R. Draper, H. Smith, Applied regression analysis, 3rd ed., Wiley, New York NY, 1998. doi:10.1002/9781118625590. [3] M. J. Diamantopoulou, R. Özçelik, H. Yavuz, Tree-bark volume prediction via machine learning: A case study based on black alder’s tree-bark production, Comput Electron Agric 151(2018): 431- 440. doi: 10.1016/j.compag.2018.06.039. [4] S. S. Ghosh, U. Khati, S. Kumar, A. Bhattacharya, M. Lavalle, Gaussian process regression-based forest above ground biomass retrieval from simulated L-band NISAR data. Int J Appl Earth Obs Geoinf 118(2023) 103252. doi: 10.1016/j.jag.2023.103252. [5] FILOTIS - Database for the Natural Environment of Greece. URL: https://filotis.itia.ntua.gr/biotopes/c/AT4011119/. [6] CE. Rasmussen, CKI Williams, Gaussian Processes for Machine Learning, The MIT Press, Massachusetts, 2006. [7] W. Chen, H. Wang, Q. H. Qin, Kernel Radial Basis Functions, in Computational Mechanics, Springer, Berlin, Heidelberg, 2007. doi: 10.1007/978-3-540-75999-7_147. [8] L. Breiman, Random Forests, Machine Learning 45(2001): 5–32. doi: 10.1023/A:1010933404324. [9] A. M. Prasad, L. R. Iverson, A. Liaw, Newer Classification and Regression Techniques: Bagging and Random Forests for Ecological Prediction, Ecosystems 9(2006): 181-199. doi: 10.1007/s10021- 005-0054-1. [10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, et al., Scikit-learn: Machine Learning in Python, J Mach Learn Res 12(2011): 2825-2830. doi: 10.48550/arXiv.1201.0490. [11] Python Software Foundation: Python Documentation, 2022. ULR: http://www.python.org/. [12] S. M. LaValle, M. S. Branicky, S. R. Lindemann, (2004). On the relationship between classical grid search and probabilistic roadmaps, The International Journal of Robotics Research 23(2004): 673–692. 94