Data-Driven Insights into Deforestation: Predictive
                         Modeling in Colombian Regions
                         Alvaro Hernán Alarcón-López1,* , Ixent Galpin1
                         1
                             Universidad de Bogota-Jorge Tadeo Lozano, Bogota, Colombia


                                        Abstract
                                        Deforestation is a critical problem that affects biodiversity, climate patterns, and water quality. This article
                                        presents a predictive model for the deforestation rate in the regions of Colombia using the CRISP-DM methodology.
                                        Historical data from 2015 to 2022 from the Institute of Hydrology, Meteorology, and Environmental Studies
                                        (IDEAM) were used. Correlation analyses were performed and models were trained using Random Forest to
                                        obtain the predictor variables: deforested and regenerated area, stable forest area, net difference in forest cover
                                        area, and change in forest cover area. Additionally, the departments of Atlántico, Sucre, Santander, and Meta
                                        were identified as the regions with the highest deforestation rates. In the prediction process, linear regression
                                        models showed the highest accuracy, with an R² of 1.00. Finally, the importance of segmenting and analyzing
                                        data by region to obtain accurate predictions and take effective corrective measures is highlighted.

                                        Keywords
                                        Deforestation, CRISP-DM methodology, Random Forest, Annual Deforestation Rate, Machine learning


                         1. Introduction
                         Deforestation is a global issue that has garnered significant attention from scientists and environmen-
                         talists due to its numerous adverse effects on the environment [1, 2]. The loss of biodiversity and
                         alterations in climate patterns are among the most devastating consequences, profoundly impacting
                         ecosystem health and human well-being. Additionally, deforestation has disrupted watershed dynamics
                         and aquatic ecosystems, contributing to the deterioration of water quality and ecological habitats.
                         While deforestation in Colombia is reportedly at an all-time low [3], it is particularly serious for the
                         country due to its rich biodiversity and vital ecosystems, which are crucial for maintaining global
                         climate stability, water cycles, and the livelihoods of indigenous and local communities. To tackle
                         these challenges, various technologies and analytical methods have been developed to identify and
                         predict deforested areas, providing essential data for devising effective conservation and reforestation
                         strategies [4, 5].
                            In recent years, analytical and classification models have become crucial for understanding and
                         predicting areas affected by deforestation. For instance, drone imagery has been instrumental in
                         accurately identifying deforested areas and monitoring forest regeneration over time. These advanced
                         technologies have significantly enhanced the precision of deforestation detection and have optimized
                         reforestation efforts [6].
                            Similarly, the application of machine learning for classifying satellite images has enabled the precise
                         identification of areas affected by forest fires and other disturbances. Moreover, studies utilizing
                         classification techniques to analyze deforestation trends underscore the importance of continuous
                         monitoring and conservation efforts. These tools and methods have also been employed to predict
                         future trends, aiding in the development of effective strategies for ecosystem preservation.
                            This paper presents a predictive model for deforestation rates in various regions of Colombia using the
                         CRISP-DM methodology a process model for data mining successful in research and development [7].
                         The objective is to provide a technological tool for the early implementation of corrective actions.

                          ICAIW 2024: Workshops at the 7th International Conference on Applied Informatics 2024, October 24–26, 2024, Viña del Mar, Chile
                         *
                           Corresponding author.
                          $ alvaroh.alarconl@utadeo.edu.co (A. H. Alarcón-López); ixent@utadeo.edu.co (I. Galpin)
                           0000-0003-4703-1907 (A. H. Alarcón-López); 0000-0001-7020-6328 (I. Galpin)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                             165
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                         165–178


The structure of the work is as follows: Section 2 discusses the impacts of deforestation, including
biodiversity loss, decreased water availability, and increased soil erosion. Section 3 reviews previous
research on the prediction and mitigation of deforestation. In Section 4, the departments in Colombia
with the highest deforestation rates based on forest area are identified, and the data distribution is
analyzed. Section 5 establishes the correlation between predictor and target variables and quantifies
their importance. Section 6 identifies predictor variables by department and generates prediction models.
Section 7 evaluates the performance of each model using appropriate metrics for regression problems.
Finally, Section 9 presents the study’s findings and provides a comprehensive overview of the research.


2. Deforestation
Deforestation of forests can lead to a series of long-term, observable problems and consequences,
ultimately resulting in serious environmental issues. One of the primary impacts is the reduction of bio-
diversity, alteration of ecosystem functioning, and modification of carbon dynamics [8, 9]. Furthermore,
the loss of natural habitats for numerous species can cause the extinction of endemic flora and fauna,
thereby diminishing the unique biodiversity of each region.
   Another impact directly related to the reduction of forested areas is the significant increase in
global temperatures. This is due to the rise in carbon dioxide levels released into the atmosphere from
deforestation, which contributes to global warming and disrupts climate patterns [10]. As a result, these
changes intensify extreme weather events, adversely affecting human communities and ecosystems
that depend on a stable climate for their survival and well-being.
   Furthermore, deforestation impacts the availability of water in watersheds and alters its flow and
distribution. These changes have significant implications for terrestrial hydrological systems and the
ecosystems that depend on them [11]. Additionally, the loss of forest cover can lead to greater variability
in water flow, resulting in more frequent and severe droughts and floods.
   Another direct consequence of deforestation is the increase in soil erosion rates and the disruption of
nutrient and water cycles, which adversely affect the livelihoods of local communities. This degradation
reduces soil quality and its capacity to support agriculture. This situation underscores the importance
of implementing sustainable forestry, agricultural, and livestock practices to mitigate negative impacts
and safeguard the natural resources essential for human sustenance.
   On the other hand, in South America, deforestation has additional consequences related to the
reduction of glacier recharge, which feeds rivers and returns water to the Amazon. This situation poses
a serious threat to the future of agriculture in various natural regions, as it leads to atypical occurrences
of droughts and floods, thereby increasing the likelihood of environmental and forest disasters [12].
   In the regions of Colombia, deforestation leads to several critical issues, including reduced biodiversity,
altered ecosystem functioning, significant contributions to global temperature rise due to increased
atmospheric carbon dioxide, and disrupted climate patterns [13]. It also affects water availability in
watersheds, altering flow and distribution, and increases soil erosion rates, disrupting nutrient and water
cycles, thus adversely impacting the survival of living organisms. Furthermore, agricultural activities,
such as livestock farming, have exacerbated deforestation rates through unsustainable practices like
burning forest areas to create grazing land for cattle [14].


3. Related Work
Analysis and classification models are paramount for understanding and predicting deforestation and
its associated problems. Numerous studies have been conducted on this topic. One example is the use
of convolutional neural networks (CNN) [15] for multiclass semantic segmentation, which enables
the identification of deforested areas from drone images. The goal of this research was to selectively
distribute seeds and monitor forest regeneration over time [6].
   Another approach is the application of machine learning to satellite image classification to identify
areas affected by wildfires. This method has demonstrated high potential for accurately classifying


                                                     166
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                      165–178


such images, utilizing metrics such as precision and average success rate [4]. Additionally, the use of
pre-trained convolutional neural networks (CNNs), combined with clustering algorithms like K-Means,
has enabled the precise identification of damaged forest areas. This provides an effective solution for
labeling satellite data, supporting rapid reforestation efforts [16].
   In another study by Kani et al., RF classification was used to analyze deforestation trends, revealing a
gradual decrease in forest areas over the years. This study underscores the importance of continuous
monitoring and conservation efforts, emphasizing the need for immediate actions to prevent further
loss of forest areas [5]. By leveraging these models and analytical techniques, it is possible not only
to accurately identify deforested areas but also to predict future trends. This capability contributes
to the development of effective reforestation and ecosystem preservation strategies, enabling more
sustainable and effective forest management.
   In the review conducted, no investigative studies were found that attempted to develop prediction
models based on time series analysis for Colombia. Therefore, this work is novel in this field as it
undertakes a distinctive approach compared to the existing research. This study aims to develop a
methodology to predict the deforestation rate in the regions of Colombia, to enable early corrective
actions. To achieve this, the well-established CRISP-DM methodology, specialized in data mining, is
employed. This methodology is based on a hierarchical model distributed across different development
stages: business understanding, data understanding, data preparation, modeling, and evaluation [7].


4. Data Understanding
The dataset used in this study comprises 561 records across 33 of 52 departments in Colombia, with
each department represented by 17 records. These records detail deforestation rates and related factors,
including variables such as forest area (SFA), deforested area (DA), and deforestation rate (ADR), among
others, over various time periods. The dataset includes key departments like Amazonas, Atlántico, Sucre,
Santander, and Meta, enabling a comprehensive analysis of deforestation trends. The even distribution
of data across these regions is essential for evaluating the robustness of the models and the complexity
of the analysis, ensuring that localized deforestation patterns are accurately captured and modeled.
   In this phase, the available and necessary resources are evaluated, and the objective of data mining is
determined. Data from secondary sources are collected and described, and their quality is verified by
statistical analysis, determining attributes and correlations [7]. The data consists of a historical record
of environmental statistics provided by the Institute of Hydrology, Meteorology, and Environmental
Studies (IDEAM). Specifically, two datasets were used: ’Change in the area covered by natural forest
according to Department Consolidated results between 1990-2022’ and ’Annual deforestation rate
according to Department Consolidated results between 1990-2022’1 . Complementary variables from
these datasets are used, with the understanding that they have identical time series and that both
contain segmented data from the 33 departments. The analysis of historical annual data from 2005 to
2022 for each of the 33 departments of Colombia is carried out to observe trends and changes over time.
Table 1 presents the data dictionary used for the analysis.
   To develop the prediction model, the annual deforestation rate (ADR) is selected as the dependent
variable. Due to the data dispersion, the models are developed by regions. Consequently, the independent
(predictor) variables are defined by the department.
   Due to the segmentation of data by departments, it is essential to determine which departments
exhibit a higher deforestation rate relative to the proportion of stable forest area in each region. To
achieve this, a new dataset is generated that presents the calculated averages for each of the variables
by department. Additionally, a column is added to establish the relationship between NDAC (net
deforestation, calculated as the difference between DA and RA) and the stable forest area variable (SFA).
   From this initial analysis, it was determined that the departments of Atlántico, Sucre, Santander, and
Meta exhibited the highest NDAC/SFA ratios. Consequently, these regions had higher deforestation

1
    http://www.ideam.gov.co/web/ecosistemas/bosques-y-recurso-forestal


                                                           167
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                            165–178


Table 1
Data dictionary

          Variable         Type                                  Description
            SFA       Decimal numeric     Stable forest area (ha)
            DA        Decimal numeric     Deforested area (ha)
             RA       Decimal numeric     Regenerated area (ha)
            AWI       Decimal numeric     Area without information (ha)
           PAWI       Decimal numeric     Proportion of area without information (%)
            ADR       Decimal numeric     Annual deforestation rate (%)
           NDAC       Decimal numeric     Net difference of the area covered by forest period t1 : t2
            CFA       Decimal numeric     Change in forest area


rates relative to the amount of forest in their territory. For this study, these four departments were
selected, and a summary of the analysis is presented in Table 2.

Table 2
Higher NDAC/SFA ratios

                                        Department        NDAC/SFA
                                        Atlántico          -0.027893
                                        Sucre              -0.023736
                                        Santander          -0.009413
                                        Meta               -0.008593


  Figure 1 presents an ADR graph for the departments of Atlántico, Sucre, Santander, and Meta,
highlighting distinct deforestation patterns. In Atlántico, the trend is fluctuating with significant peaks
and abrupt drops, indicating periods of intensive deforestation followed by recovery. Sucre exhibits
notable variability, with significant increases and sudden decreases in ADR. Santander displays high
variability characterized by multiple peaks and valleys. In Meta, a decreasing trend in ADR is observed,
suggesting the possible effectiveness of conservation policies; however, sporadic peaks still occur.
  To visualize the distribution of the annual deforestation rate for these four departments, the original
dataset was filtered, and a boxplot was generated. This boxplot allows for a visual comparison of the
variability and distribution of the annual deforestation rate among the selected departments. Figure 2
reveals that none of the distributions contain outliers for the analyzed variable. Additionally, it is noted
that the department of Atlántico has the highest median deforestation rate, while Santander has the
lowest median.


5. Data Preparation
Data selection is performed by defining specific inclusion and exclusion criteria for the IDEAM dataset,
using various methods described in the corresponding section [7]. Initially, records with missing values
(NA) are eliminated and some variables are converted from decimals to integers to ensure consistency
in the analysis. Once the data are clean, we proceed to the construction of derived attributes, such as
the NDAC/SFA ratio, which could serve as an additional predictor variable in the model.
   The Random Forest model is selected at this initial stage because of its ability to handle large
numbers of variables and its ability to identify the most relevant features among them. Unlike other
models, Random Forest is not affected by multicollinearity and can efficiently handle data with high
dimensionality. This approach allows exploring the dataset in depth, identifying precisely which
variables have the greatest impact on the prediction of the annual rate of deforestation.


                                                    168
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                       165–178


                       (a) Atlántico                                      (b) Sucre


                      (c) Santander                                       (d) Meta

Figure 1: ADR for departments with highest NDAC/ SFA ratio


Figure 2: Deforestation annual rate distribution


   Furthermore, the Random Forest model provides a valuable measure of the importance of features,
which facilitates the identification of the most significant variables for prediction. This capability is
critical in high-dimensional studies, such as deforestation analysis, where it is crucial to determine which
variables have the greatest impact on the results. By prioritizing the most relevant features, Random
Forest helps reduce the risk of overfitting and improves the predictive capability of the model [17].
This measure of importance not only guides model building but also provides a deeper understanding
of the factors driving changes in forest area.


                                                    169
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                     165–178


   In this context, correlation plots were generated for all variables, as well as box plots and scatter
plots of the annual rate of deforestation for each of the four departments with the highest NDAC/SFA
ratio. These graphs allowed a clear and detailed visualization of the relationship between the selected
variables and annual deforestation, which confirmed the relevance of the chosen characteristics. Using
these visual methods in combination with the feature importance measure provided by Random Forest
ensures that the model is based on the most robust and reliable predictors available, thus optimizing
its ability to make accurate and useful predictions for deforestation management. Figure 3 shows the
correlation matrix for each of the selected departments.


                      (a) Atlántico                                     (b) Sucre


                     (c) Santander                                      (d) Meta

Figure 3: Spearman correlation matrix to departments with highest NDAC/ SFA ratio


5.1. Significant Correlations
In the department of Atlántico, the variables DA, RA, NDAC, NDAC/SFA, and CFA exhibit a high
correlation with ADR (annual deforestation rate (%)). The p-value for each of these variables in relation
to ADR was determined using the Mann-Whitney statistical test and was found to be less than 0.05
for all of them. This result rejects the null hypothesis, indicating that these variables could be strong
predictors. The calculated values are presented in Table 3.
    In the Sucre region, the variables DA, NDAC, CFA, and NDAC/CFA show a high correlation with
ADR(annual deforestation rate). The p-value calculations using the Mann-Whitney test yielded values
lower than 0.05 for each of these variables, leading to the rejection of the null hypothesis. Therefore,
it is concluded that these variables could be strong predictors. The calculated values are presented in
Table 4.


                                                  170
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                       165–178


Table 3
P-value for variables with higher correlation - Atlántico

                                   Variables      Mann-Whitney (p-value)
                                      DA                1,71E+07
                                      RA                2,57E+07
                                    NDAC                1,04E+09
                                     CFA                1,04E+09
                                   NDAC/SFA             1,04E+09


Table 4
P-value for variables with higher correlation - Sucre

                                   Variables      Mann-Whitney (p-value)
                                      DA                8,57E+05
                                    NDAC                8,57E+05
                                     CFA                8,57E+05
                                   NDAC/SFA             8,57E+05


  Additionally, in the department of Santander, the variables DA, NDAC, CFA, and NDAC/CFA exhibit
a high correlation with ADR (annual deforestation rate). The p-value calculations using the Mann-
Whitney test yield values below 0.05 for each of these variables, leading to the rejection of the null
hypothesis. This suggests that these variables could be strong predictors. The calculated values are
presented in Table 5.

Table 5
P-value for variables with higher correlation - Santander

                                   Variables      Mann-Whitney (p-value)
                                      DA                8,57E+05
                                    NDAC                8,57E+05
                                     CFA                8,57E+05
                                   NDAC/SFA             8,57E+05


   In the Meta region, the variables SFA, RA, NDAC, and NDAC/CFA show a high correlation with ADR
(annual deforestation rate). The p-value calculations using the Mann-Whitney test yield values below
0.05 for each of these variables, leading to the rejection of the null hypothesis and indicating that these
variables could be strong predictors. The calculated values are presented in Table 6.

Table 6
P-value for variables with higher correlation - Meta

                                   Variables      Mann-Whitney (p-value)
                                     SFA                8,57E+05
                                      RA                8,57E+05
                                    NDAC                8,57E+05
                                   NDAC/SFA             8,57E+05


                                                        171
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                    165–178


5.2. Relevance features
To confirm and quantify the importance of the variables, a Random Forest model is trained, as this
algorithm allows the results of multiple decision trees to be combined to reduce the risk of overfitting
and improve the generalization of the model [18]. This is in contrast to linear and logistic regression,
which do not adequately capture the complex and non-linear relationships between variables, and other
models such as SVM and neural networks, which can be more costly and difficult to interpret. Therefore,
a ratio of 80% training data, 20% test data and the evaluation metric used was RSME. Predictor variables
were defined as: SFA, DA, RA, AWI, PAWI, NDAC, CFA, and NDAC/SCBE and the target variable: ADR.
In this way, the aim was to determine the importance of these characteristics for each of the selected
departments. selected departments. The results obtained can be seen in Figure 4.


                       (a) Atlántico                                    (b) Sucre


                      (c) Santander                                     (d) Meta

Figure 4: Importance ratio of the features


   For the Atlántico department, the Random Forest model achieved an RMSE of 0.0692, indicating good
accuracy due to the relatively small average error. Regarding feature importance, NDAC is identified as
the most important variable, contributing 25.72%. DA also shows significant importance with 21.6%,
followed by CFA at 20.68% and the NDAC/SFA ratio at 18.78%. The remaining features each contribute
less than 5%.
   For the department of Sucre, the Random Forest model achieved an RMSE of 0.0904. The most
important variable is CFA, contributing 54.4%. SD follows with 13.94%, NDAC accounts for 12.83%, and
the NDAC/SFA ratio contributes 8.23%, while the remaining features each have an importance of less
than 5%. For the department of Santander, the RMSE is 0.0227. Here, CFA is again the most important
variable, contributing 63.32%. NDAC follows with 10.74%, DA accounts for 9.01%, and the NDAC/SFA
ratio contributes 7.98%, with the other features each contributing less than 5%.
   Finally, for the Meta department, the Random Forest model achieved an RMSE of 0.0772. The most
important variable is CFA, contributing 31.73%. SFA follows with 22.64%, NDAC accounts for 14.35%,
the NDAC/SFA ratio contributes 14.03%, and DA represents 8.9%. The remaining features each have an
importance of less than 5%.


                                                  172
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                       165–178


6. Modeling
The technique selection and model development phase is essential in the predictive modeling process,
as it determines the tools and methods that will be used to analyze the data. Therefore, to approach the
problem of forest area change and deforestation rate using the data provided by IDEAM, the technique
that best suits the nature of the problem and the quality of the available data must be chosen. It
is essential to consider that the selected methods must be able to handle both linear and nonlinear
relationships present in the data [19], which will allow capturing the complex patterns inherent to the
deforestation process.
   In the case of IDEAM data, models such as linear regression, decision trees, SVM (Support Vector
Machines), and random forest are selected to predict the ADR (Annual Deforestation Rate). These
models are chosen because of their ability to combine linear and nonlinear techniques, which allows
for capturing diverse and complex patterns in the data.
   Each selected technique offers particular advantages that make it suitable for this type of analysis.
Linear regression is used to identify simple relationships between variables, providing a clear basis for
understanding how certain factors influence the rate of deforestation. On the other hand, decision trees
and random forest models are effective for capturing more complex interactions between variables,
being especially useful when working with data that exhibit nonlinear relationships. In addition, SVM
is especially valuable in high-dimensional scenarios, where the number of variables can complicate
other simpler methods.
   Although there are more advanced techniques, such as neural networks, they are not used in this case
due to their high computational demands and the need for large volumes of data for effective training.
Neural networks are powerful and can capture very complex patterns, but their implementation requires
significant resources and a larger data set than was available. Therefore, it was decided to use models
that offer a balance between predictive capability and computational efficiency. The source code used
to develop these models in Python is available on GitHub2 , allowing other researchers to reproduce the
results or adapt the techniques to their datasets.

6.1. Atlántico Region Model
The predictive model for the annual rate of deforestation in the Atlantico region is developed after a
detailed analysis using the correlation index, the p-value, and the significance key features index. These
analyses allow the identification of the most relevant variables for the model. In this scenario, it is
identified that the variables NDAC, NDAC/SCBE, CFA, DA, and RA are the most effective predictors for
modeling the annual rate of deforestation in the data from the Department of Atlántico. These variables
reflect a strong correlation with deforestation, suggesting that the combination of anomalous climatic
factors and the current forest situation is critical for predicting changes in forest cover in the region.

6.2. Sucre Region Model
In the Sucre region, data analysis showed that several variables are essential for predicting the annual
rate of deforestation. The correlation index, p-value, and importance ranking of the characteristics
determined that the variables NDAC, NDAC/SFA, CFA, and DA should be used as predictors. The
integration of these variables into the analysis provides a robust framework that facilitates not only the
accurate prediction of the annual rate of deforestation but also strategic and informed forest management
decisions. This model reflects the specific realities of the region, providing a useful tool for adaptation
processes to environmental and social changes, and strengthening the strategies for the conservation
and sustainable use of forests in Sucre.


2
    https://github.com/AlvaroHernan/DeforestationPredictive


                                                              173
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                       165–178


6.3. Santander Region Model
In the case of the department of Santander, the variables that most influence the annual rate of deforesta-
tion were identified, thanks to the analysis of the correlation indexes, the p-value, and the classification
of the importance of the characteristics. The study concluded that the variables NDAC, NDAC/SFA,
CFA, and DA are the most relevant for the predictive model. Their inclusion in the model provides a
solid basis for understanding the underlying drivers of deforestation, which is essential for developing
effective conservation strategies.

6.4. Meta Region Model
The model designed for the department of Meta is based on a comprehensive analysis that has identified
the variables NDAC, SFA, NDAC/SFA, CFA, and DA as essential for predicting the annual rate of
deforestation. This model not only provides an accurate prediction of changes in forest area but also
acts as a valuable resource for informed decision-making in natural resource management, offering a
crucial tool for the formulation of adaptive and effective conservation strategies in the Meta region.


7. Results
In the evaluation phase, model results are compared using R², MSE, and MAE metrics to assess model
accuracy. To enhance the robustness of the evaluation, cross-validation is used. This technique divides
the data into multiple subsets, or folds, ensuring that the model is trained and tested on different
partitions of the data. A common approach is 5-fold cross-validation, where the data is split into five
parts, training the model on four parts and testing it on the remaining one, repeating the process five
times. This provides a more comprehensive assessment of the model’s performance, in particular with
regards to its generalization capabilities. Metrics such as R², Mean Squared Error (MSE), and Mean
Absolute Error (MAE) are computed for each fold, and their average values are used to determine the
overall precision of the models, offering a more reliable evaluation than a simple train test split. This
section presents the interpretation of the obtained results to extract relevant and significant conclusions.
The results for each model by the department are presented in Tables 7–10.

Table 7
Model results - Atlántico

                                     Model                 R²     MSE    MAE
                                Linear Regression         1.00    0.00   0.00
                                  Decision Tree           0.76    0.03   0.11
                                      SVM                 0.27    0.11   0.27
                             Random Forest Regressor      0.91    0.02   0.10


Table 8
Model results - Sucre

                                     Model                  R²    MSE    MAE
                                Linear Regression          1.00   0.00   0.00
                                  Decision Tree            0.80   0.01   0.06
                                      SVM                 -0.16   0.09   0.23
                             Random Forest Regressor       0.71   0.02   0.08


   The results indicated that the linear regression model was the most accurate in predicting deforestation
rates across all the analyzed departments: Atlántico, Sucre, Santander, and Meta. The model achieved


                                                    174
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                         165–178


Table 9
Model results - Santander

                                      Model                 R²    MSE    MAE
                                 Linear Regression         1.00   0.00   0.00
                                   Decision Tree           0.89   0.00   0.02
                                       SVM                 0.47   0.01   0.07
                              Random Forest Regressor      0.96   0.00   0.01


Table 10
Model results - Meta

                                      Model                 R²    MSE    MAE
                                 Linear Regression         1.00   0.00   0.00
                                   Decision Tree           0.91   0.01   0.09
                                       SVM                 0.93   0.01   0.08
                              Random Forest Regressor      0.89   0.01   0.07


R² values close to 1.0, reflecting its high accuracy in forecasting deforestation patterns in these regions
and demonstrating a superior ability to explain the variability in deforestation rates.
   In addition to linear regression, the random forest model also showed a competitive performance,
especially in the departments of Atlántico and Meta, with R² above 0.91. This model is known for its
ability to capture complex interactions between variables, which makes it particularly useful in contexts
where deforestation patterns are influenced by multiple interrelated factors. Although the random
forest did not outperform the linear regression model in terms of R², its results were close enough to
consider it a robust alternative, especially in scenarios where it is desirable to minimize the risk of
overfitting.
   On the other hand, decision tree models and support vector machines (SVM) presented lower
performance compared to linear regression and random forest. In the case of the decision tree model,
the R² values ranged from 0.76 to 0.91, indicating that, although effective, its ability to predict accurately
is lower than that of the aforementioned models. The SVM model, although useful in specific contexts,
showed the greatest limitations, with R² ranging from -0.16 to 0.93, suggesting that it may not be the best
choice for this type of analysis in regions with complex and highly variable data such as deforestation.
   The analysis of the mean squared error (MSE) and mean absolute error (MAE) supported the conclu-
sions obtained from R². In all departments, the linear regression not only presented the lowest MSE
and MAE values but also maintained remarkable consistency among the different data sets. This fact
reinforces the idea that linear regression is not only accurate but also stable in its performance, which
is crucial for the implementation of policies based on its predictions. The performance of the models
was carefully interpreted to draw relevant conclusions.
   Importantly, the superiority of linear regression could be due to the linear nature of the underlying
relationships between predictor variables and deforestation rate. However, the slight variability in the
performance of the models in different departments also underscores the importance of considering
specific regional characteristics when selecting the most appropriate model.
   In summary, the linear regression model emerged as the most effective tool for predicting deforestation
rates in the departments evaluated, providing highly accurate and reliable predictions. The random
forest stood out as a robust alternative, especially in more complex scenarios. The results obtained
underline the importance of a regionalized approach to predictive modeling,


                                                     175
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                       165–178


8. Future Work
The incorporation of variables such as temperature, humidity, and forest type into predictive models is
crucial for improving the accuracy of predictions, but it faces significant challenges in terms of obtaining
and managing these data. The quality and availability of accurate and updated information on these
variables can be difficult to guarantee, especially in remote regions or areas with limited infrastructure
for environmental data collection. The limited number of meteorological stations in certain forests and
variability in collection methods can generate inconsistencies that affect the accuracy of the model.
In addition, available historical data may not cover long enough periods to capture long-term trends,
limiting the model’s predictive capability.
   Another significant limitation lies in the temporal and spatial resolution of the data. In many
instances, climatic and forest information is available at a broad scale, making it challenging to conduct
the detailed local analyses required for accurate deforestation modeling. This lack of data granularity
can lead to models that fail to capture critical variations within regions, thus reducing the effectiveness
of conservation strategies based on these predictions. Additionally, the model’s performance was
affected by the inherent variability of the data and inconsistencies across departments. To manage
these inconsistencies, a feature selection process using Random Forest was employed, enabling the
identification of the most relevant variables for each region. Furthermore, separate models were
developed for each department to account for localized factors, improving overall accuracy. Exploring
methods to enhance data collection, such as leveraging remote sensing technology or deploying denser
sensor networks in key areas, is essential to address these limitations and improve future predictions.
   Given these limitations, it is necessary to consider implementing new models that can more effectively
handle the incomplete and sometimes irregular nature of the data. Models such as those based on
deep neural networks or reinforcement learning techniques can be useful for dealing with large data
sets with high dimensionality and possible information gaps. These models can be trained to learn
complex and nonlinear patterns that might be ignored by traditional methods such as linear regression.
In addition, hybrid approaches that combine different techniques, such as the use of Random Forest
algorithms together with time series models, could offer a robust solution by integrating multiple data
sources and providing more reliable and contextualized predictions. Thus, while data collection and
management present significant challenges, the exploration of advanced, adaptive models represents a
promising avenue for improving the accuracy and utility of predictive deforestation models. With an
appropriate approach to data collection and the use of advanced modeling technology, it is possible to
overcome these limitations and move towards more effective and sustainable conservation strategies.
   We leave the deployment phase in CRISP-DM as future work, as the study is primarily research-
focused. The objective is to explore and validate the model’s accuracy and predictive capabilities, rather
than to implement it in real-world operational systems.


9. Conclusions
The variability of deforestation data among the departments necessitated segmenting the data by
region. This division enabled a more detailed and specific analysis, identifying the departments with the
highest deforestation rates, such as Atlántico, Sucre, Santander, and Meta. This approach facilitates the
development of more accurate predictive models adapted to each department. However, the accuracy
of these models can be influenced by the variable quality of historical data, which underscores the need
to improve data collection for future predictions and to ensure the applicability of models in different
contexts and regions.
   Linear regression models prove highly effective in predicting the annual rate of deforestation in
specific departments. However, variability in data quality and socioeconomic differences between
regions limit the generalizability of the results, suggesting that additional studies should be conducted
before applying these models to other geographic areas or countries.
   The predictive model developed showed high accuracy with metrics such as R², MSE, and MAE.


                                                    176
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                       165–178


However, there is a possibility of bias in the data due to variability in the quality of IDEAM’s historical
records, which may affect the accuracy of the predictions. In addition, although the model was effective
in predicting deforestation in Atlántico, Sucre, Santander, and Meta, the results may not be generalizable
to other regions of Colombia or other countries due to environmental and socioeconomic differences.
Caution is advised when applying these models outside the context studied.
   The use of predictive models based on the CRISP-DM methodology has proven effective in predicting
the deforestation rate in different regions of Colombia. Linear regression models, in particular, have
demonstrated high accuracy in predicting the annual deforestation rate. This accuracy enables the early
identification of critical areas and the formulation of appropriate conservation strategies.


References
 [1] J. V. Solórzano, J. F. Mas, J. A. Gallardo-Cruz, Y. Gao, A. F.-M. d. Oca, Deforestation detection using
     a spatio-temporal deep learning approach with synthetic aperture radar and multispectral images
     199 (2023) 87–101. doi:https://doi.org/10.1016/j.isprsjprs.2023.03.017.
 [2] M. Leon, G. Cornejo, M. Calderón, E. González-Carrión, H. Florez, Effect of deforestation on
     climate change: A co-integration and causality approach with time series, Sustainability 14 (2022)
     11303. doi:10.3390/su141811303.
 [3] The Guardian,                  Deforestation in colombia falls to lowest level in 23
     years         (2024).          URL:        https://www.theguardian.com/world/article/2024/jul/10/
     deforestation-in-colombia-falls-to-lowest-level-in-23-years, accessed: 2024-07-11.
 [4] M. Kaselimi, A. Voulodimos, I. Daskalopoulos, N. Doulamis, A. Doulamis, A vision transformer
     model for convolution-free multilabel classification of satellite imagery in deforestation monitoring
     34 (2023) 3299–3307. doi:10.1109/TNNLS.2022.3144791.
 [5] D. C. J. Kani, S. Saudia, Analysis on the performance of machine learning models for forest fire
     prediction, in: 2023 5th International Conference on Smart Systems and Inventive Technology
     (ICSSIT), 2023, pp. 1–5. doi:10.1109/ICSSIT55814.2023.10060870.
 [6] J. Villalobos-Montiel, A. Aguilar-Gonzalez, L. Orona, C. Lozoya, Identifying deforested areas
     through convolutional neural network for drone reforesting, in: 2023 IEEE Conference on Tech-
     nologies for Sustainability (SusTech), 2023, pp. 138–143. doi:10.1109/SusTech57309.2023.
     10129558.
 [7] C. Schröer, F. Kruse, J. M. Gómez, A systematic literature review on applying CRISP-DM process
     model 181 (2021) 526–534. doi:10.1016/j.procs.2021.01.199.
 [8] M. Hrachowitz, M. Stockinger, M. Coenders-Gerrits, R. Van Der Ent, H. Bogena, A. Lücke,
     C. Stumpp, Deforestation reduces the vegetation-accessible water storage in the unsatu-
     rated soil and affects catchment travel time distributions and young water fractions (2020).
     doi:10.5194/hess-2020-293.
 [9] D. Lee, Y. Choi, MultiEarth 2022 deforestation challenge – ForestGump (2022). URL: https://arxiv.
     org/abs/2206.10831v1.
[10] S. Gu, The impact of increasing forest loss areas on the global temperature, and tourism industry
     9 (2023) 42–55. doi:10.9734/ajraf/2023/v9i3205.
[11] R. Kumar, A. Kumar, P. Saikia, Deforestation and forests degradation impacts on the environment,
     in: Environmental Degradation: Challenges and Strategies for Mitigation, Springer International
     Publishing, 2022, pp. 19–46. doi:10.1007/978-3-030-95542-7_2.
[12] M. J. Dourojeanni, ¿es posible detener la deforestación en la amazonía peruana?, in: Desafíos y
     perspectivas de la situación ambiental en el Perú: en el marco de la conmemoración de los 200 años
     de vida republicana, Pontificia Universidad Católica del Perú, 2022, pp. 247–285. doi:10.18800/
     978-9972-674-30-3.013.
[13] A. Manciu, A. Rammig, A. Krause, B. R. Quesada, Impacts of land cover changes and
     global warming on climate in colombia during ENSO events 61 (2023) 111–129. doi:10.1007/
     s00382-022-06545-1.


                                                   177
Alvaro Hernán Alarcón-López et al. CEUR Workshop Proceedings                                   165–178


[14] D. Mejía, M. Díaz, K. Enciso, A. Bravo, F. Florez, S. Burkart, The impact of agricultural credit on
     the cattle inventory and deforestation in colombia: A spatial analysis, 2022. doi:10.21203/rs.3.
     rs-2188032/v1.
[15] T. Kattenborn, J. Leitloff, F. Schiefer, S. Hinz, Review on convolutional neural networks (cnn)
     in vegetation remote sensing, ISPRS journal of photogrammetry and remote sensing 173 (2021)
     24–49.
[16] M. H. Coelho, O. O. Bittencourt, F. Morelli, R. Santos, Método para a classificação de Áreas
     queimadas baseado em aprendizado de máquina automatizado 13 (2022) 029–036. doi:10.14210/
     cotb.v13.p029-036.
[17] A. Bommert, T. Welchowski, M. Schmid, J. Rahnenführer, Benchmark of filter methods for feature
     selection in high-dimensional gene expression survival data, Briefings in Bioinformatics 23 (2022)
     bbab354. doi:10.1093/bib/bbab354.
[18] V. Ignatenko, A. Surkov, S. Koltcov, Random forests with parametric entropy-based information
     gains for classification and regression problems, PeerJ Computer Science 10 (2024) e1775.
[19] J. O. Ogunleye, Predictive data analysis using linear regression and random forest, in: Data
     integrity and data governance, IntechOpen, 2022. doi:10.5772/intechopen.107818.


                                                  178