<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Data-Driven Insights into Deforestation: Predictive Modeling in Colombian Regions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alvaro Hernán Alarcón-López</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ixent Galpin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad de Bogota-Jorge Tadeo Lozano</institution>
          ,
          <addr-line>Bogota</addr-line>
          ,
          <country country="CO">Colombia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2</volume>
      <fpage>4</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>Deforestation is a critical problem that afects biodiversity, climate patterns, and water quality. This article presents a predictive model for the deforestation rate in the regions of Colombia using the CRISP-DM methodology. Historical data from 2015 to 2022 from the Institute of Hydrology, Meteorology, and Environmental Studies (IDEAM) were used. Correlation analyses were performed and models were trained using Random Forest to obtain the predictor variables: deforested and regenerated area, stable forest area, net diference in forest cover area, and change in forest cover area. Additionally, the departments of Atlántico, Sucre, Santander, and Meta were identified as the regions with the highest deforestation rates. In the prediction process, linear regression models showed the highest accuracy, with an R² of 1.00. Finally, the importance of segmenting and analyzing data by region to obtain accurate predictions and take efective corrective measures is highlighted.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deforestation</kwd>
        <kwd>CRISP-DM methodology</kwd>
        <kwd>Random Forest</kwd>
        <kwd>Annual Deforestation Rate</kwd>
        <kwd>Machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Deforestation is a global issue that has garnered significant attention from scientists and
environmentalists due to its numerous adverse efects on the environment [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The loss of biodiversity and
alterations in climate patterns are among the most devastating consequences, profoundly impacting
ecosystem health and human well-being. Additionally, deforestation has disrupted watershed dynamics
and aquatic ecosystems, contributing to the deterioration of water quality and ecological habitats.
While deforestation in Colombia is reportedly at an all-time low [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], it is particularly serious for the
country due to its rich biodiversity and vital ecosystems, which are crucial for maintaining global
climate stability, water cycles, and the livelihoods of indigenous and local communities. To tackle
these challenges, various technologies and analytical methods have been developed to identify and
predict deforested areas, providing essential data for devising efective conservation and reforestation
strategies [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        In recent years, analytical and classification models have become crucial for understanding and
predicting areas afected by deforestation. For instance, drone imagery has been instrumental in
accurately identifying deforested areas and monitoring forest regeneration over time. These advanced
technologies have significantly enhanced the precision of deforestation detection and have optimized
reforestation eforts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Similarly, the application of machine learning for classifying satellite images has enabled the precise
identification of areas afected by forest fires and other disturbances. Moreover, studies utilizing
classification techniques to analyze deforestation trends underscore the importance of continuous
monitoring and conservation eforts. These tools and methods have also been employed to predict
future trends, aiding in the development of efective strategies for ecosystem preservation.</p>
      <p>
        This paper presents a predictive model for deforestation rates in various regions of Colombia using the
CRISP-DM methodology a process model for data mining successful in research and development [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
The objective is to provide a technological tool for the early implementation of corrective actions.
The structure of the work is as follows: Section 2 discusses the impacts of deforestation, including
biodiversity loss, decreased water availability, and increased soil erosion. Section 3 reviews previous
research on the prediction and mitigation of deforestation. In Section 4, the departments in Colombia
with the highest deforestation rates based on forest area are identified, and the data distribution is
analyzed. Section 5 establishes the correlation between predictor and target variables and quantifies
their importance. Section 6 identifies predictor variables by department and generates prediction models.
Section 7 evaluates the performance of each model using appropriate metrics for regression problems.
Finally, Section 9 presents the study’s findings and provides a comprehensive overview of the research.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Deforestation</title>
      <p>
        Deforestation of forests can lead to a series of long-term, observable problems and consequences,
ultimately resulting in serious environmental issues. One of the primary impacts is the reduction of
biodiversity, alteration of ecosystem functioning, and modification of carbon dynamics [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. Furthermore,
the loss of natural habitats for numerous species can cause the extinction of endemic flora and fauna,
thereby diminishing the unique biodiversity of each region.
      </p>
      <p>
        Another impact directly related to the reduction of forested areas is the significant increase in
global temperatures. This is due to the rise in carbon dioxide levels released into the atmosphere from
deforestation, which contributes to global warming and disrupts climate patterns [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. As a result, these
changes intensify extreme weather events, adversely afecting human communities and ecosystems
that depend on a stable climate for their survival and well-being.
      </p>
      <p>
        Furthermore, deforestation impacts the availability of water in watersheds and alters its flow and
distribution. These changes have significant implications for terrestrial hydrological systems and the
ecosystems that depend on them [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Additionally, the loss of forest cover can lead to greater variability
in water flow, resulting in more frequent and severe droughts and floods.
      </p>
      <p>Another direct consequence of deforestation is the increase in soil erosion rates and the disruption of
nutrient and water cycles, which adversely afect the livelihoods of local communities. This degradation
reduces soil quality and its capacity to support agriculture. This situation underscores the importance
of implementing sustainable forestry, agricultural, and livestock practices to mitigate negative impacts
and safeguard the natural resources essential for human sustenance.</p>
      <p>
        On the other hand, in South America, deforestation has additional consequences related to the
reduction of glacier recharge, which feeds rivers and returns water to the Amazon. This situation poses
a serious threat to the future of agriculture in various natural regions, as it leads to atypical occurrences
of droughts and floods, thereby increasing the likelihood of environmental and forest disasters [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        In the regions of Colombia, deforestation leads to several critical issues, including reduced biodiversity,
altered ecosystem functioning, significant contributions to global temperature rise due to increased
atmospheric carbon dioxide, and disrupted climate patterns [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It also afects water availability in
watersheds, altering flow and distribution, and increases soil erosion rates, disrupting nutrient and water
cycles, thus adversely impacting the survival of living organisms. Furthermore, agricultural activities,
such as livestock farming, have exacerbated deforestation rates through unsustainable practices like
burning forest areas to create grazing land for cattle [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>
        Analysis and classification models are paramount for understanding and predicting deforestation and
its associated problems. Numerous studies have been conducted on this topic. One example is the use
of convolutional neural networks (CNN) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] for multiclass semantic segmentation, which enables
the identification of deforested areas from drone images. The goal of this research was to selectively
distribute seeds and monitor forest regeneration over time [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Another approach is the application of machine learning to satellite image classification to identify
areas afected by wildfires. This method has demonstrated high potential for accurately classifying
such images, utilizing metrics such as precision and average success rate [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Additionally, the use of
pre-trained convolutional neural networks (CNNs), combined with clustering algorithms like K-Means,
has enabled the precise identification of damaged forest areas. This provides an efective solution for
labeling satellite data, supporting rapid reforestation eforts [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        In another study by Kani et al., RF classification was used to analyze deforestation trends, revealing a
gradual decrease in forest areas over the years. This study underscores the importance of continuous
monitoring and conservation eforts, emphasizing the need for immediate actions to prevent further
loss of forest areas [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. By leveraging these models and analytical techniques, it is possible not only
to accurately identify deforested areas but also to predict future trends. This capability contributes
to the development of efective reforestation and ecosystem preservation strategies, enabling more
sustainable and efective forest management.
      </p>
      <p>
        In the review conducted, no investigative studies were found that attempted to develop prediction
models based on time series analysis for Colombia. Therefore, this work is novel in this field as it
undertakes a distinctive approach compared to the existing research. This study aims to develop a
methodology to predict the deforestation rate in the regions of Colombia, to enable early corrective
actions. To achieve this, the well-established CRISP-DM methodology, specialized in data mining, is
employed. This methodology is based on a hierarchical model distributed across diferent development
stages: business understanding, data understanding, data preparation, modeling, and evaluation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Understanding</title>
      <p>The dataset used in this study comprises 561 records across 33 of 52 departments in Colombia, with
each department represented by 17 records. These records detail deforestation rates and related factors,
including variables such as forest area (SFA), deforested area (DA), and deforestation rate (ADR), among
others, over various time periods. The dataset includes key departments like Amazonas, Atlántico, Sucre,
Santander, and Meta, enabling a comprehensive analysis of deforestation trends. The even distribution
of data across these regions is essential for evaluating the robustness of the models and the complexity
of the analysis, ensuring that localized deforestation patterns are accurately captured and modeled.</p>
      <p>
        In this phase, the available and necessary resources are evaluated, and the objective of data mining is
determined. Data from secondary sources are collected and described, and their quality is verified by
statistical analysis, determining attributes and correlations [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The data consists of a historical record
of environmental statistics provided by the Institute of Hydrology, Meteorology, and Environmental
Studies (IDEAM). Specifically, two datasets were used: ’Change in the area covered by natural forest
according to Department Consolidated results between 1990-2022’ and ’Annual deforestation rate
according to Department Consolidated results between 1990-2022’1. Complementary variables from
these datasets are used, with the understanding that they have identical time series and that both
contain segmented data from the 33 departments. The analysis of historical annual data from 2005 to
2022 for each of the 33 departments of Colombia is carried out to observe trends and changes over time.
Table 1 presents the data dictionary used for the analysis.
      </p>
      <p>To develop the prediction model, the annual deforestation rate (ADR) is selected as the dependent
variable. Due to the data dispersion, the models are developed by regions. Consequently, the independent
(predictor) variables are defined by the department.</p>
      <p>Due to the segmentation of data by departments, it is essential to determine which departments
exhibit a higher deforestation rate relative to the proportion of stable forest area in each region. To
achieve this, a new dataset is generated that presents the calculated averages for each of the variables
by department. Additionally, a column is added to establish the relationship between NDAC (net
deforestation, calculated as the diference between DA and RA) and the stable forest area variable (SFA).</p>
      <p>From this initial analysis, it was determined that the departments of Atlántico, Sucre, Santander, and
Meta exhibited the highest NDAC/SFA ratios. Consequently, these regions had higher deforestation
1http://www.ideam.gov.co/web/ecosistemas/bosques-y-recurso-forestal
rates relative to the amount of forest in their territory. For this study, these four departments were
selected, and a summary of the analysis is presented in Table 2.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Data Preparation</title>
      <p>
        Data selection is performed by defining specific inclusion and exclusion criteria for the IDEAM dataset,
using various methods described in the corresponding section [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Initially, records with missing values
(NA) are eliminated and some variables are converted from decimals to integers to ensure consistency
in the analysis. Once the data are clean, we proceed to the construction of derived attributes, such as
the NDAC/SFA ratio, which could serve as an additional predictor variable in the model.
      </p>
      <p>The Random Forest model is selected at this initial stage because of its ability to handle large
numbers of variables and its ability to identify the most relevant features among them. Unlike other
models, Random Forest is not afected by multicollinearity and can eficiently handle data with high
dimensionality. This approach allows exploring the dataset in depth, identifying precisely which
variables have the greatest impact on the prediction of the annual rate of deforestation.</p>
      <p>
        Furthermore, the Random Forest model provides a valuable measure of the importance of features,
which facilitates the identification of the most significant variables for prediction. This capability is
critical in high-dimensional studies, such as deforestation analysis, where it is crucial to determine which
variables have the greatest impact on the results. By prioritizing the most relevant features, Random
Forest helps reduce the risk of overfitting and improves the predictive capability of the model [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
This measure of importance not only guides model building but also provides a deeper understanding
of the factors driving changes in forest area.
      </p>
      <p>In this context, correlation plots were generated for all variables, as well as box plots and scatter
plots of the annual rate of deforestation for each of the four departments with the highest NDAC/SFA
ratio. These graphs allowed a clear and detailed visualization of the relationship between the selected
variables and annual deforestation, which confirmed the relevance of the chosen characteristics. Using
these visual methods in combination with the feature importance measure provided by Random Forest
ensures that the model is based on the most robust and reliable predictors available, thus optimizing
its ability to make accurate and useful predictions for deforestation management. Figure 3 shows the
correlation matrix for each of the selected departments.</p>
      <sec id="sec-5-1">
        <title>5.1. Significant Correlations</title>
        <p>In the department of Atlántico, the variables DA, RA, NDAC, NDAC/SFA, and CFA exhibit a high
correlation with ADR (annual deforestation rate (%)). The p-value for each of these variables in relation
to ADR was determined using the Mann-Whitney statistical test and was found to be less than 0.05
for all of them. This result rejects the null hypothesis, indicating that these variables could be strong
predictors. The calculated values are presented in Table 3.</p>
        <p>In the Sucre region, the variables DA, NDAC, CFA, and NDAC/CFA show a high correlation with
ADR(annual deforestation rate). The p-value calculations using the Mann-Whitney test yielded values
lower than 0.05 for each of these variables, leading to the rejection of the null hypothesis. Therefore,
it is concluded that these variables could be strong predictors. The calculated values are presented in
Table 4.</p>
        <p>Additionally, in the department of Santander, the variables DA, NDAC, CFA, and NDAC/CFA exhibit
a high correlation with ADR (annual deforestation rate). The p-value calculations using the
MannWhitney test yield values below 0.05 for each of these variables, leading to the rejection of the null
hypothesis. This suggests that these variables could be strong predictors. The calculated values are
presented in Table 5.</p>
        <p>In the Meta region, the variables SFA, RA, NDAC, and NDAC/CFA show a high correlation with ADR
(annual deforestation rate). The p-value calculations using the Mann-Whitney test yield values below
0.05 for each of these variables, leading to the rejection of the null hypothesis and indicating that these
variables could be strong predictors. The calculated values are presented in Table 6.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Relevance features</title>
        <p>
          To confirm and quantify the importance of the variables, a Random Forest model is trained, as this
algorithm allows the results of multiple decision trees to be combined to reduce the risk of overfitting
and improve the generalization of the model [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. This is in contrast to linear and logistic regression,
which do not adequately capture the complex and non-linear relationships between variables, and other
models such as SVM and neural networks, which can be more costly and dificult to interpret. Therefore,
a ratio of 80% training data, 20% test data and the evaluation metric used was RSME. Predictor variables
were defined as: SFA, DA, RA, AWI, PAWI, NDAC, CFA, and NDAC/SCBE and the target variable: ADR.
In this way, the aim was to determine the importance of these characteristics for each of the selected
departments. selected departments. The results obtained can be seen in Figure 4.
        </p>
        <p>(a) Atlántico</p>
        <p>(b) Sucre
(c) Santander
(d) Meta</p>
        <p>For the Atlántico department, the Random Forest model achieved an RMSE of 0.0692, indicating good
accuracy due to the relatively small average error. Regarding feature importance, NDAC is identified as
the most important variable, contributing 25.72%. DA also shows significant importance with 21.6%,
followed by CFA at 20.68% and the NDAC/SFA ratio at 18.78%. The remaining features each contribute
less than 5%.</p>
        <p>For the department of Sucre, the Random Forest model achieved an RMSE of 0.0904. The most
important variable is CFA, contributing 54.4%. SD follows with 13.94%, NDAC accounts for 12.83%, and
the NDAC/SFA ratio contributes 8.23%, while the remaining features each have an importance of less
than 5%. For the department of Santander, the RMSE is 0.0227. Here, CFA is again the most important
variable, contributing 63.32%. NDAC follows with 10.74%, DA accounts for 9.01%, and the NDAC/SFA
ratio contributes 7.98%, with the other features each contributing less than 5%.</p>
        <p>Finally, for the Meta department, the Random Forest model achieved an RMSE of 0.0772. The most
important variable is CFA, contributing 31.73%. SFA follows with 22.64%, NDAC accounts for 14.35%,
the NDAC/SFA ratio contributes 14.03%, and DA represents 8.9%. The remaining features each have an
importance of less than 5%.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Modeling</title>
      <p>
        The technique selection and model development phase is essential in the predictive modeling process,
as it determines the tools and methods that will be used to analyze the data. Therefore, to approach the
problem of forest area change and deforestation rate using the data provided by IDEAM, the technique
that best suits the nature of the problem and the quality of the available data must be chosen. It
is essential to consider that the selected methods must be able to handle both linear and nonlinear
relationships present in the data [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], which will allow capturing the complex patterns inherent to the
deforestation process.
      </p>
      <p>In the case of IDEAM data, models such as linear regression, decision trees, SVM (Support Vector
Machines), and random forest are selected to predict the ADR (Annual Deforestation Rate). These
models are chosen because of their ability to combine linear and nonlinear techniques, which allows
for capturing diverse and complex patterns in the data.</p>
      <p>Each selected technique ofers particular advantages that make it suitable for this type of analysis.
Linear regression is used to identify simple relationships between variables, providing a clear basis for
understanding how certain factors influence the rate of deforestation. On the other hand, decision trees
and random forest models are efective for capturing more complex interactions between variables,
being especially useful when working with data that exhibit nonlinear relationships. In addition, SVM
is especially valuable in high-dimensional scenarios, where the number of variables can complicate
other simpler methods.</p>
      <p>Although there are more advanced techniques, such as neural networks, they are not used in this case
due to their high computational demands and the need for large volumes of data for efective training.
Neural networks are powerful and can capture very complex patterns, but their implementation requires
significant resources and a larger data set than was available. Therefore, it was decided to use models
that ofer a balance between predictive capability and computational eficiency. The source code used
to develop these models in Python is available on GitHub2, allowing other researchers to reproduce the
results or adapt the techniques to their datasets.</p>
      <sec id="sec-6-1">
        <title>6.1. Atlántico Region Model</title>
        <p>The predictive model for the annual rate of deforestation in the Atlantico region is developed after a
detailed analysis using the correlation index, the p-value, and the significance key features index. These
analyses allow the identification of the most relevant variables for the model. In this scenario, it is
identified that the variables NDAC, NDAC/SCBE, CFA, DA, and RA are the most efective predictors for
modeling the annual rate of deforestation in the data from the Department of Atlántico. These variables
reflect a strong correlation with deforestation, suggesting that the combination of anomalous climatic
factors and the current forest situation is critical for predicting changes in forest cover in the region.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Sucre Region Model</title>
        <p>In the Sucre region, data analysis showed that several variables are essential for predicting the annual
rate of deforestation. The correlation index, p-value, and importance ranking of the characteristics
determined that the variables NDAC, NDAC/SFA, CFA, and DA should be used as predictors. The
integration of these variables into the analysis provides a robust framework that facilitates not only the
accurate prediction of the annual rate of deforestation but also strategic and informed forest management
decisions. This model reflects the specicfi realities of the region, providing a useful tool for adaptation
processes to environmental and social changes, and strengthening the strategies for the conservation
and sustainable use of forests in Sucre.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Santander Region Model</title>
        <p>In the case of the department of Santander, the variables that most influence the annual rate of
deforestation were identified, thanks to the analysis of the correlation indexes, the p-value, and the classification
of the importance of the characteristics. The study concluded that the variables NDAC, NDAC/SFA,
CFA, and DA are the most relevant for the predictive model. Their inclusion in the model provides a
solid basis for understanding the underlying drivers of deforestation, which is essential for developing
efective conservation strategies.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Meta Region Model</title>
        <p>The model designed for the department of Meta is based on a comprehensive analysis that has identified
the variables NDAC, SFA, NDAC/SFA, CFA, and DA as essential for predicting the annual rate of
deforestation. This model not only provides an accurate prediction of changes in forest area but also
acts as a valuable resource for informed decision-making in natural resource management, ofering a
crucial tool for the formulation of adaptive and efective conservation strategies in the Meta region.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Results</title>
      <p>In the evaluation phase, model results are compared using R², MSE, and MAE metrics to assess model
accuracy. To enhance the robustness of the evaluation, cross-validation is used. This technique divides
the data into multiple subsets, or folds, ensuring that the model is trained and tested on diferent
partitions of the data. A common approach is 5-fold cross-validation, where the data is split into five
parts, training the model on four parts and testing it on the remaining one, repeating the process five
times. This provides a more comprehensive assessment of the model’s performance, in particular with
regards to its generalization capabilities. Metrics such as R², Mean Squared Error (MSE), and Mean
Absolute Error (MAE) are computed for each fold, and their average values are used to determine the
overall precision of the models, ofering a more reliable evaluation than a simple train test split. This
section presents the interpretation of the obtained results to extract relevant and significant conclusions.
The results for each model by the department are presented in Tables 7–10.</p>
      <p>The results indicated that the linear regression model was the most accurate in predicting deforestation
rates across all the analyzed departments: Atlántico, Sucre, Santander, and Meta. The model achieved
R² values close to 1.0, reflecting its high accuracy in forecasting deforestation patterns in these regions
and demonstrating a superior ability to explain the variability in deforestation rates.</p>
      <p>In addition to linear regression, the random forest model also showed a competitive performance,
especially in the departments of Atlántico and Meta, with R² above 0.91. This model is known for its
ability to capture complex interactions between variables, which makes it particularly useful in contexts
where deforestation patterns are influenced by multiple interrelated factors. Although the random
forest did not outperform the linear regression model in terms of R², its results were close enough to
consider it a robust alternative, especially in scenarios where it is desirable to minimize the risk of
overfitting.</p>
      <p>On the other hand, decision tree models and support vector machines (SVM) presented lower
performance compared to linear regression and random forest. In the case of the decision tree model,
the R² values ranged from 0.76 to 0.91, indicating that, although efective, its ability to predict accurately
is lower than that of the aforementioned models. The SVM model, although useful in specific contexts,
showed the greatest limitations, with R² ranging from -0.16 to 0.93, suggesting that it may not be the best
choice for this type of analysis in regions with complex and highly variable data such as deforestation.</p>
      <p>The analysis of the mean squared error (MSE) and mean absolute error (MAE) supported the
conclusions obtained from R². In all departments, the linear regression not only presented the lowest MSE
and MAE values but also maintained remarkable consistency among the diferent data sets. This fact
reinforces the idea that linear regression is not only accurate but also stable in its performance, which
is crucial for the implementation of policies based on its predictions. The performance of the models
was carefully interpreted to draw relevant conclusions.</p>
      <p>Importantly, the superiority of linear regression could be due to the linear nature of the underlying
relationships between predictor variables and deforestation rate. However, the slight variability in the
performance of the models in diferent departments also underscores the importance of considering
specific regional characteristics when selecting the most appropriate model.</p>
      <p>In summary, the linear regression model emerged as the most efective tool for predicting deforestation
rates in the departments evaluated, providing highly accurate and reliable predictions. The random
forest stood out as a robust alternative, especially in more complex scenarios. The results obtained
underline the importance of a regionalized approach to predictive modeling,</p>
    </sec>
    <sec id="sec-8">
      <title>8. Future Work</title>
      <p>The incorporation of variables such as temperature, humidity, and forest type into predictive models is
crucial for improving the accuracy of predictions, but it faces significant challenges in terms of obtaining
and managing these data. The quality and availability of accurate and updated information on these
variables can be dificult to guarantee, especially in remote regions or areas with limited infrastructure
for environmental data collection. The limited number of meteorological stations in certain forests and
variability in collection methods can generate inconsistencies that afect the accuracy of the model.
In addition, available historical data may not cover long enough periods to capture long-term trends,
limiting the model’s predictive capability.</p>
      <p>Another significant limitation lies in the temporal and spatial resolution of the data. In many
instances, climatic and forest information is available at a broad scale, making it challenging to conduct
the detailed local analyses required for accurate deforestation modeling. This lack of data granularity
can lead to models that fail to capture critical variations within regions, thus reducing the efectiveness
of conservation strategies based on these predictions. Additionally, the model’s performance was
afected by the inherent variability of the data and inconsistencies across departments. To manage
these inconsistencies, a feature selection process using Random Forest was employed, enabling the
identification of the most relevant variables for each region. Furthermore, separate models were
developed for each department to account for localized factors, improving overall accuracy. Exploring
methods to enhance data collection, such as leveraging remote sensing technology or deploying denser
sensor networks in key areas, is essential to address these limitations and improve future predictions.</p>
      <p>Given these limitations, it is necessary to consider implementing new models that can more efectively
handle the incomplete and sometimes irregular nature of the data. Models such as those based on
deep neural networks or reinforcement learning techniques can be useful for dealing with large data
sets with high dimensionality and possible information gaps. These models can be trained to learn
complex and nonlinear patterns that might be ignored by traditional methods such as linear regression.
In addition, hybrid approaches that combine diferent techniques, such as the use of Random Forest
algorithms together with time series models, could ofer a robust solution by integrating multiple data
sources and providing more reliable and contextualized predictions. Thus, while data collection and
management present significant challenges, the exploration of advanced, adaptive models represents a
promising avenue for improving the accuracy and utility of predictive deforestation models. With an
appropriate approach to data collection and the use of advanced modeling technology, it is possible to
overcome these limitations and move towards more efective and sustainable conservation strategies.</p>
      <p>We leave the deployment phase in CRISP-DM as future work, as the study is primarily
researchfocused. The objective is to explore and validate the model’s accuracy and predictive capabilities, rather
than to implement it in real-world operational systems.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusions</title>
      <p>The variability of deforestation data among the departments necessitated segmenting the data by
region. This division enabled a more detailed and specific analysis, identifying the departments with the
highest deforestation rates, such as Atlántico, Sucre, Santander, and Meta. This approach facilitates the
development of more accurate predictive models adapted to each department. However, the accuracy
of these models can be influenced by the variable quality of historical data, which underscores the need
to improve data collection for future predictions and to ensure the applicability of models in diferent
contexts and regions.</p>
      <p>Linear regression models prove highly efective in predicting the annual rate of deforestation in
specific departments. However, variability in data quality and socioeconomic diferences between
regions limit the generalizability of the results, suggesting that additional studies should be conducted
before applying these models to other geographic areas or countries.</p>
      <p>The predictive model developed showed high accuracy with metrics such as R², MSE, and MAE.
However, there is a possibility of bias in the data due to variability in the quality of IDEAM’s historical
records, which may afect the accuracy of the predictions. In addition, although the model was efective
in predicting deforestation in Atlántico, Sucre, Santander, and Meta, the results may not be generalizable
to other regions of Colombia or other countries due to environmental and socioeconomic diferences.
Caution is advised when applying these models outside the context studied.</p>
      <p>The use of predictive models based on the CRISP-DM methodology has proven efective in predicting
the deforestation rate in diferent regions of Colombia. Linear regression models, in particular, have
demonstrated high accuracy in predicting the annual deforestation rate. This accuracy enables the early
identification of critical areas and the formulation of appropriate conservation strategies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Solórzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Mas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Gallardo-Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. F.-M. d</surname>
          </string-name>
          . Oca,
          <article-title>Deforestation detection using a spatio-temporal deep learning approach with synthetic aperture radar and multispectral images 199 (</article-title>
          <year>2023</year>
          )
          <fpage>87</fpage>
          -
          <lpage>101</lpage>
          . doi:https://doi.org/10.1016/j.isprsjprs.
          <year>2023</year>
          .
          <volume>03</volume>
          .017.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Leon</surname>
          </string-name>
          , G. Cornejo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Calderón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>González-Carrión</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Florez</surname>
          </string-name>
          ,
          <article-title>Efect of deforestation on climate change: A co-integration and causality approach with time series</article-title>
          ,
          <source>Sustainability</source>
          <volume>14</volume>
          (
          <year>2022</year>
          )
          <article-title>11303</article-title>
          . doi:
          <volume>10</volume>
          .3390/su141811303.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>The</given-names>
            <surname>Guardian</surname>
          </string-name>
          ,
          <article-title>Deforestation in colombia falls to lowest level in 23 years (</article-title>
          <year>2024</year>
          ). URL: https://www.theguardian.com/world/article/2024/jul/10/ deforestation-in
          <article-title>-colombia-falls-to-lowest-level-in-23-years</article-title>
          , accessed:
          <fpage>2024</fpage>
          -07-11.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kaselimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Voulodimos</surname>
          </string-name>
          , I. Daskalopoulos,
          <string-name>
            <given-names>N.</given-names>
            <surname>Doulamis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doulamis</surname>
          </string-name>
          ,
          <article-title>A vision transformer model for convolution-free multilabel classification of satellite imagery in deforestation monitoring 34 (</article-title>
          <year>2023</year>
          )
          <fpage>3299</fpage>
          -
          <lpage>3307</lpage>
          . doi:
          <volume>10</volume>
          .1109/TNNLS.
          <year>2022</year>
          .
          <volume>3144791</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. C. J.</given-names>
            <surname>Kani</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Saudia,</surname>
          </string-name>
          <article-title>Analysis on the performance of machine learning models for forest fire prediction</article-title>
          ,
          <source>in: 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICSSIT55814.
          <year>2023</year>
          .
          <volume>10060870</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Villalobos-Montiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aguilar-Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Orona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lozoya</surname>
          </string-name>
          ,
          <article-title>Identifying deforested areas through convolutional neural network for drone reforesting</article-title>
          ,
          <source>in: 2023 IEEE Conference on Technologies for Sustainability (SusTech)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>143</lpage>
          . doi:
          <volume>10</volume>
          .1109/SusTech57309.
          <year>2023</year>
          .
          <volume>10129558</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schröer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kruse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Gómez</surname>
          </string-name>
          ,
          <article-title>A systematic literature review on applying CRISP-DM process model 181 (</article-title>
          <year>2021</year>
          )
          <fpage>526</fpage>
          -
          <lpage>534</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.procs.
          <year>2021</year>
          .
          <volume>01</volume>
          .199.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrachowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stockinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coenders-Gerrits</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Van Der</given-names>
            <surname>Ent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bogena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lücke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stumpp</surname>
          </string-name>
          ,
          <article-title>Deforestation reduces the vegetation-accessible water storage in the unsaturated soil and afects catchment travel time distributions and young water fractions (</article-title>
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .5194/hess-2020-293.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Choi,</surname>
          </string-name>
          <article-title>MultiEarth 2022 deforestation challenge</article-title>
          -
          <source>ForestGump</source>
          (
          <year>2022</year>
          ). URL: https://arxiv. org/abs/2206.10831v1.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <article-title>The impact of increasing forest loss areas on the global temperature, and tourism industry 9 (</article-title>
          <year>2023</year>
          )
          <fpage>42</fpage>
          -
          <lpage>55</lpage>
          . doi:
          <volume>10</volume>
          .9734/ajraf/2023/v9i3205.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Saikia</surname>
          </string-name>
          ,
          <article-title>Deforestation and forests degradation impacts on the environment</article-title>
          ,
          <source>in: Environmental Degradation: Challenges and Strategies for Mitigation</source>
          , Springer International Publishing,
          <year>2022</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>46</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -95542-
          <issue>7</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>M. J. Dourojeanni</surname>
          </string-name>
          , ¿
          <article-title>es posible detener la deforestación en la amazonía peruana?, in: Desafíos y perspectivas de la situación ambiental en el Perú: en el marco</article-title>
          de la conmemoración de los 200 años de vida republicana,
          <source>Pontificia Universidad Católica del Perú</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>247</fpage>
          -
          <lpage>285</lpage>
          . doi:
          <volume>10</volume>
          .18800/
          <fpage>978</fpage>
          -9972-674-30-3.
          <fpage>013</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Manciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rammig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Quesada</surname>
          </string-name>
          ,
          <article-title>Impacts of land cover changes and global warming on climate in colombia during ENSO events 61 (</article-title>
          <year>2023</year>
          )
          <fpage>111</fpage>
          -
          <lpage>129</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s00382-022-06545-1.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mejía</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Enciso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bravo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Florez</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Burkart,</surname>
          </string-name>
          <article-title>The impact of agricultural credit on the cattle inventory and deforestation in colombia: A spatial analysis</article-title>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .21203/rs.3. rs-
          <volume>2188032</volume>
          /v1.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kattenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leitlof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schiefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hinz</surname>
          </string-name>
          ,
          <article-title>Review on convolutional neural networks (cnn) in vegetation remote sensing</article-title>
          ,
          <source>ISPRS journal of photogrammetry and remote sensing 173</source>
          (
          <year>2021</year>
          )
          <fpage>24</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. O.</given-names>
            <surname>Bittencourt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <article-title>Método para a classificação de Áreas queimadas baseado em aprendizado de máquina automatizado 13 (</article-title>
          <year>2022</year>
          )
          <fpage>029</fpage>
          -
          <lpage>036</lpage>
          . doi:
          <volume>10</volume>
          .14210/ cotb.v13.
          <fpage>p029</fpage>
          -
          <lpage>036</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bommert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Welchowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rahnenführer</surname>
          </string-name>
          ,
          <article-title>Benchmark of filter methods for feature selection in high-dimensional gene expression survival data</article-title>
          ,
          <source>Briefings in Bioinformatics</source>
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <article-title>bbab354</article-title>
          . doi:
          <volume>10</volume>
          .1093/bib/bbab354.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ignatenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Surkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koltcov</surname>
          </string-name>
          ,
          <article-title>Random forests with parametric entropy-based information gains for classification and regression problems</article-title>
          ,
          <source>PeerJ Computer Science</source>
          <volume>10</volume>
          (
          <year>2024</year>
          )
          <article-title>e1775</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Ogunleye</surname>
          </string-name>
          ,
          <article-title>Predictive data analysis using linear regression and random forest, in: Data integrity and data governance</article-title>
          ,
          <source>IntechOpen</source>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .5772/intechopen.107818.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>