1. Introduction

I. Capralova);

Explainable Spatial Modeling of Groundwater Nitrate Concentrations in the Netherlands

Iulia Capralova

i.capralova@student.rug.nl 0

Juan Cardenas-Cartagena

0 0 University of Groningen, Faculty of Science and Engineering , Nijenborgh 9, 9747 AG Groningen , The Netherlands

2025

000 0 0001

Nitrate leaching from soil is a major source of groundwater pollution that threatens both environmental and public health. In the Netherlands, intensive agricultural practices make the country vulnerable to nitrate contamination, requiring constant monitoring of groundwater quality. Since there are few measurement sites, there is a need for interpretable estimation methods via spatial modeling. This paper develops an explainable spatial regression model for estimating nitrate concentrations in groundwater in the Netherlands, using spatial and environmental factors. Three models were tested: Ridge Linear Regression, Random Forest, and Extreme Gradient Boosting (XGBoost). The ensemble of Random Forest and XGBoost explained 66% of the variance in nitrate levels, indicating competitive performance suitable for interpolation and map-based interpretation of spatial nitrate variability in the Netherlands. Key factors influencing nitrate leaching, including soil type, loam content in the soil, and groundwater depth, were identified using both model-specific and model-agnostic methods. Modeled spatial estimated maps show that between 2017 and 2023, nitrate concentrations decreased in the east and northeast, while they increased in the south and parts of the central Netherlands. This modeling framework can support targeted environmental policy decisions aimed at reducing groundwater pollution.

Nitrate leaching groundwater monitoring geospatial analysis spatial regression explainable artificial intelligence

1. Introduction 1.1. Motivation

Nitrate (NO3 – ) is a form of inorganic nitrogen that is found in soil and groundwater. It acts as a primary source of nitrogen for plants, supporting their growth and development [1]. Nitrate occurs naturally in groundwater through processes such as rock weathering and biological activity, resulting in a background level of approximately 2.5 mg/L [2]. However, in recent decades, nitrate levels in groundwater have increased worldwide due to human activities such as heavy use of nitrogen-based fertilizers and animal manure in areas with intensive farming. This causes nitrate to leak from the soil into the groundwater, leading to pollution of water bodies. Today, nitrate leaching is a serious environmental problem in many farming regions across Europe and beyond [3].

The imbalance of nitrate concentration in groundwater has consequences for both nature and human health. When nitrate enters surface waters such as lakes and rivers, it causes eutrophication (i.e., rapid growth of algae) which reduces oxygen levels and disrupts aquatic life [4]. Over time, this contamination afects the quality of drinking water, causing health issues such as methemoglobinemia and increasing the risk of gastric cancer [5, 6, 7, 8].

To protect water quality, the European Union (EU) created the Nitrates Directive (91/676/EEC), which sets a safe limit of 50 mg/L of nitrate concentration in groundwater [9]. It requires each member country to monitor nitrate levels, identify vulnerable areas, and take steps to reduce pollution. While this program has helped to improve the overall situation in many EU countries, in some local areas, nitrate contamination still remains an issue. The Netherlands, as one of the leading agricultural producers in the world, regularly monitors and reports nitrate levels through the Dutch National Minerals Policy Workshop on AI-driven Data Engineering and Reusability for Earth and Space Sciences (DARES’25), co-located with the 28th

CEUR Workshop Proceedings

ceur-ws.org

ISSN1613-0073 Monitoring Program (LMM). Between 1992 and 2019, a 50% reduction in hotspot areas was achieved [10]. Despite this progress, challenges remain. For example, in sandy agricultural regions in the South of the Netherlands, shallow groundwater (5 - 15 meters) still exceeds the EU nitrate limit [11].

These localized spikes in nitrate levels indicate that policies should be tailored to local conditions rather than relying solely on national targets. To achieve this, authorities require an understanding of nitrate distribution across the country. However, monitoring nitrate levels across the country at high resolution and frequency remains a challenge. Taking nitrate samples is both expensive and time-consuming. Thus, it is measured once or twice per year at limited locations. This leaves gaps in our understanding and makes it dificult to identify high-risk areas.

This problem motivates the development and application of data-driven spatial regression models to explain how both natural conditions and human activities contribute to variations in these patterns across regions. These models can help identify polluted areas where measurements are unavailable, support groundwater protection plans, and inform more sustainable farming practices tailored to the needs of specific regions.

Furthermore, this study aligns with the United Nations Sustainable Development Goal 6, which aims to protect and use water resources responsibly. A key target under this goal is to improve water quality by reducing pollution, minimizing the release of hazardous chemicals, and halving the proportion of untreated wastewater [12]

1.2. State-of-the-Art

Nitrate leaching has been spatially estimated using geostatistical methods that use spatial and temporal autocorrelation, relying on Tobler’s First Law of Geography, “everything is related to everything else, but near things are more related than distant things” [13]. Methods such as co-kriging and disjunctive kriging are based on this principle, as they estimate values at unsampled locations by leveraging the similarity between nearby observations. Incorporating temporal information has been shown to reduce estimation uncertainty further and improve spatial risk assessments [14, 15]. However, these methods require a minimum number of observations to compute reliable variograms1, which limits their applicability in regions with sparse monitoring data. Moreover, focusing primarily on geographic proximity can result in omitting important environmental interactions and biogeochemical processes in the nitrogen cycle (see Section 2).

In recent years, machine learning (ML) has become a preferred method for environmental and soil studies. It can handle large datasets and capture complex patterns found in groundwater quality data. Among ML techniques, tree-based models are commonly used for spatial nitrate estimation due to their inherent model-specific interpretability and relatively low computational cost [ 16].

Paper written by Mahlknecht et al. [17] applied Extreme Gradient Boosting (XGBoost) to estimate nitrate concentrations across Mexico, achieving an accuracy of 0.80 despite limited monitoring data. They identified rainfall, elevation, and slope as key predictors and produced spatial risk maps that exposed major nitrate hotspots. Similarly, Covatti et al. [18] used Random Forest (RF) to map nitrate levels in Swiss groundwater, integrating SHAP values to interpret feature importance. The model explained 58% of the nitrate variance. The model revealed that factors like seasonal precipitation and soil organic carbon emerged as strong predictors. In the United States, Ransom et al. [19] developed a 3D XGBoost model to estimate nitrate across two drinking water zones, achieving an 2 of 0.49 on hold-out data. Their analysis showed that well depth, soil and climate characteristics, and the absence of developed land were key factors influencing nitrate levels at the national scale. These studies demonstrate the ability of tree-based models to capture complex nitrate dynamics in the environment.

In the Dutch context, Spijker et al. [20] have developed an RF model to map nitrate leaching from agricultural soils using environmental predictors. The model explained 58% of the variance and highlighted land use and elevation as being key variables. However, the study was limited by its narrow temporal scope (restricted to 2017), reliance on a model-specific interpretability method only, and 1A tool in geostatistics used to analyze the spatial continuity of data by quantifying the degree of dissimilarity between data points as a function of distance. the absence of key factors influencing the nitrogen cycle, such as population pressure, precipitation, temperature, and detailed soil properties, including chemical composition. These limitations are addressed in our study through broader environmental inputs, multi-year analysis, and model-agnostic interpretability methods.

1.3. Contributions

Building on the work of [20], our study aims to advance nitrate leaching estimation in the Netherlands by ( 1 ) increasing the temporal scope of both model training and estimation, (2) incorporating a broader range of environmental and agricultural predictors based on the nitrogen cycle, and (3) applying modelagnostic and model-specific interpretability techniques. Therefore, our research has the following objective: The development of an explainable spatial regression model for estimating nitrate concentrations in groundwater in the Netherlands, using spatial and environmental data from agricultural soils.

2. A Brief Introduction to the Nitrogen Cycle

The nitrogen cycle describes the natural processes through which nitrogen moves between the atmosphere, soil, water, and living organisms, which is summarized in Figure 1 [21]. Its understanding helps in feature selection and data preprocessing steps.

According to Fowler et al. [22], the nitrogen cycle begins when nitrogen enters the soil through atmospheric fixation, where lightning converts N2 into reactive nitrogen compounds that are deposited with rainfall. Moreover, in agricultural areas, natural inputs are enhanced by the application of synthetic fertilizers, as well as organic sources such as livestock manure and urine. Additionally, urban areas primarily contribute through wastewater discharge from sewage systems into soil and water bodies. The efects are seen in densely populated regions, where higher concentrations of people are associated with greater wastewater production, fertilizer use, and nitrogen emissions [23]. These emissions, particularly ammonia from agriculture and nitrogen oxides from trafic and industry, settle onto the land as nitrogen deposition in soil.

Once in the soil, nitrogen undergoes a series of microbial transformations that convert it into forms plants can absorb. These processes are shaped by environmental conditions: soil texture influences oxygen and water flow, temperature afects microbial activity, and moisture levels determine whether nitrification or denitrification is favored. Eventually, part of this nitrate is taken up by plants, with uptake varying depending on the type of vegetation and the depth of the roots. For instance, according to Nouri et al. [24], deep-rooted or fast-growing crops tend to absorb more nitrates.

When the amount of nitrate in the soil exceeds the plant’s uptake capacity, nitrate leaches into groundwater. Since nitrate (NO3 – ) is highly soluble in water and poorly retained by soil particles due to its negative charge, it is easily washed out from the root zone under the influence of precipitation, and groundwater table [22, 25]. Moreover, soil texture determines how water moves through the soil. For example, sandy soils, due to large particles, let water with nitrate pass through quickly [26, 27]. Conversely, the fine texture and low permeability of clay soils restrict water flow, which limits nitrate leaching [28]. Additionally, temperature regulates the activity of bacteria involved in nitrate production, thereby influencing nitrification and denitrification rates [ 29]. Elevation further influences nitrogen dynamics by shaping drainage, runof, and the accumulation of water and nutrients in low-lying areas [30].

Apart from leaching, nitrogen can also be removed from the soil system through denitrification, a microbial process that transforms nitrate into gaseous nitrogen compounds (N2, N2O) released into the atmosphere. This process closes the nitrogen cycle by returning nitrogen to the atmosphere. The overall cycle ofers insights into relevant data and features for developing explainable ML-based estimators for nitrate leaching.

3. Methods

This study models nitrate leaching in agricultural soils across the Netherlands by training regression models on georeferenced nitrate measurements and spatial-temporal features. Once trained, models are applied across a spatial grid, where each geographic location contains corresponding values for all predictors. The outcome is an estimated map, showing the spatial patterns and trends of nitrate leaching across the country. The following sections describe the data exploration analysis, the modeling framework, and the evaluation methods used to assess model performance.

3.1. Data Collection

The target variable is the nitrate concentration in groundwater. Measurements were obtained from well sensors logged in the DINO Loket platform [31]. The samples were collected according to the Dutch protocol NTA 8017 [32], which ensures standardization in sampling procedures and data quality. The sampling frequency for monitoring wells ranges from once to twice a year, which mostly falls in the winter months. Samples taken within city boundaries were excluded from the analysis.

Each well consists of multiple filters, which are screened intervals at diferent depths designed to isolate hydrogeological zones. For consistency and comparability across wells, we use data from Filter 1, the shallowest filter, which directly reflects nitrate leaching from the root zone of agricultural soils.

The predictor and target variables used in this study are summarized in Table 3 in Appendix A.2. Their selection was based on the Nitrogen Cycle (see Section 2). These datasets are derived from a range of sources and measurement techniques, which are described in Appendices A.4 and A.5. In total, the dataset contained 7620 samples and 18 features.

3.1.1. Study area

For this study, the focus is on the entire territory of the Netherlands, which covers an area of approximately 41,543 km2 [33]. It lies within the Rhine–Meuse river basin, which explains relatively shallow groundwater levels [34]. The elevation ranges from 7 meters below sea level (in South Holland) to above 322 meters (in Limburg). The country has polders, which are areas of land taken from the sea or rivers and kept dry by dikes.

Unconsolidated Quaternary sediments of fluviatile, marine, or glacigenic origin dominate the subsurface of the Netherlands. The country is characterized by four main soil types: peat, clay, sand, and loess soils (see Figure 2a). Peat areas are mostly found in low-lying regions, clay soils dominate the coastal and riverine floodplains, sandy (50% of the area) soils are common in the eastern and southern parts of the country [35], and loess soil is found only in the South Netherlands (see Figure 2a).

Approximately 66% of the land is used for agriculture, while around 16% is covered by forests [36, 37]. The remaining area consists of urban settlements, water bodies, and infrastructure.

The Netherlands has a temperate maritime climate, with mild winters (3–6 °C), cool summers (17–20 °C), and a yearly average temperature of around 10 °C. Rainfall is evenly distributed throughout the year, averaging 800–900 mm annually, with most of the precipitation occurring in autumn and winter [38]. These conditions create a predisposition for nitrate leaching, especially after dry summers when nitrogen accumulates in the soil and is flushed out by winter rains [ 39].

3.1.2. Data Inspection

To understand general trends in nitrate leaching and prepare for further analysis, we first look at the available data.

The monitoring network covers a total of 874 unique groundwater locations sampled between 2008 and 2023. Table 2 and Figure 9 in Appendix A provide an overview of the monitoring network. The table summarizes the number of unique sites sampled each year, with coverage peaking in 2015 (591 sites) and reaching its lowest in 2023 (274 sites), while the figure shows the combined spatial distribution of all monitoring sites across the country.

Figure 2b displays the distribution of nitrate concentrations across diferent soil regions: clay, loess, peat, and sand. In clay and peat regions, nitrate levels are low, with median concentrations of 0.05 mg/L and maximum values well below the EU legal limit of 50 mg/L. On the other hand, anaerobic conditions in peat lands support the conversion of nitrate back into nitrogen gas through denitrification. The loess region shows moderate nitrate levels, with a median of 8.37 mg/L, and some values reach the EU limit. In contrast, sandy soils have the highest nitrate concentrations and variability, with a median value of 0.07 mg/L but maximum values exceeding 100 mg/L, more than twice the EU threshold.

Appendix A.6 describes the preprocessing data pipeline designed to prepare training and test datasets for model training and evaluation.

(a) Map of soil regions in the Netherlands.

(b) Distribution of nitrate concentrations across soil regions.

3.2. Model development 3.2.1. Algorithm selection

The goal of this study is to develop an explainable model of nitrate leaching across the Netherlands. For this reason, models with clear interpretations were prioritized. Three diferent models were chosen: • Ridge Linear Regression, used as the baseline model, assumes a linear relationship between predictor and target variables. The resulting coeficients indicate both the direction (positive or negative) and strength of each variable’s relationship with nitrate concentrations. • Random Forest, proposed by Breiman [40], is an ensemble model that constructs a number of decision trees, each trained on random subsets of data and features.

• XGBoost, introduced by Chen and Guestrin [41], is a gradient boosting model that builds decision trees sequentially, where each new tree aims to correct the errors made by previous trees.

To combine the strengths of the best-performing models, an ensemble is created by taking an inverse-variance weighted average of the models’ estimations [42].

All models are implemented using the scikit-learn library [43]. For XGBoost, the scikit-learncompatible XGBRegressor interface from the XGBoost package is used [41].

3.2.2. Performance evaluation

In order to gain diferent perspectives on the model performance, the following evaluation metrics were used: • Coeficient of determination ( 2) measures the proportion of variance in the target variable that is explained by the model. An 2 value closer to 1 indicates better estimation performance: • Root Mean Square Error (RMSE) gives higher weight to large errors by squaring the residuals before averaging and taking the square root: 2( , )̂ = 1 − ∑ =1 ( − ̂ )2 ∑ =1 ( − )̄ 2 RMSE( , )̂ =

∑( − ̂ )2 MAE( , )̂ = ∑ | − ̂ | 1 √ =1 1 =1 • Mean Absolute Error (MAE) calculates the average absolute diference between estimated and actual values:

In all three equations ( 1 ), (2), (3), is the true value for sample , ̂ is estimated value for sample , and is number of samples. Additionally, in ( 1 ) ̄ is the mean of the true values.

3.2.3. Hyperparameter optimization

To control the complexity and regularization strength of the chosen models, a hyperparameter search is performed. Specifically, the

Randomized Search Cross-Validation implementation from scikit-learn [43] is used to sample 60 unique hyperparameter combinations from the predefined search space. Each configuration was evaluated using a 5-fold time series cross-validation scheme ( Time-Series Split) to preserve temporal structure in the data and avoid information leakage between training and validation folds. Model performance during hyperparameter tuning was evaluated using MAE in (3). And 2, in ( 1 ), was used as the refitting criterion for the final model selection.

Appendix B gives an overview of the hyperparameter search space along with the chosen ones for Ridge Linear regression (see Table 4), Random Forest (see Table 5), and XGBoost (see Table 6).

3.2.4. Model training

Once the optimal configuration for each model was identified, the corresponding model was retrained on the whole training set and evaluated on the held-out test set. The objective of the training procedure was to minimize the MAE, in (3). During model selection, 2, in ( 1 ), was used as the refitting metric to prioritize models that explain more variance in the data. At the same time, RMSE, in (2), was also monitored as a secondary evaluation criterion.

To understand how each model generalizes as the amount of training data increases, a learning curve analysis was implemented based on repeated time series cross-validation. Once the best hyperparameters 1–25 ( 1 ) (2) (3) were found for each model, the best-performing estimator was cloned and retrained across multiple increasing subsets of the training data. Specifically, for each training subset fraction, the corresponding data was split using a 10-fold Time-Series Split, and the model was fitted in each fold. Validation and training scores were computed using MAE and were stored across the folds to assess the learning pattern.

Resultant learning curves for Random Forest and XGBoost can be found in Appendix B on Figures 14a, and 14b respectively.

4. Results 4.1. Model results and estimated maps

The performance of each model was evaluated on the held-out test set (882 samples), using the three metrics described in Section 3.2.2.

As shown in Table 1, the Ridge Linear Regression model achieved the lowest performance, with an 2 of 0.23, an MAE of 3.79 mg/L, and an RMSE of 7.58 mg/L. It was outperformed by both Random Forest and XGBoost with 2 values of 0.649 and 0.638, MAEs of 2.10 mg/L and 2.04 mg/L, and RMSEs of 5.13 mg/L and 5.20 mg/L, respectively. Therefore, Random Forest and XGBoost were included in the Ensemble model, with weights of 0.52 and 0.48, respectively. Resultant ensemble achieved the best overall performance, with an 2 of 0.66, an MAE of 2.02 mg/L, and an RMSE of 5.05 mg/L.

To further assess the quality of each model, Figure 3 presents scatter plots comparing the estimated versus actual nitrate concentrations, and Figure 4 demonstrates the corresponding residuals.

In Ridge Linear Regression (see Figures 3a and 4a), estimations align with the diagonal for values below 10 mg/L. For higher concentrations (>15 mg/L), estimated values cluster below the diagonal, showing underestimation. Residuals increase with expected values, reaching up to 50 mg/L, and show a downward trend across the x-axis.

The estimation patterns of Random Forest (see Figures 3b and 4b) and XGBoost (see Figures 3c and 4c) are similar with minor diferences. XGBoost estimates high nitrate concentrations ( >40 mg/L) more accurately, while Random Forest performs better for mid-range values (20-30 mg/L). Residuals in XGBoost, in the range of 20-30 mg/L, are less clustered and more dispersed compared to those in Random Forest. Residual plots for both models show that residuals range approximately from -30 to 40 on the y-axis.

In the Ensemble model (see Figures 3d and 4d), the estimation and residual patterns are similar to those of Random Forest and XGBoost. For several points in the 10-30 mg/L and >40 mg/L ranges, residuals are smaller compared to those in the individual models. Some over- and underestimations present in RF and XGBoost are reduced.

Although the performance gains are small, the ensemble provides a more stable estimation. Thus, it is used for nitrate spatial modeling. Its performance on the test set of the year 2021 can be seen in Figure 5. The Ensemble model captures the spatial pattern of nitrate concentrations across the Netherlands, specifically lower levels in the north and coastal areas, and high concentrations in the southern region of North Brabant (highlighted in the red square).

4.2. Spatial modeling

Ensemble, as the best performing model, was used with the predictor variables to generate nitrate maps at a resolution of 500 by 500 meters for the years 2010, 2017, 2021, and 2023 (see Figure 8).

Most of the Netherlands shows low nitrate concentrations, with values below 5 mg/L. The spatial pattern of increased nitrogen levels in region A (Noord-Brabant and Limburg) and region B (east of Gelderland and Overijssel) remains stable through four maps. However, some changes in nitrate distribution and intensity are still present over time.

Figure 7 shows the diference in estimated nitrate concentrations for the years 2017, 2021, and 2023, each compared to 2010. In 2017 (see Figure 7a), nitrate concentrations increased in areas of region A (Noord-Brabant) and region B (Overijssel and Drenthe). In these areas, nitrate values are up to 10 mg/L higher than in 2010.

In 2021 (see Figure 7b), regions A and B still show an increase in nitrate levels. However, the pattern is diferent compared with 2017 (see Figure 7a). In region A, the increased nitrate remains, but in region B there is a significant decrease.

The pattern in both regions from 2021 is further developed in 2023 (see Figure 7c), with a slight decrease in nitrate levels in Region A and an even greater decline in Region B. However, a new hotspot with elevated nitrate concentrations emerged in the northern part of Drenthe.

4.3. Importance and influence of predictor variables

(a) Model-specific feature importance. (b) Model-agnostic feature importance.

To understand which predictors influence the nitrate leaching, feature importance was evaluated using model-specific and model-agnostic interpretability methods.

Figure 6a presents model-specific measures of feature relevance: Gini importance for both Random Forest and XGBoost, and feature coeficients for Ridge Linear Regression. Scores from each model were normalized to sum to 1, and only features shared across all three models are shown. Among them, “Thick soil type” is highlighted as the most important predictor. The models also agreed on the importance of “Loam content”, “Sand median”, and “Silt content”.

Figure 6b presents the model-agnostic interpretability results based on LIME, Permutation Importance, and SHAP values. Across all three, “Thick soil type” and “Groundwater table” are found to be the most important features, although the importance values difer. For example, LIME gives “Thick soil type” the highest score (around 0.3), while SHAP and Permutation give it slightly lower scores. Moreover, all three methods further agreed on the importance of “Loam content”, “C:N ratio”, and “Elevation”, however, with lower importance scores.

Taken together, both model-specific and model-agnostic approaches identified “Thick soil type” and “Loam content” as important predictors of nitrate leaching. Moreover, the strength of their importance is consistent across methods, with “Thick soil type” receiving the highest scores, and “Loam content” receiving the lowest ones.

5. Discussion 5.1. Interpretation of the results

The main goal of this study was to develop an explainable spatial regression model for estimating nitrate concentrations in groundwater in the Netherlands, using spatial and environmental data from soils.

Among all tested models, the Ensemble demonstrated the best performance, explaining 66% of the variance in nitrate levels with an RMSE of 5.05mg/L, which is considered strong for large-scale applications. It outperforms comparable studies such as Ransom et al. [19] in the United States ( 2 = 0.49), Covatti et al. [18] in Switzerland ( 2 = 0.58, RMSE = 7.8 mg/L), and Spijker et al. [20] in the Netherlands ( 2 = 0.58). As the main model framework was adapted from the approach by Spijker et al. [20], our results imply that increasing the training dataset and introducing new predictor variables, such as detailed soil properties, allowed the models to capture complex interactions more accurately.

From a practical perspective, this level of estimation performance means that the Ensemble model is suitable for spatial screening and groundwater risk assessment but should not be used as a precise indicator, as it consistently underestimates high nitrate concentrations (>30 mg/L). This is a common limitation of regression models [44]. Therefore, policy decisions near regulatory thresholds should not rely solely on model outputs but use them to guide targeted monitoring eforts.

5.2. Nitrate trend through years

Analyzing the spatial nitrate concentration distribution through the years based on Figure 8, it is clear that the lowest levels are found in peat, loess, and clay areas of the west and north of the country. However, as mentioned in Section 4.2, region A (North Brabant and Limburg) and region B (east of Gelderland and Overijssel) consistently show some of the highest modeled nitrate concentrations in the Netherlands, reaching 50 mg/L. This trend can be explained by the presence of sandy soils, which are known for their low capacity to retain water and nutrients [27]. These estimations coincide with [20] nitrate estimations and online reports from the LMM [45].

Based on the diference map of 2017 (see Figure 7a) the situation of high nitrate increased in Regions A and B. This can be explained by the increase in nitrogen surplus after 2015 [ 46] mainly due to higher emissions and greater use of artificial fertilizers, despite more nitrogen being removed with the crops.

In the following years, diference maps in 2021 in Figure 7b and 2023 in Figure 7c show that the increased nitrate concentrations persist in region A, while in region B the levels mostly return to the level of year 2010. This can be explained by the diferences in sandy soils and land use practices. In Sand South, there is a predominance of dry, well-drained sandy soils (16% dry soils) and the relatively low proportion of reclaimed peat soils. Together they lead to faster nitrate leaching compared to wetter regions [47]. In addition, the agricultural structure difers across the sand subregions: Sand South has a lower share of grassland (dairy farms account for 46% of the area), which is associated with higher nitrogen leaching than grassland-dominated areas [48].

The decline in nitrate concentrations in Sand North and Sand Central after 2017 can be attributed to both environmental and agricultural factors. In Sand North, dairy farming accounts for 49%, while in Sand Central, dairy farming accounts for 67% [48, 45]. Both subregions have a relatively high share of grassland on dairy farms, which retains more nitrogen and results in lower nitrate leaching compared to arable land. These agricultural patterns, combined with a higher proportion of reclaimed peat soils in Sand North (49%), promote denitrification. This helps explain why nitrate concentrations peaked in 2017 but subsequently decreased, with Sand North even dropping below the EU threshold of 50 mg/L in recent years [49, 47].

The sharp nitrate increase observed in northern Drenthe in the diference map of 2023 (see Figure 7c) occurred in an area covered by coniferous and deciduous forests. This finding is unexpected, as forests absorb nitrate in large quantities, resulting in low nitrate leaching to groundwater [50]. This anomaly may be explained by the limitations of this study in Section 5.4.

5.3. Analysis of factors

Both model-agnostic and model-specific feature importance analyses identified the main soil classification, “Thick earth soil,” as the most important predictor of nitrate leaching in the Netherlands. Formed over centuries of deep fertilization and tillage, these soils develop a sandy top layer called the screed [51], which makes it easier for water and nutrients to move downward (see Section 2). This increases leaching risk, especially with heavy fertilizer use and irrigation. Although thick soils can store more nitrate, their structure also allows it to reach groundwater quickly. These soils are mainly found in the higher cover sand regions of Brabant, Gelderland, and Overijssel, matching the areas with the highest estimated nitrate levels (see Figure 8).

Loam content also emerged as a key predictor. Loam, a balanced mix of sand, silt, and clay, holds water and nutrients well while still draining, which slows nitrate leaching. In this study, lower loam content was linked to higher nitrate levels, suggesting that sandier, more permeable soils raise leaching risk, consistent with previous findings [ 27].

Groundwater depth was also identified as important by the model-agnostic method. SHAP values show that greater groundwater depth is associated with higher nitrate concentrations, which may seem counterintuitive, since deeper groundwater could allow more nitrate to be filtered through the soil [25]. However, in the Netherlands, deeper groundwater tables are typically found in elevated sandy regions with intensive agriculture and high nitrate inputs, as also noted by Spijker et al. [20]. Thus, groundwater depth here is likely an indicator of broader landscape characteristics.

5.4. Limitations

The spatial mismatch between nitrate and groundwater measurements (see Appendix A.6) may introduce uncertainty, as reflected by the moderate correlation between these variables (maximum correlation coeficient of = 0.6). This suggests that spatial alignment issues could weaken the precision of the model’s relationships. Also, missing land use data for some years was filled using maps from nearby years, which may have hidden short-term changes afecting nitrate leaching. And, while the ensemble model explained 66% of the variance in nitrate concentrations, a substantial proportion of variability remains unexplained. This residual variance likely arises from missing explanatory variables, such as detailed fertilizer application data. While the EU’s Farm Accountancy Data Network (FADN) does collect farm-level information on fertilizer usage, only aggregated data is publicly available [52]. This lack of high-resolution fertilizer data restricts the ability to capture field-level variations in nitrate input. These factors highlight the need for more detailed spatial and temporal data, and the development of more comprehensive models to further improve the estimation of nitrate leaching in Dutch groundwater.

6. Conclusions

This project developed an explainable machine learning models for estimating nitrate concentrations in Dutch groundwater, utilizing a combination of spatial and environmental data. The best performing model is the Ensemble model that could explain 66% of the variance in nitrate concentrations. By integrating both model-specific and model-agnostic interpretability methods, the study identified thick type of soils, loam content, and groundwater depth as the most important predictors of nitrate leaching risk. The results highlight the vulnerability of elevated sandy regions. Moreover, the results show that overall nitrate concentrations remain low across most of the country, with high values only in a few regions, suggesting that the EU regulatory limit has been efective in protecting groundwater quality.

There are two main directions for future research. First, to improve predictive performance and better explain nitrate variability, future models should integrate additional predictors related to local agricultural practices and fertilizer application. Second, in this study the validation methods do not account for spatial patterns. While a time-based split was applied to prevent temporal leakage, the training and test samples remain geographically close to one another. This spatial proximity may introduce correlation, which may result in performance estimates that do not accurately reflect the models’ ability to generalize to new locations. Therefore, a spatial k-fold cross-validation proposed by Pohjankukka et al. [53] should be implemented.

Building on the findings of this project, future development should prioritize decision-support tools based on these machine learning models to guide local nitrate management and policy decisions. Such a tool could help policymakers and citizens to identify areas most at risk and tailor interventions more precisely.

Reproducibility

For information about the datasets utilized, refer to Section 3.1 and Table 3 in Appendix A.2. The code and experiments of this project can be found in the repository: github.com/IuliaCapralova/NitrateSpatial-Modeling-NL.

Declaration on Generative AI

While writing this paper, the authors used ChatGPT (OpenAI) and Grammarly for formatting assistance and to improve the writing style. All suggestions from these tools were reviewed and edited as needed by the authors, who take full responsibility for the final content.

Acknowledgments References

The authors thank Dr. P. Pradhan and Dr. J. Spijker for their discussions on environmental challenges in the Netherlands, Dr. H. de Weerd for his feedback on the analysis presented in this work, and the University of Groningen for providing access to the Hábrók High Performance Computing cluster.

Appendices A. Data insights

This appendix further investigates the dataset by analyzing correlations between target and predictor variables and examining time series and spatial covariates. It also includes an overview of data sources, availability, key summary statistics, and highlights outliers through visualization.

A.1. Spatial distribution of monitoring wells

This section focuses on an overview of the spatial distribution of monitoring locations. Figure 9 shows the combined distribution of all locations on the map, while Table 2 lists the number of unique monitoring locations per year. The wells are spread throughout the country, with higher densities in provinces such as Utrecht, North-Brabant, and North Holland. persons/km2 CBS [56] Nitrogen deposition kg/ha/year

RIVM [57]

Variable

Nitrate (NO3–) Elevation Population Groundwater table Precipitation Temperature Organic matter Land use Soil region Soil type C:N ratio Soil calcic content -idth Soil loam content Sand median diame- m ter Sand silt content Soil acidity Bulk density

A.2. Overview of Variables Used in Modeling A.3. Correlation map

To explore the relationships between nitrate concentrations and the explanatory variables presented in Table 3, a Pearson correlation analysis was conducted. Since Pearson correlation is only meaningful for continuous variables, the analysis was limited to these features. Figure 10 shows the resulting correlation matrix based on the absolute values of pairwise correlations, allowing comparison of both positive and negative relationships in terms of strength.

Unit mg/L m m WUR [54] Rijkswaterstaat [55]

A.4. Exploration of timeseries covariates

Groundwater table Groundwater table was included as a key explanatory variable as its level may accelerate nitrate washing out from the soil [25]. The dataset consisted of groundwater depth measurements collected from monitoring wells between 2008 and 2023. The measurements were obtained from a network of permanent monitoring wells equipped with automated sensors (pressure transducers) that recorded water levels at sub-daily to daily intervals. Although only a few wells had uninterrupted data across all years, the analysis included any well with a continuous (“clean”) segment of measurements within this period, provided the segment exceeded 60 days. Given the high density of wells available, only the longest segment per well was used, and the rest were discarded to avoid redundancy. The raw data from each well often featured variable sampling frequencies, sometimes with intervals shorter than a day and with occasional missing periods. To standardize the data and address missing values, the depth measurements were first resampled to a uniform daily frequency using mean aggregation. Short gaps of up to five days were then filled by linear interpolation, making the time series consistent. In case this gap is exceeded, rows were considered as missing. The distribution of the ifnal groundwater table dataset is shown in Figure 12a.

One challenge in this study was that the locations where groundwater depth was measured did not always match the nitrate measurement sites. To solve this issue, each nitrate measurement was linked to the closest groundwater well within 5 km. This radius was chosen based on the correlation between the distance between wells and groundwater table (see Figure 11). While this method made it possible to use more data, it also added some uncertainty. The correlation between groundwater depth and the distance reaches a maximum value of 0.61, highlighting this limitation.

Temperature and Precipitation Precipitation and temperature were included as explanatory variables to capture the influence of weather conditions on soil processes and nitrate mobility. Daily meteorological data were obtained from the Royal Netherlands Meteorological Institute [62], which operates a network of automatic weather stations (AWSs) that provide synoptic, high-quality measurements validated according to NEN-EN-ISO/IEC 17025 standards. The original data were recorded at an hourly frequency and were resampled to daily averages prior to further processing. Data were available for the period 2008–2023 and were sourced from 18 stations distributed across the area of interest. Each nitrate sampling location to the nearest weather station within a 20 km radius. The distribution of both precipitation and temperature across all stations is shown in Figure 12b and 12c, respectively.

For each unique sampling point, aggregated precipitation and temperature values were computed based on the nearest available station. Next, both temperature and precipitation variables were aggregated using the autocorrelation function (ACF) analysis as described in Appendix A.6.

A.5. Exploration of spatial covariates

Land use The land use data were sourced from the LGN (Landelijk Grondgebruiksbestand Nederland) dataset [60], which provides nationwide, high-resolution maps derived from a combination of satellite imagery and aerial photography. For more recent years, Sentinel-2 satellite data and imagery from the National Satellite Data Portal were utilized, while the 2021 update also included detailed aerial photographs. The spatial resolution of the maps is 5 meters, enabling precise assignment of land use classes to each sampling location. The land use map was available for the following years: 2008, 2012, 2018, 2019, 2020, 2021, 2022, and 2023. For years, lacking a land use map, the closest available year was adopted: Nitrogen (N) deposition Annual spatial maps of nitrogen deposition for the period 2008–2023 were obtained from the national Data on Deposition in the Netherlands (GDN), developed by the Dutch National Institute for Public Health and the Environment [57]. These maps provide estimates of both wet and dry deposition, which are major sources such as ammonia NH3 and nitrogen oxides NOx. Deposition values are modeled using the OPS atmospheric transport model, which integrates oficial emission inventories, land-use information, and year-specific meteorological data. For accuracy, modeled estimates are calibrated against ground-based measurements from the national monitoring network, employing a co-kriging approach that addresses spatial uncertainty and variations in measurement reliability. The resulting calibrated maps, with a spatial resolution of 250 x 250 meters, provide a spatial representation of nitrogen inputs, distribution of which is available in Figure 12d.

Population density Annual gridded population maps for the years 2008–2023 were sourced from Statistics Netherlands (CBS) [33], with a spatial resolution of 500 by 500 meters. The dataset is based on the Dutch Personal Records Database (BRP), which provides register-based demographic information and has replaced traditional census approaches. In accordance with privacy regulations, grid squares containing fewer than five residents are not published. For the purposes of this study, such missing values were imputed with zeros, reflecting uninhabited areas such as forests, water bodies, or agricultural ifelds. The distribution of population density across nitrate sampling locations is shown in Figure 12e. Elevation Elevation data were sourced from the Actueel Hoogtebestand Nederland (AHN) [55], a national digital elevation model produced using laser altimetry from aircraft. The accuracy of the map is provided by GPS corrections and merging overlapping data swaths. The resultant map has 0.5 by 0.5 meter resolution. The elevation distribution of nitrate well locations can be seen in Figure 12f. Soil region The classification of soil regions was based on the Landelijk Meetnet efecten Mestbeleid (LMM) dataset, a national monitoring network established to evaluate the environmental efects of the Dutch manure policy. The LMM dataset established soil regions across the Netherlands according to factors such as dominant soil type, hydrology, and land use. The resultant soil regions can be seen in Figure 2a and nitrate distribution across soil region in Figure 2b.

Soil properties For this study, such soil-related attributes were included as bulk density (see Figure 12g), soil acidity (see Figure 12h), C:N molar ratio (see Figure 12i), calcic content (see Figure 12j), Fe-dith content (see Figure 12k), loam content (see Figure 12l), sand particle diameter (see Figure 12m), and silt content (see Figure 12n).

The properties were derived from the Soil Map of the Netherlands (Bodemkaart van Nederland) [54], which is based on the BIS-4D digital soil dataset [63], both of which are developed under the Basisregistratie Ondergrond (BRO) framework and released in 2024. These datasets integrate field sampling (systematic drillings, laboratory analyses, and expert classification) with geospatial modeling and machine learning, resulting in a 25-meter spatial resolution map. For each variable, values were assigned based on the top 0–8 cm soil layer.

Given the 16-year time frame of this study, the static map was used to represent nitrate levels throughout the entire period. However, it is suficient because most soil properties change over multidecadal timescales (20–50 years or more) [64]. More dynamic properties like organic matter content, pH, and C:N ratio typically vary slowly over 10–20 years under stable land use and management [65]. (a) Groundwater table (b) Precipitation (c) Temperature (d) Nitrogen deposition (e) Population density (f) Elevation (g) Bulk density (h) Soil acidity. (i) C:N molar ratio (j) Calcic content (k) Fe-dith content (l) Loam content in the soil (m) Sand particle diameter (n) Silt content in the soil

A.6. Data preprocessing

This appendix covers the steps involved in data preparation for modeling. It covers such steps as handling missing values, feature engineering, spatial and temporal alignment, train-test split, and scaling.

First, missing values were handled. For the target variable, all rows with missing values were removed, as they appeared to be missing at random. Nitrate levels are irregularly sampled throughout the year, making substantial gaps between measurements (usually several months). However, for the predictor variables, the issue of missing values was resolved using a combination of resampling, imputation, or removal, depending on the data source and measurement frequency. For details, refer to Appendices

A.4 and A.5.

Second, feature engineering steps were performed. It has been shown that variables such as temperature, precipitation, and groundwater table have a long-term impact on the nitrate leaching process [66, 67]. Therefore, these variables were aggregated by a fixed window size, which was uniquely defined for each variable through an autocorrelation function (ACF). It is a statistical tool that measures the strength of the relationship between current values in a time series and past values at various time lags. This way, the patterns in data over time can be identified. The following equation gives the function: ( ) =

− ∑=1 ( − ) (̄ + − ) ̄

∑=1 ( − ) ̄ 2 where, is the autocorrelation at lag , is the value of the time series at time , ̄ is the mean of the time series, and is the total number of observations. This results in a value between -1 and 1 that indicates the strength and direction of correlation at lag . (a) Autocorrelation of daily temperature. A positive (b) Autocorrelation of daily precipitation. A positive correlation is observed up to 73 lags.

correlation is observed up to 4 lags. (c) Autocorrelation of daily groundwater. A positive

correlation is observed up to 55 lags. series values. The dashed line represents 95% confidence bounds.

Based on ACF plots in Figure 13, the following window sizes were determined: ( 1 ) Temperature: 73 days; (2) Precipitation: 4 days; and (3) Groundwater table: 55 days.

Third, data alignment was performed both temporally and spatially. It should be noted that a spatial mismatch happened between the nitrate and groundwater table datasets, meaning that the measurement locations for nitrate and groundwater depth did not always coincide. To resolve this, each nitrate observation was paired with groundwater depth data from the nearest well within a 5 km radius. This choice was based on observed correlations between groundwater levels and distance (see Appendix A.4 Figure 11). A similar strategy was applied for weather data: temperature and precipitation. Values for these variables were taken from the closest weather station within a radius of 20 km of the nitrate monitoring site.

Fourth, while the majority of groundwater samples have low nitrate levels, there are several outliers with high concentrations. The small number of such extreme cases may limit model generalization, as it will have fewer examples to learn from when estimating these rare cases [68]. Therefore, samples with nitrate levels higher than 58 mg/L were removed from the dataset, resulting in 20 outlier data points.

Fifth, the dataset was split chronologically into training and testing sets. The training set consists of 4,439 samples spanning 13 years (2008–2020), while the testing set contains 882 samples from the last 3 years (2021–2023), corresponding to a split of approximately 83/17.

Sixth, data transformations were applied. Categorical variables were one-hot encoded to avoid ranking or ordinal assumptions between them. Continuous variables were normalized using Min-Max

Scaler, bringing them to a [0,1] range using the following equation:

− min max − min , where represents an individual raw value from a given feature in the dataset. min and max denote the minimum and maximum values observed in that feature in the training set, respectively.

Once the entire dataset has undergone all the preprocessing steps, it is ready for use in model training and evaluation, which are described in the following sections.

To generate a map of estimated nitrate levels across years, a 500 x 500 m grid covering the country is created, excluding urban areas. Each grid cell was matched with input features consistent with the training data.

B. Training insights B.1. Hyperparamter values

Hyperparameter Search space

Selected value Selector estimator alpha ( ) {0.1, 1.0, 10, 20, 30} Max features for selector {10, 20, 30, 40} Ridge regression alpha ( ) {0.1, 1.0, 10.0, 15.0} 250 None 2 1 sqrt

1. 2008 data was used for 2009 2. 2012 data was used for

2010 , 2011 , 2013 , and 2014 3. 2018 data was used for 2015 , 2016 , and 2017