ENHANCEMENT OF ACCURACY IN SOIL UREA
                                ESTIMATION USING MACHINE LEARNING TOOLS⋆

                                Sulaxana Vernekar1, †, Marlon Sequeira 2, ∗, †, Sameer M. Patil3, †, Jivan Parab 2, †and Prof.
                                Gourish Naik,2, †
                                1
                                  GVM'S GGPR College of Commerce and Economics
                                2
                                  Electronics Programme, School of Physical and Applied Sciences,Goa University, Goa, India
                                3
                                  Dnyanprassarak Mandals College & Research Centre, Mapusa, Goa India


                                                Abstract
                                                Soil health is vital for getting a good crop yield. Analysis of available soil nutrients done at the right time
                                                not only helps in conservation of soil fertility but can also help in getting a good crop yield by limiting the
                                                usage of external inputs to the soil such as fertilizers, water etc. The use of AI in agriculture is being
                                                explored in recent times to optimize the crop yield. Machine Learning techniques are used in developing
                                                smart soil sensing systems to provide accurate soil nutrients distribution. In this study, a sample of 40
                                                spectral data in the frequency range of 500MHz to 1000MHz was passed to the ParLeS software. The PLSR
                                                cross validation in ParLeS gave us an RMSE of 2.87. However, when Ridge regression based on machine
                                                learning was applied, we obtained a RMSE of 1.02 with parameter alpha set to 0.005. Thus, we can say
                                                conclusively that, machine-learning methods yield better results than traditional methods. In addition,
                                                implementation of ParLeS needs LabVIEW type of environment and needs external graphics support,
                                                whereas, Ridge regression can be implemented using simple Python environment, which is now a day most
                                                often used programming language. The implementation does not require compulsory graphics support.

                                                Keywords
                                                ParLes, PLSR, Ridge Regression.1


                                1. Introduction
                                   Agriculture is the backbone of any thriving economy. Advancements in technology has seen lot
                                of influence on the way agriculture is practiced. Smart farming has paved a way for sustainable
                                agriculture and increasing the crop productivity. Soil fertility plays an important role in crop
                                production [1]. The available nutrients in the soil can highly influence the crop yield. Cultivating
                                crops constantly without proper analysis of the soil can deteriorate its health leading to the soil
                                becoming arid. Smart farming techniques are based on micro management of the farm taking into
                                consideration the spatial and temporal variability exhibited by soil. This enables the proper
                                management of external inputs such as fertilizer and pesticide application etc. to the soil [2]. Proper
                                understanding and knowledge about the soil can enable the farmers to take proper decisions in crop
                                management thus enhancing the crop productivity [3]. There are several issues that need to be


                                SCCTT-2024: International Symposium on Smart Cities, Challenges, Technologies and Trends, 29th Nov 2024, Delhi, India
                                1∗
                                   Corresponding author.
                                †
                                  These authors contributed equally.
                                    sulaxgoa@gmail.com (S. Vernekar); marlon@unigoa.ac.in (M. Sequeira); sameer@dmscollege.ac.in (S. Patil);
                                jsparab@unigoa.ac.in (J. Parab); gmnaik@unigoa.ac.in(G Naik)
                                   (S. Vernekar); 0000-0002-7462-3492; 0000-0002-1444-5428 (M. Sequeira); 0000-
                                0002-0848-7349, 0009-0007-3674-1942 (S. Patil); (J. Parab)
                                           © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
tackled in agriculture such as lack of digitization, food safety issue, ecological problems and
inefficient agri-food supply chain. Integration of Industry 4.0 in agriculture can greatly influence
productivity, agri-food supply chain efficiency, food safety, and the sustainable use of natural
resources[4].

    Artificial Intelligence (AI) is the most rapidly growing technology embedded into all aspects of
human life. In agriculture AI technologies can be used in precision farming for soil and irrigation
management, weather forecasting, plant growth, disease prediction, and animal management [5].
With the exponential growth and development of data processing, information technology, and
artificial intelligence, smart farming makes use of cutting-edge innovations to boost productivity and
reduce labor stress and automating soil and crop management with AI [6]. Smart soil prediction is a
low-cost method of forecasting a soil's performance over a wide range of crops.

    Digital soil mapping (DSM) is used to generate digital maps of the type and quality of soil by
combining soil sensing data with environmental factors [7]. Recent years have seen a significant rise
in the use of DSM in soil science, which can be attributed to the integration of several ideal factors,
including, but not limited to, tremendous interest in quantitative and spatial soil information, the
buildup of databases of predicted or interpreted soil properties together with thoroughly known
environmental factors, and the development of computational methods combined with computer
resources to extract these stores of soil data [8]. Obtaining exact data on soil nutrient composition is
a critical step in the implementation of precision agriculture and DSM is providing a potential
breakthrough [9]. Artificial Intelligence tools such as fuzzy systems, decision trees, expert
knowledge, machine learning algorithms, deep learning methodologies, and other artificial
intelligence technologies can be used to provide more accurate forecasts and solutions in DSM [10].

   AI models and DSM have been utilized in soil fertility prediction, offering a decision-making tool
capable of forecasting the best crop based on soil pH, soil nutrients, soil moisture, environmental
variables, and other components [11]. It was observed from a study conducted on prediction of soil
nutrients using spectroscopic data that using Machine Learning (ML) techniques greatly improves
the accuracy of soil nutrient prediction [12]. ML algorithms were used in a study to find the
relationship between independent variables and dependent variables for soil data analysis. The
independent variables were moisture, temperature, soil pH, Cation Exchange Capacity(CEC) and the
dependent variables were Nitrogen, Phosphorus and Potassium (NPK). This study showed that there
exist relationships between Phosphorus, Potassium, soil pH and CEC; Nitrogen and soil moisture and
temperature using ML algorithms [13].

   In another review study on using machine learning methods for predicting soil properties,
agricultural yield, and soil fertility, it was observed that for soil prediction, Random Forest (RF) and
deep learning techniques surpass traditional ML algorithms. Depending on the model's inputs, the
RF and deep learning techniques can reliably forecast soil conditions and crop to be grown. It was
also found from the study that inaccurate data has the ability to reduce forecasting precision.
Variations in geographical elements, meteorological circumstances, and farming techniques can
hamper the process of generalizing models. Furthermore, selecting relevant characteristics from
numerous influencing factors necessitates subject expertise and testing [1].
2. Methodology
To obtain the RF spectra of various samples a cell is designed based on the principle of dielectricity.
The design details of the cell are discussed in [14].


Figure1: Experimental Setup
The experimental setup consists of a cell which is placed inside the iron box at the centre as shown
in figure 1. A Signal Hound tracking generator USBTG44A and a Signal Hound spectrum analyzer
USB-SA124B were used for obtaining the RF response. The sample is placed in the cell and RF signal
from the tracking generator is injected into the cell through the central copper wire. The strength of
the signal reduces due to dielectric loss offered by the sample solution as the signal propagates
towards the receiver end.

The RF spectrum analyzer connected at the receiver end of the cell captures signal proportional to
the radiation loss due to the sample solution. The cell has a capacity of holding 15ml of liquid. Soil
samples were prepared in the laboratory by mixing 5 different components urea, potash, sodium
chloride, calcium carbonate and phosphate in distilled water. Molar solutions for each of the
component was prepared and for 15ml of water the amount of each component required to be added
was calculated. It was found that amount of urea required was 225mg/15ml. Similarly, for the
remaining components the amount required to be added for 15ml of water was calculated. The
amount of each component to be added is shown in table 1.

Table 1: Concentrations denotation table
      Concentrations         Concentration(mg/15ml)
      denotation
                             Urea        Potash  Phosphate               Lime       Salt


       0.5                     112.5       139.7        1890             187.5      109.87

       1                       225         279.4        3780             375        219.74

       1.5                     337.5       419.1        5670             562.5      329.61
       2                       450         558.8        7560             750        439.48

       3                       675         838.2        11340            1125       659.25
                     Absorption A U


                                         Frequency(MHz

Figure 2: RF Spectra of 40 samples in the frequency range 500MHz-1000MHz

Figure 2 shows the RF spectra of 40 samples which were used in building the model for soil urea
estimation. These samples were prepared by adding different concentrations of the components
taken as per the table 1.

The soil urea estimation using this spectral data of 40 samples was done using two methods. The
first method was using the ParLeS software based on Partial Least Squares Regression(PLSR)
model. The second method was using Machine Learning Algorithm i.e. Ridge Regression.

ParLeS is a chemometrics software for multivariate modelling and prediction. It provides users with
various algorithm options to transform, preprocess and pretreat spectra. It may be used to implement
principal components analysis (PCA); partial least squares regression (PLSR) with leave-n-out cross
validation; and bootstrap aggregation-PLSR (bagging-PLSR). ParLeS facilitates the implementation
of a large number of preprocessing techniques as well as bagging-PLSR, which can improve the
robustness and accuracy of PLSR models. Other unique features of ParLeS include the provision of a
number of assessment statistics and graphical output as well as a user-friendly interface and
functionality [15].

In this study, a sample of 40 spectral data in the frequency range of 500MHz to 1000MHz was passed
into the ParLeS software. The PLSR cross validation technique was used for soil urea estimation
where the n=5 was chosen for the cross validation. Using this an RMSE of 2.87 was obtained. A
screenshot of the ParLeS software is as shown in Figure 3.
Figure 3: Screenshot of ParLeS software

Ridge regression, also known as L2 regularization, is one of the many regularization techniques
applied to linear regression models. Regularization is a statistical method used to prevent errors due
to the overfitting of training data. Ridge regression is specifically tailored to address multicollinearity
in regression analysis, which is crucial when developing machine learning models with many
parameters, particularly when these parameters are significantly weighted.
A standard, multiple-variable linear regression equation is:
                        𝑌𝑌 = 𝑋𝑋0 + 𝐵𝐵1 𝑋𝑋1 + 𝐵𝐵2 𝑋𝑋2 + ⋯ + 𝐵𝐵𝑛𝑛 𝑋𝑋𝑛𝑛 ……..(1)

In the above equation, Y represents the expected value or the dependent variable, X is the predictor
or independent variable, B denotes the regression coefficient linked to that independent variable, and
X0 is the value of the dependent variable when the independent variable is zero, also referred to as
the y-intercept. It's important to observe how the coefficients illustrate the relationship between the
dependent variable and a specific independent variable. The best-fitting line for a given dataset is
obtained by calculating coefficients for each independent variable that result in the smallest residual
sum of squares (also called the sum of squared errors).

    The Residual sum of squares (RSS) represents how well a linear regression model matches the
training data and is represented by the formula:

                                                           ⏞𝑖𝑖 )2 ……..(2)
                                   𝑅𝑅𝑅𝑅𝑅𝑅 = ∑𝑛𝑛𝑖𝑖=1(𝑌𝑌𝑖𝑖 − 𝑌𝑌

This formula is used to calculate the accuracy of model predictions against the expected values in
the training data. If the Residual Sum of Squares (RSS) is zero, it indicates that the model perfectly
predicts the dependent variables. If two or more variables have a strong linear correlation, high-
value coefficients are generated, causing the model's output to be sensitive to minor changes in the
input data. This indicates that the model overfitted on a single training dataset and is unable to
correctly generalise to new test datasets. This causes the model to be unstable.

   Multicollinearity exists when two or more predictors have a near-linear relationship or are highly
correlated, which results into unreliable and unstable estimates of regression coefficients. Ridge
regression is a procedure for eliminating the bias of coefficients and reducing the mean square error
by shrinking the coefficients of a model towards zero in order to solve problems of overfitting or
multicollinearity that are normally associated with ordinary least squares regression.

    Ridge regression corrects for high-value coefficients by introducing a regularization term (often
called the penalty term) into the RSS function. This penalty term denoted as L2, is the sum of the
squares of the model’s coefficients. It is represented in the formulation:


                        𝑅𝑅𝑅𝑅𝑅𝑅𝐿𝐿2 = ∑𝑛𝑛𝑖𝑖=1(𝑌𝑌𝑖𝑖 − ⏞
                                                   𝑌𝑌𝑖𝑖 )2 + 𝜆𝜆 ∑𝑃𝑃𝑗𝑗=1 𝐵𝐵2 ……..(3)

The L2 penalty term reduces all coefficients to balance the high ones. This process is utilized in ridge
regression, to calculate new coefficients that minimize the residual sum of squares (RSS) for a model,
thereby reducing overfitting.

   Ridge regression doesn't reduce all coefficients equally and is proportional to their original
magnitude. As the lambda (λ) parameter increases, coefficients with higher values diminish more
rapidly than those with lower values, resulting in a greater penalty for the former [16].

   In machine learning, ridge regression is used to reduce overfitting that results from model
complexity. Model complexity can be due to a model possessing too many features and features
possessing too much weight. Feature weight refers to a given predictor’s effect on the model output.

In machine learning terms, ridge regression amounts to adding bias into a model for the sake of
decreasing that model’s variance. Bias measures the average difference between predicted values and
true values and variance measures the difference between predictions across various realizations of
a given model. As bias increases, a model predicts less accurately on a training dataset. As variance
increases, a model predicts less accurately on other datasets. Bias and variance thus measure model
accuracy on training and test sets respectively. To reduce the model bias and variance, ridge
regression technique can be used [16].

Using Ridge regression technique allows control over the bias-variance trade-off. Increasing the
value of λ increases the bias but reduces the variance, while decreasing λ does the opposite. The goal
is to find an optimal λ that balances bias and variance, leading to a model that generalizes well to
new data.
Figure 4: Bias Variance Tradeoff

Selection of an appropriate value for the ridge parameter k is crucial in ridge regression, as it
directly influences the bias-variance tradeoff and the overall performance of the model. There are
several methods for the selection of ridge parameter:

1. Cross-Validation
Cross-validation is one of the most popular method used in the selection of the ridge parameter. In
this method, the dataset is divided into multiple subsets, and the model is trained on some
subsets while being validated on the remaining ones. The process is repeated over multiple
iterations, and the average performance across all iterations is used to determine the optimal
value of λ.

    •   K-Fold Cross-Validation: The dataset is divided into K subsets (folds). The model is
        trained on K- folds and validated on the remaining fold. This process is repeated K times,
        with each fold being used as the validation set once. The average performance across all
        folds is used to select λ.

    •   Leave-One-Out Cross-Validation (LOOCV): A special case of K-fold cross-validation
        where K equals the number of observations. Each observation is used as a validation
        set once, and the model is trained on the remaining observations. This method is computa-
        tionally intensive but provides an unbiased estimate of the model’s performance.

2. Grid Search: This method defines a grid of possible values for λ and the ridge regression model
is trained for each value of λ. The performance of the model is evaluated for each value of λ from the
grid and the one with the best performance is then selected as the ridge parameter.

3. Bayesian Optimization:
  Bayesian optimization is used to efficiently explore the space of possible λ values and find the
optimal value. This method can be more efficient than grid search for large search spaces.

4. Information Criteria:
Use information criteria like Akaike Information Criterion (AIC) or Bayesian Information Criterion
(BIC) to select the optimal value of λ. These criteria balance model fit and complexity.

5. Domain Knowledge:
    • Incorporate domain knowledge about the problem to guide the choice of λ. For example, if
      you know that overfitting is a significant concern, you might choose a larger value of λ [17].


Ridge regression was implements using the sklearn python package in the python programming
language. The sklearn package includes the Application Programming Interface (API) interface to
implement the same. The linear_model.Ridge() API is used to implement the ridge regression. The
only parameters supplied to the API is alpha with a value of 0.005 . Here it may be noted that the
alpha is equivalent to λ specified above. The others parameters have default values. The parameter
with the default values are copy_X=True, fit_intercept=True, tol=0.0001, max_iter=None,
positive=False, solver='auto', and random_state=None.

The dataset consisting of 40 samples was used for training and testing the ML model using Ridge
regression. With the parameter alpha set to 0.005, the RMSE obtained using this technique was found
to be 1.02.


                                           (mg/15ml)
Figure 5: Typical graph showing the actual and predicted urea values
3. Result and Discussion


It may be seen that the analysis using Ridge technique (Which is a machine learning based
tool for regression analysis) gives excellent performance with error as low as 1.02. Whereas,
the error in traditional technique of ParLeS regression is 2.87, which is nearly three times more
than that of Ridge technique. As mentioned earlier that, in addition to the advantage of less
error, the implementation of the algorithm can be done in a simpler computational platform,
not necessarily requiring complicated LabVIEW back end. This is reflected in the table 2. The
regression graph shown in figure 5 show good agreement between the actual and predicted
values using the ridge regression technique.

                         Table 2: Result obtained using various methods


                         Model       PLSR               Ridge
                         Name                           Regression
                         RMSE        2.87               1.02


4. Conclusion
In this article we studied the application of Ridge Regression Technique for analysis of urea
in the soil for better productivity of the crops. In past we had done such analysis using ParLeS
(which is propriety and not an open source software). The results obtained were encouraging
with errors as low as 1.02mg/15ml. Therefore, we conclude here that Ridge Technique is far
superior to the traditional technique of regression analysis.

5. References
   [1] Folorunso, O.; Ojo, O.; Busari, M.; Adebayo, M.; Joshua, A.; Folorunso, D.; Ugwunna, C.O.;
       Olabanjo, O.; Olabanjo, O. 2023. Exploring Machine Learning Models for Soil Nutrient
       Properties Prediction: A Systematic Review. Big Data Cogn. Comput. 7, 113. DOI:
       10.3390/bdcc7020113

   [2] Andreas Kamilaris , Francesc X. Prenafeta-Boldú. 2018. Deep learning in agriculture: A
       survey.     Computers     and   Electronics   in    Agriculture,    147,    70-90.DOI:
       10.1016/j.compag.2018.02.016.

   [3] Jain, S., Sethia, D. 2023. A Review on Applications of Artificial Intelligence for Identifying
       Soil Nutrients. In: Saini, M.K., Goel, N., Shekhawat, H.S., Mauri, J.L., Singh, D. (eds)
       Agriculture-Centric Computation. ICA 2023. Communications in Computer and Information
       Science, vol 1866. Springer, Cham. DOI: 10.1007/978-3-031-43605-5_6
 [4] Liu, Y., Ma, X., Shu, L., Hancke, G. P., & Abu-Mahfouz, A. M. 2020. From Industry 4.0 to
     Agriculture 4.0: Current Status, Enabling Technologies, and Research Challenges. IEEE
     Transactions on Industrial Informatics, 1–1. DOI:10.1109/tii.2020.3003910

 [5] Shaikh, F.K.; Memon, M.A.; Mahoto, N.A.; Zeadally, S.; Nebhen, J. 2021. Artificial intelligence
     best practices in smart agriculture. IEEE Micro , 42, 17–24.

 [6] Chen, Q.; Li, L.; Chong, C.; Wang, X. AI-enhanced soil management and smart farming. Soil
     Use and Management. 2022, 38, 7–13. DOI: 10.1111/sum.12771

 [7] Dobos, E. 2006. Digital Soil Mapping: As a Support to Production of Functional Maps; Office
     for Official Publication of the European Communities: Luxembourg

 [8] Wadoux, A.M.C.; Minasny, B.; McBratney, A.B. 2020. Machine learning for digital soil
     mapping: Applications, challenges and suggested solutions. Earth-Sci. Rev. , 210, 103359

 [9] Dong, W.; Wu, T.; Sun, Y.; Luo, J. 2018. Digital mapping of soil available phosphorus
     supported by AI technology for precision agriculture. In Proceedings of the 2018 7th
     International Conference on Agro-geoinformatics (Agro-geoinformatics), Hangzhou, China,
     pp. 1–5.

 [10] Khaledian, Y.; Miller, B.A. Selecting appropriate machine learning methods for digital soil
      mapping. Appl. Math. Model. 2020, 81, 401–418.

[11] Shahare, Y.; Gautam, V. Soil Nutrient Assessment and Crop Estimation with Machine
    Learning Method: A Survey. In Cyber Intelligence and Information Retrieval; Springer:
    Berlin/Heidelberg, Germany, 2022; pp. 253–266.

 [12] Trontelj ml., J.; Chambers, O. Machine Learning Strategy for Soil Nutrients Prediction Using
      Spectroscopic Method. Sensors 2021, 21, 4208. DOI:10.3390/ s21124208

 [13] Umm E Farwa, Ahsan Ur Rehman, Qasim Khan, S. ., & Khurram, M. (2020). Prediction of
      Soil Macronutrients Using Machine Learning Algorithm. International Journal of Computer
      (IJC), 38(1), 1–14.

[14] S. R. Vernekar, I. A. P. Nazareth, J. S. Parab and G. M. Naik, "RF spectroscopy technique for
    soil nutrient analysis," 2015 International Conference on Technologies for Sustainable
    Development (ICTSD), Mumbai, 2015, pp. 1-4. doi: 10.1109/ICTSD.2015.7095878

[15] Raphael A. Viscarra Rossel, “ParLeS: Software for chemometric analysis of spectroscopic
    data”, Chemometrics and Intelligent Laboratory Systems, Volume 90, Issue 1, 2008, Pages 72-
    83, ISSN 0169-7439, https://doi.org/10.1016/j.chemolab.2007.06.006.

 [16]      Jacob Murel, Eda Kavlakoglu,               “What     is   ridge    regression?”,   URL:
     www.ibm.com/topics/ridge-regression.

 [17]        “What is Ridge Regression?” URL: www.ibm.com/topics/ridge-regression.