Design and Development of Machine Learning Model for Crop
Yield Prediction
Taman Kumar1, Kiran Jyoti2, and Sandeep K.Singla3
1,2,3
        Guru Nanak Dev Engineering College, Ludhiana (GNDEC), Panjab, India


                  Abstract
                  Agriculture is one of the major sources of employment as well as contributor in the GDP of
                  India. Machine learning is the latest technology which can be used to help the agriculture
                  sector. This paper will focus in using the machine learning technique to predicting the wheat
                  crop yield. The regression algorithms which are used in it are simple regression, gradient
                  booster, polynomial regression and random forest. The results of every algorithm are
                  compared with actual results in the last.

                  Keywords 1
                  Crop yield prediction, machine learning, regression.

1. Introduction

    Agriculture is majorly adopted by population of India as a source of livelihood. Almost all
industries depend on raw materials produced by agriculture. That is why agriculture and allied sectors
contribute 15.4% in the GDP of India. India is second largest in producer and seventh largest exporter
of agricultural goods. The boom in this sector is measured after the green revolution of 1967. The
production of crops are depend on different parameters such as rainfall, irrigation, temperature,
different climate conditions, quality of seeds, consumption of NPK (Nitrogen, Phosphorus,
Potassium) and many more. Many changes are required in agricultural domain to improve the changes
in Indian economy (Ramesh et al. 2019). The agricultural information can be extracted by two
methods manual and by using computer and IT tools. However, manual methods have some
limitations:
    1. Biasing: The manual information is always one person’s perspective. Each and every person
has their own perspective and the provided information is not fit in every situation.
    2. Time delay: Delayed information is not useful.
    3. Correctness: To err is human, that is why there is always probability of mistakes.
    4. Reliability: All above factors affects the reliability of manual methods.
    On the other hand, technology enhancements are well known for precision. Recently the most
common used technological enhancements for agriculture domains are:
    1. Machine Learning.
    2. Deep learning.
    Machine learning: Machine learning is used in many domains such as malls to predict the behavior
of customer’s shopping, stock market trends, moreover it is used in agriculture fields also. There are
many processes that are included in agriculture like irrigation scheduling, crop diseases, by-products,
transportation etc. All procedures ultimately lead to crop yield. Despite going for mini procedures we
opted for main task i.e. crop yield. Crop yield prediction is one of the challenging problems in
precision agriculture, and many models have been proposed and validated so far. This problem
requires the use of several datasets since crop yield depends on many different factors such as climate,

International Conference on Emerging Technologies: AI, IoT, and CPS for Science & Technology Applications, September 06–07, 2021,
NITTTR Chandigarh, India
EMAIL: tamankumar0808@gmail.com (A. 1); kiranjyoti@gmail.com (A. 2); sandeepkumar.singla@gmail.com (A. 3)
ORCID: XXXX-XXXX-XXXX-XXXX (A. 1); XXXX-XXXX-XXXX-XXXX (A. 2); XXXX-XXXX-XXXX-XXXX (A. 3)
               ©2021 Copyright for this paper by its authors.
               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
               CEUR Workshop Proceedings (CEUR-WS.org)
weather, soil, use of fertilizers, and seed variety. This indicates that crop yield prediction is not a
trivial task; instead, it consists of several complicated steps. Nowadays, crop yield prediction models
can estimate the yield, but a better performance in yield prediction is still desirable (Klompenburg et
al. 2020).


2. Literature

                A. Agricultural Information Extraction

    1) Raorane and Kulkarni (2015), used datamining tools in crop management system. They
       used regression algorithms. The disadvantage is the model is not specified.
    2) Kushwaha and Bhattachrya (2015), concluded the method which is helpful in finding the
       suitable crop according to the land. Agro algorithm is used in this paper.
    3) Santra et al. (2016), used artificial neural network, decision tree algorithm and regression
       analysis to providing the information of crops and help in increasing the yield rate. The
       negative is method is not clearly specified.

                B. Crop Yield Estimation

    1) Kumar et al. (2015), suggested the method which is helpful in improving the yield of crops.
       Classifications are used and the parameters are compared. The demerit is the accuracy and
       performance is not proper.
    2) Babu and Babu (2016),gave method which provide solutions to some farming problems such
       as water and fertilizers. They have also used the agro algorithm and the accuracy is also the
       problem in it.
    3) Jain et al. (2017), in their paper found the better sequence according to which the crops
       should be sown so that the maximum yield is extracted. Not only sequence they also used
       machine learning for irrigation and crop diseases.
    4) Djodiltachoumy (2017), used K means algorithms (Clustering) on previous years data and
       predict yield according to that database. The demerit is they used fewer amounts of data and it
       is suitable only for association rule.
    5) Nigam et al. (2019), have concluded the random forest regression gives the highest yield
       prediction accuracy. Simple recurrent neural network performs better on rainfall prediction
       while LSTM is good for temperature prediction.

                C. Machine Learning Algorithms

    1) Khairunniza-Bejo et al. (2014), defined a method using Artificial Neural Network to help
       the farmers solving some of their problems. The disadvantage is the proposed method is very
       time consuming.
    2) Ramesh and Vardhan (2015),used multiple linear regression method to analyze and verify
       the database. The demerit is this method is of less accuracy.
    3) Savla et al. (2015), suggested the framework using Normalization, Clustering and
       Classification to understand the crop yield rate zones based on attributes.
    4) Sindhura et al. (2016), also used multiple linear regression methods to predict and support
       the decision making in many sectors.

    The comprehensive study of literature review revealed that the crop yield estimation and
agricultural information extraction from the ancillary data as well as historical data is an open
problem. Various machine learning models and other algorithms have been used in past for the yield
estimation.
3. Methodology


Figure 1: Flowchart of Proposed Methodology
       Elaboration of methodology:
       Step 1: The datasets are collected and processed.
       Step 2: If there are any impurities in dataset, these are removed.
       Step 3: The data is normalized if needed and can be converted into smaller volume of data.
       Step 4: The data is converted into supporting format.
       Step 5: Processed data is stored in the databases.
       Step 6: The required method is applied.
       Step 7: Final results are collected.
             A. Working of model:
    i.       Real time datasets of different parameters such as Precipitation, Wheat Crop Yield, NPK
             Consumption, Mean Temperature, Relative Humidity, Surface Pressure, Annual Rainfall
             is collected and downloaded from authentic sites such as data.gov.in and
             power.larc.nasa.gov/data-access-viewer. The area chosen is widely from Punjab, India.
             The variables and their respective units of measures are given below in table 1:

           Table 1
           Units of Measures
                        Variable                      Units of Measures
         Precipitation                                mm/day
         Relative Humidity                            %
         Surface Pressure                             kPa
         Mean Temperature                             C
         Mean Wind Speed                              m/s
          Earth Skin Temperature                       C
          NPK Consumption                              TNT

   ii.      Collected data is preprocessed. There were some ‘NA’ values which are filled by taking
            average value of the above and below column.
   iii.     Feature selection is applied to extract important parameters for modeling framework. A
            process to find correlation between all the parameters is applied and the parameters which
            were not affecting the crop yield are eliminated. Image of correlation is given below:


Figure 2: Correlation Between Variables
    iv.     Dataset is partitioned into training and testing set. 80% of data is used for training
            purpose of the model and 20% is used for testing of the model.
    v.      Various machine learning algorithms named as Random Forest trees, Polynomial
            Regression, GBM, Multiple Linear Regression and Linear Regression are implemented
            on the dataset to predict the output.

          B. Output
   1. Results of applied machine learning algorithms are compared to evaluate the model. The table
      of results are given below in table 2:

            Table 2
            Comparison of Predictions with Actual Results
Actual Results (kg/ha)  Random Forest Trees Gradient                Simple            Polynomial
                        (kg/ha)                 Booster             Regression        Regression
                                                (kg/ha)             (kg/ha)           (kg/ha)
   4693                     4152.03                 4474.749           4507              3994.286
   5097                     3943.56                 4352.341           4507              3772.596
   4724                     4149.2                  4184.22            4179              3449.539
   5017                     3945.25                 3933.049           3853              3843.369
   4304                     4001.91                 4208.408           4179              4038.906
   4583                     4277.22                 4369.696           4221              3825.896
   5046                     4160.33                 4224.234           4221              4004.914
   5077                     4233.55                 4226.763           4207              3716.017
    2. The representation of all the predicted values and actual values from year 2011 to 2018 is also
       given below in line and bar graph:


Fig. 3. Bar Graphical Representation of Predictions with Actual Results.


Fig. 4. Line Graphical Representation of Predictions with Actual Results.

    3. The table of performance evolution measures such as Mean Absolute Error, Mean Squared
       Error, Root Mean Squared Error and Mean Absolute Percentage Error of applied algorithm is
       given below in table 3:

   Table 3. Table of Performance Evolution Measures
     Type of Errors         Random Forest          Gradient                   Simple     Polynomial
                            Trees                   Booster                 Regression   Regression
  Mean Absolute Error          709.744             570.942             583.375           986.935
  Mean Squared Error          597,836.8            440,162.5           452,351           1,102,9
                                 13                   65                .375             40.261
  Root Mean Squared            773.199              663.447            672.571           1050.21
        Error
    Mean Absolute               0.144                0.115                 0.118          0.202
   Percentage Error
     4. Accuracy of applied models is given below in table 4:
        Table 4. Accuracy of Applied Models

 Random Forest              Gradient Booster               Simple                 Polynomial
     Trees                                               Regression               Regression

      85.6%                      88.5%                      88.2%                   79.8%


4. Conclusion and Future work
        From the results it is clearly shown that Gradient booster gives the maximum accurate results.
     The results are obtained currently using the Knime software but our future work is to develop an
     application so that the farmers can operate it easily.

5.   References
[1] Babu, T. Giri, and Dr G. Anjan Babu. "Big Data Analytics to Produce Big Results in the
     Agricultural Sector." (2016).
[2] Djodiltachoumy, S. "A Model for Prediction of Crop Yield." International Journal of
     Computational Intelligence and Informatics 6, no. 4 (2017).
[3] Ghadge, Rushika, Juilee Kulkarni, Pooja More, Sachee Nene, and R. L. Priya. "Prediction of
     crop yield using machine learning." Int. Res. J. Eng. Technol.(IRJET) 5 (2018).
[4] Huang, Jui-Chan, Kuo-Min Ko, Ming-Hung Shu, and Bi-Min Hsu. "Application and comparison
     of several machine learning algorithms and their integration models in regression problems."
     Neural Computing and Applications 32, no. 10 (2020): 5461-5469.
[5] Jain, Nishit, Amit Kumar, Sahil Garud, Vishal Pradhan, and Prajakta Kulkarni. "Crop selection
     method based on various environmental factors using machine learning." International Research
     Journal of Engineering and Technology (IRJET) 4, no. 2 (2017): 1530-1533.
[6] Kale, Shivani S., and Preeti S. Patil. "A Machine Learning Approach to Predict Crop Yield and
     Success Rate." In 2019 IEEE Pune Section International Conference (PuneCon), pp. 1-5. IEEE,
     2019.
[7] Khairunniza-Bejo, Siti, Samihah Mustaffha, and Wan Ishak Wan Ismail. "Application of
     artificial neural network in predicting crop yield: A review." Journal of Food Science and
     Engineering 4, no. 1 (2014): 1.
[8] Kumar, Rakesh, M. P. Singh, Prabhat Kumar, and J. P. Singh. "Crop Selection Method to
     maximize crop yield rate using machine learning technique." In 2015 international conference on
     smart technologies and management for computing, communication, controls, energy and
     materials (ICSTM), pp. 138-145. IEEE, 2015.
[9] Kushwaha, Ashwani Kumar, and Sweta Bhattachrya. "Crop yield prediction using Agro
     Algorithm in Hadoop." International Journal of Computer Science and Information Technology
     & Security (IJCSITS) 5, no. 2 (2015): 271-274.
[10] Medar, Ramesh, Vijay S. Rajpurohit, and Shweta Shweta. "Crop yield prediction using machine
     learning techniques." In 2019 IEEE 5th International Conference for Convergence in Technology
     (I2CT), pp. 1-5. IEEE, 2019.
[11] Mishra, Subhadra, Debahuti Mishra, and Gour Hari Santra. "Applications of machine learning
     techniques in agricultural crop production: a review paper." Indian Journal of Science and
     Technology 9, no. 38 (2016): 1-14.
[12] Nigam, Aruvansh, Saksham Garg, Archit Agrawal, and Parul Agrawal. "Crop yield prediction
     using machine learning algorithms." In 2019 Fifth International Conference on Image
     Information Processing (ICIIP), pp. 125-130. IEEE, 2019.
[13] Ramesh, D., and B. Vishnu Vardhan. "Analysis of crop yield prediction using data mining
     techniques." International Journal of research in engineering and technology 4, no. 1 (2015): 47-
     473.
[14] Raorane, A. A., and R. V. Kulkarni. "Application of DataMining tool to crop management
     system." Russian Journal of Agricultural and Socio-Economic Sciences 37, no. 1 (2015).
[15] Rajak, Rohit Kumar, AnkitPawar, MitaleePendke, PoojaShinde, Suresh Rathod, and
     AvinashDevare. "Crop recommendation system to maximize crop yield using machine learning
     technique." International Research Journal of Engineering and Technology 4, no. 12 (2017):
     950-953.
[16] Savla, Anshal, Himtanaya Bhadada, Parul Dhawan, and Vatsa Joshi. "Application of machine
     learning techniques for yield prediction on delineated zones in precision agriculture." IJNCAA
     (2015): 48
[17] Son, Nguyen-Thanh, Chi-Farn Chen, Cheng-Ru Chen, Horng-Yuh Guo, Youg-Sing Cheng, Shu-
     Ling Chen, Huan-Sheng Lin, and Shih-Hsiang Chen. "Machine learning approaches for rice crop
     yield predictions using time-series satellite data in Taiwan." International Journal of Remote
     Sensing 41, no. 20 (2020): 7868-7888.
[18] D. Sindhura, B. Navya Krishna, K. Sai Prasanna Lakshmi, B. Mallikarjun Rao, Dr. J Rajendra
     Prasad, Effects of Climate Changes on Agriculture International Journal of Advanced Research
     in Computer Science and Software Engineering,2016.
[19] Van Klompenburg, Thomas, Ayalew Kassahun, and Cagatay Catal. "Crop yield prediction using
     machine learning: A systematic literature review." Computers and Electronics in Agriculture 177
     (2020): 105709.