A PREDICTION OF RAINFALL DATA BASED ON SUPPORT
                VECTOR MACHINE WITH STOCHASTIC GRADIENT DESCENT
                    Dian Puspita Hapsari1, Mohammad Imam Utoyo2, Santi Wulan Purnami3
            1
            Hapsari Institute of Technology Adhi Tama, Surabaya, Indonesia, dian.puspita@itats.ac.id
                 2
                   Utoyo Airlangga University, Surabaya, Indonesia, m.i.utoyo@fst.unair.ac.id
     3
       Purnami Institute of Technology Sepuluh Nopember, Surabaya, Indonesia, santi_w@statistika.its.ac.id


ABSTRACT

Predicting rainfall is part of the weather forecast, an activity that is difficult to do because it depends on various
variables. The value of the dependent variable such as wind speed and direction, temperature, humidity, varies
from time to time and the weather calculation varies according to geographical location and atmosphere. An
accurate prediction model is needed to predict rainfall using Support Vector Machine with Stochastic Gradient
Descent (SGD-SVM) to replace the linear threshold used in traditional rainfall prediction activities. The selection
of the right model parameters has an important impact on the accuracy of SVM predictions, and SGD is proposed
to find the optimal parameters for SVM. The SGD-SVM algorithm is used in model training activities using
historical data for rainfall prediction, which can be useful information and is used by governments and
communities in making wise and smart decisions. The simulation shows that the proposed prediction model has
an algorithm performance with much better accuracy than the traditional prediction model. On the other hand, the
simulation results show the effectiveness of the SVM-SGD model used in machine learning and further promise
scope for improvement because more and more relevant attributes can be used in predicting the dependent variable.
Key words: Support Vector Machine, Stochastic Gradient Descent, Predicting Rainfall.

1. INTRODUCTION

Climate change creating extreme weather events that are more frequent and more intense in certain locations
around the globe. There is evidence heat waves have intensified, which has contributed to accelerating drought
and extreme flood events. With few exceptions, the general phenomenon is that rainfall intensity has increased,
but with a reduced number of wet days. Studies that associate rainfall and temperature are scarce, rainfall extremes
have been studied more extensively than temperature extremes (Naveendrakumar et al., 2019). In the last 20 years
the use of machine learning in rainfall prediction activities has been carried out. It is applied in several areas and
proven effective when solving those problems. Begins with clustering data model developed to overcome the
problems of time, the speed of conversation used and for clustering large data sets (Hartono et al., 2018). The
selection of the classifier has a crucial impact on the accuracy and efficiency. Artificial neural network (ANN) can
be used to predict the occurrence of rainfall in a short span of time. Models produced from ANN have peak
performance in well-defined seasons, but lose their accuracy in the transition season (Esteves et al., 2019).
    Using the SVM algorithm for accurate rainfall-runoff modeling has been done on the precipitation forecast
there is still some open research for improving the performance of the systems. The algorithm shows many unique
advantages in solving nonlinear, high dimensional pattern recognitions and large scale data (Sehad et al., 2017).
The learning problem of SVM can be expressed as a convex optimization problem, so we can find the global
minimum of the objective function by using the known effective algorithm (Young et al., 2017). We propose a
prediction model for rainfall forecasts based on Support Vector Machine with Stochastic gradient descent for
optimization. Types of optimization algorithms for minimize a loss function we used First Order Optimization
Algorithms - Gradient descent method is commonly used to train classifiers by minimizing the error function
(Bottou, 2010). Bottou were introduce the classification optimization with the gradient method for large-scale data
training and identify how optimization problems arise in machine learning and what makes them challenging.
    We have looked for training algorithms that have a short training time property (linear scaling with training set
size) but have high generalization accuracy. Support Vector Machines (SVMs) as one of the supervised learning
technique algorithms that can be applied in many cases. That is the reason why we try to present most of our
algorithms in general to facilitate the conception of derivation for large scale applications.
    This study uses an online BMKG database for classification with 2160 lines, 9 attributes such as Tn: Minimum
temperature (° C), Tx: Maximum temperature (° C), Tavg: Average temperature (° C), RH_avg: Average humidity
(%), RR: Rainfall (mm), ss: The duration of solar radiation (hours), ff_x: Maximum wind speed (m / s), ddd_x:
Wind direction at maximum speed (°), ff_avg: Average wind speed (m / s), ddd_car: Most wind direction (°). This


 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

  IICST2020: 5th International Workshop on Innovations in Information and Communication Science and Technology, Malang, Indonesia
                                              A Prediction of Rainfall Data Based on Support Vector Machine with Stochastic Gradient Descent

     data was collected from the Perak Maritime Meteorological Station II which conducted daily records. The binary
     label is based on the column the amount of rainfall measured, if there is a value there will be rain, if there is no
     value then there will be no rain.
         In the literature, different variants of gradient descent (GD) and stochastic gradient descent (SGD) have been
     suggested with the aim of increasing performance in terms of accuracy and convergence speed (Wijnhoven and
     De With, 2010). In both methods, parameters are updated in an iterative manner to minimize the objective function
     (Sopyła and Drozda, 2015). To deal with large scale datasets and to factorize large scale matrices, different
     algorithms have been proposed. Stochastic gradient descent (SGD) algorithms much simple and efficient and are
     extensively used for matrix factorization (Sakr et al., 2017).
         This paper is organized as follows: in the following section, an introduction to SVM and SGD is presented.
     Section 2 presents the proposed SVM-SGD algorithm. In Section 3, we describe the experiments for prediction of
     rainfall dataset. Section 4 reports experiments or discussion of our model including comparative results between
     the traditional SVM-based prediction models and ours. Section 5, last section states our conclusions.

     2. SVM BASED ON SGD

     2.1     Support vector machines

     Support vector machines are supervised learning models that analyzed data used for classification and regression
     analysis. The original SVM algorithm was invented by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963
     (Chervonenkis, 2013). In 1992, Bernhard et al. (2004) suggested a way to create nonlinear classifiers by applying
     the kernel trick to maximum-margin hyperplanes. Given a set of training examples, each marked as belonging to
     one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one
     category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of
     the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap
     that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a
     category based on the side of the gap on which they fall.
         For linear classifier SVMs can efficiently perform a non-linear classification using kernel trick, mapping their
     inputs into high-dimensional feature spaces. SVM tries to find the hyperplane. We will call this the optimal
     hyperplane, and we will say that it is the one that best separates the data.
     The SVM optimization problem is:
                                                                             1
                                                                      min ‖𝑤𝑤‖2
                                                                       𝑤𝑤.𝑏𝑏 2

                          𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑡𝑡𝑡𝑡 𝑦𝑦𝑖𝑖 (𝑤𝑤 ∙ 𝑥𝑥𝑖𝑖 ) + 𝑏𝑏 ≥ 1, 𝑖𝑖 = 1 . . 𝑚𝑚           (1)
         This function is convex in w. A function is convex if for every 𝒖𝒖, 𝒗𝒗 in the domain, and for every 𝜆𝜆 ∈ [0,1] we
     have 𝑓𝑓(𝜆𝜆𝒖𝒖 + (1 − 𝜆𝜆)𝒗𝒗) ≤ 𝜆𝜆𝜆𝜆(𝒖𝒖) + (1 − 𝜆𝜆)𝑓𝑓(𝒗𝒗). For convex functions in general the function f is ∇𝑓𝑓(𝑥𝑥) = 0 for
     x to be a minimum, the stationary condition tells us that the selected point must be a stationary point. It is a point
     where the function stops increasing or decreasing. When there is no constraint, the stationary condition is just the
     point where the gradient of the objective function is zero. When we have constraints, we use the gradient of the
     Lagrangian (Luo and Yu, 2006). Primal feasibility condition, looking at this condition, you should recognize the
     constraints of the primal problem. It makes sense that they must be enforced to find the minimum of the function
     under constraints, Dual feasibility condition. Similarly, this condition represents the constraints that must be
     respected for the dual problem.
         This formulation of the SVM is called the hard margin SVM. It cannot work when the data is not linearly
     separable. There are several Support Vector Machines formulations. In the next chapter, we will consider another
     formulation called the soft margin SVM, which will be able to work when data is non-linearly separable because
     of outliers. Minimizing the norm of is a convex optimization problem, which can be solved using the Lagrange
     multipliers method. Some researchers have discovered new heuristics to improve this algorithm, and popular
     libraries like LIBSVM use an SMO-like algorithm. Note that even if this is the standard way of solving the SVM
     problem, other methods exist, such as gradient descent and stochastic gradient descent (SGD) (Wijnhoven and De
     With, 2010), which is particularly used for online learning and dealing with huge datasets (Xu et al., 2015).

     2.2     Gradient Descent

     Gradient Descent is a popular optimization technique in Machine Learning and Deep Learning and it can be used
     with most, if not all, of the learning algorithms. A gradient is slope of a function, the degree of change of a
     parameter with the amount of change in another parameter. Mathematically, it can be described as the partial
     derivatives of a set of parameters with respect to its inputs. Gradient Descent is a convex function (Wu et al.,
71


      Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                    Hapsari D.P., Utoyo I., Purnami S.W.

2011). Gradient Descent can be described as an iterative method which is used to find the values of the parameters
of a function that minimizes the cost function as much as possible. The parameters are initially defined a particular
value and from that, Gradient Descent is run in an iterative fashion to find the optimal values of the parameters,
using calculus, to find the minimum possible value of the given cost function (Lin et al., 2011).
    Gradient descent is an iterative algorithm, the starts from a random point on a function and travels down its
slope in steps until it reaches the lowest point of that function. This algorithm is useful in cases where the optimal
points cannot be found by equating the slope of the function to 0. The general idea is to start with a random point
(in our parabola example start with a random “x”) and find a way to update this point for each iteration such that
we descend the slope. We are trying to minimize,
                             1
           𝐽𝐽(𝑤𝑤) = min 𝑤𝑤 𝑇𝑇 𝑤𝑤 (0,1 − 𝑦𝑦𝑖𝑖 𝑤𝑤 𝑇𝑇 𝑥𝑥𝑖𝑖 )                                                (2)
                        𝑤𝑤.𝑏𝑏   2

The steps of the algorithm are: First, find the slope of the objective function with respect to each parameter/feature.
In other words, compute the gradient of the function. Second, pick a random initial value for the parameters. (To
clarify, in the parabola example, differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we
take the partial derivative of “y” with respect to each of the features.) Third, update the gradient function by
plugging in the parameter values. And last calculate the step sizes for each feature as step size = gradient * learning
rate. Calculate the new parameters as new params = old params -step size. The repeat steps 3 to 5 until gradient is
almost 0. The “learning rate” mentioned above is a flexible parameter which heavily influences the convergence
of the algorithm (Schraudolph et al., 2007).
     Larger learning rates make the algorithm take huge steps down the slope and it might jump across the minimum
point thereby missing it. So, it is always good to stick to low learning rate such as 0.01. It can also be
mathematically shown that gradient descent algorithm takes larger steps down the slope if the starting point is high
above and takes baby steps as it reaches closer to the destination to be careful not to miss it and also be quick
enough. There are a few downsides of the gradient descent algorithm (Zhang, 2004). We need to take a closer look
at the amount of computation we make each iteration of the algorithm. We have 10,000 data points and 10 features.
The sum of squared residuals consists of as many terms as there are data points, so 10000 terms in our case. We
need to compute the derivative of this function with respect to each of the features, so in effect we will be doing
10000 * 10 = 100,000 computations per iteration. It is common to take 1000 iterations; in effect we have 100,000
* 1000 = 100000000 computations to complete the algorithm.
     That is pretty much an overhead and hence gradient descent is slow on huge data. In gradient descent, we
compute the gradient using the entire training set. A superﬁcially simple (but in fact far-reaching) alteration of this
is to ﬁnd the gradient with respect to a single randomly chosen example. This technique is called stochastic gradient
descent (SGD).

3. EXPERIMENT

3.1     Data Preprocessing

Data preprocessing is the initial stages of the machine learning process. Because only valid data will produce
accurate output, data preprocessing is the key stage. For this study, we use the one-year ground-based
meteorological data from Maritim Perak Station (ID WMO: 96937). The dataset contains atmospheric pressure,
sea level pressure, wind direction, wind speed, relative humidity and precipitation. Data is collected every day in
one year. We just consider related information and ignore the rest. The method of principal component analysis
(PCA) is used to reduce the dimensionality of the data, thus reducing the data processing time and improving the
efficiency of the algorithm. We performed data transformation on rainfall. By observing the original data set, we
find the incorrect data in the data set, which does not correspond to the fact.

3.2 Train SVM-SGD

For the assessment of the algorithms results, cross-validation and external testing were carried out. The datasets
were divided into two subsets, training and test, comprising 80% and 20% of the original samples, respectively.
Training set take 80% of the samples randomly from the dataset as the training set. Test set we used other data
remaining in the data set, which contained all the attributes except the rainfall data that the model is supposed to
predict. The test set was never used for the training of any of the models.


                                                                                                                                           72


 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                             A Prediction of Rainfall Data Based on Support Vector Machine with Stochastic Gradient Descent


     Table 1. Original Meteorological-Rainfall Data.


     Fig. 1. Non Linear Rainfall dataset


     Fig. 2. (a) the value of error rate and learning rate = 0.0001 for first experiment (b) the value of error
             rate and learning rate = 0.001 for second experiment


     4. DISCUSSION

     The Data preprocessing first task is we need to remove abnormal values data before attempting to calculate, then
     normalize the data to eliminate the effects of sample spans, and smooth the training process. The data is subjected
     to a fivefold cross validation for a more stable data model. Due to the particularity of the precipitation datum, the
     regional rainfall datum are distributed unevenly in time and space. The days of precipitation are obviously shorter
     than that of the total sample.
         To verify the convergence of the proposed Stochastic Gradient Descent -SVM algorithm, simulation has been
     done on the Rainfall dataset. The Rainfall dataset is collected from ground-based meteorological data from Maritim

73


     Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                    Hapsari D.P., Utoyo I., Purnami S.W.

Perak Station (ID WMO: 96937). There have 1825 record data for 1095 training samples and 730 testing samples
data. To display the performances of diﬀerences of diﬀerent Stochastic Gradient Descent -SVM, we employed
diﬀerent learning parameters: learning rates and error rate. For convenience, we graphed learning rates and error
rate that showed in Fig.1 (a) with value of learning rate = 0.0001 error rate = 0.365. Fig 1. (b) with value of learning
rate = 0.001 error rate = 0.364. Generally speaking, the error rate with larger learning rates are lowest error rate
than those with smaller learning rates.
    Although the datum are pretreated, they still cannot completely remove the datum on the impact of the forecast
results, and the degree of volatility is greater when the number of samples is small. We analyze the results of the
optimization parameters C and g with the four methods. The SVM-SGD method based on the Stochastic Gradient
Descent Optimization algorithm has better experimental results, with a higher classification rate. This indicates
that the Stochastic Gradient Descent Optimization algorithm has a strong ability to optimize the parameters of the
Support Vector Machine C and g, and the results are more accurate.

5. CONCLUSIONS

The purpose of this study is to investigate the current state of the optimization method based on the gradient descent
conducted on the Support Vector Machine classifier. Based on the findings of this study, it is possible to make an
overview of the dominant optimization method based on gradient descent. We model the model parameters by
using the SGD algorithm to optimize the model parameters. The SVM-SGD algorithm is proven to be an effective
method for rainfall forecast decisions. The SVM-SGD algorithm is proven to be an effective method for rainfall
forecast decisions. The SVM method is a type of machine learning method with a high level of nonlinear problems.
This is a kind of smart learning method with a strong theoretical basis. In addition, there is no limit to the
dimensions (number of vectors) of vectors due to the formation of the SVM model, which facilitates the handling
of meteorological problems with time, space and various factors.
    From the results of the trials conducted it is very important to improve accuracy and increase computational
time for learning rainfall datasets that are included in large-scale datasets. For further research it is possible to
apply optimization methods based on fractional derivatives in SVM classifiers.
    Thus, early warning of weather conditions in an area can be informed to the government before a natural
disaster occurs as a form of service using disaster prevention tools. These models provide a good example of the
ability of the Vector Support Engine classifier for modeling weather forecasting with high precision and efficiency.

ACKNOWLEDGEMENT

The authors would like to express their gratitude to the editors and anonymous reviewers for their valuable
comments and suggestions which improve the quality of this paper. Gratitude’s for the supervisor of the doctoral
program, the faculty of science and technology, Airlangga University.


REFERENCES

Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent, In: Proceedings of COMPSTAT
         2010 - 19th International Conference on Computational Statistics, Keynote, Invited and Contributed
         Papers. doi: 10.1007/978-3-7908-2604-3-16.
Chervonenkis, A.Y. (2013). Early history of support vector machines, In: Empirical Inference: Festschrift in
         Honor of Vladimir N. Vapnik. doi: 10.1007/978-3-642-41136-6_3.
Esteves, J.T., de Souza Rolim, G., and Ferraudo, A.S. (2019). Rainfall prediction methodology with binary
         multilayer perceptron neural networks, Climate Dynamics. doi: 10.1007/s00382-018-4252-x.
Hartono, O., Sitompul, O.S., Tulus, and Nababan, E.B. (2018). Optimization Model of K-Means Clustering Using
         Artificial Neural Networks to Handle Class Imbalance Problem, In: IOP Conference Series: Materials
         Science and Engineering., 288, 1-9. doi: 10.1088/1757-899X/288/1/012075.
Lin, Y., et al. (2011). Large-scale image classification: Fast feature extraction and SVM training, In: Proceedings
         of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. doi:
         10.1109/CVPR.2011.5995477.
Luo, Z.Q., and Yu, W. (2006). An introduction to convex optimization for communications and signal processing,
         IEEE Journal on Selected Areas in Communications. doi: 10.1109/JSAC.2006.879347.
Skumar, G., et al. (2019). South Asian perspective on temperature and rainfall extremes: A review, Atmospheric
         Research. doi: 10.1016/j.atmosres.2019.03.021.

                                                                                                                                           74


 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                             A Prediction of Rainfall Data Based on Support Vector Machine with Stochastic Gradient Descent

     Sakr, C. et al. (2017). Minimum precision requirements for the SVM-SGD learning algorithm, In: ICASSP, IEEE
              International Conference on Acoustics, Speech and Signal Processing - Proceedings. doi:
              10.1109/ICASSP.2017.7952334.
     Schraudolph, N.N., Yu, J., and Günter, S. (2007). A stochastic quasi-Newton method for online convex
              optimization, Journal of Machine Learning Research, 2, 436-443.
     Sehad, M., Lazri, M., and Ameur, S. (2017). Novel SVM-based technique to improve rainfall estimation over the
              Mediterranean region (north of Algeria) using the multispectral MSG SEVIRI imagery, Advances in
              Space Research. doi: 10.1016/j.asr.2016.11.042.
     Sopyła, K., and Drozda, P. (2015) Stochastic gradient descent with Barzilai-Borwein update step for SVM,
              Information Sciences. doi: 10.1016/j.ins.2015.03.073.
     Wijnhoven, R.G.J., and De With, P.H.N. (2010). Fast training of object detection using stochastic gradient descent,
              In: Proceedings - International Conference on Pattern Recognition. doi: 10.1109/ICPR.2010.112.
     Wu, W. et al. (2011). Convergence analysis of online gradient method for BP neural networks, Neural Networks.
              doi: 10.1016/j.neunet.2010.09.007.
     Xu, D., Zhang, H., and Mandic, D.P. (2015). Convergence analysis of an augmented algorithm for fully complex-
              valued neural networks, Neural Networks. doi: 10.1016/j.neunet.2015.05.003.
     Young, C.C., Liu, W.C., and Wu, M.C. (2017). A physically based and machine learning hybrid approach for
              accurate rainfall-runoff modeling during extreme typhoon events, Applied Soft Computing Journal. doi:
              10.1016/j.asoc.2016.12.052.
     Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms, in
              Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004. doi:
              10.1145/1015330.1015332.


75


     Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).