1. Introduction

Information-Communication Technologies & Embedded Systems, November

Adaptive Least-Squares Support Vector Machine and Its Online Learning

Yevgeniy Bodyanskiy

Anastasiia Deineko

Filip Brodetskyi

Danylo Kosmin

0 0 Kharkiv National University of Radio Electronics, Artificial Intelligence Department , Nauky av., 14, Kharkiv, 61166 , Ukraine 1 Kharkiv National University of Radio Electronics, Department of Informatics , Nauky av., 14, Kharkiv, 61166 , Ukraine

2020

12 2020 0000 0001

In this paper the adaptive learning method for least-squares support vector machine (LS-SVM) is proposed. Essential feature of this method is that the minimization criterion of empirical risk is realized on the sliding window of fixed dimension that essentially simplifies numerical implementation of the procedure and allows to process information generated by non-stationary nonlinear objects. For solving wide class of tasks like information processing, system and object identification, ifrst of all significantly nonlinear in conditions of structure and parametric uncertainty, artificial neural networks are wieldy used, because of its universal approximation properties and ability to learn. One of the efective neural network to solve this task are least-squares support vector machines that however couldn't be used for processing increasing data sets, when data are fed to the system in online mode. In the paper adaptive approach for learning LS-SVM in online mode using “sliding window” that permits to solve wide class of the tasks in the common problem of the Data Stream Mining is proposed. Support vector machine is neural hybrid system that combines both learning based on optimization and memory (so-called, lazy learning) and realizes method of empirical risk optimization. The key concept in the synthesis of this network are support vectors that form a small subset of the most informative data vectors allocated in the learning process. Support vector machines are really eficient neural networks in conditions of small datasets, that provide high quality approximation.

eol>Neural networks kernel function least-squares support vector machine empirical risk criterion sliding window

1. Introduction

2. Adaptive learning method for LS-SVM neural network The main disadvantage of conventional SVM [1, 2] is the numerical cumbersomeness of the synaptic weight determination procedure, that is reduced to the problem of nonlinear programming with inequality constraints, the number of which is determined by the size of the learning sample. From this point of view more efective are support vectors machines based on the least squares method, however, one way or another both neural networks process data only in batch mode.

The transformation realized by the support vector machine can be written in the form ̂ ( ) = ( ) ( ) + 0 the quadratic criterion

where case of LS-SVM) [3] is reduced to the simultaneous setting of the activation functions centers at the points of the training dataset ( ), = 1, 2, ..., like in the GRNN [4, 5] and optimization of = ( 1 , ..., , ..., ℎ ) , ( ) = ( 1 ( ), ..., ℎ ( )) and its learning (in the in the presence of system of ℎ = linear constraints-equations: function ( , 0 , ( ), ( )) = = ( ) + ∑ ( )( ( ) − ( )

( ( )) − 0 − ( )) = 1 1 2 = ( ) + ∑ ( )( ( ) − ( ) ( ( )) − 0 − ( )) must be found.

in addition besides synaptic weights and 0 also k indefinite Lagrange multipliers ( ) The system of Kuhn-Tucker equations for Lagrangian ( 2 ) can be written in the form 1 2 ( ) = ( ) + 2

∑ 2( ) ⎪ ⎪ where > 0 - regularization parameter, In the batch mode LS-SVM tuning is connected with finding the saddle point of the Lagrange ⎪ ⎪ ⎨ = − ∑ ( ) = 0,

→− ⎧⎪⎪ ∇ = − ∑ ( ) ( ( )) = 0 , ⎪⎪⎪ ( ) = ( ) − ( ) = 0,

⎪⎩ ( ) = ( ) − ( ) ( ( )) − 0 − ( ) = 0 ( 1 ) ( 2 ) − ( × 1) – vector formed by zeros) or ⎪ ⎪

∑ ( ) = 0, ⎨⎪⎪ ( ) = ( ), ⎧ ⎪⎪ = ∑ ( ) ( ( )), ⎪⎪ ( ) − ( ) ( ( )) − 0 − ( ) = 0

⎩

From the first equation of the system ( 3 ) it follows that the synaptic weights depend entirely of the values of the indefinite Lagrange multipliers, in connection with which the LS-SVM training is reduced to their definition, and the system ( 3 ) can be rewritten in a compact form: 0

, )( Λ( ) ) = ( ( ) ) same Gaussian [6] 1,2,...,,

= 1,2,..., where Λ( ) = ( ( 1 ),..., ( ),..., ( )) , − ( × 1) – vector formed by unities, { , , ( ) = ( ( 1 ),..., ( ),..., ( )) , Ω( ) = Ω = ( ) ( ( )) − ( × ) – = } = ( ( ), ( )) , ( ( ), ( )) – some kernel function that satisfies the conditions of Mercer’s theorem, often the ( ( ), ( )) = ( −‖ ( ) − ( )‖2 2 2 ) In this case, the transformation ( 1 ) implemented by the support vector machine can be ( 3 ) ( 4 ) ( 5 ) ( 6 ) rewritten in the form ̂ ( ) = ∑ ( ) (, ( )) + 0 0 ⎛ ⎜⎜ 0 ⎜ ⎜ ⎝ and its parameters ( ), 0 can be found directly from ( 4 ) as Rewriting ( 6 ) for (k+1)-th time moment as = ( 0 Ω( ) + −1 , ) −1

0 ( ( ) )

0 = ( )( ( ) )

It is clear that the LS-SVM adaptive learning may be organized based on the numerically simple procedure for matrix inversion in the right part of the system ( 6 ). ( Λ( + 1) ) = ( Ω( + 1) + −1

+1, +1 ) ( ( + 1) ) = = ⎜ Ω( ) + −1 , −1 1

0 ⋮ 1 + −1 ( ( 1 ), ( + 1)) ⎟⎟ ( ( 2 ), ( + 1)) ⎟ = ⎞ ⎟ ⎟ ⎠ ⎛

0 ⎞ = ⎜ ( ) ⎟ = ⎜ ⎝ ( + 1) ⎠⎟

( ( ))−1 ( →− ( ( ), ( + 1)) →− ( ( ), ( + 1)) × 1 + −1 →− ( ) ) ( ( + 1) ) here →− ( ( ), ( + 1)) = (1, ( ( 1 ), ( + 1)), ..., ( ( ), ( + 1))) , →− ( ) = (0, ( )) and applying the Frobenius formula for the inversion of block matrices [7], we obtain at a simple expression for calculating the matrix

( + 1) = ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

→− ⎛ ( ) + ( ( ) ( ( ), ( + 1)))× →− →− ×→− ( ( ), ( + 1) ) (1 + −1 − →− ( ( ), ( + 1)))

( ) ( ( ), ( + 1))−1 −(→− ( ( ), ( + 1)) ( )) (1 + −1 − →− ( ( ), ( + 1)) ( ) ( ( ), ( + 1))−1 →− →−

( + 1) = →− −( ( ) ( ( ), ( + 1))) (1 + −1 − →− ( ( ), ( + 1))) ⎟ ⎞ ⎟ ( ) ( ( ), ( + 1))−1 (1 + −1 − →− ( ( ), ( + 1)) ⎟

⎟ ( ) ( ( ), ( + 1))−1 ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ( 7 ) much easier to do using formula ( 7 ).

It is clear that with large volumes of the training set, the inversion of the ( × ) – matrices is 3. The LS-SVM neural network learning on the “sliding window” {Ω

in the form It should be kept in mind that as the training data set usually grows in time, so does the number of neurons in the neural network, which will sooner or later lead to "curse of dimensionality".

That is why, it is necessary, in cases that the object which generates data is non-stationary, that is necessary in situation when object generate non-stationary data (medical data sets, time series of forecasting electricity consumption, exchange rates). Thus, in this regards better to organize information processing on the “sliding window” [8, 9], that includes s last observations, which, in turn, will lead to the fact that the neural network will be formed by s nodes. For processing non-stationary data more commonly used exponential weighting of old information method, that in our situation can leads to a significant increasing of kernel functions in the hidden layer of proposed LS-SVM and this operation will make system very bulky.

In this case, introducing into consideration the ( × ) – kernel function matrix Ω(, ) = transformation implemented by a neural network with a fixed number of neurons can be written ( ( ) ( )) = ( ( ), ( ), ), = − + 1,

− + 2, ..., ; = − + 1, − + 2, ..., } ∑ = − +1 ̂ (, , ) = (, ) (, ( ), ) + 0 (, ).

Thereof parameters (, ), 0 (, )can be found by solving a matrix equation of type ( 6 ) where Λ(, ) = ( ( − + 1, ), ..., ( − + 1, ), ..., (, )) , (, ) = ( ( − + 1), ..., ( − + 2), ..., ( )) .

When ( + 1)-th observation is fed to processing, it should be calculated into Ω( + 1, ) matrix and in the same time from this matrix should be deleted observation that was fed to processing in − + 1-th moment of time. Herewith =

0 where Λ( + 1, ) = ( ( − + 2, ), ..., ( − + 3, ), ..., ( + 1, )) , ( + 1, ) = ( ( − +

It is easy to see, that relations ( 8 ), ( 9 ) are essentially an online adaptive algorithm for learning 2), ..., ( − + 3), ..., ( + 1)) . the neural network of fixed architecture.

4. Experimental results

In experimental modeling were investigated tuning of LS-SVM [10] model on the several data sets [10, 11, 12, 13] and the influence of the choosing model parameters on the results of data processing and classification. For first example data set “Double Donut” was taken, this data set includes two discharged circles one inside the other artificially generated. To the input of the neural system the number of observations, the noise level and the factor of their scale between the inner and outer circles were transmitted. On the Figure 1 initial classification of data set “Double Donut” is shown.

In the described earlier SVM model there are two parameters that afect learning. Also, it is necessary to choose the kernel function for learning LS-SVM model. The polynomial, radial-basis (Gaussian) or linear function can be used for tuning neural network SVM. are demonstrated. criterion ( 10 ) was optimized

For comparison of classification results linear, radial-basis and polynomial activation functions were used. On the Figure 2 results of classification by SVM model with linear activation function

Next, as the activation function the radial basis kernel ( 5 ) was used, in which the standard width parameter, which is adjusted manually was taken, let is note that in the learning process ( 10 ) ( 11 ) should be selected:

Modifying the output layer of SVM neural network obtained probability of occurrence of each of the classes. Due to the probabilities of belonging to the class, the 3D graph that will show the probability of a given point to one of the classes can be built. Consider the influence of the parameter from formula ( 5 ) on the classification results. The regularization parameter

Thus, for adequate classification it is necessary to choose the parameter so that the classifier is not very general, and not overfitted.

1 2 2

Increasing of this parameter lead to increasing of SVM overfitting and becomes less common. If parameter is equal to 0.15 classification results are shown in the Figure 3a and 3b. (b) The classification results of the

SVM with radial basis kernel (a) The surface of probability of the

2nd class occurrence

Thus, it is easy to see that radial basis kernel was chosen correctly. And as can be seen at the Figure 3 separating hypersurface is constructed correctly. But also it should be noted that all observations that are outside the outer circle will belong to the first class, and others – to the second. It is interesting to see what will be with increasing of the parameter . At the Figure 4 is show result of tuning SVM model with = 10. The influence of the parameter is illustrated on the Figure 5. Here as an example SVM model with radial basis kernel was used.

Based on the results obtained in the first experiment, we can conclude that for this sample the parameter of the fine does not have much efect. This may be due to the fact that at these location parameters the dividing hyperplane is found quickly and the error during training is very small.

Second part of experimental modeling was made on the data set “Double helix”. This data set was generated by mathematical equations and it could not be linearly separated. This data set represents the two helixes, which are inside one another. Figure 6 shows what this data set looks like.

The comparative analysis of classification the support vector machine with linear activation function and the support vector machine with kernel activation function were held. The results of the experiment represented in Figure 7 - Figure 10.

As can be seen in the Figure 7, the linear classifier divides the observation space into two halves, minimizing, as far as possible, the classification error. Since the linear classifier does not give the desired result, it is necessary to use the radial basis kernel.

For learning SVM model with kernel activation function the probabilities surface of one of the classes was developed. The probabilities surface of belonging to the first class with penalty function and kernel parameter equal to one is presented in the Figure 8.

As can be seen, the surface does not describe this data set well, as the model parameters are selected incorrectly for good classification. This is due to the fact that the small value of the penalty function and radial basis function parameters for complex models create an insuficiently trained model but generalize it. However, with increased parameters, the model becomes overfit. The results of classification are represented at the Figure 9.

As can be seen in the Figure 9, the classification of classes is not done well enough to correctly classify even those observations that were used in the model learning. Because the model is very general and cannot create a valid dividing hyperplane, as the model parameters do not allow to create a more complex model.

To solve this problem, it is necessary to increase the penalty function parameter or parameter of radial basis function. Graphs of probability surfaces of the first class with increasing parameter and radial basis function represented in the Figure 10a and 10b.

Based on the results presented in the Figures 10a and 10b, can be said that changing both of the parameters greatly influences the classification results. At very large values of the parameters, the model loses the ability to generalize and can classify only those observations that were in the training sample, which is presented in Figure 11.

(a) Increasing of the penalty function pa-(b) Increasing of the radial basis function rameter parameter Figure ( 10 ): Graphs of probability surfaces

In the next series of experiments data set «Breast Cancer Wisconsin Diagnostic» was used. This data set consist of observations that describe breast cancer diseases of Indian women. This data set contains features that were calculated based on the digitized results of fine-needle aspiration taken from the chest weight. Also, for each of the observations the classification attribute with correct mark is presented: "M" - malignant, "B" - benign. Because this data set is high dimensional for visualization principal component analysis was used for compression initial data. The compression results based on the principal component analysis are shown in the Figure 12.

As can be seen in this figure observation are not linearly separable. Next 3D compression was made for building separating hyperplane. In the Figure 13 is demonstrate results of 3D compression.

The classification results of data set «Breast Cancer Wisconsin Diagnostic» by LS-SVM with linear kernel are presented in the table 1.

As can be seen from these results, the precisions of the model is 85%, among malignant neoplasms correctly classified about 60%, and among benign - about 99.5%. Based on these data, we can say that the classification of malignant neoplasms by the model of the LS-SVM does not give good enough results, as only much more than half of the observations were correctly classified. But the classification of malignant neoplasms gives a relatively good result.

The developed model is not good enough to use because a probability of 85% does not provide an suficiently adequate result to inform the patient or for using to make a diagnosis. The matrix of the LS-SVM with linear kernel errors is shown in the Figure 14.

(a) quantitative estimate (b) percentage estimate Figure (14): The matrix of the LS-SVM with linear kernel errors

Thus, after analyzing these results, it is clear that using the linear core for this model is not a good enough solution. For improving classification results let is activation function and used redial basis kernel. The results of the classification by the LS-SVM model with radial basis kernel are represented in the table 2.

As can be seen from these results, the model of radial basis kernel gives 100% correct classiifcation, which is practically impossible with real world data. This fact may indicate that the model is overfeted. In this situation better to divided initial data set into training and testing sets (size of the testing set is 20% of the total sample). The results of the classification by the LS-SVM model with radial basis kernel after retraining are represented in the table 3.

The accuracy of the testing sample is more than 95%, which is a good indicator. And almost 97% of malignant neoplasms were found correctly. Error matrices are presented in Figure 16.

The experimental research confirms efectiveness of proposed approach for solving task of Big Data Mining in situation than these data are sequentially fed to processing in online mode.

5. Conclusions

The adaptive learning method for least square support vector machine neural network (LS-SVM) with fixed architecture was proposed. The distinctive feature of this method is that the empirical risk minimization criterion occurred on the fixed dimension sliding window, which simplifies the numerical implementation procedures and allows to process information generated by non-stationary objects.

The main benefit of the investigated approach is this method could be used in situation when observations are fed to process in the online mode from nonlinear and nonstationary objects in conditions of outliers in input data. Also proposed system does not sufer from the curse of dimensionality because amount of radial basis function in the hidden layer is limited by the size of the sliding window that helps to protect from the inherited to SVM and LS-SVM curse of dimensionality.

[1]

V. N.

Vapnik ,

A. J.

Chervonenkis , Pattern Recognition Theory (The Nature of Statistical Learning Theory) , Nauka, Moscow, 1974 .

[2]

Y. V.

Bodyanskiy ,

A. O.

Deineko ,

F. M.

Eze , Kernel fuzzy kohonen's clustering neural network and it's recursive learning , Automatic Control and Computer Sciences 52 ( 2018 ) 166 - 174 . doi: 10 .3103/S0146411618030045.

[3]

J. A. K.

Suykens ,

T. V.

Gestel ,

J. D.

Brabanter ,

B. D.

Moor ,

Vandewalle , Least Squares Support Vector Machines, World Scientific, Singapore, 2002 . doi: 10 .1142/5089.

[4]

D. F.

Specht , A general regression neural network, ieee trans. on neural networks , IEEE Transactions on Neural Networks 2 ( 1991 ) 568 - 576 . doi: 10 .1109/72.97934.

[5]

Y. V.

Bodyanskiy ,

A. O.

Deineko ,

Y. V.

Kutsenko , On-line kernel clustering based on the general regression neural network and t. kohonen's self-organizing map , Automatic Control and Computer Sciences 51 ( 2017 ) 55 - 62 . doi: 10 .3103/S0146411617010023.

[6]

Izonin ,

Gregus ,

Tkachenko ,

Kryvinska ,

Vitynskyi , Committee of sgtm neural-like structures with rbf kernel for insurance cost prediction task , in: 2019 IEEE 2nd Ukraine Conference on Electrical and Computer Engineering (UKRCON 2019 ), IEEE, 2019 , pp. 1037 - 1040 . doi: 10 .1109/UKRCON. 2019 . 8879905 .

[7]

F. R.

Gantmacher , The Theory of Matrices , Chelsea Publishing Company, New York, 1959 . doi: 10 .1126/science.131.3408.1216-a.

[8]

Herman-Safar , Time based cross validation , 2020 . URL: https://towardsdatascience. com /time-based-cross-validation-d259b13d42b8.

[9]

L. L.

Peterson ,

B. S.

Davie , Computer Networks: A Systems Approach , Morgan Kaufmann, 2000 .

[10]

Haykin ,

Neural

Networks .

A Comprehensive

Foundation , Prentice Hall, Inc., New Jersey, 1999 .

[11] Jupyter

notebook documentation

, 2020 . URL: http://jupyter.org/documentation.

[12] Scikit-learn documentation , 2020 . URL: http://scikit-learn.org/stable/documentation.html.

[13] Numpy

library documentation

, 2020 . URL: http://www.numpy.org.