=Paper=
{{Paper
|id=Vol-2870/paper122
|storemode=property
|title=Robust Training of ADALINA Based on the Criterion of the Maximum Correntropy in the Presence of Outliers and Correlated Noise
|pdfUrl=https://ceur-ws.org/Vol-2870/paper122.pdf
|volume=Vol-2870
|authors=Oleg Rudenko,Oleksandr Bezsonov
|dblpUrl=https://dblp.org/rec/conf/colins/RudenkoB21
}}
==Robust Training of ADALINA Based on the Criterion of the Maximum Correntropy in the Presence of Outliers and Correlated Noise==
Robust Training of ADALINA Based on the Criterion of the Maximum Correntropy in the Presence of Outliers and Correlated Noise Oleg Rudenko and Oleksandr Bezsonov Kharkiv National University of Radio, Nauky Ave. 14, Kharkiv, 61166, Ukraine Abstract In the given paper the main relations that describe an adaptive multi-step algorithm for training ADALINA are obtained. The use of such an algorithm accelerates the learning process by using information not only about one last cycle, but also about a number of previous cycles. The robustness of the estimates is ensured by the application of the maximum correlation criterion. Keywords 1 ADALINA, optimization, neural network, algorithm, gradient, training, estimation 1. Introduction ADALINA (Adaptive Linear Element) was the first linear neural network proposed by Widrow B. and Hoff M.E. and represented an alternative to the perceptron [1]. Subsequently, this element and the algorithm for its training found a fairly wide application in problems of identification, control, filtering, etc. The Widrow-Hoff learning algorithm is a Kachmazh algorithm for solving systems of linear algebraic equations. Properties of this algorithm for the solution of the identification problem are described in sufficient detail in [2]. In [3], the Kachmazh (Widrow-Hoff) regularized algorithm was used to train ADALINA in the problem of estimating non-stationary parameters. In this paper, a multistep learning algorithm is considered, which is a recurrent current regression analysis (TPA) algorithm that accelerates the ADALINA learning process by using information not only about one last cycle (as in the Widrow-Hoff algorithm), but also about a number of previous cycles. 2. The task of the ADALINA training ADALINA is described by the equation y n1 c T xn1 n1 , (1) where yn 1 – observed output signal; xn1 ( x1,n1 , x2,n1 ,..x N ,n1 )T – vector of the input signals N 1 ; c (c1 , c2 ,..c N )T – is the vector of the required parameters N 1 ; n 1 – noise; n – discrete time. The task of its training is to determine (estimate) the vector of parameters c and is reduced to minimizing some preselected quality functional (identification criterion) n F en ei , (2) i 1 COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine EMAIL: oleh.rudenko@nure.ua (O. Rudenko); oleksandr.bezsonov@nure.ua (O. Bezsonov) ORCID: 0000-0003-0859-2015 (O. Rudenko); 0000-0001-6104-4275 (O. Bezsonov) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) where ei yi yˆ i ; yˆ i ciT1 xi output signal of the model; c vector estimate c ; ei – some differentiable loss function satisfying the conditions 1) ei 0; 2) 0 0; 3) ei ei ; 4) ei e j for ei e j . ˆ The identification task is to find an estimate defined as a solution to the extreme minimum problem F min , (3) or as a solution to the system of equations F (e) n e ei i 0, (4) j i 1 j ei where ei ) – function of influence. ei If we introduce the weight function e (e) / e , then the system of equations (4) can be written as follows: n e ei ei i 0, (5) i 1 j and minimization of functional (2) will be equivalent to minimization of the weighted quadratic functional, which is most often encountered in practice n min ei ei2 ... (6) i 1 When choosing ei 0,5ei2 influence function ei ei , i.e. grows linearly with increasing ei , which explains the instability of the LMS estimate to outliers and to interference, the distributions of which have long “tails”. A robust M-score represents a score c , defined as a solution to the extremal problem (3) or as a solution to the system of equations (4), but the loss function ei should be chosen other than quadratic. There is a fairly large number of functionals that provide robust M-estimates; however, the most common are the combined functionals proposed by Huber [4] and Hempel [5] and consisting of a quadratic one, which ensures the optimality of estimates for a Gaussian distribution, and a modular one, which makes it possible to obtain a more robust distribution with heavy tails estimate. However, the efficiency of the obtained robust estimates substantially depends on the numerous parameters used in these criteria and selected on the basis of the researcher's experience. Recently, when solving problems of identification, filtration, etc. robust algorithms that are obtained not on the basis of minimization (3), but on the basis of maximizing the correlation criterion [6–13] are gaining popularity. These algorithms are simple to implement and efficient. 3. Correntropy and algorithms for its maximization Correntropy, defined as a localized measure of similarity, has proven to be very effective for obtaining robust estimates due to the fact that it is less sensitive to outliers [6–13]. For two random variables X and Y , the correlation is defined as V ( X , Y ) M k ( X , Y ), (7) where k () – rotation invariant Mercer kernels; – kernel width. The most widely used in calculating the correlation is Gaussian function, defined by the formula 1 x y 2 k ( x, y) exp . (8) 2 2 2 When calculating the correlation, it is necessary to know the joint distribution of random variables X and Y , which are usually unknown. In practice, there is often a finite number of samples xi , yi , i 1,2,..., N. Therefore, the most simple estimate of the correlation is calculated as follows: N Vˆ ( X , Y ) k ( xi yi ). 1 (9) N i 1 In tasks of identification, filtering, etc. as a functional, the correlation between the required output signal d i and the output signal of the model yi ... is used. In case of using Gaussian kernels, the optimized functional takes the form 1 1 N ei2 J corr (n) exp 2 N i n N 1 2 2 , (10) where ei d i yi – identification (filtering) error. Gradient optimization algorithm (10) with N 1 looks like [6–9] e2 wn1 wn exp n12 en xn1 , (11) 2 where – a parameter that affects the convergence rate. In [12], to eliminate impulse noise, a recurrent weighted least squares method (RWLS) was proposed, which minimizes the criterion e2 n1 exp n12 (12) 2 and having the form n1 Pn xn1 cn1 cn ( y n1 cnT xn1 ) (13) n1 xn1 Pn xn1 T n1Pn xn1 xnT1Pn 1 Pn1 Pn . (14) n1 xnT1Pn xn1 Here 0 1 weighing coefficient. Thus, when obtaining the formula for calculating Pn1 (14) the approximation Pn1 Pn n1 xn1 xnT1 (15) is used. As it is known, the introduction into the algorithm of the parameter is advisable for identifying non-stationary parameters. Another approach to estimate nonstationary parameters is to use a limited number of measurements in RLS, which leads to the algorithm of the current regression analysis method [14]. 4. Recurrent TPA algorithm with correlated interference Consider the problem of training ADALINA described by equation (1), which in matrix form (after obtaining information on n 1 iteration) is written like this Yn1 X n1c n1 , (16) where Yn1 y1 , y 2 ,...y n1 T – vector of output signals; X nT1 x1 , x2 ,..., xn1 T – matrix of input signals; c (c1 , c2 ,..c N )T – vector of estimated parameters; n1 1 , 2 ,..., n1 T – is the vector of noise. Covariance matrix Dn order n interference n1 has the following form d1,1 d1,2 ... d1,n d1,n1 d d 2,n1 Dn Dn1 M n1 Tn1 ... 2,1 d 2,2 ... d 2,n ... ... ... d nT dn d n1,n1 , d n1,1 d n1,2... d n1,n d n1,n1 where d ij M i j ; d nT d n,1 , d n,2 ,..., d n,n M n1Tn . As known, the application of the assessment cn1 X nT1 X n1 X nT1Yn1 1 to the model with correlated noise gives estimates, the variances of which will be underestimated. The Gaussian-Markov estimate (LMS) obtained by minimizing a quadratic functional has the form cn X nT1 Dn11 X n1 X nT1 Dn11Yn1. 1 (17) The current regression analysis algorithm, which has the form cn1 L ( X nT1 L X n1 L ) 1 X nT1 LYn1 L , (18) where Yn|L1 y n L1 Yn1|L – vector L 1 ; (19) y n1 Yn1|L1 X n|L1 xn L1 T X n1|L – the matrix L 1 N ; (20) x T X n1 n1|L1 was proposed in [14]. In [15] a modification of this algorithm is considered, using the mechanism of forgetting the past information (smoothing). Here L const( L N ) – algorithm’s memory. By analogy with the Gaussian-Markov estimate (17), the following estimate can be obtained: cn1 L ( X nT1 L Dn11 L X n1 L ) 1 X nT1 L Dn11 LYn1 L , (21) where d n L1,n L1 d n L,n L 2 ... d n L1,n1 d n L1,n1 d D dn d n L2,n L2 d n L2,n1 d n L 2,n1 n|L11 Dn1|L n L 2,n L 1 , ... ... ... ... dn d n1,n1 T d n,n L1 d n,n L 2 d n,n1 d n1,n1 where d nT d n,n L1 , d n,n L2 ,..., d n,n1 M n1Tn|L1 . Since the matrix Dn 1| L has a block representation, then Dn|1L1d n d nT Dn|1L Dn|1L1d n Dn|1L1 n1 n1 1 Dn1|L ... ... ... , d nT Dn|1L1 1 n1 n1 where n1 d n1,n1 d nT Dn|1L1d n . Let's assume that on n m cycle the following estimate X D X c X D Y T n|L 1 n|L n|L n|L T n|L 1 n|L n|L (22) is received. The arrival of new information (adding a new dimension) leads to the calculation of an estimate, which, by analogy with (17), can be written as follows: cn1 L1 ( X nT1 L1 X n1 L1 ) 1 X nT1 L1Yn1 L1 , (23) where Yn1|L y n L1 Yn1|L1 – vector ( L 1) 1 ; y (24) n1 Yn1|L X n|L xn L1 T X n1|L1 – the matrix ( L 1) N ; (25) x T X n1 n1|L Let’s introduce the notation Pn11|L1 X nT1|L1 Dn11|L1 X n1|L1 ; Pn|L1 X nT|L Dn|1L X n|L ; Pn11|L X nT1|L Dn11|L X n1|L and calculate Pn11|L 1 X nT|L Dn|1L d n d nT Dn|1L X n|L xn1d nT Dn|1L X n1|L1 X nT|L Dn|1L d n xn1 Pn11|L1 X nT|L Dn|1L X n|L n1 n1 n1 xn1 xnT1 Pn|L1 xn1 xnT1 , n1 xn1 X nT|L Dn|1L d n where xn1 n1 Also similarly calculate X nT1|L1 Dn11|L1Yn1|L1 X nT|L Dn|1LYn|L xn1 y n1 , yn1 Yn|L Dn|1L d n where yn1 . n1 Adding to both parts of (22) xn1 xnT1cn|L Pn|L1cn|L xn1 xnT1cn|L X nT|L Dn|1LYn|L xn1 xnT1cn|L and subtracting (22) from (23) (taking into account the properties Pn|L1 and X nT1|L1 Dn11|L1Yn1|L1 ) we receive Pn11|L1 cn1|L1 cn|L xn1 y n1 cnT|L xn1 or cn1|L1 cn|L Pn1|L1 xn1 y n1 cnT|L xn1 , where Pn|L xn1 xnT1 Pn|L Pn1|L1 Pn|L . 1 xnT1 Pn|L xn1 When discarding outdated information received at n – L + 1 step, we come from evaluation cn1|L1 to the assessment cn1|L ... To obtain the corresponding rules for correcting the estimate, we will proceed as follows. We use the block representation of the covariance matrix Dn 1|L 1 Dn1|L1 d n L1,n L1 d n L1,n L 21... d n L1,n d n L1,n1 d d d nT L1 d n L 2, n L 2 d n L 2, n d n L1,n1 n L1,n L1 n L 2,n L 1 , ... ... ... ... d n L1 Dn1|L d n1,n L1 d n1,n L1 d n1,n d n1,n1 where d nT L1 d n L1,n L2 , d n L,n L3 ,..., d n L1,n1 M n L1Tn1|L , and the inverse matrix representation Dn11|L 1 as 1 d nT L1 Dn11|L n L1 n L1 Dn11|L1 ... ... ... , D 1 d 1 T 1 Dn1|L d n L1d n L1 Dn1|L n1|L n L1 Dn11|L n L1 n L1 where n L1 d n L1,n L1 d nT L1 Dn11|L d n L1. In this case X nT1|L Dn11|L d n L 1d nT L 1 Dn11|L X n1|L x x T Pn11|L 1 X nT1|L 1 Dn11|L 1 X n 1|L 1 n L 1 n L 1 n L 1 n L 1 x n L 1d nT L 1 Dn11|L X n1|L X nT1|L Dn11|L d n L 1 x nT L 1 X nT1|L Dn11|L X n1|L n L 1 n L 1 Pn11|L x n L 1 x nT L 1 , where xn L1 X nT1|L Dn11|L d n L1 xn L1 . n L1 Similarly X nT1|L1 Dn11|L1Yn1|L1 X nT1|L Dn11|LYn1|L xnL1 y n L1 , y n L1 Yn1|L Dn11|L d n L1 where y n L1 . n L1 Subtraction from both parts of (23) xn L1 xnT L1cn1|L1 ... gives XT 1 1 T 1 T n1|L1 Dn1|L1Yn1|L1 xn L1 xn L1 cn1|L1 X n1|L1 Dn1|L1Yn1|L1 xn L1 xn L1cn1|L1. T Considering that X nT1|L Dn11|L X n1|L cn1|L X nT1|L Dn11|LYn1|L , (26) subtraction from (26) of relation (23) (taking into account the expressions for Pn11|L and X nT1|L Dn11|LYn1|L ) Pn11|L cn1|L cn1L1 xn L1 xnT L1cn1|L1 xn L1 y n L1 , from where cn1|L cn1|L1 Pn1|L xn L1 yn L1 cnT1|L1xn L1 , but Pn11|L Pn11|L1 xn L1 xnT L1 , therefore Pn1|L1 xn L1 xnT L1 Pn1|L1 Pn1|L Pn1|L1 . 1 xnT L1 Pn1|L1 xn L1 Thus, the algorithm will have the form (the first two relations describe the inclusion of newly arrived information, and the next ones describe the discarding of outdated information) cn1|L1 cn|L Pn1|L1 xn1 y n1 cnT|L xn1 ; (27) Pn|L xn1 xnT1 Pn|L Pn1|L1 Pn|L . (28) 1 xnT1 Pn|L xn1 cn1L cn1|L1 Pn1|L xn L1 y n L1 cnT1|L1 xn L1 , (29) Pn1|L1 xn L1 xnT L1 Pn1|L1 Pn1|L Pn1|L1 . (30) 1 xnT L1 Pn1|L1 xn L1 If at first outdated information is discarded, and then the newly received information is included, then the algorithm takes the form cn1|L1 cn1|L Pn|L1 xn L1 y n L1 cnT1|L xn L1 ; (31) Pn|L xn L1 xnT L1 Pn|L Pn|L1 Pn|L , (32) 1 xnT L1 Pn|L xn L1 cn1|L cn1|L1 Pn1|L xn1 y n1 cnT1|L1 xn1 ; (33) Pn|L1 xn1 xnT1 Pn|L1 Pn1|L Pn|L1 ; (34) 1 xnT1 Pn|L1 xn1 where xn L1 X nT1|L Dn11|L d n L1 xn L1 ; nL1 xn1 X nT|L Dn|1L d n xn1 ; (35) n1 y n L1 Yn1|L Dn11|L d n L1 y n L1 ; n L1 yn1 Yn|L Dn|1L d n yn1 . n1 5. Recurrent TPA algorithm in the presence of outliers and correlated noise As noted above, the current regression analysis algorithm, which has the form (5), allows two forms of presenting estimates, due to the order of using information about newly received measurements and the oldest ones. Let's dwell on this in more detail. Obtaining new information (adding a new dimension) leads to the calculation of an estimate, which can be written in the form (23) Since at each cycle, when constructing an estimate, L const , then consider the case when new dimensions are added first, and then obsolete ones are excluded. The recurrent form of estimate (23) can be obtained by standard methods using the block representation of vectors and matrices (24), (25), which allows rewriting (23) as follows: y n L1 Yn|L cn1 L ( X nT L X n L xn1 xnT1 xn L1 xnT L1 ) 1 ( xn L1 X nT L xn1 ) . (36) y n1 Let us consider a modification of the current regression analysis algorithm used to maximize the correlation (12) and which, unlike (36), will have the form y n L1 Yn|L cn1 L ( X nT L X n L n1 xn1 xnT1 n L1 xn L1 xnT L1 ) 1 ( xn L1 X nT L xn1 ) . y n1 By designating Pn11|L1 X nT1|L1 X n1|L1 ; Pn|L1 X nT|L X n|L and taking into account (24), (25), we have Pn11|L1 Pn|L1 n1 xn1 xnT1 n L1 xn L1 xnT L1. (37) Applying the matrix inversion lemma to (37), we can obtain, as already noted, two forms of computations: in one, the accumulation of information is used first (the newly arrived signal xn1 ), and then outdated information is discarded (signal xnL1 ) and vice versa. So the calculation of the matrix and the refinement of estimates when accumulating information occurs, respectively, according to the formulas n1 Pn|L xn1 xnT1 Pn|L Pn1|L1 Pn|L . (38) 1 n1 xnT1 Pn|L xn1 n1 Pn L xn1 cn1 L1 cn L ( y n1 cnT L xn1 ). (39) 1 n1 xnT1 Pn L xn1 Ratios corresponding to the discarding of obsolete information due to the fact that Pn11 L X nT1 L X n1 L Pn11 L1 nL1 xnL1 xnTL1 looks like n L1 Pn1|L1 xn L1 xnT L1 Pn1|L1 Pn1|L Pn1|L1 ; (40) 1 n L1 xnT L1 Pn1|L1 xn L1 n L1 Pn1 L1 xn L 1 c n1 L c n1 L 1 ( y n L 1 c nT1 L 1 x n L 1 ). (41) 1 n L 1 x nT L 1 Pn1 L 1 x n L 1 Thus, the recurrent estimation algorithm obtained by adding new information and then excluding obsolete information is described by relations (38) – (41). 6. Parameter selection There are many ways to choose the optimal kernel size. One of the most commonly used methods of choosing an appropriate kernel width in machine learning is cross validation. Another fairly simple approach is the Silverman's rule of thumb [16] 0,9 AN 1 5 , (42) where A is the smallest value between the standard deviation of the data sample and the interquartile range of the data, scaled by 1.34, and N is the number of data samples. As can be seen from (10), the cost function (criterion) of algorithms based on correntropy changes depending on the width , the size of which affects the accuracy of the estimate. Since the reference signals change at random, this leads to the need to apply a time-varying kernel size. The rule of thumb proposed by Silverman was applied in [17] as follows: 15 4 ˆ , (43) 3n 1 L 2 ˆ L 1 i 1 xi Lx 2 , (44) where ̂ denotes the variance of the signal sample. These relations were used in [18] to recursively update the kernel size based on the sample variance using the formula n21 n2 1 ˆ , (45) where (0 1) is close to 1, and ̂ is a sample of the variance of the reference signal x n ... Since the value n2 proportional to the variance of the control sample, a noisy pulse standard can cause a large n2 , which weakens the stability of the algorithm. Therefore, in this work, the threshold is set for n2 n2 , (46) where is determined based on the real situation. In [19], an algorithm for adaptive changes in the kernel width is proposed, which is based on the analysis of the following rule: max ei , i 1,2,...N . (47) 2 2 When choosing a variable , the function f will also be variable and therefore the rule for 2 updating the weights can be controlled. In [20], the case of correction under the assumption that the kernel width linearly depends on the instantaneous error, i.e. n1 k en1 , (48) where k is a positive constant. In [21], it is proposed to use the following function in the estimation algorithm f 2 : 1 exp . 2 f 2 2m 2 (49) 1 B This function has all the necessary properties and is based on the Butterworth filter. In (11) – gain; m and B – filter order and throughput, respectively. Options and m can be either fixed or adaptively changeable. Since the bandwidth B is another parameter that significantly affects f , an attempt is made to adapt it at each iteration based on the analysis of the error en1 The quantity B determines whether n 1 outlier or not. Therefore, in this work as B at time n the average of all past error patterns is selected. Such choice of B allows to reduce the influence of outliers of sampling errors and leads to a slowdown in the rate of convergence of the estimation algorithm. In [13], to determine the optimal value of the variable n the optimization problem is solved. To maximize f n the derivative of (11) with respect to n equals to zero en21 n21 ena1 n1 e2 exp n21 , (50) 2 xn1 en21 2 n1 which produces the following expression: en21 n21 . e2 2 ea (51) 2 ln n1 n 1 n1 n1 2 2 xn1 en1 Here ena1 nT xn1 – a priori error; en1 ena1 n1 , ... Since the information about the implementation of the noise n 1 usually absent, it is not possible to use this formula. Therefore, for the practical application of the correction rule n21 in this paper it is proposed to replace n21 with noise variance 2 and furthermore, it is assumed that the prior error ena1 , does not depend on noise n 1 , i.e. it is assumed that M ena1 n1 0. As noted in this paper, the introduction of the approximation ena1 n1 0 is quite reasonable, since on average this product is zero. Thus, in the final form, the correction rule n21 has following form: en21 n21 . e2 2 (52) n 1 2 ln xn1 en1 2 2 For a smooth update n21 using the moving average method [22], the following rule is proposed in the work: 2 en21 (1 ) min , n2 , if 0 n1 1, n1 2 n 2 ln n1 (53) n 2 otherwise , where – smoothing coefficient close to one, and en21 2 n1 2 . (54) x n1 en21 As seen in (53), to provide a positive square kernel width n21 the suggested kernel value is updated when 0 < n 1 <1. In addition, it can be seen from (53) that in the update n21 plays a major role n 1 , which, as follows from (54), depends on the values en21 , xn1 2 and 2 ... In the case of noise with time-varying characteristics, the learning strategy described in [23] can be used to estimate the time-varying noise variance. Thus, the approach proposed in this paper is applicable to non- stationary noise as well. In [24], a modification of the RLMS is proposed, supplemented by an online recursive scheme for adapting the kernel size, using the analysis of error values on a number of observations m ,n1 m ,n m ,n1 (55) where 1 m ,n1 Nw en en Nw 1 . (56) Here Nw is the size of the observation window. In the paper, en is estimated rather roughly using only the manifold of the window’s edge. In [25], the following correction scheme is proposed for n21 : n21 n2 m2 ,n1 N 1 en m ,n1 2 N 1 en N w 1 m ,n1 2 . (57) w w I II It should be noted that terms I and II can be considered as compensation for estimating en ... To reduce the computational load, this expression can be simplified as follows: n21 n2 m2 ,n1. (58) Analysis of the above approaches to parameter selection shows that there is no single rule for choosing this parameter; therefore, in the practical implementation of algorithms based on maximizing the correlation, one should be guided by the recommendations discussed above. 7. Conclusion In this work, the main relations that describe an adaptive multi-step algorithm for training ADALINA are obtained, which allows to adjust its parameters in real time in the presence of outliers and correlated noise. The use of such an algorithm accelerates the learning process by using information not only about one last cycle (as in the traditional Widrow-Hoff learning algorithm), but also about a number of previous cycles. The robustness of the estimates is ensured by the application of the maximum correlation criterion. 8. Acknowledgements The European Commission's support for the production of this publication does not constitute an endorsement of the contents, which reflect the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein. 9. References [1] B. Widrow, M. Hoff, Adaptive switching circuits, IRE WESCON Convention Record. Part 4. New York: Institute of Radio Engineers, 1960, p. 96–104. [2] B.D. Liberol, O.G. Rudenko, A.A. Bessonov, Investigation of the convergence of one-step adaptive identification algorithms, Problems of Control and Informatics, 2018.5, pp. 19–32. [3] O.G. Rudenko, A.A. Bessonov, Regularized algorithm for learning adalina in the problem of estimating non-stationary parameters,Control systems and machines, 2019.1, pp.22–30. [4] P. Huber, Robustness in statistics. – M.: Mir, 1984, 304 p. [5] F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, W.A. Stahel, Robust Statistics. The Approach Based on Influence Functions. – NY: John Wiley and Sons, 1986, 526 p. [6] I. Santamaría, P.P. Pokharel, C. Jose, J.C. Principe, Generalized Correlation Function: Definition, Properties, and Application to Blind Equalization, IEEE Trans. on Signal Processing, Vol. 54, no. 6, 2006, pp. 2187–2197. DOI:10.1109 / TSP.2006.872524 [7] W. Liu, P.P. Pokharel, J.C. Principe, Correntropy: Properties and Applications in Non-Gaussian Signal Processing, IEEE Trans. on Signal Processing, 1, 2007, pp. 5286–5298DOI: 10.1109 / TSP.2007.896065 [8] W. Wang, J. Zhao, H. Qu, B. Chen, J.C. Principe, An adaptive kernel width update method of correntropy for channel estimation, IEEE International Conference on Digital Signal Processing (DSP), 2015, pp. 916–920. DOI:10.1109 / ICDSP.2015.7252010 [9] A. Gunduz, J.C. Principe, Correntropy as a novel measure for nonlinearity tests / Signal Processing, 2009, v. 89, pp. 14–23. URL: https://doi.org/10.1016/j.sigpro.2008.07.005 [10] Y. Guo, B. Ma, Y. Li, Kernel-Width Adaption Diffusion Maximum Correntropy Algorithm / IEEE Acces, 2016, v. 4, pp.1–14. DOI: 10.1109 / ACCESS.2020.2972905. URL: https://doi.org/10.36227/techrxiv.11842281.v1 [11] L. Lu, H. Zhao, Active impulsive noise control using maximum correntropy withadaptive kernel size, Mechanical Systems and Signal Processing, 2017, v. 87, Part A., pp. 180–191. URL: https://doi.org/10.1016/j.ymssp.2016.10.020 [12] Y. Qi, Y. Wang, J. Zhang, J. Zhu, X. Zheng, Robust Deep Network with Maximum Correntropy Criterion for Seizure Detection, BioMed Research International. Volume 2014, Article ID 703816, 10 p. URL:http://dx.doi.org/10.1155/2014/703816 [13] L. Shi, H. Zhao, Y. Zakharov, An Improved Variable Kernel Width for Maximum Correntropy Criterion Algorithm, IEEE Trans. on Circuits and Systems II: Express Briefs, 2018, 5p. DOI: 10.1109 / TCSII.2018.2880564 [14] I.I. Perelman, Operational identification of control objects, M .: Energoizdat, 1982, 272 p. [15] O.G. Rudenko, I.D. Terenkovsky, A. Shtefan, G.A. Oda, Modified algorithm of the current regression analysis in identification and forecasting problems, Radioelectronics and Informatics, 1998, No. 4 (05), pp. 58–61. [16] B.W. Silverman, Density Estimation for Statistics and Data Analysis, vol. 3: CRC Press: New York, NY, USA, 1986, 176 p. [17] W. Wertz, Statistical Density estimation: A survey, Goettingen: Vandenhoeck and Ruprecht, 1978, 108 p. [18] Z.C. Hea, H.H. Yea, E. Lib, An efficient algorithm for Nonlinear Active Noise Control of Impulsive Noise, Applied Acoustics, 2019, Vol. 148, pp. 366–374. [19] Y. Liu, J. Chen Correntropy-based kernel learning for nonlinear system identification with unknown noise: an industrial case study, Proc. of the 10th IFAC Symposium on Dynamics and Control of Process Systems, 2013, pp. 361–366. [20] J.C. Munoz, J.H. Chen, Removal of the effects of outliers in batch process data through maximum correntropy estimator, Chemom. Intell. Lab. Syst., 2012, pp. 53–58. [21] F. Huang, J. Zhang, S. Zhang Adaptive filtering under a variable kernel width maximum correntropy criterion, IEEE Transactions on Circuits and Systems II: Express Briefs, 2017, Vol. 64, no.10, pp. 1247–1251. [22] L. Lu, H. Zhao, Active impulsive noise control using maximum correntropy with adaptive kernel size, Mechanical Systems and Signal Processing, 2017, vol. 87, pp. 180–191. [23] M. Bergamasco, F.D. Rossa, L. Piroddi, Active noise control with on-line estimation of non- Gaussian noise characteristics, J. Sound Vib., 2012, 331 (1), pp. 27–40. [24] M. Belge, E.L. Miller, A sliding window RLS-like adaptive algorithm for filtering alpha-stable noise, IEEE Signal Process. Lett., 2000, vol. 7., pp. 86–89. [25] A.N. Vazquez, J.A. Garcia, Combination of recursive least-norm algorithms for robust adaptive filtering in alpha-stable noise, IEEE Trans. Signal Process, 2012, vol. 60 (3), pp. 1478–1482.