=Paper= {{Paper |id=Vol-2870/paper122 |storemode=property |title=Robust Training of ADALINA Based on the Criterion of the Maximum Correntropy in the Presence of Outliers and Correlated Noise |pdfUrl=https://ceur-ws.org/Vol-2870/paper122.pdf |volume=Vol-2870 |authors=Oleg Rudenko,Oleksandr Bezsonov |dblpUrl=https://dblp.org/rec/conf/colins/RudenkoB21 }} ==Robust Training of ADALINA Based on the Criterion of the Maximum Correntropy in the Presence of Outliers and Correlated Noise== https://ceur-ws.org/Vol-2870/paper122.pdf
Robust Training of ADALINA Based on the Criterion of the
Maximum Correntropy in the Presence of Outliers and
Correlated Noise
Oleg Rudenko and Oleksandr Bezsonov
Kharkiv National University of Radio, Nauky Ave. 14, Kharkiv, 61166, Ukraine


                 Abstract
                 In the given paper the main relations that describe an adaptive multi-step algorithm for
                 training ADALINA are obtained. The use of such an algorithm accelerates the learning
                 process by using information not only about one last cycle, but also about a number of
                 previous cycles. The robustness of the estimates is ensured by the application of the
                 maximum correlation criterion.

                 Keywords 1
                 ADALINA, optimization, neural network, algorithm, gradient, training, estimation

1. Introduction
    ADALINA (Adaptive Linear Element) was the first linear neural network proposed by Widrow B.
and Hoff M.E. and represented an alternative to the perceptron [1]. Subsequently, this element and the
algorithm for its training found a fairly wide application in problems of identification, control,
filtering, etc. The Widrow-Hoff learning algorithm is a Kachmazh algorithm for solving systems of
linear algebraic equations. Properties of this algorithm for the solution of the identification problem
are described in sufficient detail in [2]. In [3], the Kachmazh (Widrow-Hoff) regularized algorithm
was used to train ADALINA in the problem of estimating non-stationary parameters. In this paper, a
multistep learning algorithm is considered, which is a recurrent current regression analysis (TPA)
algorithm that accelerates the ADALINA learning process by using information not only about one
last cycle (as in the Widrow-Hoff algorithm), but also about a number of previous cycles.

2. The task of the ADALINA training
    ADALINA is described by the equation
                                                        y n1  c T xn1   n1 ,                                          (1)

where yn 1 – observed output signal; xn1  ( x1,n1 , x2,n1 ,..x N ,n1 )T – vector of the input signals
 N  1 ; c   (c1 , c2 ,..c N )T – is the vector of the required parameters N  1 ;  n 1 – noise; n – discrete
time.
                                                                                                           
  The task of its training is to determine (estimate) the vector of parameters c and is reduced to
minimizing some preselected quality functional (identification criterion)
                                                                                     n
                                                                      F en     ei  ,                                 (2)
                                                                                    i 1



COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine
EMAIL: oleh.rudenko@nure.ua (O. Rudenko); oleksandr.bezsonov@nure.ua (O. Bezsonov)
ORCID: 0000-0003-0859-2015 (O. Rudenko); 0000-0001-6104-4275 (O. Bezsonov)
            ©️ 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
where ei  yi  yˆ i ; yˆ i  ciT1 xi  output signal of the model; c  vector estimate c ;  ei  – some
differentiable loss function satisfying the conditions
        1)  ei   0;
        2)  0   0;
        3)  ei     ei ;
        4)    ei    e j    for    ei  e j .
                                                             ˆ
   The identification task is to find an estimate  defined as a solution to the extreme minimum
problem
                                               F    min ,                             (3)
   or as a solution to the system of equations
                                       F (e) n            e
                                                 ei  i  0,                         (4)
                                         j    i 1        j
                  ei 
where  ei )          – function of influence.
                  ei
   If we introduce the weight function  e    (e) / e , then the system of equations (4) can be
written as follows:
                                               n
                                                        e
                                              ei ei  i  0,                             (5)
                                             i 1           j
   and minimization of functional (2) will be equivalent to minimization of the weighted quadratic
functional, which is most often encountered in practice
                                                       n
                                              min  ei ei2 ...                                    (6)
                                                      i 1

    When choosing  ei   0,5ei2 influence function  ei   ei , i.e. grows linearly with increasing ei ,
which explains the instability of the LMS estimate to outliers and to interference, the distributions of
which have long “tails”.
    A robust M-score represents a score c , defined as a solution to the extremal problem (3) or as a
solution to the system of equations (4), but the loss function  ei  should be chosen other than
quadratic.
    There is a fairly large number of functionals that provide robust M-estimates; however, the most
common are the combined functionals proposed by Huber [4] and Hempel [5] and consisting of a
quadratic one, which ensures the optimality of estimates for a Gaussian distribution, and a modular
one, which makes it possible to obtain a more robust distribution with heavy tails estimate. However,
the efficiency of the obtained robust estimates substantially depends on the numerous parameters used
in these criteria and selected on the basis of the researcher's experience.
    Recently, when solving problems of identification, filtration, etc. robust algorithms that are
obtained not on the basis of minimization (3), but on the basis of maximizing the correlation criterion
[6–13] are gaining popularity. These algorithms are simple to implement and efficient.

3. Correntropy and algorithms for its maximization
   Correntropy, defined as a localized measure of similarity, has proven to be very effective for
obtaining robust estimates due to the fact that it is less sensitive to outliers [6–13].
   For two random variables X and Y , the correlation is defined as
                                        V ( X , Y )  M k ( X , Y ),                  (7)
where k () – rotation invariant Mercer kernels;  – kernel width.
  The most widely used in calculating the correlation is Gaussian function, defined by the formula
                                                        1    x  y 2 
                                        k ( x, y)      exp          .                            (8)
                                                   2          2 2 
                                                            
   When calculating the correlation, it is necessary to know the joint distribution of random variables
X and Y , which are usually unknown. In practice, there is often a finite number of samples
xi , yi , i  1,2,..., N. Therefore, the most simple estimate of the correlation is calculated as follows:
                                                            N
                                          Vˆ ( X , Y )   k ( xi  yi ).
                                                         1
                                                                                                    (9)
                                                        N i 1
   In tasks of identification, filtering, etc. as a functional, the correlation between the required output
signal d i and the output signal of the model yi ... is used. In case of using Gaussian kernels, the
optimized functional takes the form
                                                     1 1 N                  ei2 
                                  J corr (n)                       exp   
                                                    2  N i n N 1  2 2 
                                                                                     ,            (10)

where ei  d i  yi – identification (filtering) error.
   Gradient optimization algorithm (10) with N  1 looks like [6–9]
                                                              e2 
                                       wn1  wn   exp  n12 en xn1 ,                        (11)
                                                              2 
                                                                      
where  – a parameter that affects the convergence rate.
   In [12], to eliminate impulse noise, a recurrent weighted least squares method (RWLS) was
proposed, which minimizes the criterion
                                                              e2 
                                                 n1  exp  n12                               (12)
                                                              2 
                                                                      
   and having the form
                                                    n1 Pn xn1
                              cn1  cn                              ( y n1  cnT xn1 )         (13)
                                                 n1 xn1 Pn xn1
                                                          T

                                                 n1Pn xn1 xnT1Pn 
                                               1 
                                Pn1   Pn                             .                         (14)
                                                  n1 xnT1Pn xn1 
                                          
   Here 0    1  weighing coefficient.
   Thus, when obtaining the formula for calculating Pn1 (14) the approximation
                                             Pn1  Pn   n1 xn1 xnT1                         (15)
   is used.
   As it is known, the introduction into the algorithm of the parameter  is advisable for identifying
non-stationary parameters.
   Another approach to estimate nonstationary parameters is to use a limited number of
measurements in RLS, which leads to the algorithm of the current regression analysis method [14].

4. Recurrent TPA algorithm with correlated interference
    Consider the problem of training ADALINA described by equation (1), which in matrix form
(after obtaining information on n 1  iteration) is written like this
                                               Yn1  X n1c    n1 ,                           (16)
where Yn1   y1 , y 2 ,...y n1 T – vector of output signals;
    X nT1  x1 , x2 ,..., xn1 T – matrix of input signals;
    c   (c1 , c2 ,..c N )T – vector of estimated parameters;
     n1  1 ,  2 ,..., n1 T – is the vector of noise.
    Covariance matrix Dn order n interference  n1 has the following form
                                 d1,1     d1,2 ...   d1,n                          d1,n1 
                                d                                                  d 2,n1   Dn
                      
         Dn1  M  n1 Tn1  
                                 ...
                                      
                                     2,1   d 2,2 ... d 2,n
                                              ...       ...
                                                                                                
                                                                                       ...  d nT
                                                                                                                dn     
                                                                                                             d n1,n1 
                                                                                                                          ,
                                                                                            
                                d n1,1 d n1,2... d n1,n                       d n1,n1 
                                                                    
where d ij  M  i j ; d nT  d n,1 , d n,2 ,..., d n,n   M  n1Tn .              
    As known, the application of the assessment
                                                               
                                     cn1  X nT1 X n1 X nT1Yn1            1

to the model with correlated noise gives estimates, the variances of which will be underestimated.
    The Gaussian-Markov estimate (LMS) obtained by minimizing a quadratic functional has the form
                                                                         
                                 cn  X nT1 Dn11 X n1 X nT1 Dn11Yn1.
                                                                            1
                                                                                                                                        (17)
    The current regression analysis algorithm, which has the form
                                               cn1 L  ( X nT1 L X n1 L ) 1 X nT1 LYn1 L ,                                        (18)
where
                                                     Yn|L1   y n L1 
                                                                       
                                           Yn1|L             – vector L  1 ;                                              (19)
                                                     y                 
                                                     n1   Yn1|L1 
                              X n|L1   xn L1 
                                             T
                                      
                   X n1|L             – the matrix L 1  N ;                 (20)
                              x T   X            
                                                     
                              n1   n1|L1 
was proposed in [14]. In [15] a modification of this algorithm is considered, using the mechanism of
forgetting the past information (smoothing). Here L  const( L  N ) – algorithm’s memory.
   By analogy with the Gaussian-Markov estimate (17), the following estimate can be obtained:
                                     cn1 L  ( X nT1 L Dn11 L X n1 L ) 1 X nT1 L Dn11 LYn1 L ,                                (21)
where
              d n L1,n L1            d n L,n L 2 ... d n L1,n1         d n L1,n1 
             d                                                                                     D                     dn 
                                          d n L2,n L2 d n L2,n1            d n L 2,n1   n|L11                        
    Dn1|L  
                n  L  2,n  L 1
                                                                                                                           ,
                      ...                       ...              ...                   ...       
                                                                                                   dn                 d n1,n1 
                                                                                                        T
             
              d n,n L1                   d n,n  L  2     d n,n1             d n1,n1  
where
                                                           
   d nT  d n,n L1 , d n,n L2 ,..., d n,n1   M  n1Tn|L1 .       
    Since the matrix Dn 1| L has a block representation, then
                                               Dn|1L1d n d nT Dn|1L                     Dn|1L1d n 
                                    Dn|1L1                                                        
                                                        n1                                 n1 
                           1
                          Dn1|L                  ...                           ...          ...      ,
                                               d nT Dn|1L1                                  1        
                                                                                                     
                                   
                                   
                                                     n1                                    n1 
where  n1  d n1,n1  d nT Dn|1L1d n .
    Let's assume that on n  m cycle the following estimate
                                                   X D X c  X D Y
                                                         T
                                                         n|L
                                                                 1
                                                                 n|L   n|L       n|L
                                                                                                T
                                                                                                n|L
                                                                                                          1
                                                                                                          n|L n|L                                        (22)
  is received.
  The arrival of new information (adding a new dimension) leads to the calculation of an estimate,
which, by analogy with (17), can be written as follows:
                                         cn1 L1  ( X nT1 L1 X n1 L1 ) 1 X nT1 L1Yn1 L1 ,                                                     (23)
where
                                          Yn1|L   y n L1 
                                                            
                              Yn1|L1             – vector ( L  1) 1 ;
                                          y                                                                                                          (24)
                                          n1   Yn1|L 

                                X n|L   xn L1 
                                             T
                                      
                   X n1|L1             – the matrix ( L  1)  N ;                                                                         (25)
                                x T   X          
                                n1   n1|L 
    Let’s introduce the notation
                                                     
                                     Pn11|L1  X nT1|L1 Dn11|L1 X n1|L1 ;                   
                                                             
                                                 Pn|L1  X nT|L Dn|1L X n|L ;        
                                                         
                                             Pn11|L  X nT1|L Dn11|L X n1|L           
    and calculate Pn11|L 1
                                       X nT|L Dn|1L d n d nT Dn|1L X n|L             xn1d nT Dn|1L X n1|L1            X nT|L Dn|1L d n xn1
Pn11|L1  X nT|L Dn|1L X n|L                                                                                                                  
                                                      n1                                         n1                             n1
    xn1 xnT1
                 Pn|L1  xn1 xnT1 ,
      n1
                    xn1  X nT|L Dn|1L d n
where xn1 
                                n1
    Also similarly calculate
                           X nT1|L1 Dn11|L1Yn1|L1  X nT|L Dn|1LYn|L  xn1 y n1 ,

            yn1  Yn|L Dn|1L d n
where yn1                         .
                     n1
    Adding to both parts of (22) xn1 xnT1cn|L
                                       Pn|L1cn|L  xn1 xnT1cn|L  X nT|L Dn|1LYn|L  xn1 xnT1cn|L
and subtracting (22) from (23) (taking into account the properties Pn|L1 and X nT1|L1 Dn11|L1Yn1|L1 )
we receive
                                                                                          
                                         Pn11|L1 cn1|L1  cn|L  xn1 y n1  cnT|L xn1                    
    or
                                                                                 
                                 cn1|L1  cn|L  Pn1|L1 xn1 y n1  cnT|L xn1 ,                      
where
                                                                                           Pn|L xn1 xnT1 Pn|L
                                                               Pn1|L1  Pn|L                                     .
                                                                                           1  xnT1 Pn|L xn1
   When discarding outdated information received at n – L + 1 step, we come from evaluation
cn1|L1 to the assessment cn1|L ... To obtain the corresponding rules for correcting the estimate, we
will proceed as follows.
   We use the block representation of the covariance matrix Dn 1|L 1
Dn1|L1 
  d n L1,n L1          d n L1,n L 21... d n L1,n                 d n L1,n1 
 d                                                                                           d                        d nT L1 
                             d n  L  2, n  L  2 d n  L  2, n          d n L1,n1   n L1,n L1                        

    n  L  2,n  L 1
                                                                                                                           ,
          ...                        ...                ...                     ...       
                                                                                              d n L1               Dn1|L 
  d n1,n L1               d n1,n L1          d n1,n                 d n1,n1                                          
where
                                                                                                        
                    d nT L1  d n L1,n L2 , d n L,n L3 ,..., d n L1,n1   M  n L1Tn1|L ,                
     and the inverse matrix representation Dn11|L 1 as
                                                   1                                                 d nT L1 Dn11|L   
                                                                                                                        
                                                n L1                                           n L1                 
                                Dn11|L1       ...                        ...                   ...                    ,
                                              D 1 d                                        1              T       1 
                                                                                            Dn1|L d n L1d n L1 Dn1|L
                                              n1|L n L1                    Dn11|L                                 
                                             
                                             
                                                   n L1                                              n L1            
                                                                                                                           
where  n L1  d n L1,n L1  d nT L1 Dn11|L d n L1.
     In this case
                                                                                         X nT1|L Dn11|L d n L 1d nT L 1 Dn11|L X n1|L
                                                          x x
                                                                          T
Pn11|L 1  X nT1|L 1 Dn11|L 1 X n 1|L 1               n  L 1 n  L 1
                                                                                                                                                 
                                                                        n  L 1                                n L 1
     x n L 1d nT L 1 Dn11|L X n1|L           X nT1|L Dn11|L d n L 1 x nT L 1
                                                                                            X nT1|L Dn11|L X n1|L 
                     n L 1                                     n L 1
 Pn11|L  x n L 1 x nT L 1 ,
where
                                                     xn L1  X nT1|L Dn11|L d n L1
                                            xn L1                                       .
                                                                    n L1
     Similarly
     X nT1|L1 Dn11|L1Yn1|L1  X nT1|L Dn11|LYn1|L  xnL1 y n L1 ,

                y n L1  Yn1|L Dn11|L d n L1
where y n L1                                      .
                               n L1
  Subtraction from both parts of (23) xn L1 xnT L1cn1|L1 ... gives

 
 XT       1             1     
                                         T                             1                  T
 n1|L1 Dn1|L1Yn1|L1  xn L1 xn L1 cn1|L1  X n1|L1 Dn1|L1Yn1|L1  xn L1 xn L1cn1|L1.
                                                
                                                                T


  Considering that
                                                                              
                             X nT1|L Dn11|L X n1|L cn1|L  X nT1|L Dn11|LYn1|L ,              (26)
     subtraction from (26) of relation (23) (taking into account the expressions for Pn11|L and
X nT1|L Dn11|LYn1|L )
                                                                   
                                  Pn11|L cn1|L  cn1L1  xn L1 xnT L1cn1|L1  xn L1 y n L1 ,
   from where
                                                        
                 cn1|L  cn1|L1  Pn1|L xn L1 yn L1  cnT1|L1xn L1 ,            
   but
                               Pn11|L  Pn11|L1  xn L1 xnT L1 ,
   therefore
                                              Pn1|L1 xn L1 xnT L1 Pn1|L1
                      Pn1|L  Pn1|L1                                               .
                                                 1  xnT L1 Pn1|L1 xn L1
    Thus, the algorithm will have the form (the first two relations describe the inclusion of newly
arrived information, and the next ones describe the discarding of outdated information)
                                                                          
                               cn1|L1  cn|L  Pn1|L1 xn1 y n1  cnT|L xn1 ;                        (27)
                                                              Pn|L xn1 xnT1 Pn|L
                                      Pn1|L1  Pn|L                                 .                       (28)
                                                              1  xnT1 Pn|L xn1
                                                                  
                         cn1L  cn1|L1  Pn1|L xn L1 y n L1  cnT1|L1 xn L1 ,                  (29)
                                                      Pn1|L1 xn L1 xnT L1 Pn1|L1
                             Pn1|L  Pn1|L1                                                        .       (30)
                                                         1  xnT L1 Pn1|L1 xn L1
   If at first outdated information is discarded, and then the newly received information is included,
then the algorithm takes the form
                                                                      
                         cn1|L1  cn1|L  Pn|L1 xn L1 y n L1  cnT1|L xn L1 ;                   (31)
                                                         Pn|L xn L1 xnT L1 Pn|L
                                    Pn|L1  Pn|L                                         ,                   (32)
                                                         1  xnT L1 Pn|L xn L1
                                                                      
                             cn1|L  cn1|L1  Pn1|L xn1 y n1  cnT1|L1 xn1 ;                      (33)
                                                            Pn|L1 xn1 xnT1 Pn|L1
                                   Pn1|L  Pn|L1                                        ;                   (34)
                                                             1  xnT1 Pn|L1 xn1
where
                                                  xn L1  X nT1|L Dn11|L d n L1
                                    xn L1                                                  ;
                                                                   nL1
                                                   xn1  X nT|L Dn|1L d n
                                           xn1                             ;                                 (35)
                                                             n1
                                                y n L1  Yn1|L Dn11|L d n L1
                                   y n L1                                       ;
                                                              n L1
                                                    yn1  Yn|L Dn|1L d n
                                            yn1                            .
                                                              n1

5. Recurrent TPA algorithm in the presence of outliers and correlated noise
   As noted above, the current regression analysis algorithm, which has the form (5), allows two
forms of presenting estimates, due to the order of using information about newly received
measurements and the oldest ones.
   Let's dwell on this in more detail.
   Obtaining new information (adding a new dimension) leads to the calculation of an estimate,
which can be written in the form (23)
   Since at each cycle, when constructing an estimate, L  const , then consider the case when new
dimensions are added first, and then obsolete ones are excluded.
   The recurrent form of estimate (23) can be obtained by standard methods using the block
representation of vectors and matrices (24), (25), which allows rewriting (23) as follows:
                                                                                                             y n L1 
                                                                                                                       
                                                                                                             Yn|L 
               cn1 L  ( X nT L X n L  xn1 xnT1  xn L1 xnT L1 ) 1 ( xn L1 X nT L  xn1 )     .           (36)
                                                                                                             y         
                                                                                                             n1 
                                                                                                                      
   Let us consider a modification of the current regression analysis algorithm used to maximize the
correlation (12) and which, unlike (36), will have the form
                                                                                                             y n L1 
                                                                                                                       
                                                                                                            Yn|L 
cn1 L  ( X nT L X n L   n1 xn1 xnT1   n L1 xn L1 xnT L1 ) 1 ( xn L1 X nT L  xn1 )     .
                                                                                                              
                                                                                                             y n1 
                                                                                                                       
                                                                                                                       
   By designating
                                                Pn11|L1  X nT1|L1 X n1|L1 ;
                                                    Pn|L1  X nT|L X n|L
   and taking into account (24), (25), we have
                                Pn11|L1  Pn|L1   n1 xn1 xnT1  n L1 xn L1 xnT L1.                          (37)
  Applying the matrix inversion lemma to (37), we can obtain, as already noted, two forms of
computations: in one, the accumulation of information is used first (the newly arrived signal xn1 ),
and then outdated information is discarded (signal xnL1 ) and vice versa. So the calculation of the
matrix and the refinement of estimates when accumulating information occurs, respectively,
according to the formulas
                                                  n1 Pn|L xn1 xnT1 Pn|L
                               Pn1|L1  Pn|L                               .              (38)
                                                 1   n1 xnT1 Pn|L xn1
                                                              n1 Pn L xn1
                              cn1 L1  cn L                                            ( y n1  cnT L xn1 ).              (39)
                                                        1   n1 xnT1 Pn L xn1
   Ratios corresponding to the discarding of obsolete information due to the fact that
                       Pn11 L  X nT1 L X n1 L  Pn11 L1  nL1 xnL1 xnTL1
   looks like
                                                            n L1 Pn1|L1 xn L1 xnT L1 Pn1|L1
                              Pn1|L  Pn1|L1                                                                   ;            (40)
                                                              1   n L1 xnT L1 Pn1|L1 xn L1
                                            n L1 Pn1 L1 xn L 1
         c n1 L  c n1 L 1                                                      ( y n L 1  c nT1 L 1 x n L 1 ).   (41)
                                    1   n L 1 x nT L 1 Pn1 L 1 x n L 1
   Thus, the recurrent estimation algorithm obtained by adding new information and then excluding
obsolete information is described by relations (38) – (41).
6. Parameter  selection
   There are many ways to choose the optimal kernel size. One of the most commonly used methods
of choosing an appropriate kernel width in machine learning is cross validation. Another fairly simple
approach is the Silverman's rule of thumb [16]
                                         0,9 AN 1 5 ,                                      (42)
where A is the smallest value between the standard deviation of the data sample and the interquartile
range of the data, scaled by 1.34, and N is the number of data samples.
   As can be seen from (10), the cost function (criterion) of algorithms based on correntropy changes
depending on the width  , the size of which affects the accuracy of the estimate. Since the reference
signals change at random, this leads to the need to apply a time-varying kernel size.
   The rule of thumb proposed by Silverman was applied in [17] as follows:
                                                 15
                                            4
                                                 ˆ ,                              (43)
                                            3n 
                                       1  L 2              
                               ˆ            
                                     L  1  i 1
                                                   xi  Lx 2 ,
                                                                                      (44)
                                                             
where ̂ denotes the variance of the signal sample.
   These relations were used in [18] to recursively update the kernel size based on the sample
variance using the formula
                                 n21   n2  1   ˆ ,                          (45)
where  (0    1) is close to 1, and ̂ is a sample of the variance of the reference signal x n ...
Since the value  n2 proportional to the variance of the control sample, a noisy pulse standard can
cause a large  n2 , which weakens the stability of the algorithm. Therefore, in this work, the threshold
 is set for  n2
                                           n2   ,                                            (46)
where  is determined based on the real situation.
   In [19], an algorithm for adaptive changes in the kernel width is proposed, which is based on the
analysis of the following rule:
                                    max ei
                                         , i  1,2,...N .                                 (47)
                                     2 2
                                                        
   When choosing a variable  , the function f  will also be variable and therefore the rule for
                                                        2

updating the weights can be controlled.
   In [20], the case of correction  under the assumption that the kernel width linearly depends on
the instantaneous error, i.e.
                                      n1  k en1 ,                                      (48)
where k is a positive constant.
                                                                                       
   In [21], it is proposed to use the following function in the estimation algorithm f  2 :

                                      1  exp  .
                                                             2
                                 f 2 
                                                         2m
                                                2                                           (49)
                                           1      
                                                B 
                                                   
   This function has all the necessary properties and is based on the Butterworth filter. In (11)  –
gain; m and B – filter order and throughput, respectively. Options  and m can be either fixed or
adaptively changeable. Since the bandwidth B is another parameter that significantly affects f  , an
attempt is made to adapt it at each iteration based on the analysis of the error en1 The quantity B
determines whether  n 1 outlier or not. Therefore, in this work as B at time n the average of all past
error patterns is selected. Such choice of B allows to reduce the influence of outliers of sampling
errors and leads to a slowdown in the rate of convergence of the estimation algorithm.
   In [13], to determine the optimal value of the variable  n the optimization problem is solved. To
maximize f  n  the derivative of (11) with respect to  n equals to zero
                            en21   n21  ena1 n1         e2 
                                                           exp  n21 ,                              (50)
                                          2
                                  xn1 en21                   2 
                                                                   n1 
   which produces the following expression:
                                                      en21
                              n21                                     .
                                              e2   2  ea                                          (51)
                                        2 ln  n1    n 1   n1 n1 
                                                          2 2
                                                    xn1 en1       
                                                                    
   Here ena1   nT xn1 – a priori error; en1  ena1   n1 , ...
   Since the information about the implementation of the noise n 1 usually absent, it is not possible
to use this formula. Therefore, for the practical application of the correction rule  n21 in this paper it is
proposed to replace  n21 with noise variance  2 and furthermore, it is assumed that the prior error
ena1 , does not depend on noise  n 1 , i.e. it is assumed that M ena1 n1  0. As noted in this paper,
the introduction of the approximation ena1 n1  0 is quite reasonable, since on average this product
is zero. Thus, in the final form, the correction rule  n21 has following form:
                                                      en21
                                    n21                           .
                                                    e2   2                                          (52)
                                                       n 1   
                                              2 ln              
                                                     xn1 en1 
                                                            2 2
                                                                
    For a smooth update  n21 using the moving average method [22], the following rule is proposed
in the work:
                      2                    en21           
                         (1   ) min             ,  n2 , if 0   n1  1,
              n1  
               2         n                 2 ln  n1        
                                                                                          (53)
                     
                                            n
                                               2
                                                    otherwise    ,

where  – smoothing coefficient close to one, and
                                          en21   2
                                 n1           2
                                                       .                                                (54)
                                         x n1 en21
   As seen in (53), to provide a positive square kernel width  n21 the suggested kernel value is
updated when 0 <  n 1 <1. In addition, it can be seen from (53) that in the update  n21 plays a major

role  n 1 , which, as follows from (54), depends on the values en21 , xn1
                                                                                   2
                                                                                       and  2 ... In the case of
noise with time-varying characteristics, the learning strategy described in [23] can be used to estimate
the time-varying noise variance. Thus, the approach proposed in this paper is applicable to non-
stationary noise as well.
    In [24], a modification of the RLMS is proposed, supplemented by an online recursive scheme for
adapting the kernel size, using the analysis of error values on a number of observations
                                    m ,n1  m ,n  m ,n1                              (55)
where
                                           1
                                  m ,n1 
                                          Nw
                                                      
                                               en  en Nw 1 .                            (56)

   Here Nw is the size of the observation window. In the paper, en is estimated rather roughly using
only the manifold of the window’s edge.
   In [25], the following correction scheme is proposed for  n21 :

            n21   n2  m2 ,n1 
                                         N
                                          1
                                                         
                                            en  m ,n1 2 
                                                             N
                                                              1
                                                                                    
                                                                en N w 1  m ,n1 2 .
                                                                                            (57)
                                         w      w               
                                                  I                       II
   It should be noted that terms I and II can be considered as compensation for estimating en ... To
reduce the computational load, this expression can be simplified as follows:
                                   n21   n2  m2 ,n1.                              (58)
   Analysis of the above approaches to parameter selection  shows that there is no single rule for
choosing this parameter; therefore, in the practical implementation of algorithms based on
maximizing the correlation, one should be guided by the recommendations discussed above.

7. Conclusion
    In this work, the main relations that describe an adaptive multi-step algorithm for training
ADALINA are obtained, which allows to adjust its parameters in real time in the presence of outliers
and correlated noise. The use of such an algorithm accelerates the learning process by using
information not only about one last cycle (as in the traditional Widrow-Hoff learning algorithm), but
also about a number of previous cycles. The robustness of the estimates is ensured by the application
of the maximum correlation criterion.

8. Acknowledgements
   The European Commission's support for the production of this publication does not constitute an
endorsement of the contents, which reflect the views only of the authors, and the Commission cannot
be held responsible for any use which may be made of the information contained therein.

9. References

[1] B. Widrow, M. Hoff, Adaptive switching circuits, IRE WESCON Convention Record. Part 4.
    New York: Institute of Radio Engineers, 1960, p. 96–104.
[2] B.D. Liberol, O.G. Rudenko, A.A. Bessonov, Investigation of the convergence of one-step
    adaptive identification algorithms, Problems of Control and Informatics, 2018.5, pp. 19–32.
[3] O.G. Rudenko, A.A. Bessonov, Regularized algorithm for learning adalina in the problem of
    estimating non-stationary parameters,Control systems and machines, 2019.1, pp.22–30.
[4] P. Huber, Robustness in statistics. – M.: Mir, 1984, 304 p.
[5] F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, W.A. Stahel, Robust Statistics. The Approach
    Based on Influence Functions. – NY: John Wiley and Sons, 1986, 526 p.
[6] I. Santamaría, P.P. Pokharel, C. Jose, J.C. Principe, Generalized Correlation Function:
    Definition, Properties, and Application to Blind Equalization, IEEE Trans. on Signal Processing,
    Vol. 54, no. 6, 2006, pp. 2187–2197. DOI:10.1109 / TSP.2006.872524
[7] W. Liu, P.P. Pokharel, J.C. Principe, Correntropy: Properties and Applications in Non-Gaussian
    Signal Processing, IEEE Trans. on Signal Processing, 1, 2007, pp. 5286–5298DOI: 10.1109 /
    TSP.2007.896065
[8] W. Wang, J. Zhao, H. Qu, B. Chen, J.C. Principe, An adaptive kernel width update method of
     correntropy for channel estimation, IEEE International Conference on Digital Signal Processing
     (DSP), 2015, pp. 916–920. DOI:10.1109 / ICDSP.2015.7252010
[9] A. Gunduz, J.C. Principe, Correntropy as a novel measure for nonlinearity tests / Signal
     Processing, 2009, v. 89, pp. 14–23. URL: https://doi.org/10.1016/j.sigpro.2008.07.005
[10] Y. Guo, B. Ma, Y. Li, Kernel-Width Adaption Diffusion Maximum Correntropy Algorithm /
     IEEE Acces, 2016, v. 4, pp.1–14. DOI: 10.1109 / ACCESS.2020.2972905. URL:
     https://doi.org/10.36227/techrxiv.11842281.v1
[11] L. Lu, H. Zhao, Active impulsive noise control using maximum correntropy withadaptive kernel
     size, Mechanical Systems and Signal Processing, 2017, v. 87, Part A., pp. 180–191. URL:
     https://doi.org/10.1016/j.ymssp.2016.10.020
[12] Y. Qi, Y. Wang, J. Zhang, J. Zhu, X. Zheng, Robust Deep Network with Maximum Correntropy
     Criterion for Seizure Detection, BioMed Research International. Volume 2014, Article ID
     703816, 10 p. URL:http://dx.doi.org/10.1155/2014/703816
[13] L. Shi, H. Zhao, Y. Zakharov, An Improved Variable Kernel Width for Maximum Correntropy
     Criterion Algorithm, IEEE Trans. on Circuits and Systems II: Express Briefs, 2018, 5p. DOI:
     10.1109 / TCSII.2018.2880564
[14] I.I. Perelman, Operational identification of control objects, M .: Energoizdat, 1982, 272 p.
[15] O.G. Rudenko, I.D. Terenkovsky, A. Shtefan, G.A. Oda, Modified algorithm of the current
     regression analysis in identification and forecasting problems, Radioelectronics and Informatics,
     1998, No. 4 (05), pp. 58–61.
[16] B.W. Silverman, Density Estimation for Statistics and Data Analysis, vol. 3: CRC Press: New
     York, NY, USA, 1986, 176 p.
[17] W. Wertz, Statistical Density estimation: A survey, Goettingen: Vandenhoeck and Ruprecht,
     1978, 108 p.
[18] Z.C. Hea, H.H. Yea, E. Lib, An efficient algorithm for Nonlinear Active Noise Control of
     Impulsive Noise, Applied Acoustics, 2019, Vol. 148, pp. 366–374.
[19] Y. Liu, J. Chen Correntropy-based kernel learning for nonlinear system identification with
     unknown noise: an industrial case study, Proc. of the 10th IFAC Symposium on Dynamics and
     Control of Process Systems, 2013, pp. 361–366.
[20] J.C. Munoz, J.H. Chen, Removal of the effects of outliers in batch process data through
     maximum correntropy estimator, Chemom. Intell. Lab. Syst., 2012, pp. 53–58.
[21] F. Huang, J. Zhang, S. Zhang Adaptive filtering under a variable kernel width maximum
     correntropy criterion, IEEE Transactions on Circuits and Systems II: Express Briefs, 2017, Vol.
     64, no.10, pp. 1247–1251.
[22] L. Lu, H. Zhao, Active impulsive noise control using maximum correntropy with adaptive kernel
     size, Mechanical Systems and Signal Processing, 2017, vol. 87, pp. 180–191.
[23] M. Bergamasco, F.D. Rossa, L. Piroddi, Active noise control with on-line estimation of non-
     Gaussian noise characteristics, J. Sound Vib., 2012, 331 (1), pp. 27–40.
[24] M. Belge, E.L. Miller, A sliding window RLS-like adaptive algorithm for filtering alpha-stable
     noise, IEEE Signal Process. Lett., 2000, vol. 7., pp. 86–89.
[25] A.N. Vazquez, J.A. Garcia, Combination of recursive least-norm algorithms for robust adaptive
     filtering in alpha-stable noise, IEEE Trans. Signal Process, 2012, vol. 60 (3), pp. 1478–1482.