Chebyshev methods for optimization of large systems on stiff target functionals with weakly filled structured hessian matrices Igor G. Chernorutsky Peter the Great St. Petersburg Polytechnic University 29, Polytechnicheskaya str., St.Petersburg,195251 igcher1946@mail.ru Nikita V. Voinov Peter the Great St. Petersburg Polytechnic University 29, Polytechnicheskaya str., St.Petersburg,195251 voinov@ics2.ecd.spbstu.ru Anna V. Rubtsova Peter the Great St. Petersburg Polytechnic University 29, Polytechnicheskaya str., St.Petersburg,195251 annarub2011@yandex.ru Lina P. Kotlyarova Peter the Great St. Petersburg Polytechnic University 29, Polytechnicheskaya str., St.Petersburg,195251 lina1305@yandex.ru Olga V. Aleksandrova Peter the Great St. Petersburg Polytechnic University 29, Polytechnicheskaya str., St.Petersburg,195251 o.v.alexandrov@yandex.ru Abstract: We analyze matrix gradient methods for optimizing large systems of arbitrary nature according to functions with a special character of Hessian matrices. It is assumed that the Hessian matrix is sparse and structured (nonzero elements occupy fixed positions). The last condition is for large systems optimization consisting of interconnected subsystems of a lower dimension. We suggest that the use of standard Newton-type methods is difficult because of the possible sign uncertainty of the Hessian matrices (non-convexity) and the need to store full-size high-order matrices. In the procedures under consideration, memory savings are achieved by storing only sparse matrices in packaged form. Keywords: gradient methods, relaxation functions, non-convex problems, stiff functionals. 1 Introduction To solve the problem of unconditional minimization J ( x )  min, x  R n , J  C 2 R n x   a class of matrix gradient methods is considered, Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 29   x k 1  x k  H k  Gk , hk  J  x k , hk  R1 , (1) where Gk  J  x  , H  matrix function Gk . k k This class of methods includes some particular cases: classical procedures as gradient descent methods, Levenberg- Marquardt methods, Newton’s methods, methods with exponent relaxation (ER-methods). We present approach based on the concept of the relaxation function [1] for constructing and analyzing nontraditional gradient methods with the Chebyshev relaxation function. The main assumptions used in the construction of this class of methods are:  high dimension of the argument vector x (n>>1) and the undesirability of storing a full-size matrix (nxn) in the computer's memory;  a high stiffness degree [1, 2] of the functional J(x) in a wide range of the argument variation;  convexity of the minimized functional J(x) is guaranteed only in the neighborhood of the minimum point. If the stiffness degree of the optimality criterion J(x) is sufficiently high, then standard computational tools are not very effective due to the reasons stated in [1, 2]. Newton-type methods, as it is known [3-5], are not intended for solving non- convex problems. Moreover, they lose their effectiveness in conditions of high stiffness. The use of algorithms from the class of well-known Markuardt-Levenberg methods with a high stiffness degree of the functional J(x) is associated with the complexity of the algorithmic selection of the corresponding regularization parameter. Most often in this situation, it is recommended to use different non-matrix forms of the conjugate gradient method (SG). However, further it will be shown that in the class of matrix gradient schemes (1) there are algorithms that are more efficient for the considered problems than the SG methods. 2 Multiparameter Optimization Tasks Large systems we define as systems described by models with a large number of controlled parameters. An optimized system may be represented as a set of smaller interconnected subsystems (Fig. 1). Figure 1 – The complex of interrelated subsystems Y is the output characteristics of the subsystems. The requirements for the output parameters of the system (specifications) are given in the form of a system of inequalities: yj(xj, xq)  tj, j  [1: q – 1]; yq(xq)  tq, (2) where j is nj-dimensional local vector of controlled parameters; xq – nq-dimensional vector of controlled parameters, affecting all q output parameters and interconnecting individual subsystems of the system being optimized. The dimension of the full vector of controlled parameters x  [x1, x2, , xq] is equal to q n   ni . (3) i 1 Using the known technique of reducing the solution of inequality systems to optimization tasks, we obtain the target functional of the form q   J  x    U j x j , x q  min, x  R n . j 1 (4) 30 If we have the system of inequalities yi ( x)  ti , it is equivalent to this system: zi ( x)  ti  yi  x   0 Corresponding optimization task: J1 ( x)  min zi ( x)  max i x It is convex and can be reduced to an equivalent formulation that allows achieving non-convexity: J 2 ( x)  max(exp( zi ( x)))  min i x The last task with a sufficiently large  is equivalent to the following: m J 3 ( x )   exp(  zi ( x ))  min x i 1 This is the main target functional in solving systems of inequalities. It is non-convex and has the form (4). Functionals (4) also arise in other formulations of optimization tasks of real systems, therefore the problem (4) has a rather general character. Further, we will consider methods for problem solving (4) with the following additional assumptions:  solving analyzing task of the optimized system requires significant computational costs. Therefore, in the optimization process, it is required to minimize the number of references to the calculation of values J(x);  fill coefficient  of matrix G(x) = J(x) is small. We can usually assume  ~ 1/q. It is easy to establish that the structure of the matrix G (x) does not depend on the point x: G11 0 G1q     G22 G2 q  G x   0 G33 G3q  .        Gq1 Gq 2 Gq 3  Gqq  The submatrices Gij have dimensions ninj, and the total number of nonzero elements is q q 1  ni2  2nq  ni . i 1 i 1 Thus, taking into account the symmetry of the matrix G (x), it is necessary to store in the computer’s memory q q 1  ni2  ni  2  n n q i i 1 i 1 nonzero elements. The necessary information about the storage schemes of sparse matrices are widely presented in the literature describing large data [6]. 3 Methods with Chebyshev relaxation functions We take the eigenvalues i(Gk)  [– m, M], M  m  0. According to above assumptions and requirements for relaxation functions formulated in [1], the most rational method should have a relaxation function R (), the values of which sharply decrease from R = 1 at  = 0, remaining small throughout the entire range [0, M]. And opposite, if <0, the function R () should increase intensively. In addition, the matrix function H corresponding to R () must be constructed without matrix multiplications in order to preserve the sparsity property of the matrix Gk  J(xk). Основной текст должен быть написан в одну колонку. Размер шрифта – 10, межстрочный интервал – одинарный. Рекомендуется использовать шрифт Times New Roman. We show that as such R () with accuracy up to multiplier, offset Chebyshev polynomials of the 2nd kind Ps() can be used, satisfying the known recurrence relations: P1()  1, P2()  2(1 – 2); Ps + 1()  2(1 – 2)Ps() – Ps – 1(). (5) 31 Figure 2 – Chebyshev relaxation function Dependency graph Ps() /s for some s is shown in Fig. 2. Supposing R()  PL() /L with a sufficiently large value of L, we get an arbitrarily fast relaxation of any addend in the presentation of approximating paraboloid [1] 1 n 2  f x   k 1  2 i 1   i , k  i R 2  i , (6) where n  x  i , k u i k i 1 is a decomposition of the current vector into eigenvectors of the matrix Gk . This statement follows from the known fact of uniform convergence of the sequence {Ps() /s} to zero when s   in the open space (0, 1). Further, we assume that the Gk matrix eigenvalues are normalized to the interval (0, 1). In this case, it suffices to consider the matrix G / || G || instead of the matrix G, and the vector g / || G || instead of the gradient vector g. Correlating to the adopted R (), H () dependence has the form Н()  [1 – R()] /  [1 – PL() /L] /. (7) The methods design (1) directly with function (7) is possible, but it makes it necessary to solve at each step k large linear systems of equations with sparse matrix. Below it is shown that there are more effective implementation techniques. According to (7) it follows that H () is a polynomial of L–2 degree, while R () has L–1 degree. Therefore, to implement the matrix gradient method with the indicated function H (), generally speaking, there is no need to solve linear systems. The method will be as follows.   x k 1  x k  1 E   2Gk  ...   L 1GkL  2 g k  x k  H Gk g k .   (8) The implementation of method (8) can be based on the techniques of calculating the coefficients i for various L degrees. In this case, the number L should be chosen from the condition of the most rapid decrease of J(x). An alternative, more economical approach based on other considerations is discussed below. For function: H s     1   2   ...   s 1 s  2 , s  2, 3,... from (5) we can get the recurrence relation (s + 1)Hs + 1  2s(1 – 2)Hs – (s –1)Hs – 1 + 4s; H1  0, H2  2, s  [2: L – 1]. (9) Consequently, we have 2s s 1 4s k x k 1  s  1  x k  H s 1 g k  x k   E  2Gk  H s g k  H s 1 g k  g , s 1 s 1 s 1 s   2 : L  1 or 32 2s s 1 4s k s 1  x k 1  s  1  x k   E  2Gk  s  s 1  g ; s 1 s 1 s 1 1  0, 2  2 g k , s   2 : L  1. (10) xk + 1[s] is s-е approximation to vector xk + 1  xk + 1[L]. Thus, with a fixed quadratic approximation f(x) of the functional J(x) in the neighborhood of x  xk, we have the opportunity to move from Рs to Ps + 1 due to one multiplication of the matrix E – 2Gk by the vector, fully using the sparsity property of the matrix Gk and without additional gradient calculations. The efficiency of algorithm (9) with large values of the stiffness coefficient [1,2] is determined by the relaxation factors for small eigenvalus of the matrix Gk. Consider the positive part of the spectrum (  0), which is especially important in the neighbourhood of the optimum, where the matrix G(x) is positively defined. The main advantage of the method with Rs()  Ps() /s is that already at small s there is a noticeable suppression of the addends from (5) in a wide range of  values . Below there are the Rs values for the internal maximum of Rs() and the boundaries of the ranges s    s, where |Rs()|  Rs: s  3 4 5 6 7 8 Rs  0,333 0,272 0,250 0,239 0,233 0,230 s  0,147 0,092 0,061 0,044 0,033 0,025 s  0,853 0,908 0,939 0,956 0,967 0,975 -R′s (0)  5,30 10,0 16,0 23,3 32,0 42,0 In the left part of the spectrum (  0) we have Rs     1  Rs  0  , therefore, the values of the derivatives R′s(0) in the last row of the table characterize the relaxation multipliers for the negative terms in (6). Calculation of derivatives R′s(0) can be performed on the basis of the following recurrence relations: P1  0, P2  4; Ps1  2 Ps  4s  Ps1 ; RL  0   PL L . s, s values for s  8 (  0) can be calculated using the asymptotic formula s  1,63/s2, s  1 – s; (11) when Rs  0,22. The relation (11) is obtained from the following representation of Chebyshev polynomials sin L  PL     ,   sin 2 , ,   [0,1]. L sin  2 Indeed, for sufficiently small  we have: sin  PL          ,   4 L2 .  If we consider x = sqrt(), we get F()  (x)  sin x /x. We have F()  /F(кр) when   кр, where 2  кр  x кр      6, 523;   кр   x кр  0, 22. 2 Thus, if we suppose кр  4L кр, we get the following statement: for the smallest (positive) eigenvalue m the inequality is fulfilled min  4L2m  кр  6,523, it means, if m  6,523/ (4L2)  1,63/ L2, (12) for all   m we’ll have |RL()|  0,22. From (12) follows (11). 33 The enlarged scheme of the algorithm based on the relation (10) can be implemented using the following sequence of steps. It is assumed that all task variables are properly normalized. We also suppose that the variables are numbered in some optimal way, ensuring efficient storage of the sparse matrix (E – Gk ) in the computer's memory. 4 RELCH Algorithm Step 1. Set the starting point x; calculate J : J(x); set L, which determines the number of recalculations using the formula (10) (about the prior choice of L, see below). Step 2. Calculate g : J(x), G : J(x); lay g : g/ ||G||, G : G/ ||G||;  :  1. Step 3. According to the formula (10) build L ; put x t : x  L . Step 4. Calculate Jt : J(xt). If Jt  J, go to step 5, otherwise go to step 6. Step 5. Put  : /2, x t : x  L and go to step 4. Step 6. Put x: xt, J : Jt and go to step 2. The criterion of the end of the process is not specified here. As a rule, the calculations end when the specified number of calculations of the functional has been exhausted or when the algorithm is explicitly stopped. The number of recalculations L by the formula (10) is a parameter set by the user. According to (11), it is initially advisable to assume L  1,63  L  1, 3  where  - assessment of the ravine degree of the minimized functional. With this choice of L, the relaxation multipliers in the positive part of the spectrum will be guaranteed to be less than 0.23. When designing algorithmic methods for L tasks, it   is necessary to take into account that the sequence {Js}, where Js  J x k  s will not decrease monotonically with s  . In step 5 of the algorithm, the regulation of the advance vector norm was applied in order to prevent the local quadratic model of the functional from leaving the space of validity. 5 Convergence Characteristics We give an estimate of the efficiency of the method (10) in comparison with the methods of conjugate gradients (SG methods). For tasks of large dimension (when the number of iterations is less than the dimension), one can guarantee the convergence of SG-methods only with the speed of a geometric progression even for strongly convex quadratic functionals. We’ll consider the case f(x)  1/2Gx, x, G  0 and estimate the rate of convergence of the SG method to the extreme point x = 0. The iteration xk, obtained by the SG method can be represented as xk  (E + c1G + c2G2 +  + ckGk)x0  Pk(G)x0, where Pk(G) - matrix polynomial of degree k. Moreover, it follows from the properties of the SG method that the coefficients с1, , ck of the polynomial Рk(G) at each iteration take such values to minimize the value f(xk), which differs only by a multiplier from the error function. In other words k-е approximation minimizes f(xk) among vectors x0 + V, where vector V is the element of a subspace stretched by vectors Gx0, G2x0, , Gkx0. We suggest n x    i,0 u i , 0 i 1 where {ui} - orthonormal basis of eigenvectors of the G matrix, therefore we get n n x k  Pk G    i,0 u i    i,0 Pk    u i , Pk  0   1, i 1 i 1 n   f x k  1 2 Gx k , x k  1 2  2i ,0 Pk2   i   i . i 1 (13) Hence, we have 2 n x 0   2i,0 , i 1 34 n k 2 2 x   2i,0 Pk2   i   max Pk2   i  x 0 . (14) i i 1 As a polynomial Pk() we choose the closest to optimal polynomial, the least deviating from zero on the interval [m, M], containing all the eigenvalues of the positive-definite G matrix and normalized so that Pk(0)  1. By linear change of variables M m M m   t 2 2 the task is reduced to constructing a polynomial least deviating from zero on the interval t  [– 1, 1] and taking at the point t0  (M + m) / (M – m), (corresponding   0) value 1. The solution of the last problem is given by the polynomial. Tk  t  Tk  t  Tk  t    , cos  k arccos t 0  Tk  t 0  where Tk(t)  cos(karccos t) is the Chebyshev polynomial. At the same time 1 max Tk  t   max Tk  t  . 1 t 1 Tk  t 0  1t 1 It is obvious max Tk  t   1, 1 t 1 that is why 1 Lk  max Pk     max Tk  t   ,    m, M  , t   1,1 .  t Tk  t 0  Since the conception is fair      k k Tk  t   0,5  t  t 2  1  t  t 2  1  ,    2  k  2   k  M  m  M  m M  m  M  m   Lk  2      1       1   M m  M  m  M m  M  m         k  M  m  k M  m  2       .  M  m   M  m    With sufficiently large k (k  k0) we have k k k  M  m   1   2  M Lk  2    2    2 1   ,   . (15)  M  m   1   m From (13) and (14) we get ||xk||  Lk||x0|| or ||xk||  2qk||x0||, k  k0, (16) where q  1  2   . Thus, the convergence of the SG method with the speed of a geometric progression is proved. The exact value of Lk, valid for any k, will be equal to 35   k    k 2 2 Lk  2  1     1    , L0  1.        From (16) it follows that when   1, convergence can be very slow. SG method «finiteness» is the exact solution to the problem of minimizing a quadratic function (paraboloid) in n steps, where n is the dimension of the search space. It appears only with a sufficiently large number of iterations. The degree of the polynomial Pk() in (13) will be equal to n, and the optimal choice of this polynomial reduces to localizing its n roots at the 1, 2, , n, points that will lead to an exact solution of the task (f(xn)  0). It is easy to see that the estimate (16) for a quadratic function of a general form f(x)  1/2Gx, x – G, x + с converted to ||xk – x||  2qk||x0 – x||, (17) where x is the optimal point that does not coincide in the general case with the start of the coordinates. An important feature of RELCH-type algorithms is that the corresponding relaxation multipliers will be determined only by the number of iterations L and  degree of the rigidity of the problem, regardless of the dimension n. At the same time, in the schemes of SG methods n iteration order is required to complete each cycle of descent; otherwise, according to (17), the rate of convergence may be very small. In addition, each iteration of the SG method, even for a quadratic case, requires a new gradient calculation, that is, additional computational costs for calculating the target functional. We will further assume that the RELCH algorithm is implemented with a constant L  1,3  , having relaxation multipliers in the area of > 0, not exceeding the value of 0.23. We will consider the problem of minimizing a quadratic functional f(x)  1/2Gx, x with a positively defined matrix G. Let us estimate the number of calculations f(x) required to achieve the control vector x with the norm ||x||  0,23 by the SG method and the RELCH algorithm from the starting point x0 c ||x0||  1. When reaching the x point the whole situation repeats, therefore, the comparative efficiency estimates obtained below are of a rather general character. We will assume that two-sided finite difference relations are used to calculate the derivatives, which in the following analysis provides additional advantages to the SG method. To achieve the vector x, the RELCH algorithm needs to calculate at the point x0 a weakly filled Hesse matrix and a gradient vector f(x0). With fill coefficient , it would require about 2n2 calculations of f. Further, we iterate L  1, 3  using formula (10), which do not require additional calculations of the objective functional f . To obtain the vector x the SG method will need N iterations, where N is calculated this way: ||xN||  2qN  0,23, it means N  – 2,2/ ln q. To perform each iteration, it is necessary to update the gradient vector, that in the general case of the application of two-sided finite-difference approximations of derivatives is associated with 2n calculations of f(x). The total number of calculations f is – 4,4n / ln q. The relative benefit in the number of f calculations by the RELCH method compared to the SG method is given by the function ()  – 2,2/ (nln q). It is obvious that when    we have q()  1 and ()  . Typical values  for   0,01 and n  1000 are given below:   100 1000 1500 104 105   1,0 3,4 4,0 11,0 35,0 Thus, to obtain comparable results when   104 , the RELCH algorithm will require approximately 11 times less f calculations than with the SG method. However, it should be taken into account that as  increases, the number of L recalculations by the formula (10) grows. This can lead to an increase in the influence of computational errors in the calculation of s with large s numbers. Example. We will consider the model task of the quadratic functional minimization f(x) с n = 200, h = 1500, g = 0,025. For definiteness, we assume that the time of a single calculation of f(x) is equivalent to performing 102n multiplication operations with floating point. The execution time of a single multiplication operation for some device for definiteness conditionally we will assume equal to ty  310– 5sec. The calculation of the f(x) value takes at the same time tf  0,6 sec of processor time. To calculate f и f using the general finite-difference formulas, we need, respectively, t  2ntf  4 min, t  2n2t  20 min. The number of recalculations by the formula (10) is equal to L  1, 3   50 . At each recalculation, a slightly filled matrix E – 2Gk is multiplied by a vector 50 , that requires n2ty  310– 2 sec of computer time. The vector construction time 50 without taking into account the calculation of f, f will be about 50310– 2  1,5 seconds and may not be taken into account. 36 The result is that to build the control vector x when ||x||  0,23 by using the RELCH method, it will take about t  20 min of machine time. The SG method will cost, respectively, (1500)20  1,3 hours. When the RELCH algorithm is re-applied to the constructed vector x, we will get the vector x с ||x||  0,23||x|| etc. Therefore, if we denote the corresponding sequence of vectors by{xm}, the norm of the vector x will decrease according to the law of geometric progression ||xm||  dm||x0||, where d <0.23 regardless of the  and n values. An important additional advantage of the RELCH algorithm in comparison with the SG method is its rather high efficiency in the non-convex case, since the relaxation function of the method in the left half-plane is entirely located in the allowed area and the relaxation multipliers for   0 grow rapidly by absolute value when going from  s to s1 . Growth characteristics were given before. 6 Conclusion The described class of matrix gradient methods has shown in practice a sufficiently high performance in conditions of high stiffness and non-convexity of objective functionals. When optimizing large systems, it is possible to use efficient “packed” storage forms for matrices of second derivatives, that significantly reduces the requirements for the necessary computer memory. However, the method retains its main characteristics for small-sized systems, competing with the main optimization procedures of nonlinear programming. Acknowledgements The work was financially supported by the Ministry of Education and Science of the Russian Federation in the framework of the Federal Targeted Program for Research and Development in Priority Areas of Advancement of the Russian Scientific and Technological Complex for 2014-2020 (No. 14.584.21.0022, ID RFMEFI58417X0022). References [1] I. Chernorutskiy, P. Drobintsev, V. Kotlyarov and N. Voinov. A New Approach to Generation and Analysis of Gradient Methods Based on Relaxation Function. Proceedings - 2017 UKSim-AMSS 19th International Conference on Modelling and Simulation, UKSim 2017:83-88, May 2018. [2] I. G. Chernorutsky. Algorithmic problems of stiff optimization. St. Petersburg Polytechnic University Journal of Engineering Science and Technology, №6:141-152, 2012. [3] Yu. E. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177-205, August 2006. [4] M. Avriel. Nonlinear programming: analysis and methods - Mineola, NY: Dover Publishing, 2003. [5] P. Deuflhard. Newton methods for nonlinear problems. Affine invariance and adaptive algorithms - Volume 35 / Springer Series in Computational Mathematics, Springer, 2004. [6] S. Pissanetski.Technology of sparse matrices. - Mir, 1988. 37