=Paper=
{{Paper
|id=Vol-2514/paper17
|storemode=property
|title=Chebyshev methods for optimization of large systems on stiff target functionals with weakly filled structured hessian matrices
|pdfUrl=https://ceur-ws.org/Vol-2514/paper17.pdf
|volume=Vol-2514
|authors=Igor Chernorutsky,Nikita Voinov,Anna Rubtsova,Lina Kotlyarova,Olga Aleksandrova
}}
==Chebyshev methods for optimization of large systems on stiff target functionals with weakly filled structured hessian matrices==
Chebyshev methods for optimization of large systems on stiff target
functionals with weakly filled structured hessian matrices
Igor G. Chernorutsky
Peter the Great St. Petersburg Polytechnic University
29, Polytechnicheskaya str., St.Petersburg,195251
igcher1946@mail.ru
Nikita V. Voinov
Peter the Great St. Petersburg Polytechnic University
29, Polytechnicheskaya str., St.Petersburg,195251
voinov@ics2.ecd.spbstu.ru
Anna V. Rubtsova
Peter the Great St. Petersburg Polytechnic University
29, Polytechnicheskaya str., St.Petersburg,195251
annarub2011@yandex.ru
Lina P. Kotlyarova
Peter the Great St. Petersburg Polytechnic University
29, Polytechnicheskaya str., St.Petersburg,195251
lina1305@yandex.ru
Olga V. Aleksandrova
Peter the Great St. Petersburg Polytechnic University
29, Polytechnicheskaya str., St.Petersburg,195251
o.v.alexandrov@yandex.ru
Abstract: We analyze matrix gradient methods for optimizing large
systems of arbitrary nature according to functions with a special
character of Hessian matrices. It is assumed that the Hessian matrix
is sparse and structured (nonzero elements occupy fixed positions).
The last condition is for large systems optimization consisting of
interconnected subsystems of a lower dimension. We suggest that
the use of standard Newton-type methods is difficult because of the
possible sign uncertainty of the Hessian matrices (non-convexity)
and the need to store full-size high-order matrices. In the procedures
under consideration, memory savings are achieved by storing only
sparse matrices in packaged form.
Keywords: gradient methods, relaxation functions, non-convex
problems, stiff functionals.
1 Introduction
To solve the problem of unconditional minimization
J ( x ) min, x R n , J C 2 R n
x
a class of matrix gradient methods is considered,
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
29
x k 1 x k H k Gk , hk J x k , hk R1 , (1)
where Gk J x , H matrix function Gk .
k
k
This class of methods includes some particular cases: classical procedures as gradient descent methods, Levenberg-
Marquardt methods, Newton’s methods, methods with exponent relaxation (ER-methods).
We present approach based on the concept of the relaxation function [1] for constructing and analyzing nontraditional
gradient methods with the Chebyshev relaxation function. The main assumptions used in the construction of this class of
methods are:
high dimension of the argument vector x (n>>1) and the undesirability of storing a full-size matrix (nxn) in the
computer's memory;
a high stiffness degree [1, 2] of the functional J(x) in a wide range of the argument variation;
convexity of the minimized functional J(x) is guaranteed only in the neighborhood of the minimum point.
If the stiffness degree of the optimality criterion J(x) is sufficiently high, then standard computational tools are not very
effective due to the reasons stated in [1, 2]. Newton-type methods, as it is known [3-5], are not intended for solving non-
convex problems. Moreover, they lose their effectiveness in conditions of high stiffness. The use of algorithms from the
class of well-known Markuardt-Levenberg methods with a high stiffness degree of the functional J(x) is associated with the
complexity of the algorithmic selection of the corresponding regularization parameter.
Most often in this situation, it is recommended to use different non-matrix forms of the conjugate gradient method (SG).
However, further it will be shown that in the class of matrix gradient schemes (1) there are algorithms that are more
efficient for the considered problems than the SG methods.
2 Multiparameter Optimization Tasks
Large systems we define as systems described by models with a large number of controlled parameters.
An optimized system may be represented as a set of smaller interconnected subsystems (Fig. 1).
Figure 1 – The complex of interrelated subsystems
Y is the output characteristics of the subsystems.
The requirements for the output parameters of the system (specifications) are given in the form of a system of
inequalities:
yj(xj, xq) tj, j [1: q – 1]; yq(xq) tq, (2)
where j is nj-dimensional local vector of controlled parameters; xq – nq-dimensional vector of controlled parameters,
affecting all q output parameters and interconnecting individual subsystems of the system being optimized. The dimension
of the full vector of controlled parameters x [x1, x2, , xq] is equal to
q
n ni . (3)
i 1
Using the known technique of reducing the solution of inequality systems to optimization tasks, we obtain the target
functional of the form
q
J x U j x j , x q min, x R n .
j 1
(4)
30
If we have the system of inequalities yi ( x) ti , it is equivalent to this system:
zi ( x) ti yi x 0
Corresponding optimization task:
J1 ( x) min zi ( x) max
i x
It is convex and can be reduced to an equivalent formulation that allows achieving non-convexity:
J 2 ( x) max(exp( zi ( x))) min
i x
The last task with a sufficiently large is equivalent to the following:
m
J 3 ( x ) exp( zi ( x )) min
x
i 1
This is the main target functional in solving systems of inequalities. It is non-convex and has the form (4).
Functionals (4) also arise in other formulations of optimization tasks of real systems, therefore the problem (4) has a
rather general character.
Further, we will consider methods for problem solving (4) with the following additional assumptions:
solving analyzing task of the optimized system requires significant computational costs. Therefore, in the
optimization process, it is required to minimize the number of references to the calculation of values J(x);
fill coefficient of matrix G(x) = J(x) is small. We can usually assume ~ 1/q.
It is easy to establish that the structure of the matrix G (x) does not depend on the point x:
G11 0 G1q
G22 G2 q
G x 0 G33 G3q .
Gq1 Gq 2 Gq 3 Gqq
The submatrices Gij have dimensions ninj, and the total number of nonzero elements is
q q 1
ni2 2nq ni .
i 1 i 1
Thus, taking into account the symmetry of the matrix G (x), it is necessary to store in the computer’s memory
q q 1
ni2 ni 2 n n q i
i 1 i 1
nonzero elements. The necessary information about the storage schemes of sparse matrices are widely presented in the
literature describing large data [6].
3 Methods with Chebyshev relaxation functions
We take the eigenvalues i(Gk) [– m, M], M m 0. According to above assumptions and requirements for relaxation
functions formulated in [1], the most rational method should have a relaxation function R (), the values of which sharply
decrease from R = 1 at = 0, remaining small throughout the entire range [0, M]. And opposite, if <0, the function R ()
should increase intensively. In addition, the matrix function H corresponding to R () must be constructed without matrix
multiplications in order to preserve the sparsity property of the matrix Gk J(xk).
Основной текст должен быть написан в одну колонку. Размер шрифта – 10, межстрочный интервал – одинарный.
Рекомендуется использовать шрифт Times New Roman.
We show that as such R () with accuracy up to multiplier, offset Chebyshev polynomials of the 2nd kind Ps() can be
used, satisfying the known recurrence relations:
P1() 1, P2() 2(1 – 2); Ps + 1() 2(1 – 2)Ps() – Ps – 1(). (5)
31
Figure 2 – Chebyshev relaxation function
Dependency graph Ps() /s for some s is shown in Fig. 2. Supposing R() PL() /L with a sufficiently large value of L,
we get an arbitrarily fast relaxation of any addend in the presentation of approximating paraboloid [1]
1 n 2
f x
k 1
2 i 1
i , k i R 2 i , (6)
where
n
x i , k u i
k
i 1
is a decomposition of the current vector into eigenvectors of the matrix Gk .
This statement follows from the known fact of uniform convergence of the sequence {Ps() /s} to zero when s in
the open space (0, 1). Further, we assume that the Gk matrix eigenvalues are normalized to the interval (0, 1). In this case, it
suffices to consider the matrix G / || G || instead of the matrix G, and the vector g / || G || instead of the gradient vector g.
Correlating to the adopted R (), H () dependence has the form
Н() [1 – R()] / [1 – PL() /L] /. (7)
The methods design (1) directly with function (7) is possible, but it makes it necessary to solve at each step k large linear
systems of equations with sparse matrix. Below it is shown that there are more effective implementation techniques.
According to (7) it follows that H () is a polynomial of L–2 degree, while R () has L–1 degree. Therefore, to
implement the matrix gradient method with the indicated function H (), generally speaking, there is no need to solve linear
systems. The method will be as follows.
x k 1 x k 1 E 2Gk ... L 1GkL 2 g k x k H Gk g k .
(8)
The implementation of method (8) can be based on the techniques of calculating the coefficients i for various L degrees.
In this case, the number L should be chosen from the condition of the most rapid decrease of J(x). An alternative, more
economical approach based on other considerations is discussed below.
For function:
H s 1 2 ... s 1 s 2 , s 2, 3,...
from (5) we can get the recurrence relation
(s + 1)Hs + 1 2s(1 – 2)Hs – (s –1)Hs – 1 + 4s; H1 0, H2 2, s [2: L – 1]. (9)
Consequently, we have
2s s 1 4s k
x k 1 s 1 x k H s 1 g k x k E 2Gk H s g k H s 1 g k g ,
s 1 s 1 s 1
s 2 : L 1
or
32
2s s 1 4s k
s 1 x k 1 s 1 x k E 2Gk s s 1 g ;
s 1 s 1 s 1
1 0, 2 2 g k , s 2 : L 1.
(10)
xk + 1[s] is s-е approximation to vector xk + 1 xk + 1[L].
Thus, with a fixed quadratic approximation f(x) of the functional J(x) in the neighborhood of x xk, we have the
opportunity to move from Рs to Ps + 1 due to one multiplication of the matrix E – 2Gk by the vector, fully using the sparsity
property of the matrix Gk and without additional gradient calculations. The efficiency of algorithm (9) with large values of
the stiffness coefficient [1,2] is determined by the relaxation factors for small eigenvalus of the matrix Gk. Consider the
positive part of the spectrum ( 0), which is especially important in the neighbourhood of the optimum, where the matrix
G(x) is positively defined. The main advantage of the method with Rs() Ps() /s is that already at small s there is a
noticeable suppression of the addends from (5) in a wide range of values . Below there are the Rs values for the internal
maximum of Rs() and the boundaries of the ranges s s, where |Rs()| Rs:
s 3 4 5 6 7 8
Rs 0,333 0,272 0,250 0,239 0,233 0,230
s 0,147 0,092 0,061 0,044 0,033 0,025
s 0,853 0,908 0,939 0,956 0,967 0,975
-R′s (0) 5,30 10,0 16,0 23,3 32,0 42,0
In the left part of the spectrum ( 0) we have
Rs 1 Rs 0 ,
therefore, the values of the derivatives R′s(0) in the last row of the table characterize the relaxation multipliers for the
negative terms in (6). Calculation of derivatives R′s(0) can be performed on the basis of the following recurrence relations:
P1 0, P2 4; Ps1 2 Ps 4s Ps1 ; RL 0 PL L .
s, s values for s 8 ( 0) can be calculated using the asymptotic formula
s 1,63/s2, s 1 – s; (11)
when Rs 0,22.
The relation (11) is obtained from the following representation of Chebyshev polynomials
sin L
PL , sin 2 , , [0,1].
L sin 2
Indeed, for sufficiently small we have:
sin
PL , 4 L2 .
If we consider x = sqrt(), we get
F() (x) sin x /x.
We have
F() /F(кр) when кр,
where
2
кр x кр
6, 523; кр x кр 0, 22.
2
Thus, if we suppose кр 4L кр, we get the following statement: for the smallest (positive) eigenvalue m the inequality
is fulfilled
min 4L2m кр 6,523,
it means, if
m 6,523/ (4L2) 1,63/ L2, (12)
for all m we’ll have
|RL()| 0,22.
From (12) follows (11).
33
The enlarged scheme of the algorithm based on the relation (10) can be implemented using the following sequence of
steps. It is assumed that all task variables are properly normalized. We also suppose that the variables are numbered in some
optimal way, ensuring efficient storage of the sparse matrix (E – Gk ) in the computer's memory.
4 RELCH Algorithm
Step 1. Set the starting point x; calculate J : J(x); set L, which determines the number of recalculations using the formula
(10) (about the prior choice of L, see below).
Step 2. Calculate g : J(x), G : J(x); lay g : g/ ||G||, G : G/ ||G||; : 1.
Step 3. According to the formula (10) build L ; put x t : x L .
Step 4. Calculate Jt : J(xt). If Jt J, go to step 5, otherwise go to step 6.
Step 5. Put : /2, x t : x L and go to step 4.
Step 6. Put x: xt, J : Jt and go to step 2.
The criterion of the end of the process is not specified here. As a rule, the calculations end when the specified number of
calculations of the functional has been exhausted or when the algorithm is explicitly stopped. The number of recalculations
L by the formula (10) is a parameter set by the user. According to (11), it is initially advisable to assume
L 1,63 L 1, 3
where - assessment of the ravine degree of the minimized functional. With this choice of L, the relaxation multipliers in
the positive part of the spectrum will be guaranteed to be less than 0.23. When designing algorithmic methods for L tasks, it
is necessary to take into account that the sequence {Js}, where Js J x k s will not decrease monotonically with
s . In step 5 of the algorithm, the regulation of the advance vector norm was applied in order to prevent the local
quadratic model of the functional from leaving the space of validity.
5 Convergence Characteristics
We give an estimate of the efficiency of the method (10) in comparison with the methods of conjugate gradients (SG
methods). For tasks of large dimension (when the number of iterations is less than the dimension), one can guarantee the
convergence of SG-methods only with the speed of a geometric progression even for strongly convex quadratic functionals.
We’ll consider the case
f(x) 1/2Gx, x, G 0
and estimate the rate of convergence of the SG method to the extreme point x = 0.
The iteration xk, obtained by the SG method can be represented as
xk (E + c1G + c2G2 + + ckGk)x0 Pk(G)x0,
where Pk(G) - matrix polynomial of degree k. Moreover, it follows from the properties of the SG method that the
coefficients с1, , ck of the polynomial Рk(G) at each iteration take such values to minimize the value f(xk), which differs
only by a multiplier from the error function. In other words k-е approximation minimizes f(xk) among vectors x0 + V, where
vector V is the element of a subspace stretched by vectors Gx0, G2x0, , Gkx0.
We suggest
n
x i,0 u i ,
0
i 1
where {ui} - orthonormal basis of eigenvectors of the G matrix, therefore we get
n n
x k Pk G i,0 u i i,0 Pk u i , Pk 0 1,
i 1 i 1
n
f x k 1 2 Gx k , x k 1 2 2i ,0 Pk2 i i .
i 1
(13)
Hence, we have
2 n
x 0
2i,0 ,
i 1
34
n
k 2 2
x 2i,0 Pk2 i max Pk2 i x 0 . (14)
i
i 1
As a polynomial Pk() we choose the closest to optimal polynomial, the least deviating from zero on the interval [m, M],
containing all the eigenvalues of the positive-definite G matrix and normalized so that Pk(0) 1.
By linear change of variables
M m M m
t
2 2
the task is reduced to constructing a polynomial least deviating from zero on the interval t [– 1, 1] and taking at the point
t0 (M + m) / (M – m), (corresponding 0) value 1. The solution of the last problem is given by the polynomial.
Tk t Tk t
Tk t ,
cos k arccos t 0 Tk t 0
where Tk(t) cos(karccos t) is the Chebyshev polynomial. At the same time
1
max Tk t max Tk t .
1 t 1 Tk t 0 1t 1
It is obvious
max Tk t 1,
1 t 1
that is why
1
Lk max Pk max Tk t , m, M , t 1,1 .
t Tk t 0
Since the conception is fair
k k
Tk t 0,5 t t 2 1 t t 2 1 ,
2
k
2
k
M m M m M m M m
Lk 2 1 1
M m M m M m M m
k
M m
k
M m
2 .
M m M m
With sufficiently large k (k k0) we have
k k k
M m 1 2 M
Lk 2 2 2 1 , . (15)
M m 1 m
From (13) and (14) we get
||xk|| Lk||x0||
or
||xk|| 2qk||x0||, k k0, (16)
where q 1 2 . Thus, the convergence of the SG method with the speed of a geometric progression is proved.
The exact value of Lk, valid for any k, will be equal to
35
k
k
2 2
Lk 2 1 1 , L0 1.
From (16) it follows that when 1, convergence can be very slow.
SG method «finiteness» is the exact solution to the problem of minimizing a quadratic function (paraboloid) in n steps,
where n is the dimension of the search space. It appears only with a sufficiently large number of iterations. The degree of
the polynomial Pk() in (13) will be equal to n, and the optimal choice of this polynomial reduces to localizing its n roots at
the 1, 2, , n, points that will lead to an exact solution of the task (f(xn) 0).
It is easy to see that the estimate (16) for a quadratic function of a general form
f(x) 1/2Gx, x – G, x + с
converted to
||xk – x|| 2qk||x0 – x||, (17)
where x is the optimal point that does not coincide in the general case with the start of the coordinates.
An important feature of RELCH-type algorithms is that the corresponding relaxation multipliers will be determined only
by the number of iterations L and degree of the rigidity of the problem, regardless of the dimension n. At the same time,
in the schemes of SG methods n iteration order is required to complete each cycle of descent; otherwise, according to (17),
the rate of convergence may be very small. In addition, each iteration of the SG method, even for a quadratic case, requires
a new gradient calculation, that is, additional computational costs for calculating the target functional. We will further
assume that the RELCH algorithm is implemented with a constant L 1,3 , having relaxation multipliers in the area of
> 0, not exceeding the value of 0.23.
We will consider the problem of minimizing a quadratic functional f(x) 1/2Gx, x with a positively defined matrix G.
Let us estimate the number of calculations f(x) required to achieve the control vector x with the norm ||x|| 0,23 by the SG
method and the RELCH algorithm from the starting point x0 c ||x0|| 1. When reaching the x point the whole situation
repeats, therefore, the comparative efficiency estimates obtained below are of a rather general character.
We will assume that two-sided finite difference relations are used to calculate the derivatives, which in the following
analysis provides additional advantages to the SG method.
To achieve the vector x, the RELCH algorithm needs to calculate at the point x0 a weakly filled Hesse matrix and a
gradient vector f(x0). With fill coefficient , it would require about 2n2 calculations of f. Further, we iterate L 1, 3
using formula (10), which do not require additional calculations of the objective functional f .
To obtain the vector x the SG method will need N iterations, where N is calculated this way:
||xN|| 2qN 0,23,
it means N – 2,2/ ln q. To perform each iteration, it is necessary to update the gradient vector, that in the general case of
the application of two-sided finite-difference approximations of derivatives is associated with 2n calculations of f(x). The
total number of calculations f is – 4,4n / ln q. The relative benefit in the number of f calculations by the RELCH method
compared to the SG method is given by the function () – 2,2/ (nln q). It is obvious that when we have
q() 1 and () . Typical values for 0,01 and n 1000 are given below:
100 1000 1500 104 105
1,0 3,4 4,0 11,0 35,0
Thus, to obtain comparable results when 104 , the RELCH algorithm will require approximately 11 times less f
calculations than with the SG method. However, it should be taken into account that as increases, the number of L
recalculations by the formula (10) grows. This can lead to an increase in the influence of computational errors in the
calculation of s with large s numbers.
Example. We will consider the model task of the quadratic functional minimization f(x) с n = 200, h = 1500, g = 0,025.
For definiteness, we assume that the time of a single calculation of f(x) is equivalent to performing 102n multiplication
operations with floating point. The execution time of a single multiplication operation for some device for definiteness
conditionally we will assume equal to ty 310– 5sec. The calculation of the f(x) value takes at the same time tf 0,6 sec of
processor time. To calculate f и f using the general finite-difference formulas, we need, respectively, t 2ntf 4 min,
t 2n2t 20 min. The number of recalculations by the formula (10) is equal to L 1, 3 50 . At each recalculation, a
slightly filled matrix E – 2Gk is multiplied by a vector 50 , that requires n2ty 310– 2 sec of computer time. The vector
construction time 50 without taking into account the calculation of f, f will be about 50310– 2 1,5 seconds and may
not be taken into account.
36
The result is that to build the control vector x when ||x|| 0,23 by using the RELCH method, it will take about
t 20 min of machine time. The SG method will cost, respectively, (1500)20 1,3 hours.
When the RELCH algorithm is re-applied to the constructed vector x, we will get the vector x с ||x|| 0,23||x|| etc.
Therefore, if we denote the corresponding sequence of vectors by{xm}, the norm of the vector x will decrease according to
the law of geometric progression ||xm|| dm||x0||, where d <0.23 regardless of the and n values.
An important additional advantage of the RELCH algorithm in comparison with the SG method is its rather high
efficiency in the non-convex case, since the relaxation function of the method in the left half-plane is entirely located in the
allowed area and the relaxation multipliers for 0 grow rapidly by absolute value when going from s to s1 . Growth
characteristics were given before.
6 Conclusion
The described class of matrix gradient methods has shown in practice a sufficiently high performance in conditions of high
stiffness and non-convexity of objective functionals. When optimizing large systems, it is possible to use efficient “packed”
storage forms for matrices of second derivatives, that significantly reduces the requirements for the necessary computer
memory. However, the method retains its main characteristics for small-sized systems, competing with the main
optimization procedures of nonlinear programming.
Acknowledgements
The work was financially supported by the Ministry of Education and Science of the Russian Federation in the framework
of the Federal Targeted Program for Research and Development in Priority Areas of Advancement of the Russian Scientific
and Technological Complex for 2014-2020 (No. 14.584.21.0022, ID RFMEFI58417X0022).
References
[1] I. Chernorutskiy, P. Drobintsev, V. Kotlyarov and N. Voinov. A New Approach to Generation and Analysis of
Gradient Methods Based on Relaxation Function. Proceedings - 2017 UKSim-AMSS 19th International Conference
on Modelling and Simulation, UKSim 2017:83-88, May 2018.
[2] I. G. Chernorutsky. Algorithmic problems of stiff optimization. St. Petersburg Polytechnic University Journal of
Engineering Science and Technology, №6:141-152, 2012.
[3] Yu. E. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical
Programming, 108(1):177-205, August 2006.
[4] M. Avriel. Nonlinear programming: analysis and methods - Mineola, NY: Dover Publishing, 2003.
[5] P. Deuflhard. Newton methods for nonlinear problems. Affine invariance and adaptive algorithms - Volume 35 /
Springer Series in Computational Mathematics, Springer, 2004.
[6] S. Pissanetski.Technology of sparse matrices. - Mir, 1988.
37