Introduction

[Polyak & Tremba] Polyak B. T. & Tremba A. A. (2018). Solving underdetermined nonlinear equations by Newton-like method Submitted to Computational Optimization and Applications journal.

Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations

Boris T. Polyak

boris@ipu.ru 0

Andrey A. Tremba

atremba@ipu.ru 0 0 V.A. Trapeznikov Institute of Control Sciences RAS , Moscow, Russia Profsoyuznaya, 65, 117997 Moscow , Russia

1970

827 830

Newton method is a well-known tool for solving finite-dimensional systems of equations. Pure Newton-Raphson method has at least quadratic convergence rate, but its convergence radius is often limited. Damped version of Newton method uses smaller step-size with same direction, with larger convergence ball but linear convergence rate. We propose mixed step-size choice strategy, incorporating both quadratic convergence rate and wide (global in some cases) convergence radius. The method can be used in cases of under-determined equations and Banach-space equations. We present a modification of proposed method requiring no a-priori knowledge of problem constants as well. The method may be also used for solving a class of non-convex optimization problems.

Introduction We aim to explore solvability conditions of generic equation

P (x) = 0; with operator P : Rn → Rm; m ≤ n. Most results of this note are valid for Banach spaces as well, but we stick to the finite-dimensional setting for simplicity. Through the paper we assume existence of derivative P ′ of the operator in a ball Bρ = {x : ∥x − x0∥ ≤ }. We also assume that the derivative is Lipschitz with constant L in the ball:

∥P ′(xa) − P ′(xb)∥ ≤ L∥xa − xb∥; ∀xa; xb ∈ Bρ: Solvability of the equation (1) can be expressed in terms of conditions on a given point x0 : P (x0) ̸= 0 solely, or in a set around condition point, e.g. Bρ.

These additional conditions are related to non-degeneracy P , which may be expressed through its derivative P ′. There are two types of conditions: point-wise and set-wise ∥P ′(x0)T h∥∗ ≥ 0∥h∥∗;

0 > 0; ∀h ∈ Rm; ∥P ′(x)T h∥∗ ≥ ∥h∥∗; > 0; ∀h ∈ Rm; ∀x ∈ Bρ: (1) (2) (3) (4) Both conditions are generalization of notion “existence of inverse matrix”, and valid for under-determined (also Banach) systems of equation. In order, this notion can be generalized for non-differentiable and infinitedimensional spaces and is known as “metric regularity”. Thus algorithms discussed in this note are applicable to Banach spaces as well.

For example, let norms in image and pre-image be Euclidean ones. Then Lipschitz constant of the derivative is coupled with spectral operator norm, and (3) simply states that the least singular value of matrix P ′(x0) is positive: 0 = m(P ′(x0)) > 0. While the former property is easier to calculate, it leads to conservative, but elegant results. Most known one is Newton-Kantorovich theorem related to convergence of Newton method started at x0.

Main goal of this note is to explore how convergence property of Newton method depend on “operational” ball size . It appears that if we modify Newton method itself, then we can achieve considerable and provable improvement over known results. 1.1

Newton-Kantorovich and Mysovskikh Theorems

One of most practical generic algorithms for solving equations is Newton (Newton-Raphson) method, which uses idea of operator linearization. Let’s write it down in slightly general than usual form, e.g. as in constraint step [Hager, 1993]: zk = arg min

P ′(xk)z=P (xk) xk+1 = xk − zk: ∥z∥; (5) Vector zk is also called Newton direction. This scheme embrace both standard and under-determined systems. It has explicit expression in n = m case for non-degenerate derivative as zk = (P ′(xk))−1P (xk). Also in m ̸= n Euclidean case it can be expressed via Moore-Penrose pseudo-inverse as zk = P ′(xk)†P (xk) by [Ben-Israel, 1966].

It is known that if Newton method converges, then its convergence rate is quadratic ∥P (xk+1)∥ ≤ c1∥P (xk)∥2. Famous Newton-Kantorovich theorem impose semi-local conditions at one point x0, ensuring that Newton method (5) converges [Kantorovich et al, 1982]. The theorem initially assumes existence of inverse operator (P ′(x))−1, which can be relaxed to Lipschitz condition (2). The following result is slight modification of [Robinson, 1972, Theorem 2].

Proposition 1. Assume (2) and (3) hold. If ∥P (x0)∥ ≤ s, h =

L 2 s < 0

Neglectable difference is that Robinson used more general setup, and used norm of right inverse operator ∥(P ′(x0))r−i1ght∥ as a constant. It appears that the norm is equal to 1= 0. Review of [Gal´antai, 2000] includes other means for solution of under-determined equations, including generalized inverse, outer inverse of P ′, etc. The Proposition’s statement is in same form as Newton-Kantorovich theorem, with following changes. Condition on inverse to derivative operator is replaced with condition (3), also condition on second derivative being replaced by (2). Kantorovich demonstrated that the theorem is unimprovable without further assumptions on operator P at points other than x0. There are multiple results on how exactly the method converges, cf. historical review [Yamomoto, 2000].

Constant h can be improved, if we assume strong non-degeneracy not only at x0, but on whole ball Bρ, i.e. having assumed (4) instead of (3). It is described by another famous theorem by [Mysovskikh, 1949] (NewtonMysovskikh theorem, see also [Kantorovich et al, 1982]) , and updated by [Polyak, 1964] to Banach and non-exact cases. Applied to Newton method (5), an excerpt from [Polyak, 1964, Corollary 1] takes form: Proposition 2. Assume (2) and (4) hold. Let ∥P (x0)∥ ≤ s. Then if then Newton method (5) converges to a solution x∗, and h =

L 2 s < 2 and r1 = 2 L

< ; 2 ∥x∗ − xk∥ ≤ L

Hk ( h ) 2 ( h ) 2

Here used sum-of-double-exponentials function Hk( ) = ∑ℓ∞=k (2ℓ) defined on inverse of the first function in the series: ∆ : [0; ∞) → [0; 1), s.t. ∆(H0( )) ≡ ; are monotonically increasing and are easily computed with needed accuracy. ∈ [0; 1). Below we also use ∈ [0; 1). All these functions In the paper we study convergence radius of Newton method, i.e. conditions of type ∥P (x0)∥ ≤ s, maximizing image radius s. This also determines a convergence ball in image space.

This is very useful in study of equation variability. Consider a function g : Rn → Rm and known solution xa : g(xa) = 0. The problem of interest is to study range of right-hand side y of equation

g(x) = y; such that the equation still has a solution. Other problem is to estimate size of a small ball image {g(x) : x ∈ Bε}. This problem is trivially converted to (1) by substitution and shift of coordinates P (x) ≡ g(x) − y; x0 ≡ xa with ∥P (x0)∥ = ∥y∥. If equation (1) has solution (proven, say, by starting Newton method from x0!) for all ∥P (x0)∥ ≤ s, then equation (6) has a solution for any y : ∥y∥ ≤ s. Note that g(·) has exactly the same derivative as P (·).

In order to estimate convergence radius, conditions of Proposition 1 can be explicitly inverted.

Corollary 1. Under conditions (2), (3), if (6) (7) then Newton method (5) converges to a solution of (1).

In both cases theoretical applicability of pure (non-damped) Newton method is limited by a constant (maximal 2 convergence radius is 2µL0 or 2Lµ2 ), independently on operational ball radius . In main Section 2 we propose an algorithm, which increase convergence ball in same conditions as Proposition 2. 1.3

Damped Newton Method

Another very popular and practical choice is to make non-unit step-size in Newton direction, introducing damping parameter k ≤ 1.

Newton method is tightly connected with equivalent minimization problem and Newton direction is descent direction for the functional. Thus for appropriately small step-sizes k follows ∥P (xk − kzk)∥ ≤ ∥P (xk)∥, and {xk} is relaxation sequence for the functional. There are many strategies for the step-size , e.g. well-known Armijo backtracking rule i = 02−i; shrinkage i = 0ci2; 0 < c2 < 1 or direct line-search [Ortega & Rheinboldt, 2000]. Typically damping step-size search terminates as soon following condition is satisfied for some constant c: ∥P (xk − kzk)∥ ≤ (1 − c k)∥P (xk)∥:

Damped Newton method may be globally converging, but its convergence rate is typically linear at first ∥P (xk+1)∥ ≤ c2∥P (xk)∥, but eventually leading to quadratic convergence rate. This was first demonstrated by [Pshenichnyi, 1970]. Unfortunately, explicit conditions for convergence radius and convergence rate were not studied. We also mention recent paper [Nesterov, 2007], where appropriate damping paired with modified

Newton direction, obtained from an auxiliary regularized problem.

In the note we are to provide some simple and “natural” strategy for step-size choice. Then we prove both quadratic convergence rate of resulting algorithm, and larger convergence radius. Our basic method is using specific constants of a problem, which may be unknown a-priori. For overcoming this, we propose modification of the algorithm, which is adaptive to the constants. We demonstrate both methods on systems of quadratic equations, allowing simpler statements. It turned out that the same strategy was proposed by [Deuflhard, 2004], resulting in interesting “global” results, but convergence radius were still unexplored. 2

Choosing Step-Size for Newton Method We consider vector equation (1) with m ≤ n, with initial point x0 and (damped) Newton method (7). Key observation is following: under assumptions (2) and (4) norm of Newton direction vector is limited as ∥zk∥ ≤ µ1 ∥P (xk)∥ on x ∈ Bρ. Then target functional ∥P (x)∥, reflecting equation residual, has upper estimate ∥P (xk+1)∥ ≤ (|1 − | + 2L2 2∥P (xk)∥)∥P (xk)∥; obtained by linearization due to differentiability on Bρ. Optimal choice for minimizing the estimate is k = min {1;

2 L∥P (xk)∥ }: It is one of two possible choices, other uses only Lipschitz constant L in step-size calculation [Polyak & Tremba].

The step-size depend on value ∥P (xk)∥, and if it is small enough, method becomes pure Newton method (5). Theorem. Let (2) and (4) hold. If (8) (9) (10) (11) then Newton algorithm (7) with step-size (8) converges to a solution x∗ of (1). Also    2  ∥P (x0)∥ − 2L k; 2 2 L 2−(2(k−kmax)); k ≥ kmax;

k < kmax;  (1 )  L (kmax − k + 2H0 2 ); k < kmax;  2L Hk−kmax (21 ); k ≥ kmax:

{ kmax = max 0; ⌈ 2L ⌉ } 2 ∥P (x0)∥ − 2 : where

This theorem statement is twofold. First, maximal number of non-unit step lengths is explicitly given. It allows to predict when linear convergence rate switches to quadratic one. Of course, total convergence rate still quadratic.

Second, the last part of (9) states potentially unlimited (or even global) convergence radius.

Sketch of proof. Step length (8) results in non-increasing sequence of ∥P (xk)∥ (actually, monotonically decreas2 µ2 ing whenever ∥P (xk)∥ > 0). In first phase k < 1, and ∥P (xk+1)∥ ≤ ∥P (xk)∥ − 2µL , until threshold ∥P (xk)∥ ≤ L 1. Solve

2. Evaluate 3. If either or 5. Take

met. This limits number of steps as (11). In second phase, k ≡ 1, and standard analysis of Newton steps is valid.

Crucial part of proof is estimating distances ∥x0 − xk∥ (and eventually ∥x0 − x∗∥), which should guarantee sequence {xk} be within operational ball Bρ. This distance implicitly depends (through kmax) on initial condition as (10). Inverting this condition by limiting ∥x0 − x∗∥ ≤ leads to convergence radius (9).

Main drawback of the step length choice (8) is dependence on constants L; modification of the method, which omits this requirements. or its estimates. We propose a

The idea is very simple. We are to notice that both constants L; enter step-size formulae (8) in combined fashion as 2=L, and thus may be estimated by backtracking. Let = 2=L be “true” constant. Assume that we have some initial value 0. There it can be either smaller than , or greater.

If we start algorithm with 0 ≤ , then exist constants Lb ≥ L; b ≤ , such that b2=Lb = 0 and properties (2) and (4) hold for Lb and b. Thus Theorem 2 is valid with respect to these constants.

Other case of 0 ≥ is trickier. Here we check whether P (xk − kzk) sufficiently decreases or not. If it is decreasing well enough, we accept this step-size, otherwise update 0 → 1 < 0. Let’s demonstrate, how exactly it is done. 2.1.1

Algorithm (Constant-Adaptive Newton Method)

As mentioned, the algorithm takes as input initial estimate 0 and scalar multiplier 0 < q < 1. The algorithm is initialized with counter k = 0 and number p0 = ∥P (x0)∥.

zk = arg

min P ′(xk)z=P (xk)

∥z∥; k = min {1; k };

pk pk+1 = ∥P (xk − kzk)∥: k < 1 and pk+1 < pk − 2k ; k = 1 and pk+1 <

1 2 k

pk; xk+1 = xk − kzk; holds, then go to Step 5. Otherwise 4. apply update rule k ← q k and return to Step 2 without increasing counter.

set k+1 =

k, increase counter k ← k + 1, and go to Step 1.

Note that in contrast to known algorithm, we use backtracking search over parameter of the method, but not over step length.

Meanwhile the algorithm behaves nice in practice, its convergence analysis becomes complicated, and Theorem is not applicable directly. 2.2

Norm Variability

There is a specific property of under-determined systems of equations. Solution of linear constraint equations

P ′(xk)z = P (xk) in (5) (and (7)) is non-unique, in contrast to standard (n = m) system of equations. Thus choice of norm in pre-image space affects method’s trajectory {xk}. It appears that proper norm leads to very interesting results, e.g. solution sparsity. On the other hand, problem’s constants L; do also depend on the norms and rarely could be calculated explicitly or estimated. Discussion on topic worth another article. Scalar problem (1) with P : Rn → R is not an optimization problem, but can be expressed as non-differentiable optimization problem |P (x)| → min. It is non-convex problem in most cases. Newton direction for Euclidean setup coincides with gradient or anti-gradient of |P (·)|, and the method resembles a variant of gradient descent (and ascend) method for the function, differentiable everywhere except solution set P (x) = 0. Step-size (8) results in quadratic convergence rate for the problem under assumptions (2), (4). If operation ball Bρ is the whole space, i.e. = ∞, then the convergence is global. 3

Conclusions

We study convergence radius (range of P (x0)) for Newton method applicability. For specifically chosen steplength strategy (8) we extended convergence radius. The strategy preserve quadratic convergence rate of the method. It is suitable both for standard and under-determined systems of equations (or Banach equation). To employ this strategy a-priori knowledge of problem constants is needed, and we proposed adaptive variant of the algorithm.

Acknowledgments This work was supported by Russian Science Foundation, Project 16-11-10015. [Ben-Israel, 1966] Ben-Israel, A. (1966). A Newton-Raphson method for the solution of systems of equations.

[ Ben-Israel , 1966 ] Ben- Israel , A. ( 1966 ). A Newton-Raphson method for the solution of systems of equations . Journal of Mathematical Analysis and Applications , 15 , 243 - 252 .

[Deuflhard , 2004] Deuflhard P. ( 2004 ). Newton Methods for Nonlinear Problems . Affine Invariance and

Adaptive

Algorithms . Berlin: Springer.

[Gal´antai, 2000] Gal´antai A. ( 2000 ). The theory of Newtons method . Journal of Computational and Applied Mathematics , 124 , 25 - 44 .

[Hager , 1993] Hager, W. W. ( 1993 ). Analysis and Implementation of a Dual Algorithm for Constrained Optimization . Journal of Optimization Theory and Applications , 79 ( 3 ), 427 - 462 .

[Kantorovich et al, 1982 ] Kantorovich, L.V. & Akilov , G.P. ( 1982 ). Functional Analysis. 2nd ed . Oxford: Pergamon Press.

[Mysovskikh , 1949] Mysovskikh, I. ( 1949 ). On convergence of Newton's method . Trudy Mat. Inst. Steklov , 28 , 145 - 147 . (in Russian).

[Nesterov , 2007] Nesterov, Yu. ( 2007 ). Modified Gauss-Newton scheme with worst case guarantees for global performance . Optimization Methods and Software 22 ( 3 ), 469 - 483 .

[Ortega & Rheinboldt , 2000] Ortega, J. M. & Rheinboldt , W. C. ( 2000 ). Iterative Solution of Nonlinear Equations in Several Variables . San Diego, California: Academic Press.

[Polyak , 1964] Polyak, B. T. ( 1964 ). Gradient methods for solving equations and inequalities . USSR Computational Mathematics and Mathematical Phys . 4 ( 6 ), 17 - 32 .