<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>[Polyak &amp; Tremba] Polyak B. T. &amp; Tremba A. A. (2018). Solving underdetermined nonlinear equations
by Newton-like method Submitted to Computational Optimization and Applications journal.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Boris T. Polyak</string-name>
          <email>boris@ipu.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrey A. Tremba</string-name>
          <email>atremba@ipu.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>V.A. Trapeznikov Institute of Control Sciences RAS</institution>
          ,
          <addr-line>Moscow, Russia Profsoyuznaya, 65, 117997 Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1970</year>
      </pub-date>
      <fpage>827</fpage>
      <lpage>830</lpage>
      <abstract>
        <p>Newton method is a well-known tool for solving finite-dimensional systems of equations. Pure Newton-Raphson method has at least quadratic convergence rate, but its convergence radius is often limited. Damped version of Newton method uses smaller step-size with same direction, with larger convergence ball but linear convergence rate. We propose mixed step-size choice strategy, incorporating both quadratic convergence rate and wide (global in some cases) convergence radius. The method can be used in cases of under-determined equations and Banach-space equations. We present a modification of proposed method requiring no a-priori knowledge of problem constants as well. The method may be also used for solving a class of non-convex optimization problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <sec id="sec-1-1">
        <title>We aim to explore solvability conditions of generic equation</title>
        <p>P (x) = 0;
with operator P : Rn → Rm; m ≤ n. Most results of this note are valid for Banach spaces as well, but we stick
to the finite-dimensional setting for simplicity. Through the paper we assume existence of derivative P ′ of the
operator in a ball Bρ = {x : ∥x − x0∥ ≤ }. We also assume that the derivative is Lipschitz with constant L in
the ball:</p>
        <p>∥P ′(xa) − P ′(xb)∥ ≤ L∥xa − xb∥; ∀xa; xb ∈ Bρ:
Solvability of the equation (1) can be expressed in terms of conditions on a given point x0 : P (x0) ̸= 0 solely, or
in a set around condition point, e.g. Bρ.</p>
        <p>These additional conditions are related to non-degeneracy P , which may be expressed through its derivative
P ′. There are two types of conditions: point-wise
and set-wise
∥P ′(x0)T h∥∗ ≥ 0∥h∥∗;</p>
        <p>0 &gt; 0; ∀h ∈ Rm;
∥P ′(x)T h∥∗ ≥
∥h∥∗;
&gt; 0; ∀h ∈ Rm; ∀x ∈ Bρ:
(1)
(2)
(3)
(4)
Both conditions are generalization of notion “existence of inverse matrix”, and valid for under-determined
(also Banach) systems of equation. In order, this notion can be generalized for non-differentiable and
infinitedimensional spaces and is known as “metric regularity”. Thus algorithms discussed in this note are applicable
to Banach spaces as well.</p>
        <p>For example, let norms in image and pre-image be Euclidean ones. Then Lipschitz constant of the derivative
is coupled with spectral operator norm, and (3) simply states that the least singular value of matrix P ′(x0)
is positive: 0 = m(P ′(x0)) &gt; 0. While the former property is easier to calculate, it leads to conservative,
but elegant results. Most known one is Newton-Kantorovich theorem related to convergence of Newton method
started at x0.</p>
        <p>Main goal of this note is to explore how convergence property of Newton method depend on “operational”
ball size . It appears that if we modify Newton method itself, then we can achieve considerable and provable
improvement over known results.
1.1</p>
        <sec id="sec-1-1-1">
          <title>Newton-Kantorovich and Mysovskikh Theorems</title>
          <p>One of most practical generic algorithms for solving equations is Newton (Newton-Raphson) method, which uses
idea of operator linearization. Let’s write it down in slightly general than usual form, e.g. as in constraint step
[Hager, 1993]:
zk = arg min</p>
          <p>P ′(xk)z=P (xk)
xk+1 = xk − zk:
∥z∥;
(5)
Vector zk is also called Newton direction. This scheme embrace both standard and under-determined systems.
It has explicit expression in n = m case for non-degenerate derivative as zk = (P ′(xk))−1P (xk). Also in m ̸= n
Euclidean case it can be expressed via Moore-Penrose pseudo-inverse as zk = P ′(xk)†P (xk) by [Ben-Israel, 1966].</p>
          <p>It is known that if Newton method converges, then its convergence rate is quadratic ∥P (xk+1)∥ ≤ c1∥P (xk)∥2.
Famous Newton-Kantorovich theorem impose semi-local conditions at one point x0, ensuring that Newton
method (5) converges [Kantorovich et al, 1982]. The theorem initially assumes existence of inverse operator
(P ′(x))−1, which can be relaxed to Lipschitz condition (2). The following result is slight modification of
[Robinson, 1972, Theorem 2].</p>
          <p>Proposition 1. Assume (2) and (3) hold. If ∥P (x0)∥ ≤ s,
h =</p>
          <p>L
2 s &lt;
0</p>
          <p>Neglectable difference is that Robinson used more general setup, and used norm of right inverse operator
∥(P ′(x0))r−i1ght∥ as a constant. It appears that the norm is equal to 1= 0. Review of [Gal´antai, 2000] includes
other means for solution of under-determined equations, including generalized inverse, outer inverse of P ′, etc.
The Proposition’s statement is in same form as Newton-Kantorovich theorem, with following changes. Condition
on inverse to derivative operator is replaced with condition (3), also condition on second derivative being replaced
by (2). Kantorovich demonstrated that the theorem is unimprovable without further assumptions on operator
P at points other than x0. There are multiple results on how exactly the method converges, cf. historical review
[Yamomoto, 2000].</p>
          <p>
            Constant h can be improved, if we assume strong non-degeneracy not only at x0, but on whole ball Bρ, i.e.
having assumed (4) instead of (3). It is described by another famous theorem by [Mysovskikh, 1949]
            <xref ref-type="bibr" rid="ref5">(NewtonMysovskikh theorem, see also [Kantorovich et al, 1982])</xref>
            , and updated by [Polyak, 1964] to Banach and non-exact
cases. Applied to Newton method (5), an excerpt from [Polyak, 1964, Corollary 1] takes form:
Proposition 2. Assume (2) and (4) hold. Let ∥P (x0)∥ ≤ s. Then if
then Newton method (5) converges to a solution x∗, and
h =
          </p>
          <p>L
2 s &lt; 2 and r1 =
2
L</p>
          <p>H0</p>
          <p>&lt; ;
2
∥x∗ − xk∥ ≤ L</p>
          <p>Hk
( h )
2
( h )
2</p>
          <p>Here used sum-of-double-exponentials function Hk( ) = ∑ℓ∞=k (2ℓ) defined on
inverse of the first function in the series: ∆ : [0; ∞) → [0; 1), s.t. ∆(H0( )) ≡ ;
are monotonically increasing and are easily computed with needed accuracy.
∈ [0; 1). Below we also use
∈ [0; 1). All these functions
In the paper we study convergence radius of Newton method, i.e. conditions of type ∥P (x0)∥ ≤ s, maximizing
image radius s. This also determines a convergence ball in image space.</p>
          <p>This is very useful in study of equation variability. Consider a function g : Rn → Rm and known solution
xa : g(xa) = 0. The problem of interest is to study range of right-hand side y of equation</p>
          <p>g(x) = y;
such that the equation still has a solution. Other problem is to estimate size of a small ball image {g(x) : x ∈ Bε}.
This problem is trivially converted to (1) by substitution and shift of coordinates P (x) ≡ g(x) − y; x0 ≡ xa
with ∥P (x0)∥ = ∥y∥. If equation (1) has solution (proven, say, by starting Newton method from x0!) for all
∥P (x0)∥ ≤ s, then equation (6) has a solution for any y : ∥y∥ ≤ s. Note that g(·) has exactly the same derivative
as P (·).</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>In order to estimate convergence radius, conditions of Proposition 1 can be explicitly inverted.</title>
        <p>Corollary 1. Under conditions (2), (3), if
(6)
(7)
then Newton method (5) converges to a solution of (1).</p>
        <p>In both cases theoretical applicability of pure (non-damped) Newton method is limited by a constant (maximal
2
convergence radius is 2µL0 or 2Lµ2 ), independently on operational ball radius . In main Section 2 we propose an
algorithm, which increase convergence ball in same conditions as Proposition 2.
1.3</p>
        <sec id="sec-1-2-1">
          <title>Damped Newton Method</title>
          <p>Another very popular and practical choice is to make non-unit step-size in Newton direction, introducing damping
parameter k ≤ 1.</p>
          <p>Newton method is tightly connected with equivalent minimization problem
and Newton direction is descent direction for the functional. Thus for appropriately small step-sizes k follows
∥P (xk − kzk)∥ ≤ ∥P (xk)∥, and {xk} is relaxation sequence for the functional. There are many strategies for
the step-size , e.g. well-known Armijo backtracking rule i = 02−i; shrinkage i = 0ci2; 0 &lt; c2 &lt; 1 or
direct line-search [Ortega &amp; Rheinboldt, 2000]. Typically damping step-size search terminates as soon following
condition is satisfied for some constant c:
∥P (xk − kzk)∥ ≤ (1 − c k)∥P (xk)∥:</p>
          <p>Damped Newton method may be globally converging, but its convergence rate is typically linear at first
∥P (xk+1)∥ ≤ c2∥P (xk)∥, but eventually leading to quadratic convergence rate. This was first demonstrated
by [Pshenichnyi, 1970]. Unfortunately, explicit conditions for convergence radius and convergence rate were
not studied. We also mention recent paper [Nesterov, 2007], where appropriate damping paired with modified</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>Newton direction, obtained from an auxiliary regularized problem.</title>
        <p>In the note we are to provide some simple and “natural” strategy for step-size choice. Then we prove both
quadratic convergence rate of resulting algorithm, and larger convergence radius. Our basic method is using
specific constants of a problem, which may be unknown a-priori. For overcoming this, we propose modification
of the algorithm, which is adaptive to the constants. We demonstrate both methods on systems of quadratic
equations, allowing simpler statements. It turned out that the same strategy was proposed by [Deuflhard, 2004],
resulting in interesting “global” results, but convergence radius were still unexplored.
2</p>
        <p>Choosing Step-Size for Newton Method
We consider vector equation (1) with m ≤ n, with initial point x0 and (damped) Newton method (7). Key
observation is following: under assumptions (2) and (4) norm of Newton direction vector is limited as ∥zk∥ ≤
µ1 ∥P (xk)∥ on x ∈ Bρ. Then target functional ∥P (x)∥, reflecting equation residual, has upper estimate
∥P (xk+1)∥ ≤ (|1 − | + 2L2 2∥P (xk)∥)∥P (xk)∥;
obtained by linearization due to differentiability on Bρ. Optimal choice for minimizing the estimate is
k = min {1;</p>
        <p>2
L∥P (xk)∥
}:
It is one of two possible choices, other uses only Lipschitz constant L in step-size calculation [Polyak &amp; Tremba].</p>
        <p>The step-size depend on value ∥P (xk)∥, and if it is small enough, method becomes pure Newton method (5).
Theorem. Let (2) and (4) hold. If
(8)
(9)
(10)
(11)
then Newton algorithm (7) with step-size (8) converges to a solution x∗ of (1). Also


 2
 ∥P (x0)∥ − 2L k;
2 2
L
2−(2(k−kmax)); k ≥ kmax;</p>
        <p>k &lt; kmax;
 (1 )
 L (kmax − k + 2H0 2 ); k &lt; kmax;
 2L Hk−kmax (21 ); k ≥ kmax:</p>
        <p>{
kmax = max 0;
⌈ 2L ⌉ }
2 ∥P (x0)∥ − 2 :
where</p>
        <p>This theorem statement is twofold. First, maximal number of non-unit step lengths is explicitly given. It
allows to predict when linear convergence rate switches to quadratic one. Of course, total convergence rate still
quadratic.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Second, the last part of (9) states potentially unlimited (or even global) convergence radius.</title>
        <p>Sketch of proof. Step length (8) results in non-increasing sequence of ∥P (xk)∥ (actually, monotonically
decreas2 µ2
ing whenever ∥P (xk)∥ &gt; 0). In first phase k &lt; 1, and ∥P (xk+1)∥ ≤ ∥P (xk)∥ − 2µL , until threshold ∥P (xk)∥ ≤ L
1. Solve</p>
      </sec>
      <sec id="sec-1-5">
        <title>2. Evaluate</title>
      </sec>
      <sec id="sec-1-6">
        <title>3. If either or</title>
      </sec>
      <sec id="sec-1-7">
        <title>5. Take</title>
        <p>met. This limits number of steps as (11). In second phase, k ≡ 1, and standard analysis of Newton steps is
valid.</p>
        <p>Crucial part of proof is estimating distances ∥x0 − xk∥ (and eventually ∥x0 − x∗∥), which should guarantee
sequence {xk} be within operational ball Bρ. This distance implicitly depends (through kmax) on initial condition
as (10). Inverting this condition by limiting ∥x0 − x∗∥ ≤ leads to convergence radius (9).</p>
      </sec>
      <sec id="sec-1-8">
        <title>Main drawback of the step length choice (8) is dependence on constants L; modification of the method, which omits this requirements. or its estimates. We propose a</title>
        <p>The idea is very simple. We are to notice that both constants L; enter step-size formulae (8) in combined
fashion as 2=L, and thus may be estimated by backtracking. Let = 2=L be “true” constant. Assume that
we have some initial value 0. There it can be either smaller than , or greater.</p>
        <p>If we start algorithm with 0 ≤ , then exist constants Lb ≥ L; b ≤ , such that b2=Lb = 0 and properties (2)
and (4) hold for Lb and b. Thus Theorem 2 is valid with respect to these constants.</p>
        <p>Other case of 0 ≥ is trickier. Here we check whether P (xk − kzk) sufficiently decreases or not. If it is
decreasing well enough, we accept this step-size, otherwise update 0 → 1 &lt; 0. Let’s demonstrate, how exactly
it is done.
2.1.1</p>
        <sec id="sec-1-8-1">
          <title>Algorithm (Constant-Adaptive Newton Method)</title>
          <p>As mentioned, the algorithm takes as input initial estimate 0 and scalar multiplier 0 &lt; q &lt; 1. The algorithm
is initialized with counter k = 0 and number p0 = ∥P (x0)∥.</p>
          <p>zk = arg</p>
          <p>min
P ′(xk)z=P (xk)</p>
          <p>∥z∥;
k = min {1; k };</p>
          <p>pk
pk+1 = ∥P (xk − kzk)∥:
k &lt; 1 and pk+1 &lt; pk − 2k ;
k = 1 and pk+1 &lt;</p>
          <p>1
2 k</p>
          <p>pk;
xk+1 = xk − kzk;
holds, then go to Step 5. Otherwise
4. apply update rule k ← q k and return to Step 2 without increasing counter.</p>
          <p>set k+1 =</p>
          <p>k, increase counter k ← k + 1, and go to Step 1.</p>
          <p>Note that in contrast to known algorithm, we use backtracking search over parameter of the method, but not
over step length.</p>
          <p>Meanwhile the algorithm behaves nice in practice, its convergence analysis becomes complicated, and Theorem
is not applicable directly.
2.2</p>
        </sec>
        <sec id="sec-1-8-2">
          <title>Norm Variability</title>
          <p>There is a specific property of under-determined systems of equations. Solution of linear constraint equations</p>
          <p>P ′(xk)z = P (xk)
in (5) (and (7)) is non-unique, in contrast to standard (n = m) system of equations. Thus choice of norm in
pre-image space affects method’s trajectory {xk}. It appears that proper norm leads to very interesting results,
e.g. solution sparsity. On the other hand, problem’s constants L; do also depend on the norms and rarely
could be calculated explicitly or estimated. Discussion on topic worth another article.
Scalar problem (1) with P : Rn → R is not an optimization problem, but can be expressed as non-differentiable
optimization problem |P (x)| → min. It is non-convex problem in most cases. Newton direction for Euclidean
setup coincides with gradient or anti-gradient of |P (·)|, and the method resembles a variant of gradient descent
(and ascend) method for the function, differentiable everywhere except solution set P (x) = 0. Step-size (8)
results in quadratic convergence rate for the problem under assumptions (2), (4). If operation ball Bρ is the
whole space, i.e. = ∞, then the convergence is global.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Conclusions</title>
      <p>We study convergence radius (range of P (x0)) for Newton method applicability. For specifically chosen
steplength strategy (8) we extended convergence radius. The strategy preserve quadratic convergence rate of the
method. It is suitable both for standard and under-determined systems of equations (or Banach equation). To
employ this strategy a-priori knowledge of problem constants is needed, and we proposed adaptive variant of the
algorithm.</p>
      <sec id="sec-2-1">
        <title>Acknowledgments</title>
        <sec id="sec-2-1-1">
          <title>This work was supported by Russian Science Foundation, Project 16-11-10015. [Ben-Israel, 1966] Ben-Israel, A. (1966). A Newton-Raphson method for the solution of systems of equations.</title>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [
          <string-name>
            <surname>Ben-Israel</surname>
          </string-name>
          ,
          <year>1966</year>
          ]
          <article-title>Ben-</article-title>
          <string-name>
            <surname>Israel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>1966</year>
          ).
          <article-title>A Newton-Raphson method for the solution of systems of equations</article-title>
          .
          <source>Journal of Mathematical Analysis and Applications</source>
          ,
          <volume>15</volume>
          ,
          <fpage>243</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Deuflhard</source>
          , 2004]
          <string-name>
            <surname>Deuflhard P.</surname>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Newton Methods for Nonlinear Problems</article-title>
          . Affine Invariance and
          <string-name>
            <given-names>Adaptive</given-names>
            <surname>Algorithms</surname>
          </string-name>
          . Berlin: Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Gal´antai, 2000]
          <article-title>Gal´antai</article-title>
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>The theory of Newtons method</article-title>
          .
          <source>Journal of Computational and Applied Mathematics</source>
          ,
          <volume>124</volume>
          ,
          <fpage>25</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Hager</source>
          , 1993] Hager,
          <string-name>
            <surname>W. W.</surname>
          </string-name>
          (
          <year>1993</year>
          ).
          <article-title>Analysis and Implementation of a Dual Algorithm for Constrained Optimization</article-title>
          .
          <source>Journal of Optimization Theory and Applications</source>
          ,
          <volume>79</volume>
          (
          <issue>3</issue>
          ),
          <fpage>427</fpage>
          -
          <lpage>462</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Kantorovich et al,
          <year>1982</year>
          ] Kantorovich,
          <string-name>
            <given-names>L.V.</given-names>
            &amp;
            <surname>Akilov</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.P.</surname>
          </string-name>
          (
          <year>1982</year>
          ).
          <source>Functional Analysis. 2nd ed</source>
          . Oxford: Pergamon Press.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Mysovskikh</source>
          , 1949] Mysovskikh,
          <string-name>
            <surname>I.</surname>
          </string-name>
          (
          <year>1949</year>
          ).
          <article-title>On convergence of Newton's method</article-title>
          .
          <source>Trudy Mat. Inst. Steklov</source>
          ,
          <volume>28</volume>
          ,
          <fpage>145</fpage>
          -
          <lpage>147</lpage>
          . (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Nesterov</source>
          , 2007] Nesterov,
          <string-name>
            <surname>Yu.</surname>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Modified Gauss-Newton scheme with worst case guarantees for global performance</article-title>
          .
          <source>Optimization Methods and Software</source>
          <volume>22</volume>
          (
          <issue>3</issue>
          ),
          <fpage>469</fpage>
          -
          <lpage>483</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Ortega &amp; Rheinboldt</source>
          , 2000] Ortega,
          <string-name>
            <given-names>J. M.</given-names>
            &amp;
            <surname>Rheinboldt</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. C.</surname>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Iterative Solution of Nonlinear Equations in Several Variables</article-title>
          . San Diego, California: Academic Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Polyak</source>
          , 1964] Polyak,
          <string-name>
            <surname>B. T.</surname>
          </string-name>
          (
          <year>1964</year>
          ).
          <article-title>Gradient methods for solving equations and inequalities</article-title>
          .
          <source>USSR Computational Mathematics and Mathematical Phys</source>
          .
          <volume>4</volume>
          (
          <issue>6</issue>
          ),
          <fpage>17</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>