Quasi-Newton method




Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. The "full" Newton's method requires the Jacobian in order to search for zeros, or the Hessian for finding extrema.




Contents






  • 1 Search for zeros: root finding


  • 2 Search for extrema: optimization


  • 3 Relationship to matrix inversion


  • 4 Notable implementations


  • 5 See also


  • 6 References


  • 7 Further reading





Search for zeros: root finding


Newton's method to find zeroes of a function g{displaystyle g}g of multiple variables is given by xn+1=xn−[Jg(xn)]−1g(xn){displaystyle x_{n+1}=x_{n}-[J_{g}(x_{n})]^{-1}g(x_{n})}{displaystyle x_{n+1}=x_{n}-[J_{g}(x_{n})]^{-1}g(x_{n})}, where [Jg(xn)]−1{displaystyle [J_{g}(x_{n})]^{-1}}[J_{g}(x_{n})]^{-1} is the left inverse of the Jacobian matrix Jg(xn){displaystyle J_{g}(x_{n})}J_{g}(x_{n}) of g{displaystyle g}g evaluated for xn{displaystyle x_{n}}x_{n}.


Strictly speaking, any method that replaces the exact Jacobian Jg(xn){displaystyle J_{g}(x_{n})}J_{g}(x_{n}) with an approximation is a quasi-Newton method. For instance, the chord method (where Jg(xn){displaystyle J_{g}(x_{n})}J_{g}(x_{n}) is replaced by Jg(x0){displaystyle J_{g}(x_{0})}{displaystyle J_{g}(x_{0})} for all iterations) is a simple example. The methods given below for optimization refer to an important subclass of quasi-Newton methods, secant methods.[1]


Using methods developed to find extrema in order to find zeroes is not always a good idea, as the majority of the methods used to find extrema require that the matrix that is used is symmetrical. While this holds in the context of the search for extrema, it rarely holds when searching for zeroes. Broyden's "good" and "bad" methods are two methods commonly used to find extrema that can also be applied to find zeroes. Other methods that can be used are the column-updating method, the inverse column-updating method, the quasi-Newton least squares method and the quasi-Newton inverse least squares method.


More recently quasi-Newton methods have been applied to find the solution of multiple coupled systems of equations (e.g. fluid–structure interaction problems or interaction problems in physics). They allow the solution to be found by solving each constituent system separately (which is simpler than the global system) in a cyclic, iterative fashion until the solution of the global system is found.[1][2]



Search for extrema: optimization


Noting that the search for a minimum or maximum of a scalar-valued function is nothing else than the search for the zeroes of the gradient of that function, quasi-Newton methods can be readily applied to find extrema of a function. In other words, if g{displaystyle g}g is the gradient of f{displaystyle f}f, then searching for the zeroes of the vector-valued function g{displaystyle g}g corresponds to the search for the extrema of the scalar-valued function f{displaystyle f}f; the Jacobian of g{displaystyle g}g now becomes the Hessian of f{displaystyle f}f. The main difference is that the Hessian matrix is a symmetric matrix, unlike the Jacobian when searching for zeroes. Most quasi-Newton methods used in optimization exploit this property.


In optimization, quasi-Newton methods (a special case of variable-metric methods) are algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on Newton's method to find the stationary point of a function, where the gradient is 0. Newton's method assumes that the function can be locally approximated as a quadratic in the region around the optimum, and uses the first and second derivatives to find the stationary point. In higher dimensions, Newton's method uses the gradient and the Hessian matrix of second derivatives of the function to be minimized.


In quasi-Newton methods the Hessian matrix does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In multiple dimensions the secant equation is under-determined, and quasi-Newton methods differ in how they constrain the solution, typically by adding a simple low-rank update to the current estimate of the Hessian.


The first quasi-Newton algorithm was proposed by William C. Davidon, a physicist working at Argonne National Laboratory. He developed the first quasi-Newton algorithm in 1959: the DFP updating formula, which was later popularized by Fletcher and Powell in 1963, but is rarely used today. The most common quasi-Newton algorithms are currently the SR1 formula (for "symmetric rank-one"), the BHHH method, the widespread BFGS method (suggested independently by Broyden, Fletcher, Goldfarb, and Shanno, in 1970), and its low-memory extension L-BFGS. The Broyden's class is a linear combination of the DFP and BFGS methods.


The SR1 formula does not guarantee the update matrix to maintain positive-definiteness and can be used for indefinite problems. The Broyden's method does not require the update matrix to be symmetric and is used to find the root of a general system of equations (rather than the gradient) by updating the Jacobian (rather than the Hessian).


One of the chief advantages of quasi-Newton methods over Newton's method is that the Hessian matrix (or, in the case of quasi-Newton methods, its approximation) B{displaystyle B}B does not need to be inverted. Newton's method, and its derivatives such as interior point methods, require the Hessian to be inverted, which is typically implemented by solving a system of linear equations and is often quite costly. In contrast, quasi-Newton methods usually generate an estimate of B−1{displaystyle B^{-1}}B^{-1} directly.


As in Newton's method, one uses a second-order approximation to find the minimum of a function f(x){displaystyle f(x)}f(x). The Taylor series of f(x){displaystyle f(x)}f(x) around an iterate is


f(xk+Δx)≈f(xk)+∇f(xk)TΔx+12ΔxTBΔx,{displaystyle f(x_{k}+Delta x)approx f(x_{k})+nabla f(x_{k})^{mathrm {T} },Delta x+{frac {1}{2}}Delta x^{mathrm {T} }B,Delta x,}{displaystyle f(x_{k}+Delta x)approx f(x_{k})+nabla f(x_{k})^{mathrm {T} },Delta x+{frac {1}{2}}Delta x^{mathrm {T} }B,Delta x,}

where (f{displaystyle nabla f}nabla f) is the gradient, and B{displaystyle B}B an approximation to the Hessian matrix. The gradient of this approximation (with respect to Δx{displaystyle Delta x}Delta x) is


f(xk+Δx)≈f(xk)+BΔx,{displaystyle nabla f(x_{k}+Delta x)approx nabla f(x_{k})+B,Delta x,}{displaystyle nabla f(x_{k}+Delta x)approx nabla f(x_{k})+B,Delta x,}

and setting this gradient to zero (which is the goal of optimisation) provides the Newton step:


Δx=−B−1∇f(xk).{displaystyle Delta x=-B^{-1}nabla f(x_{k}).}{displaystyle Delta x=-B^{-1}nabla f(x_{k}).}

The Hessian approximation B{displaystyle B}B is chosen to satisfy


f(xk+Δx)=∇f(xk)+BΔx,{displaystyle nabla f(x_{k}+Delta x)=nabla f(x_{k})+B,Delta x,}{displaystyle nabla f(x_{k}+Delta x)=nabla f(x_{k})+B,Delta x,}

which is called the secant equation (the Taylor series of the gradient itself). In more than one dimension B{displaystyle B}B is underdetermined. In one dimension, solving for B{displaystyle B}B and applying the Newton's step with the updated value is equivalent to the secant method. The various quasi-Newton methods differ in their choice of the solution to the secant equation (in one dimension, all the variants are equivalent). Most methods (but with exceptions, such as Broyden's method) seek a symmetric solution (BT=B{displaystyle B^{T}=B}{displaystyle B^{T}=B}); furthermore, the variants listed below can be motivated by finding an update Bk+1{displaystyle B_{k+1}}B_{k+1} that is as close as possible to Bk{displaystyle B_{k}}B_{k} in some norm; that is, Bk+1=argminB⁡B−Bk‖V{displaystyle B_{k+1}=operatorname {argmin} _{B}|B-B_{k}|_{V}}{displaystyle B_{k+1}=operatorname {argmin} _{B}|B-B_{k}|_{V}}, where V{displaystyle V}V is some positive-definite matrix that defines the norm. An approximate initial value of B0=I∗x{displaystyle B_{0}=I*x}{displaystyle B_{0}=I*x}[clarification needed] is often sufficient to achieve rapid convergence. Note that B0{displaystyle B_{0}}B_{0} should be positive-definite. The unknown xk{displaystyle x_{k}}x_{k} is updated applying the Newton's step calculated using the current approximate Hessian matrix Bk{displaystyle B_{k}}B_{k}:




  • Δxk=−αkBk−1∇f(xk){displaystyle Delta x_{k}=-alpha _{k}B_{k}^{-1}nabla f(x_{k})}{displaystyle Delta x_{k}=-alpha _{k}B_{k}^{-1}nabla f(x_{k})}, with α{displaystyle alpha }alpha chosen to satisfy the Wolfe conditions;


  • xk+1=xk+Δxk{displaystyle x_{k+1}=x_{k}+Delta x_{k}}{displaystyle x_{k+1}=x_{k}+Delta x_{k}};

  • The gradient computed at the new point f(xk+1){displaystyle nabla f(x_{k+1})}nabla f(x_{k+1}), and


yk=∇f(xk+1)−f(xk){displaystyle y_{k}=nabla f(x_{k+1})-nabla f(x_{k})}{displaystyle y_{k}=nabla f(x_{k+1})-nabla f(x_{k})}

is used to update the approximate Hessian Bk+1{displaystyle B_{k+1}}B_{k+1}, or directly its inverse Hk+1=Bk+1−1{displaystyle H_{k+1}=B_{k+1}^{-1}}{displaystyle H_{k+1}=B_{k+1}^{-1}} using the Sherman–Morrison formula.


  • A key property of the BFGS and DFP updates is that if Bk{displaystyle B_{k}}B_{k} is positive-definite, and αk{displaystyle alpha _{k}}alpha _{k} is chosen to satisfy the Wolfe conditions, then Bk+1{displaystyle B_{k+1}}B_{k+1} is also positive-definite.

The most popular update formulas are:

































Method

Bk+1={displaystyle displaystyle B_{k+1}=}displaystyle B_{k+1}=

Hk+1=Bk+1−1={displaystyle H_{k+1}=B_{k+1}^{-1}=}H_{k+1}=B_{k+1}^{-1}=

BFGS

Bk+ykykTykTΔxk−BkΔxk(BkΔxk)TΔxkTBkΔxk{displaystyle B_{k}+{frac {y_{k}y_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} }Delta x_{k}}}-{frac {B_{k}Delta x_{k}(B_{k}Delta x_{k})^{mathrm {T} }}{Delta x_{k}^{mathrm {T} }B_{k},Delta x_{k}}}}{displaystyle B_{k}+{frac {y_{k}y_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} }Delta x_{k}}}-{frac {B_{k}Delta x_{k}(B_{k}Delta x_{k})^{mathrm {T} }}{Delta x_{k}^{mathrm {T} }B_{k},Delta x_{k}}}}

(I−ΔxkykTykTΔxk)Hk(I−ykΔxkTykTΔxk)+ΔxkΔxkTykTΔxk{displaystyle left(I-{frac {Delta x_{k}y_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} }Delta x_{k}}}right)H_{k}left(I-{frac {y_{k}Delta x_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} }Delta x_{k}}}right)+{frac {Delta x_{k}Delta x_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} },Delta x_{k}}}}{displaystyle left(I-{frac {Delta x_{k}y_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} }Delta x_{k}}}right)H_{k}left(I-{frac {y_{k}Delta x_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} }Delta x_{k}}}right)+{frac {Delta x_{k}Delta x_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} },Delta x_{k}}}}

Broyden

Bk+yk−BkΔxkΔxkTΔxkΔxkT{displaystyle B_{k}+{frac {y_{k}-B_{k}Delta x_{k}}{Delta x_{k}^{mathrm {T} },Delta x_{k}}},Delta x_{k}^{mathrm {T} }}{displaystyle B_{k}+{frac {y_{k}-B_{k}Delta x_{k}}{Delta x_{k}^{mathrm {T} },Delta x_{k}}},Delta x_{k}^{mathrm {T} }}

Hk+(Δxk−Hkyk)ΔxkTHkΔxkTHkyk{displaystyle H_{k}+{frac {(Delta x_{k}-H_{k}y_{k})Delta x_{k}^{mathrm {T} }H_{k}}{Delta x_{k}^{mathrm {T} }H_{k},y_{k}}}}{displaystyle H_{k}+{frac {(Delta x_{k}-H_{k}y_{k})Delta x_{k}^{mathrm {T} }H_{k}}{Delta x_{k}^{mathrm {T} }H_{k},y_{k}}}}
Broyden family

(1−φk)Bk+1BFGS+φkBk+1DFP,φ[0,1]{displaystyle (1-varphi _{k})B_{k+1}^{text{BFGS}}+varphi _{k}B_{k+1}^{text{DFP}},quad varphi in [0,1]}{displaystyle (1-varphi _{k})B_{k+1}^{text{BFGS}}+varphi _{k}B_{k+1}^{text{DFP}},quad varphi in [0,1]}


DFP

(I−ykΔxkTykTΔxk)Bk(I−ΔxkykTykTΔxk)+ykykTykTΔxk{displaystyle left(I-{frac {y_{k},Delta x_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} },Delta x_{k}}}right)B_{k}left(I-{frac {Delta x_{k}y_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} },Delta x_{k}}}right)+{frac {y_{k}y_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} },Delta x_{k}}}}{displaystyle left(I-{frac {y_{k},Delta x_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} },Delta x_{k}}}right)B_{k}left(I-{frac {Delta x_{k}y_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} },Delta x_{k}}}right)+{frac {y_{k}y_{k}^{mathrm {T} }}{y_{k}^{mathrm {T} },Delta x_{k}}}}

Hk+ΔxkΔxkTΔxkTyk−HkykykTHkykTHkyk{displaystyle H_{k}+{frac {Delta x_{k}Delta x_{k}^{mathrm {T} }}{Delta x_{k}^{mathrm {T} },y_{k}}}-{frac {H_{k}y_{k}y_{k}^{mathrm {T} }H_{k}}{y_{k}^{mathrm {T} }H_{k}y_{k}}}}{displaystyle H_{k}+{frac {Delta x_{k}Delta x_{k}^{mathrm {T} }}{Delta x_{k}^{mathrm {T} },y_{k}}}-{frac {H_{k}y_{k}y_{k}^{mathrm {T} }H_{k}}{y_{k}^{mathrm {T} }H_{k}y_{k}}}}

SR1

Bk+(yk−BkΔxk)(yk−BkΔxk)T(yk−BkΔxk)TΔxk{displaystyle B_{k}+{frac {(y_{k}-B_{k},Delta x_{k})(y_{k}-B_{k},Delta x_{k})^{mathrm {T} }}{(y_{k}-B_{k},Delta x_{k})^{mathrm {T} },Delta x_{k}}}}{displaystyle B_{k}+{frac {(y_{k}-B_{k},Delta x_{k})(y_{k}-B_{k},Delta x_{k})^{mathrm {T} }}{(y_{k}-B_{k},Delta x_{k})^{mathrm {T} },Delta x_{k}}}}

Hk+(Δxk−Hkyk)(Δxk−Hkyk)T(Δxk−Hkyk)Tyk{displaystyle H_{k}+{frac {(Delta x_{k}-H_{k}y_{k})(Delta x_{k}-H_{k}y_{k})^{mathrm {T} }}{(Delta x_{k}-H_{k}y_{k})^{mathrm {T} }y_{k}}}}{displaystyle H_{k}+{frac {(Delta x_{k}-H_{k}y_{k})(Delta x_{k}-H_{k}y_{k})^{mathrm {T} }}{(Delta x_{k}-H_{k}y_{k})^{mathrm {T} }y_{k}}}}

Other methods are Pearson's method, McCormick's method, the Powell symmetric Broyden (PSB) method and Greenstadt's method.[1]



Relationship to matrix inversion


When f{displaystyle f}f is a convex quadratic function with positive-definite Hessian B{displaystyle B}B, one would expect the matrices Hk{displaystyle H_{k}}H_{k} generated by a quasi-Newton method to converge to the inverse Hessian H=B−1{displaystyle H=B^{-1}}{displaystyle H=B^{-1}}. This is indeed the case for the class of quasi-Newton methods based on least-change updates.[3]



Notable implementations


Implementations of quasi-Newton methods are available in many programming languages. Notable implementations include:




  • GNU Octave uses a form of BFGS in its fsolve function, with trust region extensions.


  • Mathematica includes quasi-Newton solvers.[4]

  • The NAG Library contains several routines[5] for minimizing or maximizing a function[6] which use quasi-Newton algorithms.

  • In MATLAB's Optimization Toolbox, the fminunc function uses (among other methods) the BFGS quasi-Newton method.[7] Many of the constrained methods of the Optimization toolbox use BFGS and the variant L-BFGS.[8]


  • R's optim general-purpose optimizer routine uses the BFGS method by using method="BFGS".[9]


  • Scipy.optimize has fmin_bfgs. In the SciPy extension to Python, the scipy.optimize.minimize function includes, among other methods, a BFGS implementation.[10]



See also




  • BFGS method

    • L-BFGS

    • OWL-QN



  • Broyden's method

  • DFP updating formula

  • Newton's method

  • Newton's method in optimization

  • Quasi-Newton least squares method

  • SR1 formula



References





  1. ^ abc Haelterman, Rob (2009). "Analytical study of the least squares quasi-Newton method for interaction problems". PhD Thesis, Ghent University. Retrieved 2014-08-14..mw-parser-output cite.citation{font-style:inherit}.mw-parser-output .citation q{quotes:"""""""'""'"}.mw-parser-output .citation .cs1-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Lock-green.svg/9px-Lock-green.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .citation .cs1-lock-limited a,.mw-parser-output .citation .cs1-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Lock-gray-alt-2.svg/9px-Lock-gray-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .citation .cs1-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Lock-red-alt-2.svg/9px-Lock-red-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration{color:#555}.mw-parser-output .cs1-subscription span,.mw-parser-output .cs1-registration span{border-bottom:1px dotted;cursor:help}.mw-parser-output .cs1-ws-icon a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Wikisource-logo.svg/12px-Wikisource-logo.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output code.cs1-code{color:inherit;background:inherit;border:inherit;padding:inherit}.mw-parser-output .cs1-hidden-error{display:none;font-size:100%}.mw-parser-output .cs1-visible-error{font-size:100%}.mw-parser-output .cs1-maint{display:none;color:#33aa33;margin-left:0.3em}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration,.mw-parser-output .cs1-format{font-size:95%}.mw-parser-output .cs1-kern-left,.mw-parser-output .cs1-kern-wl-left{padding-left:0.2em}.mw-parser-output .cs1-kern-right,.mw-parser-output .cs1-kern-wl-right{padding-right:0.2em}


  2. ^ Rob Haelterman, Dirk Van Eester, Daan Verleyen (2015). "Accelerating the solution of a physics model inside a tokamak using the (Inverse) Column Updating Method". Journal of Computational and Applied Mathematics. 279: 133–144. doi:10.1016/j.cam.2014.11.005.CS1 maint: Uses authors parameter (link)


  3. ^ Robert Mansel Gower; Peter Richtarik (2015). "Randomized Quasi-Newton Updates are Linearly Convergent Matrix Inversion Algorithms". arXiv:1602.01768 [math.NA].


  4. ^ http://reference.wolfram.com/mathematica/tutorial/UnconstrainedOptimizationQuasiNewtonMethods.html


  5. ^ The Numerical Algorithms Group. "Keyword Index: Quasi-Newton". NAG Library Manual, Mark 23. Retrieved 2012-02-09.


  6. ^ The Numerical Algorithms Group. "E04 – Minimizing or Maximizing a Function" (PDF). NAG Library Manual, Mark 23. Retrieved 2012-02-09.


  7. ^ http://www.mathworks.com/help/toolbox/optim/ug/fminunc.html


  8. ^ http://www.mathworks.com/help/toolbox/optim/ug/brnoxzl.html


  9. ^ [1]


  10. ^ http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html




Further reading



  • Bonnans, J. F., Gilbert, J. Ch., Lemaréchal, C. and Sagastizábal, C. A. (2006), Numerical optimization, theoretical and numerical aspects. Second edition. Springer.
    ISBN 978-3-540-35445-1.


  • Fletcher, Roger (1987), Practical methods of optimization (2nd ed.), New York: John Wiley & Sons, ISBN 978-0-471-91547-8.


  • Nocedal, Jorge; Wright, Stephen J. (1999). "Quasi-Newton Methods". Numerical Optimization. New York: Springer. pp. 192–221. ISBN 0-387-98793-2.


  • Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. (2007). "Section 10.9. Quasi-Newton or Variable Metric Methods in Multidimensions". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8.


  • Scales, L. E. (1985). Introduction to Non-Linear Optimization. New York: MacMillan. pp. 84–106. ISBN 0-333-32552-4.








Popular posts from this blog

Shashamane

Carrot

Deprivation index