Computer Science/Optimization

5. Methods for Unconstrained Optimization

  • -
728x90
반응형

Introduction

A principal advantage of Newton’s method is that it converges rapidly when the current estimate of the variables is close to the solution. However, it also has disadvantages, and overcoming these disadvantages has led to many other techniques.

In particular, Newton’s method can

  1. fail to converge
  1. converge to a point that is not a stationary point

However, this kind of problem can be overcome by using strategies that guarantee progress toward the solution at every iteration, such as the line search and trust-region strategies.

The cost of Newton’s method can also be a concern. It requires the derivation, computation, and storage of the second derivative matrix, and the solution of a system of linear equations.

Luckily, large problems are frequently sparse, and taking advantage of sparsity can greatly reduce the computational cost of Newton’s method.

We will introduce methods that are compromises to Newton’s method and that reduce one or more of these costs.

  1. Quasi-Newton method: the most widely used Newton-type method for problems of moderate size
  1. Steepest-descent method: an old and widely known method whose costs are low but whose performance is usually bad
    💡
    It illustrates the dangers of compromising too much when using Newton’s method

Steepest-Descent Method

This method is the simplest Newton-type method for nonlinear optimization. The price for this simplicity is that the method is hopelessly inefficient at solving most problems.

Advantage

  • does not require the computation of second derivatives

Disadvantage

  • It has a linear rate of convergence. Therefore, it is slower than Newton’s method
💡
Sometimes it is very slow that xk+1xkx_{k + 1} - x_k is below the machine epsilon. In this case, the method fails.

The steepest-descent method computes the search direction from

pk=f(xk)Tp_k = -\nabla f(x_k)^T

and then uses a line search to determine

xk+1=xk+αkpkx_{k +1} = x_k + \alpha_k p_k
💡
Note that the steepest-descent method already satisfies sufficient descent conditions and gradient-relatedness conditions.
💡
Two successive search directions in the steepest descent are orthogonal, under exact line search
💡
Convergence rate : linear

The formula for the search direction can be derived in two ways.

First derivation

If we think the Hessian approximated by the identity matrix, we can view Newton’s method as a

[f(xk)]p=f(xk)Tp=f(xk)T[\nabla f(x_k)]p = -\nabla f(x_k)^T \Leftrightarrow p = -\nabla f(x_k)^T
💡
This kind of approximation of Hessian is the basis of the quasi-Newton methods.

Second derivation

By using 1st-order approximation

f(xk+p)f(xk)+f(xk)pf(x_k + p) \approx f(x_k) + \nabla f(x_k)p

The intuitive idea is to minimize this approximation to obtain the search direction. However, this approximation does not have a finite minimum in general. Instead, the search direction is computed by minimizing a scaled version of this approximation

minp0f(xk)ppf(xk)\min_{p \ne 0}\frac{\nabla f(x_k)p}{\|p\|\|\nabla f(x_k)\|}

The solution is pk=f(xk)p_k = -\nabla f(x_k)

Quasi-Newton Methods

Quasi-Newton methods are among the most widely used methods for nonlinear optimization in particular when the Hessian is hard to compute.

There are many different quasi-Newton methods, but they are all based on approximating the Hessian 2f(xk)\nabla^2f(x_k) by another matrix BkB_k that is available at a lower cost. Then the search direction is obtained by solving

Bkp=f(xk)TB_kp = -\nabla f(x_k)^T

Therefore, this is equivalent to minimizing the quadratic model

minpf(xk)+f(xk)p+12pTBkp\min_{p}f(x_k) + \nabla f(x_k)p + \frac{1}{2}p^TB_k p
💡
The various quasi-Newton methods differ in the choice of BkB_k

Advantage

  1. BkB_k can be found using only first-derivative information
  1. the search direction can be computed using only O(n2)O(n^2)

Disadvantage

  1. The methods do not converge quadratically, but they can converge super-linearly
    💡
    But, at the precision of computer arithmetic, there is not much practical difference between the two rates of convergence.
  1. Still require matrix storage, so they are not normally used to solve large problems.

These methods are generalizations of a method for one-dimensional problems called the secant method. The secant method uses the approximation

f(xk)f(xk)f(xk1)xkxk1f''(x_k) \approx \frac{f'(x_{k}) - f'(x_{k - 1})}{x_k - x_{k - 1}}

Similarly,

2f(xk)(xkxk1)f(xk)f(xk1)\nabla^2f(x_k)(x_k - x_{k - 1}) \approx \nabla f(x_k) - \nabla f(x_{k - 1})

From this, we can obtain the condition used to define the quasi-Newton approximations BkB_k

Bk(xkxk1)=f(xk)f(xk1)B_k(x_{k} - x_{k - 1}) = \nabla f(x_k) - \nabla f(x_{k - 1})

We will call this the secant condition.

💡
Note that it is enough to define BkB_k uniquely because we have only nn equations even though we have to calculate n2n^2 unknown variable in BkB_k. Therefore, additional conditions must be imposed.

Suppose ff is a quadratic function, f(x)=12xTQxcTxf(x) = \frac{1}{2}x^TQx - c^Tx. In this case

f(xk)f(xk1)=Q(xkQk1)2f(xk)=Q\nabla f(x_k) - \nabla f(x_{k - 1}) = Q(x_k - Q_{k - 1}) \\ \nabla^2f(x_k) = Q

So that the Hessian matrix QQ satisfies the secant condition. Intuitively we are asking that the approximation BkB_k mimic the behavior of the Hessian matrix when it multiplies xkxk1x_k - x_{k - 1}.

💡
This kind of interpretation is quite intuitive. This is basically we use 2nd-order approximation so that it is very natural that the quadratic function’s Hessian satisfies the secant condition.

Although this interpretation is precise only for quadratic functions, it holds in an approximate way for general nonlinear functions. In addition, we have to find the approximation BkB_k that imitates the effect of the Hessian matrix along a particular direction(xkxk1)x_k - x_{k - 1}).

For simplicity, let sk=xk+1xks_k = x_{k + 1} - x_k and yk=f(xk+1)f(xk)y_k = \nabla f(x_{k + 1}) - \nabla f(x_k). Thus secant condition then becomes

Bksk1=yk1B_ks_{k - 1} = y_{k - 1}

An example of a quasi-Newton approximation is given by the formula

Bk+1=Bk+[ykBksk][ykBksk]T[ykBksk]TskB_{k + 1} = B_k + \frac{[y_k - B_ks_k][y_k - B_ks_k]^T}{[y_k - B_ks_k]^Ts_k}

Properties of quasi-Newton method

  1. The secant condition will be satisfied regardless of how BkB_k is chosen
    💡
    xk1x_{k - 1}에 해당하는 gradient를 2nd approximation가 그대로 가진다는 의미이다.
  1. The new approximation Bk+1B_{k + 1} is obtained by modifying the old approximation BkB_k. To start a quasi-Newton method some initial approximation B0B_0 must be specified. Often B0=IB_0= I is used, but it is reasonable and often advantageous to supply a better initial approximation if one can be obtained with little effort.
  1. The new approximation Bk+1B_{k + 1} must be closed to BkB_k
    💡
    어떤 기준으로 closed를 정의할 것인지에 따라서 다양한 algorithm이 등장할 수 있다.
  1. The search direction can be computed using O(n2)O(n^2). Normally the computational cost of solving a system of linear equations is O(n3)O(n^3). But in this case, it is possible because it is possible to derive formulas that update a Cholesky factorization of BkB_k.
    💡
    With a factorization available, the search direction can be computed via back substitution.
    Cholesky decomposition
    In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced /ʃəˈlɛski/ shə-LES-kee) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations. It was discovered by André-Louis Cholesky for real matrices, and posthumously published in 1924.[1] When it is applicable, the Cholesky decomposition is roughly twice as efficient as the LU decomposition for solving systems of linear equations.[2]
    https://en.wikipedia.org/wiki/Cholesky_decomposition
    Triangular matrix
    In mathematics, a triangular matrix is a special kind of square matrix. A square matrix is called .mw-parser-output .vanchor>:target~.vanchor-text{background-color:#b1d2ff}lower triangular if all the entries above the main diagonal are zero. Similarly, a square matrix is called upper triangular if all the entries below the main diagonal are zero.
    https://en.wikipedia.org/wiki/Triangular_matrix#Forward_and_back_substitution

All the quasi-Newton method have the form

Bk+1=Bk+[something]B_{k + 1} = B_k + [something]

The something represents an update to the old approximation BkB_k, and so a formula for a quasi-Newton approximation is often referred to as an update formula.

Symmetric Rank-one update

In this algorithm, we want to find Bk+1B_{k + 1} that closed to BkB_k in terms of the difference in its rank

rank(Bk+1Bk)=1\text{rank}(B_{k + 1} - B_k) = 1
  • Formula
    Bk+1=Bk+[ykBksk][ykBksk]T[ykBksk]TskB_{k + 1} = B_k + \frac{[y_k - B_ks_k][y_k - B_ks_k]^T}{[y_k - B_ks_k]^Ts_k}
  • It satisfies the secant condition.
  • This is the only rank-one update preserving symmetry
  • We can not guarantee that it is a positive definite
  • Super efficient to calculate
  • Derivation

BGFS

In this algorithm, we want to find Bk+1B_{k + 1} that closed to BkB_k in terms of the Frobenius norm of its difference

W(Bk+11Bk1)WF\|W(B_{k+ 1}^{-1}- B_k^{-1})W\|_F
💡
WW is a special matrix to make it positive definite

Symmetry is not the only property that can be imposed. Since the Hessian matrix at the solution will normally be positive definite, it is reasonable to ask the matrices BkB_k be positive definite as well.

There is no rank-one update formula that maintains both symmetry and positive definiteness of the Hessian approximation. However, there are infinitely many rank-two formulas that do this. The most widely used formula, and the one considered to be most effective, is BGFS update formula

Bk+1=Bk(Bksk)(Bksk)TskTBksk+ykykTykTskB_{k + 1} = B_k - \frac{(B_ks_k)(B_ks_k)^T}{s_k^TB_ks_k} + \frac{y_ky_k^T}{y_k^Ts_k}
💡
It also satisfies the secant condition

Broyden class

Bk+1=Bk(Bksk)(Bksk)TskTBksk+ykykTykTsk+ϕ(xkTBksk)vkvkTB_{k + 1} = B_k - \frac{(B_ks_k)(B_ks_k)^T}{s_k^TB_ks_k} + \frac{y_ky_k^T}{y_k^Ts_k} + \phi(x_k^TB_ks_k)v_kv_k^T

where vk=ykykTskBkskskTBkskv_k = \frac{y_k}{y_k^Ts_k} - \frac{B_ks_k}{s_k^TB_ks_k}

  • ϕ=0\phi = 0 : BGFS
  • ϕ=1\phi = 1 : Davidon, Fletcher, and Powell (DFP) update, older and less efficient than BFGS

    In this algorithm, we want to find Bk+1B_{k + 1} that closed to BkB_k in terms of the Frobenius norm of its difference

    W(Bk+1Bk)WF\|W(B_{k+ 1}- B_k)W\|_F
    💡
    WW is a special matrix to make it positive definite
    💡
    This approach is very similar to the BGFS but it is less stable than BFGS. It is actually reasonable since we use the inverse of Hessian, not Hessian itself.

반응형
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.