Computer Science/Optimization

5. Methods for Unconstrained Optimization

728x90

Introduction

A principal advantage of Newton’s method is that it converges rapidly when the current estimate of the variables is close to the solution. However, it also has disadvantages, and overcoming these disadvantages has led to many other techniques.

In particular, Newton’s method can

fail to converge

converge to a point that is not a stationary point

However, this kind of problem can be overcome by using strategies that guarantee progress toward the solution at every iteration, such as the line search and trust-region strategies.

The cost of Newton’s method can also be a concern. It requires the derivation, computation, and storage of the second derivative matrix, and the solution of a system of linear equations.

Luckily, large problems are frequently sparse, and taking advantage of sparsity can greatly reduce the computational cost of Newton’s method.

We will introduce methods that are compromises to Newton’s method and that reduce one or more of these costs.

Quasi-Newton method: the most widely used Newton-type method for problems of moderate size

Steepest-descent method: an old and widely known method whose costs are low but whose performance is usually bad
💡
It illustrates the dangers of compromising too much when using Newton’s method

Steepest-Descent Method

This method is the simplest Newton-type method for nonlinear optimization. The price for this simplicity is that the method is hopelessly inefficient at solving most problems.

Advantage

does not require the computation of second derivatives

Disadvantage

It has a linear rate of convergence. Therefore, it is slower than Newton’s method

💡

Sometimes it is very slow that

x_{k + 1} - x_k

is below the machine epsilon. In this case, the method fails.

The steepest-descent method computes the search direction from

p_k = -\nabla f(x_k)^T

and then uses a line search to determine

x_{k +1} = x_k + \alpha_k p_k

💡

Note that the steepest-descent method already satisfies sufficient descent conditions and gradient-relatedness conditions.

💡

Two successive search directions in the steepest descent are orthogonal, under exact line search

💡

Convergence rate : linear

The formula for the search direction can be derived in two ways.

First derivation

If we think the Hessian approximated by the identity matrix, we can view Newton’s method as a

[\nabla f(x_k)]p = -\nabla f(x_k)^T \Leftrightarrow p = -\nabla f(x_k)^T

💡

This kind of approximation of Hessian is the basis of the quasi-Newton methods.

Second derivation

By using 1st-order approximation

f(x_k + p) \approx f(x_k) + \nabla f(x_k)p

The intuitive idea is to minimize this approximation to obtain the search direction. However, this approximation does not have a finite minimum in general. Instead, the search direction is computed by minimizing a scaled version of this approximation

\min_{p \ne 0}\frac{\nabla f(x_k)p}{\|p\|\|\nabla f(x_k)\|}

The solution is $p_k = -\nabla f(x_k)$

Quasi-Newton Methods

Quasi-Newton methods are among the most widely used methods for nonlinear optimization in particular when the Hessian is hard to compute.

There are many different quasi-Newton methods, but they are all based on approximating the Hessian $\nabla^2f(x_k)$ by another matrix $B_k$ that is available at a lower cost. Then the search direction is obtained by solving

B_kp = -\nabla f(x_k)^T

Therefore, this is equivalent to minimizing the quadratic model

\min_{p}f(x_k) + \nabla f(x_k)p + \frac{1}{2}p^TB_k p

💡

The various quasi-Newton methods differ in the choice of

B_k

Advantage

$B_k$ can be found using only first-derivative information

the search direction can be computed using only $O(n^2)$

Disadvantage

The methods do not converge quadratically, but they can converge super-linearly
💡
But, at the precision of computer arithmetic, there is not much practical difference between the two rates of convergence.

Still require matrix storage, so they are not normally used to solve large problems.

These methods are generalizations of a method for one-dimensional problems called the secant method. The secant method uses the approximation

f''(x_k) \approx \frac{f'(x_{k}) - f'(x_{k - 1})}{x_k - x_{k - 1}}

Similarly,

\nabla^2f(x_k)(x_k - x_{k - 1}) \approx \nabla f(x_k) - \nabla f(x_{k - 1})

From this, we can obtain the condition used to define the quasi-Newton approximations $B_k$

B_k(x_{k} - x_{k - 1}) = \nabla f(x_k) - \nabla f(x_{k - 1})

We will call this the secant condition.

💡

Note that it is enough to define

B_k

uniquely because we have only

n

equations even though we have to calculate

n^2

unknown variable in

B_k

. Therefore, additional conditions must be imposed.

Suppose $f$ is a quadratic function, $f(x) = \frac{1}{2}x^TQx - c^Tx$ . In this case

\nabla f(x_k) - \nabla f(x_{k - 1}) = Q(x_k - Q_{k - 1}) \\ \nabla^2f(x_k) = Q

So that the Hessian matrix $Q$ satisfies the secant condition. Intuitively we are asking that the approximation $B_k$  mimic the behavior of the Hessian matrix when it multiplies $x_k - x_{k - 1}$ .

💡

This kind of interpretation is quite intuitive. This is basically we use 2nd-order approximation so that it is very natural that the quadratic function’s Hessian satisfies the secant condition.

Although this interpretation is precise only for quadratic functions, it holds in an approximate way for general nonlinear functions. In addition, we have to find the approximation $B_k$ that imitates the effect of the Hessian matrix along a particular direction( $x_k - x_{k - 1})$ .

For simplicity, let $s_k = x_{k + 1} - x_k$ and $y_k = \nabla f(x_{k + 1}) - \nabla f(x_k)$ . Thus secant condition then becomes

B_ks_{k - 1} = y_{k - 1}

An example of a quasi-Newton approximation is given by the formula

B_{k + 1} = B_k + \frac{[y_k - B_ks_k][y_k - B_ks_k]^T}{[y_k - B_ks_k]^Ts_k}

Properties of quasi-Newton method

The secant condition will be satisfied regardless of how $B_k$ is chosen
💡
$x_{k - 1}$ 에 해당하는 gradient를 2nd approximation가 그대로 가진다는 의미이다.

The new approximation $B_{k + 1}$ is obtained by modifying the old approximation $B_k$ . To start a quasi-Newton method some initial approximation $B_0$ must be specified. Often $B_0= I$ is used, but it is reasonable and often advantageous to supply a better initial approximation if one can be obtained with little effort.

The new approximation $B_{k + 1}$ must be closed to $B_k$
💡
어떤 기준으로 closed를 정의할 것인지에 따라서 다양한 algorithm이 등장할 수 있다.

The search direction can be computed using $O(n^2)$ . Normally the computational cost of solving a system of linear equations is $O(n^3)$ . But in this case, it is possible because it is possible to derive formulas that update a Cholesky factorization of $B_k$ .
💡
With a factorization available, the search direction can be computed via back substitution.
Cholesky decomposition
In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced /ʃəˈlɛski/ shə-LES-kee) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations. It was discovered by André-Louis Cholesky for real matrices, and posthumously published in 1924.[1] When it is applicable, the Cholesky decomposition is roughly twice as efficient as the LU decomposition for solving systems of linear equations.[2]
https://en.wikipedia.org/wiki/Cholesky_decomposition
Triangular matrix
In mathematics, a triangular matrix is a special kind of square matrix. A square matrix is called .mw-parser-output .vanchor>:target~.vanchor-text{background-color:#b1d2ff}lower triangular if all the entries above the main diagonal are zero. Similarly, a square matrix is called upper triangular if all the entries below the main diagonal are zero.
https://en.wikipedia.org/wiki/Triangular_matrix#Forward_and_back_substitution

All the quasi-Newton method have the form

B_{k + 1} = B_k + [something]

The something represents an update to the old approximation $B_k$ , and so a formula for a quasi-Newton approximation is often referred to as an update formula.

Symmetric Rank-one update

In this algorithm, we want to find $B_{k + 1}$ that closed to $B_k$ in terms of the difference in its rank

\text{rank}(B_{k + 1} - B_k) = 1

Formula
$B_{k + 1} = B_k + \frac{[y_k - B_ks_k][y_k - B_ks_k]^T}{[y_k - B_ks_k]^Ts_k}$

It satisfies the secant condition.

This is the only rank-one update preserving symmetry

We can not guarantee that it is a positive definite

Super efficient to calculate

Derivation

BGFS

In this algorithm, we want to find $B_{k + 1}$ that closed to $B_k$ in terms of the Frobenius norm of its difference

\|W(B_{k+ 1}^{-1}- B_k^{-1})W\|_F

💡

W

is a special matrix to make it positive definite

Symmetry is not the only property that can be imposed. Since the Hessian matrix at the solution will normally be positive definite, it is reasonable to ask the matrices $B_k$ be positive definite as well.

There is no rank-one update formula that maintains both symmetry and positive definiteness of the Hessian approximation. However, there are infinitely many rank-two formulas that do this. The most widely used formula, and the one considered to be most effective, is BGFS update formula

B_{k + 1} = B_k - \frac{(B_ks_k)(B_ks_k)^T}{s_k^TB_ks_k} + \frac{y_ky_k^T}{y_k^Ts_k}

💡

It also satisfies the secant condition

Broyden class

B_{k + 1} = B_k - \frac{(B_ks_k)(B_ks_k)^T}{s_k^TB_ks_k} + \frac{y_ky_k^T}{y_k^Ts_k} + \phi(x_k^TB_ks_k)v_kv_k^T

where $v_k = \frac{y_k}{y_k^Ts_k} - \frac{B_ks_k}{s_k^TB_ks_k}$

$\phi = 0$ : BGFS

$\phi = 1$ : Davidon, Fletcher, and Powell (DFP) update, older and less efficient than BFGS
In this algorithm, we want to find $B_{k + 1}$ that closed to $B_k$ in terms of the Frobenius norm of its difference
$\|W(B_{k+ 1}- B_k)W\|_F$
💡
$W$ is a special matrix to make it positive definite
💡
This approach is very similar to the BGFS but it is less stable than BFGS. It is actually reasonable since we use the inverse of Hessian, not Hessian itself.

Contents

새소식

인기 검색어