5. Methods for Unconstrained Optimization
- -
Introduction
A principal advantage of Newton’s method is that it converges rapidly when the current estimate of the variables is close to the solution. However, it also has disadvantages, and overcoming these disadvantages has led to many other techniques.
In particular, Newton’s method can
- fail to converge
- converge to a point that is
nota stationary point
However, this kind of problem can be overcome by using strategies that guarantee progress toward the solution at every iteration, such as the line search and trust-region strategies.
The cost of Newton’s method can also be a concern. It requires the derivation, computation, and storage of the second derivative matrix, and the solution of a system of linear equations.
Luckily, large problems are frequently sparse, and taking advantage of sparsity can greatly reduce the computational cost of Newton’s method.
We will introduce methods that are compromises to Newton’s method and that reduce one or more of these costs.
- Quasi-Newton method: the most widely used Newton-type method for problems of moderate size
- Steepest-descent method: an old and widely known method whose costs are low but whose performance is usually bad💡It illustrates the dangers of compromising too much when using Newton’s method
Steepest-Descent Method
This method is the simplest Newton-type method for nonlinear optimization. The price for this simplicity is that the method is hopelessly inefficient at solving most problems.
Advantage
- does not require the computation of
second derivatives
Disadvantage
- It has a
linearrate of convergence. Therefore, it isslowerthan Newton’s method
The steepest-descent method computes the search direction from
and then uses a line search to determine
orthogonal, under exact line searchThe formula for the search direction can be derived in two ways.
First derivation
If we think the Hessian approximated by the identity matrix, we can view Newton’s method as a
quasi-Newton methods.Second derivation
By using 1st-order approximation
The intuitive idea is to minimize this approximation to obtain the search direction. However, this approximation does not have a finite minimum in general. Instead, the search direction is computed by minimizing a scaled version of this approximation
The solution is pk=−∇f(xk)
Quasi-Newton Methods

Quasi-Newton methods are among the most widely used methods for nonlinear optimization in particular when the Hessian is hard to compute.
There are many different quasi-Newton methods, but they are all based on approximating the Hessian ∇2f(xk) by another matrix Bk that is available at a lower cost. Then the search direction is obtained by solving
Therefore, this is equivalent to minimizing the quadratic model
Advantage
- Bk can be found using only
first-derivativeinformation
- the search direction can be computed using only O(n2)
Disadvantage
- The methods do
notconverge quadratically, but they can converge super-linearly💡But, at the precision of computer arithmetic, there is not much practical difference between the two rates of convergence.
- Still require matrix storage, so they are not normally used to solve large problems.
These methods are generalizations of a method for one-dimensional problems called the secant method. The secant method uses the approximation
Similarly,
From this, we can obtain the condition used to define the quasi-Newton approximations Bk
We will call this the secant condition.
uniquely because we have only n equations even though we have to calculate n2 unknown variable in Bk. Therefore, additional conditions must be imposed.Suppose f is a quadratic function, f(x)=21xTQx−cTx. In this case
So that the Hessian matrix Q satisfies the secant condition. Intuitively we are asking that the approximation Bk mimic the behavior of the Hessian matrix when it multiplies xk−xk−1.
Although this interpretation is precise only for quadratic functions, it holds in an approximate way for general nonlinear functions. In addition, we have to find the approximation Bk that imitates the effect of the Hessian matrix along a particular direction(xk−xk−1).
For simplicity, let sk=xk+1−xk and yk=∇f(xk+1)−∇f(xk). Thus secant condition then becomes
An example of a quasi-Newton approximation is given by the formula
Properties of quasi-Newton method
- The secant condition will be satisfied regardless of how Bk is chosen💡xk−1에 해당하는 gradient를 2nd approximation가 그대로 가진다는 의미이다.
- The new approximation Bk+1 is obtained by modifying the old approximation Bk. To start a quasi-Newton method some initial approximation B0 must be
specified. Often B0=I is used, but it is reasonable and often advantageous to supply a better initial approximation if one can be obtained with little effort.
- The new approximation Bk+1 must be closed to Bk💡어떤 기준으로 closed를 정의할 것인지에 따라서 다양한 algorithm이 등장할 수 있다.
- The search direction can be computed using O(n2). Normally the computational cost of solving a system of linear equations is O(n3). But in this case, it is possible because it is possible to derive formulas that update a Cholesky factorization of Bk.💡With a factorization available, the search direction can be computed via back substitution.Cholesky decompositionIn linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced /ʃəˈlɛski/ shə-LES-kee) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations. It was discovered by André-Louis Cholesky for real matrices, and posthumously published in 1924.[1] When it is applicable, the Cholesky decomposition is roughly twice as efficient as the LU decomposition for solving systems of linear equations.[2]
https://en.wikipedia.org/wiki/Cholesky_decompositionTriangular matrixIn mathematics, a triangular matrix is a special kind of square matrix. A square matrix is called .mw-parser-output .vanchor>:target~.vanchor-text{background-color:#b1d2ff}lower triangular if all the entries above the main diagonal are zero. Similarly, a square matrix is called upper triangular if all the entries below the main diagonal are zero.
https://en.wikipedia.org/wiki/Triangular_matrix#Forward_and_back_substitution
All the quasi-Newton method have the form
The something represents an update to the old approximation Bk, and so a formula for a quasi-Newton approximation is often referred to as an update formula.
Symmetric Rank-one update
In this algorithm, we want to find Bk+1 that closed to Bk in terms of the difference in its rank
- FormulaBk+1=Bk+[yk−Bksk]Tsk[yk−Bksk][yk−Bksk]T
- It satisfies the secant condition.
- This is the only rank-one update preserving symmetry
- We
can notguarantee that it is a positive definite
- Super efficient to calculate
Derivation
BGFS
In this algorithm, we want to find Bk+1 that closed to Bk in terms of the Frobenius norm of its difference
Symmetry is not the only property that can be imposed. Since the Hessian matrix at the solution will normally be positive definite, it is reasonable to ask the matrices Bk be positive definite as well.
There is no rank-one update formula that maintains both symmetry and positive definiteness of the Hessian approximation. However, there are infinitely many rank-two formulas that do this. The most widely used formula, and the one considered to be most effective, is BGFS update formula
Broyden class
where vk=ykTskyk−skTBkskBksk
- ϕ=0 : BGFS
- ϕ=1 : Davidon, Fletcher, and Powell (DFP) update, older and less efficient than BFGS
In this algorithm, we want to find Bk+1 that closed to Bk in terms of the Frobenius norm of its difference
∥W(Bk+1−Bk)W∥F💡W is a special matrix to make it positive definite💡This approach is very similar to the BGFS but it is less stable than BFGS. It is actually reasonable since we use the inverse of Hessian, not Hessian itself.
소중한 공감 감사합니다