3. Guaranteeing Descent
- -
Newton’s Methods for Minimization
- The basic idea can be derived using the first-order necessary optimality condition
- Alternatively, consider second-order Taylor series approximation
Let q(p)=f(x)+∇f(x)p+21pT∇2f(x)p
for any given x
Proof
Therefore,
- Why don’t we just use 1st-order approximation?
→ Note that we want to minimize f not a q. In addition, since 1st-order approximation is not bounded, we use 2nd-order approximation instead. Moreover, 2nd-order approximation is convex so that we can easily find the optimal point
analytically
.
- Convergence rate (if converging, along with assumptions of f and the point)
- Quadratic : Newton’s method has a
quadratic
rate of convergence except in degenerate cases→ It converges faster and faster
The exact condition for quadratic convergence
Let f be a real-valued function of n variables defined on an open convex set S. Assume that ∇2f is Lipschitz continuous on S, that is,
for all x,y∈S and for some constant L<∞. Consider the sequence {xk} generated by
Let x∗ be a minimizer of f(x) in S and assume that ∇2f(x∗) is positive definite. If ∥x0−x∗∥ is sufficiently small, then {xk} converges quadratically to x∗.
Proof
- Super-linear if the Newton direction is approached at the limit
- Quadratic : Newton’s method has a
Problems of Naive Newton’s Method
Why not simply keep using the basic version of Newton's step?
In fact, when it works, it works well.
However
- It may not converge at all (or fail, if the Hessian is
singular
(i.e. not invertible))
- It doesn’t look for a minimum, just a stationary point.
- The basic form is computationally heavy
→ Since finding a inverse matrix requires large computations.
It is possible to reduce these costs by using algorithms that compromise on Newton’s method. Virtually all of these algorithms get by with only first derivative calculations. Most of these algorithms avoid solving a linear system and reduce the cost of using the Newton formula to O(n2) or less. The methods designed for solving large problems reduce the storage requirements to O(n).
One of the algorithms that typically rely only on first derivative calculations is the Gradient Descent method
. It is widely used to find local optimum in differentiable functions.
Also, there are methods like Quasi-Newton methods
which reduce the cost of the Newton formula without solving a linear system. Quasi-Newton methods such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm are known to reduce the complexity of the Newton method while maintaining similar performance.
However, if we keep in mind that almost all of these methods are compromises on Newton's method, the relationships among these methods become clearer, and it becomes easier to understand their relative merits.
Ensuring Descent (Guaranteeing Descent)
In the general iterative algorithm,
where αk=1,pk=−[∇2f(xk)]−1∇f(xk)T
Suppose we would like to have an improvement in every iteration. This is possible if the search direction is a descent direction
But what if the Hessian is not positive at that point? Then, we have to replace ∇2f(xk) with something similar
that is a positive definite matrix.
However, this kind of problem occurs since we used 2nd order approximation of its function. Therefore, it can be resolved by just using 1st-order approximation. However, internally it has a problem. (We will check this kind of problem when we learn a steepest gradient descent method)
Justification
- We can obtain a descent direction, based on the Newton direction (the latter has fast convergence, if it converges)
→ Note that we
can't
always sure that it converges
- At the solution to the optimization problem ∇2, f(x∗) will usually be positive definite (it is always positive semidefinite), so that the Hessian will normally only be replaced at points distant from the solution.
Now we will see 3 technical details when we choose a search direction
- Modified Matrix Factorization : It is related to the positive definiteness of its Hessian
- Sufficient descent : angle condition
- Gradient related : norm of the search direction cannot become too much smaller than that of the gradient
Modified Matrix Factorization
Let ∇2f(xk) is not positive definite. Since it is a real symmetric matrix, we already know that it is diagonalizable and has a real eigenvector.
The problem is that it has a negative real eigenvalue. To remedy this problem, we want to make it positive definite by adding some perturbations.
First, we choose the diagonal matrix E and add it to the ∇2f(xk)
By Weyl’s inequality, we can guarantee that ∇2f(xk)+E has only positive eigenvalues.
Since E is a symmetric matrix, ∇2f(xk)+E is also a symmetric matrix. Therefore, the ∇2f(xk)+E is always diagonalizable.
By using this matrix, we want to find p such that
Sufficient descent
The problem is that pk might become arbitrarily close to being orthogonal
to ∇f(xk) while still remaining in a descent direction, and thus the algorithm would make little progress toward a solution.
To ensure against this we assume that
for all k, where ϵ>0 is some specified tolerance
.
Gradient related
The search directions are said to be gradient related
if
for all k, where m>0 is some constant.
This condition states that the norm of the search direction cannot become too much smaller than that of the gradient.
소중한 공감 감사합니다