3. Guaranteeing Descent
- -
Newton’s Methods for Minimization
- The basic idea can be derived using the first-order necessary optimality condition∇f(x)=0
- Alternatively, consider second-order Taylor series approximationf(x+p)≈f(x)+∇f(x)p+21pT∇2f(x)p
Let q(p)=f(x)+∇f(x)p+21pT∇2f(x)p
pminq(p)for any given x
Proof
dpd{f(x)+∇f(x)p+21pT∇2f(x)p}=0⇒∇f(x)T+∇2f(x)p=0⇒p=−[∇2f(x)]−1∇f(x)TTherefore,
xk+1=xk−(∇2f(xk))−1∇f(xk)T
💡The quadratic function q provides a new interpretation of Newton’s method for minimizing f.💡The step size is usually obtained by solving a linear system of equations rather than by computing the inverse of the Hessian.
- Why don’t we just use 1st-order approximation?
→ Note that we want to minimize f not a q. In addition, since 1st-order approximation is not bounded, we use 2nd-order approximation instead. Moreover, 2nd-order approximation is convex so that we can easily find the optimal point
analytically
.💡But this kind of approximation requires at least a positive semi-definite Hessian matrix at that point. If a hessian matrix is not a positive semi-definite, we take a perturbation of it to make it positive semi-definite. This technique is calledquasi-Newton's Method
.
- Convergence rate (if converging, along with assumptions of f and the point)
- Quadratic : Newton’s method has a
quadratic
rate of convergence except in degenerate cases→ It converges faster and faster
The exact condition for quadratic convergence
Let f be a real-valued function of n variables defined on an open convex set S. Assume that ∇2f is Lipschitz continuous on S, that is,
∥∇2f(x)−∇2f(y)∥≤L∥x−y∥for all x,y∈S and for some constant L<∞. Consider the sequence {xk} generated by
xk+1=xk−[∇2f(xk)]−1∇f(xk)Let x∗ be a minimizer of f(x) in S and assume that ∇2f(x∗) is positive definite. If ∥x0−x∗∥ is sufficiently small, then {xk} converges quadratically to x∗.
💡Basically, this uses 2nd-order taylor’s approximation.Proof
- Super-linear if the Newton direction is approached at the limit
- Quadratic : Newton’s method has a
initial starting point
. Normally, it works well when the hessian is positive definite
.

Problems of Naive Newton’s Method
Why not simply keep using the basic version of Newton's step?
In fact, when it works, it works well.
However
- It may not converge at all (or fail, if the Hessian is
singular
(i.e. not invertible))
- It doesn’t look for a minimum, just a stationary point.💡즉, Newton’s method를 통해서 구한 해가 local minima가 아닌 saddle point일 수 있다는 것이다. 다시 말해서 함수의 변화율이 0이 되는 점을 찾을 뿐인 것이다. 앞서 살펴본 것처럼 first-order condition만으로는 local optima라고 주장할 수 없는 것과 일맥상통한다.
- The basic form is computationally heavy
→ Since finding a inverse matrix requires large computations.
It is possible to reduce these costs by using algorithms that compromise on Newton’s method. Virtually all of these algorithms get by with only first derivative calculations. Most of these algorithms avoid solving a linear system and reduce the cost of using the Newton formula to O(n2) or less. The methods designed for solving large problems reduce the storage requirements to O(n).
One of the algorithms that typically rely only on first derivative calculations is the Gradient Descent method
. It is widely used to find local optimum in differentiable functions.
Also, there are methods like Quasi-Newton methods
which reduce the cost of the Newton formula without solving a linear system. Quasi-Newton methods such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm are known to reduce the complexity of the Newton method while maintaining similar performance.
However, if we keep in mind that almost all of these methods are compromises on Newton's method, the relationships among these methods become clearer, and it becomes easier to understand their relative merits.
ideal
method for solving minimization problems.Ensuring Descent (Guaranteeing Descent)
In the general iterative algorithm,
where αk=1,pk=−[∇2f(xk)]−1∇f(xk)T
positive definite
이어야 한다. Suppose we would like to have an improvement in every iteration. This is possible if the search direction is a descent direction

But what if the Hessian is not positive at that point? Then, we have to replace ∇2f(xk) with something similar
that is a positive definite matrix.
However, this kind of problem occurs since we used 2nd order approximation of its function. Therefore, it can be resolved by just using 1st-order approximation. However, internally it has a problem. (We will check this kind of problem when we learn a steepest gradient descent method)
not
guarantee that it always converges.Justification
- We can obtain a descent direction, based on the Newton direction (the latter has fast convergence, if it converges)
→ Note that we
can't
always sure that it converges
- At the solution to the optimization problem ∇2, f(x∗) will usually be positive definite (it is always positive semidefinite), so that the Hessian will normally only be replaced at points distant from the solution.
Now we will see 3 technical details when we choose a search direction
- Modified Matrix Factorization : It is related to the positive definiteness of its Hessian
- Sufficient descent : angle condition
- Gradient related : norm of the search direction cannot become too much smaller than that of the gradient
Modified Matrix Factorization
Let ∇2f(xk) is not positive definite. Since it is a real symmetric matrix, we already know that it is diagonalizable and has a real eigenvector.
The problem is that it has a negative real eigenvalue. To remedy this problem, we want to make it positive definite by adding some perturbations.
First, we choose the diagonal matrix E and add it to the ∇2f(xk)
By Weyl’s inequality, we can guarantee that ∇2f(xk)+E has only positive eigenvalues.
Since E is a symmetric matrix, ∇2f(xk)+E is also a symmetric matrix. Therefore, the ∇2f(xk)+E is always diagonalizable.
By using this matrix, we want to find p such that

Sufficient descent
The problem is that pk might become arbitrarily close to being orthogonal
to ∇f(xk) while still remaining in a descent direction, and thus the algorithm would make little progress toward a solution.
To ensure against this we assume that
for all k, where ϵ>0 is some specified tolerance
.
angle condition
Gradient related
The search directions are said to be gradient related
if
for all k, where m>0 is some constant.
This condition states that the norm of the search direction cannot become too much smaller than that of the gradient.
소중한 공감 감사합니다