Computer Science/Optimization

3. Guaranteeing Descent

728x90

Newton’s Methods for Minimization

The basic idea can be derived using the first-order necessary optimality condition
$\nabla f(x) = 0$

Alternatively, consider second-order Taylor series approximation
$f(x+p) \approx f(x) + \nabla f(x)p + \frac{1}{2}p^T\nabla^2 f(x)p$
Let $q(p) = f(x) + \nabla f(x)p + \frac{1}{2}p^T\nabla ^2 f(x)p$
$\min_p q(p)$
for any given $x$
- Proof
  $\frac{d}{dp} \{f(x) + \nabla f(x)p + \frac{1}{2}p^T \nabla^2f(x)p \} = 0 \\ \Rightarrow \nabla f(x)^T + \nabla^2f(x)p = 0 \\ \Rightarrow p = -[\nabla^2f(x)]^{-1}\nabla f(x)^T$
  Therefore,
  $x_{k + 1} = x_k - (\nabla^2f(x_k))^{-1}\nabla f(x_k)^T$
💡
The quadratic function $q$ provides a new interpretation of Newton’s method for minimizing $f$ .
💡
The step size is usually obtained by solving a linear system of equations rather than by computing the inverse of the Hessian.

Why don’t we just use 1st-order approximation?
→ Note that we want to minimize $f$ not a $q$ . In addition, since 1st-order approximation is not bounded, we use 2nd-order approximation instead. Moreover, 2nd-order approximation is convex so that we can easily find the optimal point analytically.
💡
But this kind of approximation requires at least a positive semi-definite Hessian matrix at that point. If a hessian matrix is not a positive semi-definite, we take a perturbation of it to make it positive semi-definite. This technique is called quasi-Newton's Method.

Convergence rate (if converging, along with assumptions of $f$ and the point)
- Quadratic : Newton’s method has a quadratic rate of convergence except in degenerate cases
  → It converges faster and faster
  - The exact condition for quadratic convergence
    Let $f$ be a real-valued function of $n$ variables defined on an open convex set S. Assume that $\nabla^2f$ is Lipschitz continuous on $S$ , that is,
    $\|\nabla^2f(x) - \nabla^2f(y)\| \le L\|x-y\|$
    for all $x, y\in S$ and for some constant $L < \infty$ . Consider the sequence $\{x_k\}$ generated by
    $x_{k + 1} = x_k - [\nabla^2f(x_k)]^{-1}\nabla f(x_k)$
    Let $x^*$ be a minimizer of $f(x)$ in $S$ and assume that $\nabla^2f(x^*)$ is positive definite. If $\|x_0 - x^*\|$ is sufficiently small, then $\{x_k\}$ converges quadratically to $x^*$ .
    💡
    Basically, this uses 2nd-order taylor’s approximation.
    - Proof
- Super-linear if the Newton direction is approached at the limit

💡

But, we can’t guarantee that it always converges. That is why it is sensitive to the initial starting point. Normally, it works well when the hessian is positive definite.

Problems of Naive Newton’s Method

Why not simply keep using the basic version of Newton's step?

x_{k + 1} = x_k - [\nabla^2f(x_k)]^{-1}\nabla f(x_k)^T

In fact, when it works, it works well.

However

It may not converge at all (or fail, if the Hessian is singular(i.e. not invertible))

It doesn’t look for a minimum, just a stationary point.
💡
즉, Newton’s method를 통해서 구한 해가 local minima가 아닌 saddle point일 수 있다는 것이다. 다시 말해서 함수의 변화율이 0이 되는 점을 찾을 뿐인 것이다. 앞서 살펴본 것처럼 first-order condition만으로는 local optima라고 주장할 수 없는 것과 일맥상통한다.

The basic form is computationally heavy
→ Since finding a inverse matrix requires large computations.

It is possible to reduce these costs by using algorithms that compromise on Newton’s method. Virtually all of these algorithms get by with only first derivative calculations. Most of these algorithms avoid solving a linear system and reduce the cost of using the Newton formula to $O(n^2)$ or less. The methods designed for solving large problems reduce the storage requirements to $O(n)$ .

One of the algorithms that typically rely only on first derivative calculations is the Gradient Descent method. It is widely used to find local optimum in differentiable functions.

Also, there are methods like Quasi-Newton methods which reduce the cost of the Newton formula without solving a linear system. Quasi-Newton methods such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm are known to reduce the complexity of the Newton method while maintaining similar performance.

However, if we keep in mind that almost all of these methods are compromises on Newton's method, the relationships among these methods become clearer, and it becomes easier to understand their relative merits.

💡

Note : Basically, gradient descent method has a slower convergence speed than Newton’s method. (Since, Newton’s method uses 2nd derivative information). In addition, it is sensitive to learning rate.

💡

Actually, Newton’s method is almost never used in its classical form. But, it is still important because it represents an ideal method for solving minimization problems.

Ensuring Descent (Guaranteeing Descent)

In the general iterative algorithm,

x_{k + 1} = x_k + \alpha_kp_k

where $\alpha_k = 1, p_k = -[\nabla^2f(x_k)]^{-1}\nabla f(x_k)^T$

💡

중요한 사실은 해당 지점을 기준으로 2nd approximation을 진행하고 극소점을 찾은 것이다. 2nd approximation이 잘 동작하려면 반드시 positive definite이어야 한다.

Suppose we would like to have an improvement in every iteration. This is possible if the search direction is a descent direction

\begin{align}\nabla f(x_k) p_k < 0 &= -\nabla f(x_k)[\nabla^2f(x_k)]^{-1}\nabla f(x_k)^T < 0\\ &\Rightarrow [\nabla^2f(x_k)]^{-1} \text{ is positive definite} \\ &\Rightarrow \nabla^2f(x_k) \text{ is positive definite}\end{align}

💡

This means that we can’t guarantee that the search direction is a descent direction if its hessian is not a positive definite. If its hessian is not positive definite, 2nd approximation doesn’t work well. (Think about

f(x_1, x_2) = x_1^4 - x_2^4

)

But what if the Hessian is not positive at that point? Then, we have to replace $\nabla^2f(x_k)$ with something similar that is a positive definite matrix.

However, this kind of problem occurs since we used 2nd order approximation of its function. Therefore, it can be resolved by just using 1st-order approximation. However, internally it has a problem. (We will check this kind of problem when we learn a steepest gradient descent method)

💡

Note that even though the function value must be decreased by using the approach, we can not guarantee that it always converges.

Justification

We can obtain a descent direction, based on the Newton direction (the latter has fast convergence, if it converges)
→ Note that we can't always sure that it converges

At the solution to the optimization problem $\nabla^2$ , $f(x^*)$ will usually be positive definite (it is always positive semidefinite), so that the Hessian will normally only be replaced at points distant from the solution.

Now we will see 3 technical details when we choose a search direction

Modified Matrix Factorization : It is related to the positive definiteness of its Hessian

Sufficient descent : angle condition

Gradient related : norm of the search direction cannot become too much smaller than that of the gradient

Modified Matrix Factorization

Let $\nabla^2f(x_k)$ is not positive definite. Since it is a real symmetric matrix, we already know that it is diagonalizable and has a real eigenvector.

The problem is that it has a negative real eigenvalue. To remedy this problem, we want to make it positive definite by adding some perturbations.

First, we choose the diagonal matrix $E$ and add it to the $\nabla^2 f(x_k)$

By Weyl’s inequality, we can guarantee that $\nabla^2f(x_k) + E$ has only positive eigenvalues.

Since $E$ is a symmetric matrix, $\nabla^2f(x_k) + E$ is also a symmetric matrix. Therefore, the $\nabla^2f(x_k) + E$ is always diagonalizable.

By using this matrix, we want to find $p$ such that

[\nabla^2f(x_k) + E]p = - \nabla f(x_k)^T \\ \text{or} \\ p = - [\nabla^2f(x_k) + E]^{-1}\nabla f(x_k)^T

💡

Note : If we perturbed the hessian matrix, we can’t guarantee that eigenvectors are same.

Sufficient descent

The problem is that $p_k$ might become arbitrarily close to being orthogonal to $\nabla f(x_k)$ while still remaining in a descent direction, and thus the algorithm would make little progress toward a solution.

To ensure against this we assume that

-\frac{\nabla f(x_k)p_k}{\|p_k\|\|\nabla f(x_k)}\ge \epsilon > 0

for all $k$ , where $\epsilon > 0$ is some specified tolerance.

💡

Actually, the above equation means the angle between the gradient vector and the search direction vector. For this reason, it can be referred to as the angle condition

Gradient related

The search directions are said to be gradient related if

\|p_k\| \ge m\|\nabla f(x_k)\|

for all $k$ , where $m >0$ is some constant.

💡

Note that

m

is independent to the choice of

k

This condition states that the norm of the search direction cannot become too much smaller than that of the gradient.

Contents

새소식

인기 검색어