Computer Science/Optimization

3. Guaranteeing Descent

  • -
728x90
반응형

Newton’s Methods for Minimization

  • The basic idea can be derived using the first-order necessary optimality condition
    f(x)=0\nabla f(x) = 0
  • Alternatively, consider second-order Taylor series approximation
    f(x+p)f(x)+f(x)p+12pT2f(x)pf(x+p) \approx f(x) + \nabla f(x)p + \frac{1}{2}p^T\nabla^2 f(x)p

    Let q(p)=f(x)+f(x)p+12pT2f(x)pq(p) = f(x) + \nabla f(x)p + \frac{1}{2}p^T\nabla ^2 f(x)p

    minpq(p)\min_p q(p)

    for any given xx

    • Proof
      ddp{f(x)+f(x)p+12pT2f(x)p}=0f(x)T+2f(x)p=0p=[2f(x)]1f(x)T\frac{d}{dp} \{f(x) + \nabla f(x)p + \frac{1}{2}p^T \nabla^2f(x)p \} = 0 \\ \Rightarrow \nabla f(x)^T + \nabla^2f(x)p = 0 \\ \Rightarrow p = -[\nabla^2f(x)]^{-1}\nabla f(x)^T

      Therefore,

      xk+1=xk(2f(xk))1f(xk)Tx_{k + 1} = x_k - (\nabla^2f(x_k))^{-1}\nabla f(x_k)^T
    💡
    The quadratic function qq provides a new interpretation of Newton’s method for minimizing ff.
    💡
    The step size is usually obtained by solving a linear system of equations rather than by computing the inverse of the Hessian.
  • Why don’t we just use 1st-order approximation?

    → Note that we want to minimize ff not a qq. In addition, since 1st-order approximation is not bounded, we use 2nd-order approximation instead. Moreover, 2nd-order approximation is convex so that we can easily find the optimal point analytically.

    💡
    But this kind of approximation requires at least a positive semi-definite Hessian matrix at that point. If a hessian matrix is not a positive semi-definite, we take a perturbation of it to make it positive semi-definite. This technique is called quasi-Newton's Method.
  • Convergence rate (if converging, along with assumptions of ff and the point)
    • Quadratic : Newton’s method has a quadratic rate of convergence except in degenerate cases

      It converges faster and faster

      • The exact condition for quadratic convergence

        Let ff be a real-valued function of nn variables defined on an open convex set S. Assume that 2f\nabla^2f is Lipschitz continuous on SS, that is,

        2f(x)2f(y)Lxy\|\nabla^2f(x) - \nabla^2f(y)\| \le L\|x-y\|

        for all x,ySx, y\in S and for some constant L<L < \infty. Consider the sequence {xk}\{x_k\} generated by

        xk+1=xk[2f(xk)]1f(xk)x_{k + 1} = x_k - [\nabla^2f(x_k)]^{-1}\nabla f(x_k)

        Let xx^* be a minimizer of f(x)f(x) in SS and assume that 2f(x)\nabla^2f(x^*) is positive definite. If x0x\|x_0 - x^*\| is sufficiently small, then {xk}\{x_k\}  converges quadratically to xx^*.

        💡
        Basically, this uses 2nd-order taylor’s approximation.
        • Proof

    • Super-linear if the Newton direction is approached at the limit
💡
But, we can’t guarantee that it always converges. That is why it is sensitive to the initial starting point. Normally, it works well when the hessian is positive definite.
Visually Explained: Newton's Method in Optimization
We take a look at Newton's method, a powerful technique in Optimization. We explain the intuition behind it, and we list some of its pros and cons. No necessary background required beyond basic linear algebra and calculus. 00:00 Introduction 00:14 Unconstrained Optimization 01:18 Iterative Optimization 3:41 Numerical Example 6:00 Derivation of Newton's Method 7:48 Newton's Method for Solving Equations 8:33 The Good 9:31 The Bad 10:35 The Ugly
https://www.youtube.com/watch?v=W7S94pq5Xuo

Problems of Naive Newton’s Method

Why not simply keep using the basic version of Newton's step?

xk+1=xk[2f(xk)]1f(xk)Tx_{k + 1} = x_k - [\nabla^2f(x_k)]^{-1}\nabla f(x_k)^T

In fact, when it works, it works well.

However

  • It may not converge at all (or fail, if the Hessian is singular(i.e. not invertible))
  • It doesn’t look for a minimum, just a stationary point.
    💡
    즉, Newton’s method를 통해서 구한 해가 local minima가 아닌 saddle point일 수 있다는 것이다. 다시 말해서 함수의 변화율이 0이 되는 점을 찾을 뿐인 것이다. 앞서 살펴본 것처럼 first-order condition만으로는 local optima라고 주장할 수 없는 것과 일맥상통한다.
  • The basic form is computationally heavy

    → Since finding a inverse matrix requires large computations.

It is possible to reduce these costs by using algorithms that compromise on Newton’s method. Virtually all of these algorithms get by with only first derivative calculations. Most of these algorithms avoid solving a linear system and reduce the cost of using the Newton formula to O(n2)O(n^2) or less. The methods designed for solving large problems reduce the storage requirements to O(n)O(n).

One of the algorithms that typically rely only on first derivative calculations is the Gradient Descent method. It is widely used to find local optimum in differentiable functions.

Also, there are methods like Quasi-Newton methods which reduce the cost of the Newton formula without solving a linear system. Quasi-Newton methods such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm are known to reduce the complexity of the Newton method while maintaining similar performance.

However, if we keep in mind that almost all of these methods are compromises on Newton's method, the relationships among these methods become clearer, and it becomes easier to understand their relative merits.

💡
Note : Basically, gradient descent method has a slower convergence speed than Newton’s method. (Since, Newton’s method uses 2nd derivative information). In addition, it is sensitive to learning rate.
💡
Actually, Newton’s method is almost never used in its classical form. But, it is still important because it represents an ideal method for solving minimization problems.

Ensuring Descent (Guaranteeing Descent)

In the general iterative algorithm,

xk+1=xk+αkpkx_{k + 1} = x_k + \alpha_kp_k

where αk=1,pk=[2f(xk)]1f(xk)T\alpha_k = 1, p_k = -[\nabla^2f(x_k)]^{-1}\nabla f(x_k)^T

💡
중요한 사실은 해당 지점을 기준으로 2nd approximation을 진행하고 극소점을 찾은 것이다. 2nd approximation이 잘 동작하려면 반드시 positive definite이어야 한다.

Suppose we would like to have an improvement in every iteration. This is possible if the search direction is a descent direction

f(xk)pk<0=f(xk)[2f(xk)]1f(xk)T<0[2f(xk)]1 is positive definite2f(xk) is positive definite\begin{align}\nabla f(x_k) p_k < 0 &= -\nabla f(x_k)[\nabla^2f(x_k)]^{-1}\nabla f(x_k)^T < 0\\ &\Rightarrow [\nabla^2f(x_k)]^{-1} \text{ is positive definite} \\ &\Rightarrow \nabla^2f(x_k) \text{ is positive definite}\end{align}
💡
This means that we can’t guarantee that the search direction is a descent direction if its hessian is not a positive definite. If its hessian is not positive definite, 2nd approximation doesn’t work well. (Think about f(x1,x2)=x14x24f(x_1, x_2) = x_1^4 - x_2^4)

But what if the Hessian is not positive at that point? Then, we have to replace 2f(xk)\nabla^2f(x_k) with something similar that is a positive definite matrix.

However, this kind of problem occurs since we used 2nd order approximation of its function. Therefore, it can be resolved by just using 1st-order approximation. However, internally it has a problem. (We will check this kind of problem when we learn a steepest gradient descent method)

💡
Note that even though the function value must be decreased by using the approach, we can not guarantee that it always converges.

Justification

  1. We can obtain a descent direction, based on the Newton direction (the latter has fast convergence, if it converges)

    → Note that we can't always sure that it converges

  1. At the solution to the optimization problem 2\nabla^2, f(x)f(x^*) will usually be positive definite (it is always positive semidefinite), so that the Hessian will normally only be replaced at points distant from the solution.

Now we will see 3 technical details when we choose a search direction

  1. Modified Matrix Factorization : It is related to the positive definiteness of its Hessian
  1. Sufficient descent : angle condition
  1. Gradient related : norm of the search direction cannot become too much smaller than that of the gradient

Modified Matrix Factorization

Let 2f(xk)\nabla^2f(x_k) is not positive definite. Since it is a real symmetric matrix, we already know that it is diagonalizable and has a real eigenvector.

The problem is that it has a negative real eigenvalue. To remedy this problem, we want to make it positive definite by adding some perturbations.

First, we choose the diagonal matrix EE and add it to the 2f(xk)\nabla^2 f(x_k)

By Weyl’s inequality, we can guarantee that 2f(xk)+E\nabla^2f(x_k) + E has only positive eigenvalues.

Since EE is a symmetric matrix, 2f(xk)+E\nabla^2f(x_k) + E is also a symmetric matrix. Therefore, the 2f(xk)+E\nabla^2f(x_k) + E is always diagonalizable.

By using this matrix, we want to find pp such that

[2f(xk)+E]p=f(xk)Torp=[2f(xk)+E]1f(xk)T[\nabla^2f(x_k) + E]p = - \nabla f(x_k)^T \\ \text{or} \\ p = - [\nabla^2f(x_k) + E]^{-1}\nabla f(x_k)^T
Weyl's inequality
In linear algebra, Weyl's inequality is a theorem about the changes to eigenvalues of an Hermitian matrix that is perturbed. It can be used to estimate the eigenvalues of a perturbed Hermitian matrix.
https://en.wikipedia.org/wiki/Weyl's_inequality
💡
Note : If we perturbed the hessian matrix, we can’t guarantee that eigenvectors are same.

Sufficient descent

The problem is that pkp_k might become arbitrarily close to being orthogonal to f(xk)\nabla f(x_k) while still remaining in a descent direction, and thus the algorithm would make little progress toward a solution.

To ensure against this we assume that

f(xk)pkpkf(xk)ϵ>0-\frac{\nabla f(x_k)p_k}{\|p_k\|\|\nabla f(x_k)}\ge \epsilon > 0

for all kk, where ϵ>0\epsilon > 0 is some specified tolerance.

💡
Actually, the above equation means the angle between the gradient vector and the search direction vector. For this reason, it can be referred to as the angle condition

Gradient related

The search directions are said to be gradient related if

pkmf(xk)\|p_k\| \ge m\|\nabla f(x_k)\|

for all kk, where m>0m >0 is some constant.

💡
Note that mm is independent to the choice of kk

This condition states that the norm of the search direction cannot become too much smaller than that of the gradient.

반응형
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.