Computer Science/Optimization

4. Guaranteeing Convergence

  • -
728x90
반응형

Guaranteeing Convergence

The auxiliary techniques that are used to guarantee convergence attempt to rein in the optimization method when it is in danger of getting out of control, and they also try to avoid intervening when the optimization method is performing effectively.

The term globalization strategy is used to distinguish the method used for the new estimate of the solution from the method for computing the search direction.

  1. Method for computing the search direction: derived from the Taylor series, and the Taylor series is a local approximation to the function
  1. Selecting the new estimate of the solution: Designed to guarantee global convergence, that is, convergence from any starting point.
    💡
    Note that this is a convergence to a stationary point.

If the underlying optimization method produces good search directions, as is often the case with Newton’s method on well-conditioned problems, then the globalization strategies will act merely as a safety net. For a method that produces less effective search directions, such as a linear conjugate-gradient method, they can be a major contributor to the practical success of a method.

We will discuss two major types of globalization stratety

  1. Line search method
  1. Trust-region method: Will not be discussed in this article

Line search Method

Let xkx_k be the current estimate of a minimizer of ff, and let pkp_k be the search direction at the point xkx_k. Then the new estimate of the solution is defined by the formula

xk+1=xk+αkpkx_{k + 1} = x_k + \alpha_kp_k

where the step length αk\alpha_k is some scalar chosen so that

f(xk+1)<f(xk)f(x_{k + 1}) < f(x_k)

Note that we assume that pkp_k is a descent direction at. Therefore,

pkTf(xk)<0p_k^T\nabla f(x_k) < 0

This should be guaranteed by the algorithm used to compute the search direction.

💡
In the guaranteeing descent, it is a matter of choosing pkp_k. However, in the line search method, it is a matter of choosing αk\alpha_k.

If pkp_k is a descent direction, then f(xk+αpk)<f(xk)f(x_k + \alpha p_k) < f(x_k) at least for small positive values of α\alpha.

The technique is called a line search because a search for a new point xk+1x_{k + 1} is carried out along the line y(α)=xk+αpky(\alpha) = x_k + \alpha p_k.

Intuitively we would like to choose αk\alpha_k as the solution to

arg minα>0f(xk+αpk)\argmin_{\alpha > 0} f(x_k + \alpha p_k)
💡
But it is too expensive to solve this problem exactly. So an approximate minimizer is accepted instead.

Note that we can’t always be sure whether it converges to the stationary point or not.

💡
It is related to the determination of the step size.

An example of not converging to the stationary point

Consider the minimization problem

minxx2\min_{x}x^2

with initial value x0=3x_0 = -3. At each iteration, we use the search direction pk=1p_k = 1 with step length αk=2k\alpha_k = 2^{-k}. Hence

xk+1=xk+2kx_{k + 1} = x_k + 2^{-k}

Therefore, f(xk+1)<f(xk)f(x_{k + 1}) < f(x_k). In addition, it is easy to show that pkp_k is always a descent direction.

However, even though this simple algorithm, it doesn’t converge to a stationary point. Since

limkxk=1\lim_{k \to \infty}x_k = -1

but f(1)=20f'(-1) = -2\ne 0.

💡
By this example, it is not enough to just by reducing the function value

Guarantee the convergence to the stationary point

One way to guarantee convergence is to make additional assumptions.

  1. Assumptions on search direction pkp_k
    1. it produces sufficient descent
    1. it is gradient related
    💡
    These conditions can normally be guaranteed by making slight modifications to the method used to compute the search direction. Techniques for doing this are discussed in the context of specific methods.
  1. Assumptions on step length αk\alpha_k
    1. it produces a sufficient decrease in the function ff
    1. it is not too small (it is called backtracking)
    💡
    Armijo condition + backtracking: sometimes called armijo line search

Sufficient decrease (Armijo condition)

This condition requires that

f(xk+αkpk)f(xk)+μαkpkTf(xk)f(x_k+\alpha_kp_k)\le f(x_k) + \mu\alpha_kp_k^T\nabla f(x_k)

where μ\mu is some scalar satisfying 0<μ<10 < \mu < 1

💡
Note that f(xk)+αkf(xk)pkf(x_k) +\alpha_k\nabla f(x_k)p_k is just a 1st-order approximation of ff at pkp_k.

This condition ensures that the function value decreases by a certain proportion at each iteration, guiding the algorithm towards the optimal solution.

If α\alpha is small, the linear approximation will be good, and the sufficient decrease condition will be satisfied. However, if α\alpha is large, the decrease predicted by linear approximation may differ greatly from the actual decrease in ff, and the condition can be violated.

💡
즉, xkx_k를 업데이트 하는 과정에서 함숫값이 작아지는 것까지 step size를 허용하는 것이다.

Backtracking

The Armijo condition doesn’t forbid tiny step size.

Let pkp_k be a search direction satisfying the sufficient descent condition. Define αk\alpha_k to be the first element of the sequence

1,12,14,,2i,1, \frac{1}{2}, \frac{1}{4}, \dots, 2^{-i}, \dots

stop once the Armijo condition is fulfilled. Sometimes Armijo condition + backtracking is called Armijo line search

💡
A more efficient way than backtracking is the Wolfe condition
💡
By backtracking, we don’t take smaller steps than necessary

Conclusion

By adding some assumptions and using the above techniques, we guarantee that

f(xk)0\nabla f(x_k) \rarr 0

Let ff be a real-valued function of nn variables. Let x0x_0 be a given initial point and define {xk}\{x_k\} by xk+1=xk+αkpkx_{k + 1} = x_k + \alpha_k p_k, where pkp_k is a vector of dimension nn and αk0\alpha_k \ge 0 is a scalar. Assume that

  1. the set S={x:f(x)f(x0)}S = \{x : f(x) \le f(x_0)\} is bounded

    → This ensures that the function takes on its minimum value at a finite point

    💡
    For example, we want to delete the case such as f(x)=exf(x)= e^x
  1. f\nabla f is Lipschitz continuous for all xx
  1. the vectors pkp_k satisfy a sufficient decent condition
  1. the search directions are the gradient-related and bounded norm
    pkM,kN\|p_k\| \le M, \forall k\in \mathbb N
  1. the scalar αk\alpha_k is chosen as the first element of the sequence {1,12,14,}\{1, \frac{1}{2}, \frac{1}{4}, \dots\} to satisfy a sufficient decrease condition
    f(xk+αkpk)f(xk)+μαkf(xk)pkf(x_k + \alpha_kp_k) \le f(x_k) + \mu\alpha_k\nabla f(x_k)p_k

    where 0<μ<10 < \mu < 1

Then

limkf(xk)=0\lim_{k \to \infty}\|\nabla f(x_k)\| = 0
반응형
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.