Computer Science/Optimization

6. Least-Squares Data Fitting

  • -
728x90
반응형

What is the Least square method?

The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of nn point (xi,yi)(x_i, y_i) where xix_i is an independent variable and yiy_i is a dependent variable whose value is found by observation.

The model function has the form f(x,β)f(x, \beta), where mm adjustable parameters are held in the vector β\beta.

The goal is to find the parameter values for the model that best fits the data. The fit of a model to a data point is measured by its residual, defined as the difference between the observed value of the dependent variable and the value predicted by the model

ri=yif(xi,β)r_i = y_i - f(x_i, \beta)

The least-squares method finds the optimal parameter values by minimizing the sum of squared residuals, SS:

S:=i=1nri2S := \sum_{i = 1}^n r_i^2

Solving the least squares problem

The minimum of the sum of squares is found by setting the gradient to zero. Since the model contains mm parameters, there are mm gradient equations

SBj=2iririBj=0,j=1,,m\frac{\partial S}{\partial B_j} = 2\sum_i r_i\frac{\partial r_i}{\partial B_j} = 0, j = 1, \dots, m

since

ri=yif(xi,β)r_i = y_i - f(x_i, \beta)

the gradient equations become

2irif(xi,β)Bj=0,j=1,,m-2\sum_i r_i\frac{\partial f(x_i, \beta)}{\partial B_j} = 0, j = 1, \dots, m

Linear least squares

A regression model is a linear one when the model comprises a linear combination of the parameters

f(x,β)=j=1mβjϕj(x)f(x, \beta) = \sum_{j = 1}^m \beta_j\phi_j(x)

where the function ϕj\phi_j is a function of xx

Let

Xij=ϕj(xi)Yi=yiX_{ij} = \phi_j(x_i) \\ Y_i = y_i

We can compute the least squares in the following way. Note that DD is the set of all data.

L(D,β)=YXβ2=(YXβ)T(YXB)=YTYYTXββTXTY+βTXTXβ\begin{align}L(D, \beta) &= \|Y - X\beta\|^2 = (Y-X\beta)^T(Y-XB)\\ &= Y^TY - Y^TX\beta -\beta^TX^TY + \beta^TX^TX\beta\end{align}

The gradient of the loss is:

Lβ=2XTY+2XTXβ\frac{\partial L}{\partial\beta} = -2X^TY + 2X^TX\beta

Setting the gradient of the loss to zero and solving for β\beta. We get

2XTY+2XTXβ=0XTy=XTXβ-2X^TY +2X^TX\beta = 0 \\ \Rightarrow X^Ty = X^TX\beta

So we can get a solution

β^=(XTX)1XTY\hat \beta = (X^TX)^{-1}X^TY
💡
XTXX^TX is not always invertible. Actually, it is invertible when XX has independent columns.
💡
In addition, it can be prove by using the best approximation method.
Gradient
In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field whose value at a point gives the direction and the rate of fastest increase. The gradient transforms like a vector under change of basis of the space of variables of . If the gradient of a function is non-zero at a point , the direction of the gradient is the direction in which the function increases most quickly from , and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to maximize a function by gradient ascent. In coordinate-free terms, the gradient of a function may be defined by:
https://en.wikipedia.org/wiki/Gradient

Non-linear least squares

In general, there is no closed-form solution. So that numerical algorithms are used to find the value of the parameters β\beta that minimizes the objective. Most algorithms involve choosing initial values for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation

βjk+1=βjk+ΔBj\beta_j^{k + 1} =\beta_j^k + \Delta B_j

To align with with textbook, from now on

ri=fi(β)F(β)=(f1(β),f2(β),,fm(β))\begin{align} r_i &= f_i(\beta) \\ F(\beta) &= (f_1(\beta), f_2(\beta), \dots, f_m(\beta))\end{align}

Let

f(β):=12i=1mfi(β)2=12F(β)TF(β)f(\beta) :=\frac{1}{2}\sum_{i = 1}^mf_i(\beta)^2 = \frac{1}{2}F(\beta)^TF(\beta)
💡
We have scaled the problem by 12\frac{1}{2} to make its derivative simpler.

Therefore, our objective function can be written as

minβf(β)\min_{\beta} f(\beta)

By using the chain rule,

f(β)=F(β)F(β)\nabla f(\beta) = \nabla F(\beta) F(\beta)
  • Proof

    Actually,

    f(β)=i=1mfi(β)fi(β)\nabla f(\beta) = \sum_{i = 1}^m \nabla f_i(\beta)f_i(\beta)

    Let,

    F(β)=[f1(β),f2(β),,fm(β)]\nabla F(\beta) = [\nabla f_1(\beta), \nabla f_2(\beta), \dots, \nabla f_m(\beta)]

    So,

    F(β)F(β)=[f1(β),f2(β),,fm(β)][f1(β)f2(β)fm(β)]=i=1mfi(β)fi(β)=f(β)\begin{align} \nabla F(\beta)F(\beta) &= [\nabla f_1(\beta), \nabla f_2(\beta), \dots, \nabla f_m(\beta)] \begin{bmatrix} f_1(\beta) \\ f_2(\beta) \\ \vdots \\ f_m(\beta)\end{bmatrix} \\ &=\sum_{i = 1}^m \nabla f_i(\beta)f_i(\beta) \\ &= \nabla f(\beta)\end{align}

By using the chain rule again,

2f(β)=F(β)F(β)T+i=1mfi(β)2fi(β)\nabla^2f(\beta) = \nabla F(\beta)\nabla F(\beta)^T + \sum_{i = 1}^mf_i(\beta)\nabla^2f_i(\beta)
  • Proof

    As we have seen in the previous proof,

    f(β)=i=1mfi(β)fi(β)=[i=1mfiβ1fi(β)i=1mfiβ2fi(β)i=1mfiβjfi(β)]\begin{align} \nabla f(\beta) &= \sum_{i = 1}^m \nabla f_i(\beta)f_i(\beta) \\ &= \begin{bmatrix}\sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_1}f_i(\beta) \\ \sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_2}f_i(\beta)\\\vdots \\ \sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_j}f_i(\beta)\end{bmatrix}\end{align}

    where β=(β1,β2,,βj)\beta = (\beta_1, \beta_2, \dots, \beta_j)

    That means

    fβk=i=1mfiβkfi(β)\frac{\partial f}{\partial \beta_k} = \sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_k}f_i(\beta)

    So,

    [2fβ1βk2fβ2βk2fβjβk]=[fiβkfiβ1+i=1m2fiβ1βkfi(β)fiβkfiβ2+i=1m2fiβ2βkfi(β)fiβkfiβj+i=1m2fiβjβkfi(β)]\begin{bmatrix} \frac{\partial^2 f}{\partial \beta_1\partial \beta_k} \\ \frac{\partial^2 f}{\partial \beta_2\partial \beta_k} \\ \vdots \\ \frac{\partial^2 f}{\partial \beta_j\partial \beta_k}\end{bmatrix} = \begin{bmatrix}\frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_1} + \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_1\partial \beta_k}f_i(\beta)\\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_2} + \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_2\partial \beta_k}f_i(\beta)\\ \vdots \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_j} + \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_j\partial \beta_k}f_i(\beta)\end{bmatrix}

    We can split each term separately

    [2fβ1βk2fβ2βk2fβjβk]=[fiβkfiβ1fiβkfiβ2fiβkfiβj]+[i=1m2fiβ1βkfi(β)i=1m2fiβ2βkfi(β)i=1m2fiβjβkfi(β)]\begin{bmatrix} \frac{\partial^2 f}{\partial \beta_1\partial \beta_k} \\ \frac{\partial^2 f}{\partial \beta_2\partial \beta_k} \\ \vdots \\ \frac{\partial^2 f}{\partial \beta_j\partial \beta_k}\end{bmatrix} = \begin{bmatrix} \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_j}\end{bmatrix} + \begin{bmatrix} \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_1\partial \beta_k}f_i(\beta)\\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_2\partial \beta_k}f_i(\beta)\\ \vdots \\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_j\partial \beta_k}f_i(\beta)\end{bmatrix}

    When we look carefully,

    2fiβlβk=[2fi(β)]l,k\frac{\partial^2f_i}{\partial \beta_l\partial \beta_k} = [\nabla^2f_i(\beta)]_{l,k}

    Therefore,

    [i=1m2fiβ1βkfi(β)i=1m2fiβ2βkfi(β)i=1m2fiβjβkfi(β)]=i=1mcolk(2fi(β))fi(β)\begin{bmatrix} \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_1\partial \beta_k}f_i(\beta)\\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_2\partial \beta_k}f_i(\beta)\\ \vdots \\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_j\partial \beta_k}f_i(\beta)\end{bmatrix} = \sum_{i = 1}^m \text{col}_k(\nabla^2f_i(\beta))f_i(\beta)

    In addition,

    [fiβkfiβ1fiβkfiβ2fiβkfiβj]=[fiβ1fiβ2fiβj]fiβk\begin{bmatrix} \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_j}\end{bmatrix} = \begin{bmatrix}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_j}\end{bmatrix}\frac{\partial f_i}{\partial \beta_k}

    Therefore,

    2f(β)=[fiβ1fiβ2fiβj][fiβ1fiβ2fiβj]+i=1m[col1(2fi(β))col2(2fi(β))colj(2fi(β))]fi(β)=F(β)F(β)T+i=1mfi(β)2fi(β)\begin{align} \nabla^2f(\beta) = \begin{bmatrix}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_j}\end{bmatrix}\begin{bmatrix}\frac{\partial f_i}{\partial\beta_1} & \frac{\partial f_i}{\partial\beta_2} & \cdots &\frac{\partial f_i}{\partial\beta_j} \\ \end{bmatrix} \\ + \sum_{i = 1}^m\begin{bmatrix} \text{col}_1(\nabla^2f_i(\beta)) & \text{col}_2(\nabla^2f_i(\beta)) & \cdots & \text{col}_j(\nabla^2f_i(\beta)) \end{bmatrix}f_i(\beta) \\ = \nabla F(\beta)\nabla F(\beta)^T + \sum_{i = 1}^mf_i(\beta) \nabla^2f_i(\beta)\end{align}

Optimization Conditions

Recall

  1. First-order necessary condition

    If β\beta^* is a local minimum, then f(β)=0\nabla f(\beta^*) = 0

  1. Second-order necessary condition

    If β\beta^* is a local minimum, then 2f(β)\nabla^2f(\beta^*) is positive semi-definite.

  1. Second-order sufficient condition

    If f(β)=0\nabla f(\beta^*) = 0 and 2f(β)\nabla^2f(\beta^*) is positive definite, then β\beta^* is a strict local minimum

💡
In general, there is no necessary and sufficient condition except the convex problem

Suppose we have a perfect-fitting solution. Are the optimality conditions satisfied? (i.e. F(β)=0F(\beta^*) = 0)

It is trivial that

f(β)=F(β)F(β)=0\nabla f(\beta^*) = \nabla F(\beta^*)F(\beta^*) = 0

What about the hessian?

2f(β)=F(β)F(β)T+i=1mfi(β)2fi(β)\nabla^2 f(\beta^*) = \nabla F(\beta^*)\nabla F(\beta^*)^T + \sum_{i = 1}^m f_i(\beta^*)\nabla^2f_i(\beta^*)

But we already know that fi(β)=0,i=1,,mf_i(\beta^*) = 0, \forall i = 1, \dots, m

Therefore,

2f(β)=F(β)F(β)T\nabla^2f(\beta^*) = \nabla F(\beta^*)\nabla F(\beta^*)^T
vT2f(β)v=vTF(β)F(β)Tv=(F(β)Tv)T(F(β)Tv)=F(β)Tv0\begin{align}v^T\nabla^2f(\beta^*)v &= v^T\nabla F(\beta^*)\nabla F(\beta^*)^Tv \\ &= (\nabla F(\beta^*)^Tv)^T(\nabla F(\beta^*)^Tv) \\ &= \|\nabla F(\beta^*)^Tv\| \\ &\ge 0\end{align}

for all vRmv \in R^m

Therefore, 2f(β)\nabla^2 f(\beta^*) is positive semi-definite.

💡
If F(β)T\nabla F(\beta^*)^T has full rank, 2f(β)\nabla^2 f(\beta^*) is positive definite.

Gauss-Newton Method

Recall that Newton’s method is

2f(β)p=f(β)\nabla^2f(\beta)p = -\nabla f(\beta)

Replace

f(β)=F(β)F(β)\nabla f(\beta) = \nabla F(\beta) F(\beta)

and

2f(β)=F(β)F(β)T+i=1mfi(β)2fi(β)\nabla^2f(\beta) = \nabla F(\beta)\nabla F(\beta)^T + \sum_{i = 1}^mf_i(\beta)\nabla^2f_i(\beta)

We get

[F(β)F(β)T+i=1mfi(β)2fi(β)]p=F(β)F(β)[\nabla F(\beta)\nabla F(\beta)^T+ \sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta)]p = -\nabla F(\beta)F(\beta)

However,

i=1mfi(β)2fi(β)]p\sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta)]p

requires Hessian of the functions. If the given point is near the solution, F(β)=0F(\beta^*) = 0. This means that

fi(β)=0,if_i(\beta) = 0, \forall i

In Gauss-Newton method uses this approximation directly.

Therefore,

F(β)F(β)Tp=F(β)F(β)\nabla F(\beta)\nabla F(\beta)^Tp = -\nabla F(\beta)F(\beta)

Levenberg-Marquardt Method (LM Algorithm)

If the residual is large, Guass-Newton method can perform poorly.

💡
In addition, if the Jacobian of FF is not of full rank at the solution, it can also perform poorly.

One approach to remedy this kind of problem is use some approximation to the second term in the formula for the Hessian matrix

i=1mfi(β)2fi(β)\sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta)

The oldest and simplest of these approximation is

i=1mfi(β)2fi(β)λI\sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta) \approx \lambda I

where λ0\lambda \ge 0

Therefore,

[F(β)F(β)T+λI]p=F(β)F(β)[\nabla F(\beta)\nabla F(\beta)^T + \lambda I]p = -\nabla F(\beta)F(\beta)

This is referred to as the Levenberg-Marquardt method.

반응형
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.