Computer Science/Optimization

6. Least-Squares Data Fitting

728x90

What is the Least square method?

The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of $n$ point $(x_i, y_i)$ where $x_i$ is an independent variable and $y_i$ is a dependent variable whose value is found by observation.

The model function has the form $f(x, \beta)$ , where $m$ adjustable parameters are held in the vector $\beta$ .

The goal is to find the parameter values for the model that best fits the data. The fit of a model to a data point is measured by its residual, defined as the difference between the observed value of the dependent variable and the value predicted by the model

r_i = y_i - f(x_i, \beta)

The least-squares method finds the optimal parameter values by minimizing the sum of squared residuals, $S$ :

S := \sum_{i = 1}^n r_i^2

Solving the least squares problem

The minimum of the sum of squares is found by setting the gradient to zero. Since the model contains $m$ parameters, there are $m$ gradient equations

\frac{\partial S}{\partial B_j} = 2\sum_i r_i\frac{\partial r_i}{\partial B_j} = 0, j = 1, \dots, m

since

r_i = y_i - f(x_i, \beta)

the gradient equations become

-2\sum_i r_i\frac{\partial f(x_i, \beta)}{\partial B_j} = 0, j = 1, \dots, m

Linear least squares

A regression model is a linear one when the model comprises a linear combination of the parameters

f(x, \beta) = \sum_{j = 1}^m \beta_j\phi_j(x)

where the function $\phi_j$ is a function of $x$

Let

X_{ij} = \phi_j(x_i) \\ Y_i = y_i

We can compute the least squares in the following way. Note that $D$ is the set of all data.

\begin{align}L(D, \beta) &= \|Y - X\beta\|^2 = (Y-X\beta)^T(Y-XB)\\ &= Y^TY - Y^TX\beta -\beta^TX^TY + \beta^TX^TX\beta\end{align}

The gradient of the loss is:

\frac{\partial L}{\partial\beta} = -2X^TY + 2X^TX\beta

Setting the gradient of the loss to zero and solving for $\beta$ . We get

-2X^TY +2X^TX\beta = 0 \\ \Rightarrow X^Ty = X^TX\beta

So we can get a solution

\hat \beta = (X^TX)^{-1}X^TY

💡

X^TX

is not always invertible. Actually, it is invertible when

X

has independent columns.

💡

In addition, it can be prove by using the best approximation method.

Non-linear least squares

In general, there is no closed-form solution. So that numerical algorithms are used to find the value of the parameters $\beta$ that minimizes the objective. Most algorithms involve choosing initial values for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation

\beta_j^{k + 1} =\beta_j^k + \Delta B_j

To align with with textbook, from now on

\begin{align} r_i &= f_i(\beta) \\ F(\beta) &= (f_1(\beta), f_2(\beta), \dots, f_m(\beta))\end{align}

Let

f(\beta) :=\frac{1}{2}\sum_{i = 1}^mf_i(\beta)^2 = \frac{1}{2}F(\beta)^TF(\beta)

💡

We have scaled the problem by

\frac{1}{2}

to make its derivative simpler.

Therefore, our objective function can be written as

\min_{\beta} f(\beta)

By using the chain rule,

\nabla f(\beta) = \nabla F(\beta) F(\beta)

Proof
Actually,
$\nabla f(\beta) = \sum_{i = 1}^m \nabla f_i(\beta)f_i(\beta)$
Let,
$\nabla F(\beta) = [\nabla f_1(\beta), \nabla f_2(\beta), \dots, \nabla f_m(\beta)]$
So,
$\begin{align} \nabla F(\beta)F(\beta) &= [\nabla f_1(\beta), \nabla f_2(\beta), \dots, \nabla f_m(\beta)] \begin{bmatrix} f_1(\beta) \\ f_2(\beta) \\ \vdots \\ f_m(\beta)\end{bmatrix} \\ &=\sum_{i = 1}^m \nabla f_i(\beta)f_i(\beta) \\ &= \nabla f(\beta)\end{align}$

By using the chain rule again,

\nabla^2f(\beta) = \nabla F(\beta)\nabla F(\beta)^T + \sum_{i = 1}^mf_i(\beta)\nabla^2f_i(\beta)

Proof
As we have seen in the previous proof,
$\begin{align} \nabla f(\beta) &= \sum_{i = 1}^m \nabla f_i(\beta)f_i(\beta) \\ &= \begin{bmatrix}\sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_1}f_i(\beta) \\ \sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_2}f_i(\beta)\\\vdots \\ \sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_j}f_i(\beta)\end{bmatrix}\end{align}$
where $\beta = (\beta_1, \beta_2, \dots, \beta_j)$
That means
$\frac{\partial f}{\partial \beta_k} = \sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_k}f_i(\beta)$
So,
$\begin{bmatrix} \frac{\partial^2 f}{\partial \beta_1\partial \beta_k} \\ \frac{\partial^2 f}{\partial \beta_2\partial \beta_k} \\ \vdots \\ \frac{\partial^2 f}{\partial \beta_j\partial \beta_k}\end{bmatrix} = \begin{bmatrix}\frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_1} + \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_1\partial \beta_k}f_i(\beta)\\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_2} + \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_2\partial \beta_k}f_i(\beta)\\ \vdots \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_j} + \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_j\partial \beta_k}f_i(\beta)\end{bmatrix}$
We can split each term separately
$\begin{bmatrix} \frac{\partial^2 f}{\partial \beta_1\partial \beta_k} \\ \frac{\partial^2 f}{\partial \beta_2\partial \beta_k} \\ \vdots \\ \frac{\partial^2 f}{\partial \beta_j\partial \beta_k}\end{bmatrix} = \begin{bmatrix} \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_j}\end{bmatrix} + \begin{bmatrix} \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_1\partial \beta_k}f_i(\beta)\\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_2\partial \beta_k}f_i(\beta)\\ \vdots \\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_j\partial \beta_k}f_i(\beta)\end{bmatrix}$
When we look carefully,
$\frac{\partial^2f_i}{\partial \beta_l\partial \beta_k} = [\nabla^2f_i(\beta)]_{l,k}$
Therefore,
$\begin{bmatrix} \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_1\partial \beta_k}f_i(\beta)\\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_2\partial \beta_k}f_i(\beta)\\ \vdots \\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_j\partial \beta_k}f_i(\beta)\end{bmatrix} = \sum_{i = 1}^m \text{col}_k(\nabla^2f_i(\beta))f_i(\beta)$
In addition,
$\begin{bmatrix} \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_j}\end{bmatrix} = \begin{bmatrix}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_j}\end{bmatrix}\frac{\partial f_i}{\partial \beta_k}$
Therefore,
$\begin{align} \nabla^2f(\beta) = \begin{bmatrix}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_j}\end{bmatrix}\begin{bmatrix}\frac{\partial f_i}{\partial\beta_1} & \frac{\partial f_i}{\partial\beta_2} & \cdots &\frac{\partial f_i}{\partial\beta_j} \\ \end{bmatrix} \\ + \sum_{i = 1}^m\begin{bmatrix} \text{col}_1(\nabla^2f_i(\beta)) & \text{col}_2(\nabla^2f_i(\beta)) & \cdots & \text{col}_j(\nabla^2f_i(\beta)) \end{bmatrix}f_i(\beta) \\ = \nabla F(\beta)\nabla F(\beta)^T + \sum_{i = 1}^mf_i(\beta) \nabla^2f_i(\beta)\end{align}$

Optimization Conditions

Recall

First-order necessary condition
If $\beta^*$ is a local minimum, then $\nabla f(\beta^*) = 0$

Second-order necessary condition
If $\beta^*$ is a local minimum, then $\nabla^2f(\beta^*)$ is positive semi-definite.

Second-order sufficient condition
If $\nabla f(\beta^*) = 0$ and $\nabla^2f(\beta^*)$ is positive definite, then $\beta^*$ is a strict local minimum

💡

In general, there is no necessary and sufficient condition except the convex problem

Suppose we have a perfect-fitting solution. Are the optimality conditions satisfied? (i.e. $F(\beta^*) = 0$ )

It is trivial that

\nabla f(\beta^*) = \nabla F(\beta^*)F(\beta^*) = 0

What about the hessian?

\nabla^2 f(\beta^*) = \nabla F(\beta^*)\nabla F(\beta^*)^T + \sum_{i = 1}^m f_i(\beta^*)\nabla^2f_i(\beta^*)

But we already know that $f_i(\beta^*) = 0, \forall i = 1, \dots, m$

Therefore,

\nabla^2f(\beta^*) = \nabla F(\beta^*)\nabla F(\beta^*)^T

\begin{align}v^T\nabla^2f(\beta^*)v &= v^T\nabla F(\beta^*)\nabla F(\beta^*)^Tv \\ &= (\nabla F(\beta^*)^Tv)^T(\nabla F(\beta^*)^Tv) \\ &= \|\nabla F(\beta^*)^Tv\| \\ &\ge 0\end{align}

for all $v \in R^m$

Therefore, $\nabla^2 f(\beta^*)$ is positive semi-definite.

💡

\nabla F(\beta^*)^T

has full rank,

\nabla^2 f(\beta^*)

is positive definite.

Gauss-Newton Method

Recall that Newton’s method is

\nabla^2f(\beta)p = -\nabla f(\beta)

Replace

\nabla f(\beta) = \nabla F(\beta) F(\beta)

and

\nabla^2f(\beta) = \nabla F(\beta)\nabla F(\beta)^T + \sum_{i = 1}^mf_i(\beta)\nabla^2f_i(\beta)

We get

[\nabla F(\beta)\nabla F(\beta)^T+ \sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta)]p = -\nabla F(\beta)F(\beta)

However,

\sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta)]p

requires Hessian of the functions. If the given point is near the solution, $F(\beta^*) = 0$ . This means that

f_i(\beta) = 0, \forall i

In Gauss-Newton method uses this approximation directly.

Therefore,

\nabla F(\beta)\nabla F(\beta)^Tp = -\nabla F(\beta)F(\beta)

Levenberg-Marquardt Method (LM Algorithm)

If the residual is large, Guass-Newton method can perform poorly.

💡

In addition, if the Jacobian of

F

is not of full rank at the solution, it can also perform poorly.

One approach to remedy this kind of problem is use some approximation to the second term in the formula for the Hessian matrix

\sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta)

The oldest and simplest of these approximation is

\sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta) \approx \lambda I

where $\lambda \ge 0$

Therefore,

[\nabla F(\beta)\nabla F(\beta)^T + \lambda I]p = -\nabla F(\beta)F(\beta)

This is referred to as the Levenberg-Marquardt method.

Contents

새소식

인기 검색어

6. Least-Squares Data Fitting

What is the Least square method?

Solving the least squares problem

Linear least squares

Non-linear least squares

Optimization Conditions

Gauss-Newton Method

Levenberg-Marquardt Method (LM Algorithm)

당신이 좋아할만한 콘텐츠

티스토리툴바