What is the Least square method? The objective consists of adjusting the parameters
of a model function to best fit a data set. A simple data set consists of n n n point ( x i , y i ) (x_i, y_i) ( x i , y i ) where x i x_i x i is an independent variable and y i y_i y i is a dependent variable whose value is found by observation.
The model function has the form f ( x , β ) f(x, \beta) f ( x , β ) , where m m m adjustable parameters are held in the vector β \beta β .
The goal is to find the parameter values for the model that best
fits the data. The fit of a model to a data point is measured by its residual
, defined as the difference between the observed value of the dependent variable and the value predicted by the model
r i = y i − f ( x i , β ) r_i = y_i - f(x_i, \beta) r i = y i − f ( x i , β ) The least-squares method finds the optimal parameter values by minimizing the sum of squared residuals, S S S :
S : = ∑ i = 1 n r i 2 S := \sum_{i = 1}^n r_i^2 S := i = 1 ∑ n r i 2 Solving the least squares problem The minimum of the sum of squares is found by setting the gradient to zero. Since the model contains m m m parameters, there are m m m gradient equations
∂ S ∂ B j = 2 ∑ i r i ∂ r i ∂ B j = 0 , j = 1 , … , m \frac{\partial S}{\partial B_j} = 2\sum_i r_i\frac{\partial r_i}{\partial B_j} = 0, j = 1, \dots, m ∂ B j ∂ S = 2 i ∑ r i ∂ B j ∂ r i = 0 , j = 1 , … , m since
r i = y i − f ( x i , β ) r_i = y_i - f(x_i, \beta) r i = y i − f ( x i , β ) the gradient equations become
− 2 ∑ i r i ∂ f ( x i , β ) ∂ B j = 0 , j = 1 , … , m -2\sum_i r_i\frac{\partial f(x_i, \beta)}{\partial B_j} = 0, j = 1, \dots, m − 2 i ∑ r i ∂ B j ∂ f ( x i , β ) = 0 , j = 1 , … , m Linear least squares A regression model is a linear one when the model comprises a linear combination of the parameters
f ( x , β ) = ∑ j = 1 m β j ϕ j ( x ) f(x, \beta) = \sum_{j = 1}^m \beta_j\phi_j(x) f ( x , β ) = j = 1 ∑ m β j ϕ j ( x ) where the function ϕ j \phi_j ϕ j is a function of x x x
Let
X i j = ϕ j ( x i ) Y i = y i X_{ij} = \phi_j(x_i) \\ Y_i = y_i X ij = ϕ j ( x i ) Y i = y i We can compute the least squares in the following way. Note that D D D is the set of all data.
L ( D , β ) = ∥ Y − X β ∥ 2 = ( Y − X β ) T ( Y − X B ) = Y T Y − Y T X β − β T X T Y + β T X T X β \begin{align}L(D, \beta) &= \|Y - X\beta\|^2 = (Y-X\beta)^T(Y-XB)\\ &= Y^TY - Y^TX\beta -\beta^TX^TY + \beta^TX^TX\beta\end{align} L ( D , β ) = ∥ Y − Xβ ∥ 2 = ( Y − Xβ ) T ( Y − XB ) = Y T Y − Y T Xβ − β T X T Y + β T X T Xβ The gradient of the loss is:
∂ L ∂ β = − 2 X T Y + 2 X T X β \frac{\partial L}{\partial\beta} = -2X^TY + 2X^TX\beta ∂ β ∂ L = − 2 X T Y + 2 X T Xβ Setting the gradient of the loss to zero and solving for β \beta β . We get
− 2 X T Y + 2 X T X β = 0 ⇒ X T y = X T X β -2X^TY +2X^TX\beta = 0 \\ \Rightarrow X^Ty = X^TX\beta − 2 X T Y + 2 X T Xβ = 0 ⇒ X T y = X T Xβ So we can get a solution
β ^ = ( X T X ) − 1 X T Y \hat \beta = (X^TX)^{-1}X^TY β ^ = ( X T X ) − 1 X T Y 💡
X T X X^TX X T X is not always invertible. Actually, it is invertible when
X X X has independent columns.
💡
In addition, it can be prove by using the best approximation method.
Gradient
In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field whose value at a point gives the direction and the rate of fastest increase. The gradient transforms like a vector under change of basis of the space of variables of . If the gradient of a function is non-zero at a point , the direction of the gradient is the direction in which the function increases most quickly from , and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to maximize a function by gradient ascent. In coordinate-free terms, the gradient of a function may be defined by:
https://en.wikipedia.org/wiki/Gradient
Non-linear least squares In general, there is no closed-form solution. So that numerical algorithms are used to find the value of the parameters β \beta β that minimizes the objective. Most algorithms involve choosing initial values for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation
β j k + 1 = β j k + Δ B j \beta_j^{k + 1} =\beta_j^k + \Delta B_j β j k + 1 = β j k + Δ B j To align with with textbook, from now on
r i = f i ( β ) F ( β ) = ( f 1 ( β ) , f 2 ( β ) , … , f m ( β ) ) \begin{align} r_i &= f_i(\beta) \\ F(\beta) &= (f_1(\beta), f_2(\beta), \dots, f_m(\beta))\end{align} r i F ( β ) = f i ( β ) = ( f 1 ( β ) , f 2 ( β ) , … , f m ( β )) Let
f ( β ) : = 1 2 ∑ i = 1 m f i ( β ) 2 = 1 2 F ( β ) T F ( β ) f(\beta) :=\frac{1}{2}\sum_{i = 1}^mf_i(\beta)^2 = \frac{1}{2}F(\beta)^TF(\beta) f ( β ) := 2 1 i = 1 ∑ m f i ( β ) 2 = 2 1 F ( β ) T F ( β ) 💡
We have scaled the problem by
1 2 \frac{1}{2} 2 1 to make its derivative simpler.
Therefore, our objective function can be written as
min β f ( β ) \min_{\beta} f(\beta) β min f ( β ) By using the chain rule,
∇ f ( β ) = ∇ F ( β ) F ( β ) \nabla f(\beta) = \nabla F(\beta) F(\beta) ∇ f ( β ) = ∇ F ( β ) F ( β ) Proof Actually,
∇ f ( β ) = ∑ i = 1 m ∇ f i ( β ) f i ( β ) \nabla f(\beta) = \sum_{i = 1}^m \nabla f_i(\beta)f_i(\beta) ∇ f ( β ) = i = 1 ∑ m ∇ f i ( β ) f i ( β ) Let,
∇ F ( β ) = [ ∇ f 1 ( β ) , ∇ f 2 ( β ) , … , ∇ f m ( β ) ] \nabla F(\beta) = [\nabla f_1(\beta), \nabla f_2(\beta), \dots, \nabla f_m(\beta)] ∇ F ( β ) = [ ∇ f 1 ( β ) , ∇ f 2 ( β ) , … , ∇ f m ( β )] So,
∇ F ( β ) F ( β ) = [ ∇ f 1 ( β ) , ∇ f 2 ( β ) , … , ∇ f m ( β ) ] [ f 1 ( β ) f 2 ( β ) ⋮ f m ( β ) ] = ∑ i = 1 m ∇ f i ( β ) f i ( β ) = ∇ f ( β ) \begin{align} \nabla F(\beta)F(\beta) &= [\nabla f_1(\beta), \nabla f_2(\beta), \dots, \nabla f_m(\beta)] \begin{bmatrix} f_1(\beta) \\ f_2(\beta) \\ \vdots \\ f_m(\beta)\end{bmatrix} \\ &=\sum_{i = 1}^m \nabla f_i(\beta)f_i(\beta) \\ &= \nabla f(\beta)\end{align} ∇ F ( β ) F ( β ) = [ ∇ f 1 ( β ) , ∇ f 2 ( β ) , … , ∇ f m ( β )] f 1 ( β ) f 2 ( β ) ⋮ f m ( β ) = i = 1 ∑ m ∇ f i ( β ) f i ( β ) = ∇ f ( β ) By using the chain rule again,
∇ 2 f ( β ) = ∇ F ( β ) ∇ F ( β ) T + ∑ i = 1 m f i ( β ) ∇ 2 f i ( β ) \nabla^2f(\beta) = \nabla F(\beta)\nabla F(\beta)^T + \sum_{i = 1}^mf_i(\beta)\nabla^2f_i(\beta) ∇ 2 f ( β ) = ∇ F ( β ) ∇ F ( β ) T + i = 1 ∑ m f i ( β ) ∇ 2 f i ( β ) Proof As we have seen in the previous proof,
∇ f ( β ) = ∑ i = 1 m ∇ f i ( β ) f i ( β ) = [ ∑ i = 1 m ∂ f i ∂ β 1 f i ( β ) ∑ i = 1 m ∂ f i ∂ β 2 f i ( β ) ⋮ ∑ i = 1 m ∂ f i ∂ β j f i ( β ) ] \begin{align} \nabla f(\beta) &= \sum_{i = 1}^m \nabla f_i(\beta)f_i(\beta) \\ &= \begin{bmatrix}\sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_1}f_i(\beta) \\ \sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_2}f_i(\beta)\\\vdots \\ \sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_j}f_i(\beta)\end{bmatrix}\end{align} ∇ f ( β ) = i = 1 ∑ m ∇ f i ( β ) f i ( β ) = ∑ i = 1 m ∂ β 1 ∂ f i f i ( β ) ∑ i = 1 m ∂ β 2 ∂ f i f i ( β ) ⋮ ∑ i = 1 m ∂ β j ∂ f i f i ( β ) where β = ( β 1 , β 2 , … , β j ) \beta = (\beta_1, \beta_2, \dots, \beta_j) β = ( β 1 , β 2 , … , β j )
That means
∂ f ∂ β k = ∑ i = 1 m ∂ f i ∂ β k f i ( β ) \frac{\partial f}{\partial \beta_k} = \sum_{i = 1}^m \frac{\partial f_i}{\partial \beta_k}f_i(\beta) ∂ β k ∂ f = i = 1 ∑ m ∂ β k ∂ f i f i ( β ) So,
[ ∂ 2 f ∂ β 1 ∂ β k ∂ 2 f ∂ β 2 ∂ β k ⋮ ∂ 2 f ∂ β j ∂ β k ] = [ ∂ f i ∂ β k ∂ f i ∂ β 1 + ∑ i = 1 m ∂ 2 f i ∂ β 1 ∂ β k f i ( β ) ∂ f i ∂ β k ∂ f i ∂ β 2 + ∑ i = 1 m ∂ 2 f i ∂ β 2 ∂ β k f i ( β ) ⋮ ∂ f i ∂ β k ∂ f i ∂ β j + ∑ i = 1 m ∂ 2 f i ∂ β j ∂ β k f i ( β ) ] \begin{bmatrix} \frac{\partial^2 f}{\partial \beta_1\partial \beta_k} \\ \frac{\partial^2 f}{\partial \beta_2\partial \beta_k} \\ \vdots \\ \frac{\partial^2 f}{\partial \beta_j\partial \beta_k}\end{bmatrix} = \begin{bmatrix}\frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_1} + \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_1\partial \beta_k}f_i(\beta)\\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_2} + \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_2\partial \beta_k}f_i(\beta)\\ \vdots \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_j} + \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_j\partial \beta_k}f_i(\beta)\end{bmatrix} ∂ β 1 ∂ β k ∂ 2 f ∂ β 2 ∂ β k ∂ 2 f ⋮ ∂ β j ∂ β k ∂ 2 f = ∂ β k ∂ f i ∂ β 1 ∂ f i + ∑ i = 1 m ∂ β 1 ∂ β k ∂ 2 f i f i ( β ) ∂ β k ∂ f i ∂ β 2 ∂ f i + ∑ i = 1 m ∂ β 2 ∂ β k ∂ 2 f i f i ( β ) ⋮ ∂ β k ∂ f i ∂ β j ∂ f i + ∑ i = 1 m ∂ β j ∂ β k ∂ 2 f i f i ( β ) We can split each term separately
[ ∂ 2 f ∂ β 1 ∂ β k ∂ 2 f ∂ β 2 ∂ β k ⋮ ∂ 2 f ∂ β j ∂ β k ] = [ ∂ f i ∂ β k ∂ f i ∂ β 1 ∂ f i ∂ β k ∂ f i ∂ β 2 ⋮ ∂ f i ∂ β k ∂ f i ∂ β j ] + [ ∑ i = 1 m ∂ 2 f i ∂ β 1 ∂ β k f i ( β ) ∑ i = 1 m ∂ 2 f i ∂ β 2 ∂ β k f i ( β ) ⋮ ∑ i = 1 m ∂ 2 f i ∂ β j ∂ β k f i ( β ) ] \begin{bmatrix} \frac{\partial^2 f}{\partial \beta_1\partial \beta_k} \\ \frac{\partial^2 f}{\partial \beta_2\partial \beta_k} \\ \vdots \\ \frac{\partial^2 f}{\partial \beta_j\partial \beta_k}\end{bmatrix} = \begin{bmatrix} \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_j}\end{bmatrix} + \begin{bmatrix} \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_1\partial \beta_k}f_i(\beta)\\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_2\partial \beta_k}f_i(\beta)\\ \vdots \\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_j\partial \beta_k}f_i(\beta)\end{bmatrix} ∂ β 1 ∂ β k ∂ 2 f ∂ β 2 ∂ β k ∂ 2 f ⋮ ∂ β j ∂ β k ∂ 2 f = ∂ β k ∂ f i ∂ β 1 ∂ f i ∂ β k ∂ f i ∂ β 2 ∂ f i ⋮ ∂ β k ∂ f i ∂ β j ∂ f i + ∑ i = 1 m ∂ β 1 ∂ β k ∂ 2 f i f i ( β ) ∑ i = 1 m ∂ β 2 ∂ β k ∂ 2 f i f i ( β ) ⋮ ∑ i = 1 m ∂ β j ∂ β k ∂ 2 f i f i ( β ) When we look carefully,
∂ 2 f i ∂ β l ∂ β k = [ ∇ 2 f i ( β ) ] l , k \frac{\partial^2f_i}{\partial \beta_l\partial \beta_k} = [\nabla^2f_i(\beta)]_{l,k} ∂ β l ∂ β k ∂ 2 f i = [ ∇ 2 f i ( β ) ] l , k Therefore,
[ ∑ i = 1 m ∂ 2 f i ∂ β 1 ∂ β k f i ( β ) ∑ i = 1 m ∂ 2 f i ∂ β 2 ∂ β k f i ( β ) ⋮ ∑ i = 1 m ∂ 2 f i ∂ β j ∂ β k f i ( β ) ] = ∑ i = 1 m col k ( ∇ 2 f i ( β ) ) f i ( β ) \begin{bmatrix} \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_1\partial \beta_k}f_i(\beta)\\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_2\partial \beta_k}f_i(\beta)\\ \vdots \\ \sum_{i = 1}^m{}\frac{\partial^2f_i}{\partial\beta_j\partial \beta_k}f_i(\beta)\end{bmatrix} = \sum_{i = 1}^m \text{col}_k(\nabla^2f_i(\beta))f_i(\beta) ∑ i = 1 m ∂ β 1 ∂ β k ∂ 2 f i f i ( β ) ∑ i = 1 m ∂ β 2 ∂ β k ∂ 2 f i f i ( β ) ⋮ ∑ i = 1 m ∂ β j ∂ β k ∂ 2 f i f i ( β ) = i = 1 ∑ m col k ( ∇ 2 f i ( β )) f i ( β ) In addition,
[ ∂ f i ∂ β k ∂ f i ∂ β 1 ∂ f i ∂ β k ∂ f i ∂ β 2 ⋮ ∂ f i ∂ β k ∂ f i ∂ β j ] = [ ∂ f i ∂ β 1 ∂ f i ∂ β 2 ⋮ ∂ f i ∂ β j ] ∂ f i ∂ β k \begin{bmatrix} \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_k}\frac{\partial f_i}{\partial\beta_j}\end{bmatrix} = \begin{bmatrix}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_j}\end{bmatrix}\frac{\partial f_i}{\partial \beta_k} ∂ β k ∂ f i ∂ β 1 ∂ f i ∂ β k ∂ f i ∂ β 2 ∂ f i ⋮ ∂ β k ∂ f i ∂ β j ∂ f i = ∂ β 1 ∂ f i ∂ β 2 ∂ f i ⋮ ∂ β j ∂ f i ∂ β k ∂ f i Therefore,
∇ 2 f ( β ) = [ ∂ f i ∂ β 1 ∂ f i ∂ β 2 ⋮ ∂ f i ∂ β j ] [ ∂ f i ∂ β 1 ∂ f i ∂ β 2 ⋯ ∂ f i ∂ β j ] + ∑ i = 1 m [ col 1 ( ∇ 2 f i ( β ) ) col 2 ( ∇ 2 f i ( β ) ) ⋯ col j ( ∇ 2 f i ( β ) ) ] f i ( β ) = ∇ F ( β ) ∇ F ( β ) T + ∑ i = 1 m f i ( β ) ∇ 2 f i ( β ) \begin{align} \nabla^2f(\beta) = \begin{bmatrix}\frac{\partial f_i}{\partial\beta_1} \\ \frac{\partial f_i}{\partial\beta_2} \\ \vdots \\ \frac{\partial f_i}{\partial\beta_j}\end{bmatrix}\begin{bmatrix}\frac{\partial f_i}{\partial\beta_1} & \frac{\partial f_i}{\partial\beta_2} & \cdots &\frac{\partial f_i}{\partial\beta_j} \\ \end{bmatrix} \\ + \sum_{i = 1}^m\begin{bmatrix} \text{col}_1(\nabla^2f_i(\beta)) & \text{col}_2(\nabla^2f_i(\beta)) & \cdots & \text{col}_j(\nabla^2f_i(\beta)) \end{bmatrix}f_i(\beta) \\ = \nabla F(\beta)\nabla F(\beta)^T + \sum_{i = 1}^mf_i(\beta) \nabla^2f_i(\beta)\end{align} ∇ 2 f ( β ) = ∂ β 1 ∂ f i ∂ β 2 ∂ f i ⋮ ∂ β j ∂ f i [ ∂ β 1 ∂ f i ∂ β 2 ∂ f i ⋯ ∂ β j ∂ f i ] + i = 1 ∑ m [ col 1 ( ∇ 2 f i ( β )) col 2 ( ∇ 2 f i ( β )) ⋯ col j ( ∇ 2 f i ( β )) ] f i ( β ) = ∇ F ( β ) ∇ F ( β ) T + i = 1 ∑ m f i ( β ) ∇ 2 f i ( β ) Optimization Conditions Recall
First-order necessary conditionIf β ∗ \beta^* β ∗ is a local minimum, then ∇ f ( β ∗ ) = 0 \nabla f(\beta^*) = 0 ∇ f ( β ∗ ) = 0
Second-order necessary conditionIf β ∗ \beta^* β ∗ is a local minimum, then ∇ 2 f ( β ∗ ) \nabla^2f(\beta^*) ∇ 2 f ( β ∗ ) is positive semi-definite.
Second-order sufficient conditionIf ∇ f ( β ∗ ) = 0 \nabla f(\beta^*) = 0 ∇ f ( β ∗ ) = 0 and ∇ 2 f ( β ∗ ) \nabla^2f(\beta^*) ∇ 2 f ( β ∗ ) is positive definite, then β ∗ \beta^* β ∗ is a strict local minimum
💡
In general, there is no necessary and sufficient condition except the convex problem
Suppose we have a perfect-fitting solution. Are the optimality conditions satisfied? (i.e. F ( β ∗ ) = 0 F(\beta^*) = 0 F ( β ∗ ) = 0 )
It is trivial that
∇ f ( β ∗ ) = ∇ F ( β ∗ ) F ( β ∗ ) = 0 \nabla f(\beta^*) = \nabla F(\beta^*)F(\beta^*) = 0 ∇ f ( β ∗ ) = ∇ F ( β ∗ ) F ( β ∗ ) = 0 What about the hessian?
∇ 2 f ( β ∗ ) = ∇ F ( β ∗ ) ∇ F ( β ∗ ) T + ∑ i = 1 m f i ( β ∗ ) ∇ 2 f i ( β ∗ ) \nabla^2 f(\beta^*) = \nabla F(\beta^*)\nabla F(\beta^*)^T + \sum_{i = 1}^m f_i(\beta^*)\nabla^2f_i(\beta^*) ∇ 2 f ( β ∗ ) = ∇ F ( β ∗ ) ∇ F ( β ∗ ) T + i = 1 ∑ m f i ( β ∗ ) ∇ 2 f i ( β ∗ ) But we already know that f i ( β ∗ ) = 0 , ∀ i = 1 , … , m f_i(\beta^*) = 0, \forall i = 1, \dots, m f i ( β ∗ ) = 0 , ∀ i = 1 , … , m
Therefore,
∇ 2 f ( β ∗ ) = ∇ F ( β ∗ ) ∇ F ( β ∗ ) T \nabla^2f(\beta^*) = \nabla F(\beta^*)\nabla F(\beta^*)^T ∇ 2 f ( β ∗ ) = ∇ F ( β ∗ ) ∇ F ( β ∗ ) T v T ∇ 2 f ( β ∗ ) v = v T ∇ F ( β ∗ ) ∇ F ( β ∗ ) T v = ( ∇ F ( β ∗ ) T v ) T ( ∇ F ( β ∗ ) T v ) = ∥ ∇ F ( β ∗ ) T v ∥ ≥ 0 \begin{align}v^T\nabla^2f(\beta^*)v &= v^T\nabla F(\beta^*)\nabla F(\beta^*)^Tv \\ &= (\nabla F(\beta^*)^Tv)^T(\nabla F(\beta^*)^Tv) \\ &= \|\nabla F(\beta^*)^Tv\| \\ &\ge 0\end{align} v T ∇ 2 f ( β ∗ ) v = v T ∇ F ( β ∗ ) ∇ F ( β ∗ ) T v = ( ∇ F ( β ∗ ) T v ) T ( ∇ F ( β ∗ ) T v ) = ∥∇ F ( β ∗ ) T v ∥ ≥ 0 for all v ∈ R m v \in R^m v ∈ R m
Therefore, ∇ 2 f ( β ∗ ) \nabla^2 f(\beta^*) ∇ 2 f ( β ∗ ) is positive semi-definite.
💡
If
∇ F ( β ∗ ) T \nabla F(\beta^*)^T ∇ F ( β ∗ ) T has full rank,
∇ 2 f ( β ∗ ) \nabla^2 f(\beta^*) ∇ 2 f ( β ∗ ) is positive definite.
Gauss-Newton Method Recall that Newton’s method is
∇ 2 f ( β ) p = − ∇ f ( β ) \nabla^2f(\beta)p = -\nabla f(\beta) ∇ 2 f ( β ) p = − ∇ f ( β ) Replace
∇ f ( β ) = ∇ F ( β ) F ( β ) \nabla f(\beta) = \nabla F(\beta) F(\beta) ∇ f ( β ) = ∇ F ( β ) F ( β ) and
∇ 2 f ( β ) = ∇ F ( β ) ∇ F ( β ) T + ∑ i = 1 m f i ( β ) ∇ 2 f i ( β ) \nabla^2f(\beta) = \nabla F(\beta)\nabla F(\beta)^T + \sum_{i = 1}^mf_i(\beta)\nabla^2f_i(\beta) ∇ 2 f ( β ) = ∇ F ( β ) ∇ F ( β ) T + i = 1 ∑ m f i ( β ) ∇ 2 f i ( β ) We get
[ ∇ F ( β ) ∇ F ( β ) T + ∑ i = 1 m f i ( β ) ∇ 2 f i ( β ) ] p = − ∇ F ( β ) F ( β ) [\nabla F(\beta)\nabla F(\beta)^T+ \sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta)]p = -\nabla F(\beta)F(\beta) [ ∇ F ( β ) ∇ F ( β ) T + i = 1 ∑ m f i ( β ) ∇ 2 f i ( β )] p = − ∇ F ( β ) F ( β ) However,
∑ i = 1 m f i ( β ) ∇ 2 f i ( β ) ] p \sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta)]p i = 1 ∑ m f i ( β ) ∇ 2 f i ( β )] p requires Hessian of the functions. If the given point is near the solution, F ( β ∗ ) = 0 F(\beta^*) = 0 F ( β ∗ ) = 0 . This means that
f i ( β ) = 0 , ∀ i f_i(\beta) = 0, \forall i f i ( β ) = 0 , ∀ i In Gauss-Newton method uses this approximation directly.
Therefore,
∇ F ( β ) ∇ F ( β ) T p = − ∇ F ( β ) F ( β ) \nabla F(\beta)\nabla F(\beta)^Tp = -\nabla F(\beta)F(\beta) ∇ F ( β ) ∇ F ( β ) T p = − ∇ F ( β ) F ( β ) Levenberg-Marquardt Method (LM Algorithm) If the residual is large, Guass-Newton method can perform poorly.
💡
In addition, if the Jacobian of
F F F is not of full rank at the solution, it can also perform poorly.
One approach to remedy this kind of problem is use some approximation to the second term in the formula for the Hessian matrix
∑ i = 1 m f i ( β ) ∇ 2 f i ( β ) \sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta) i = 1 ∑ m f i ( β ) ∇ 2 f i ( β ) The oldest and simplest of these approximation is
∑ i = 1 m f i ( β ) ∇ 2 f i ( β ) ≈ λ I \sum_{i = 1}^m f_i(\beta)\nabla^2f_i(\beta) \approx \lambda I i = 1 ∑ m f i ( β ) ∇ 2 f i ( β ) ≈ λ I where λ ≥ 0 \lambda \ge 0 λ ≥ 0
Therefore,
[ ∇ F ( β ) ∇ F ( β ) T + λ I ] p = − ∇ F ( β ) F ( β ) [\nabla F(\beta)\nabla F(\beta)^T + \lambda I]p = -\nabla F(\beta)F(\beta) [ ∇ F ( β ) ∇ F ( β ) T + λ I ] p = − ∇ F ( β ) F ( β ) This is referred to as the Levenberg-Marquardt
method.