Computer Science/Machine learning

7. Support Vector Machine

  • -
728x90
반응형

Functional Margin and Geometric Margin
What is the difference

The key difference between the functional margin and geometric margin is that the functional margin is defined for every data point, whereas the geometric margin is defined only for the data point closest to the decision boundary. The geometric margin is often used as a measure of the generalization performance of a linear classifier, as it captures how well the classifier can separate new data points that are similar to the training data.

Functional Margin

Given a training example (x(i),y(i))(x^{(i)}, y^{(i)}), we define the functional margin of (w,b)(w, b) with respect to the training example

r^(i)=y(i)(wTx+b)\hat r^{(i)} = y^{(i)}(w^Tx + b)

Note that if y(i)=1y^{(i)} = 1, then for the functional margin to be large (i.e. for our prediction to be confident and correct), we need wTx+bw^Tx + b to be a large positive number. Conversely, if y(i)=1y^{(i)} = -1, then for the functional margin to be large, we need wTx+bw^Tx + b to be a large negative number.

💡
y(i)(wTx+b)>0y^{(i)}(w^Tx + b) > 0 means that our prediction on this example is correct. (i.e. a large functional margin represents a confidentand a correct prediction

Given a training set S={(x(i),y(i))i=1,,m}S = \{(x^{(i)}, y^{(i)})| i = 1, \dots, m\}, we can define the functional margin of (w,b)(w, b) with respect to S to be the smallest of the functional margins of the individual training examples. Denoted by r^\hat r, this can be written as

r^=mini=1,,mr^(i)\hat r = \min_{i = 1, \dots, m}\hat r^{(i)}

However, there’s one property of the functional margin that makes it not a very good measure of confidence. This is because if we replace ww with 2w2w and bb with 2b2b, then since g(wTx+b)=g(2wt+2b)g(w^Tx + b) = g(2w^t+2b), this would not change the result of classifier (Since, classifier and gg depends only on the sign, but not on the magnitude.

💡
where g(z)=1g(z) = 1 if z0z\ge 0, and g(z)=1g(z) = -1 otherwise

This means that we can scale ww and bb without changing the result of classifier. Intuitively, it might therefore makes sense to impose some sort of normalization condition such as w2=1\|w\|_2 = 1 (i.e. we can replace ww and bb with w/w2w/\|w\|_2 and b/w2b/\|w\|_2 respectively. So the concept of geometric margin arises.

Geometric margin

We want to show that geometric margin is not only the normalizing functional margin but also the actual geometric distance between the datapoint and the decision boundary. Since we already know that AA represents x(i)x^{(i)}, we therefore find that the point BB is given x(i)r(i)w/wx^{(i)} - r^{(i)}\cdot w/\|w\|. In addition, we already know that the point BB lies in the decision boundary. Therefore

wT(x(i)=r(i)ww)+b=0w^T(x^{(i)}=r^{(i)}\frac{w}{\|w\|}) + b = 0

Solving for r(i)r^{(i)} yields

r(i)=(ww)Tx(i)+bwr^{(i)} = (\frac{w}{\|w\|})^Tx^{(i)} + \frac{b}{\|w\|}
💡
Therefore, the geometric margin must be interpreted as a normalized functional margin and the geometric distance between datapoints and the decision boundary.

Therefore, the geometric margin is invariant to rescaling of the parameters.

In addition, given a training set S={(x(i),y(i))i=1,,m}S = \{(x^{(i)}, y^{(i)})| i = 1, \dots, m\}, we also define the geometric margin of (w,b)(w, b) with respect to SS to be the smallest of the geometric margins on the individual training examples:

r=mini=1,,mr(i)r = \min_{i = 1, \dots, m}r^{(i)}
The optimal margin classifier

Given a training set, we want to find the optimal decision boundary that maximizes the geometric margin. We will assume that we are given a training set that is linearly separable.

💡
If the training set that is not linearly separable, it can be solvable by using regularization. This concept will discuss later.

So, we want to solve the following optimization problem

maxr,w,brsubject to y(i)(wTx(i)+b)r,i=1,,mand w=1\max_{r, w, b}r \\ \text{subject to }y^{(i)}(w^Tx^{(i)} + b) \ge r, i = 1,\dots, m \\ \text{and } \|w\|=1

But this optimization problem has “w=1\|w\| = 1” constraint which is not a convex. So, we have to reformulate the above optimization problem as follows

maxr^,w,br^wsubject to y(i)(wTx(i)+b)r^,i=1,,m\max_{\hat r, w, b}\frac{\hat r}{\|w\|} \\ \text{subject to }y^{(i)}(w^Tx^{(i)} + b) \ge \hat r, i = 1,\dots, m \\

However, the objective function(r^w\frac{\hat r}{\|w\|}) still non-covex function, so we need to step further. Recall our earlier discussion that we can add an arbitrary scaling constraint on ww and bb without changing anything. By using this fact, we want to make the functional margin of w,bw, b with respect to the training set to be 11:

r^=1\hat r = 1

Therefore, the optimization problem is reformulated as follows

maxw,b1wsubject to y(i)(wTx(i)+b)1,i=1,,m\max_{w, b}\frac{1}{\|w\|} \\ \text{subject to }y^{(i)}(w^Tx^{(i)} + b) \ge 1, i = 1,\dots, m \\

However, the objective function(1w\frac{1}{\|w\|}) still non-convex function, so we need to step further.

It’s easily show that maximizing 1w\frac{1}{\|w\|} is equivalent to minimizing w2\|w\|^2 which is convex. Finally, therefore, we can derive the optimization problem as follows

minw,b12w2subject to y(i)(wTx(i)+b)1,i=1,,m\min_{w, b}\frac{1}{2}\|w\|^2 \\ \text{subject to }y^{(i)}(w^Tx^{(i)} + b) \ge 1, i = 1,\dots, m \\

Although the above objective function can be solved by quadratic programming, if we use Lagrange duality , we can use kernels to get optimal margin classifiers to work efficiently in very high dimensional space.

💡
In addition, the dual form typically do much better than generic QP software

Lagrange duality

Consider a problem of the following form:

minwf(w)subject to hi(w)=0,i=1,,l\min_w f(w) \\ \text{subject to }h_i(w) = 0, i = 1,\dots, l

This problem could be solved by Lagrange multipliers. In this method, we can define the Lagrangian to be

L(w,β)=f(w)+i=1lβihi(w)\mathcal L(w, \beta) = f(w)+ \sum_{i = 1}^l\beta_ih_i(w)

Here, the βi\beta_i’s called the Lagrange multipliers . We would then find and set L\mathcal L’s partial derivative to zero:

Lwi=0,Lβi=0\frac{\partial \mathcal L}{\partial w_i} = 0, \frac{\partial \mathcal L}{\partial \beta_i} = 0

and solve for ww and β\beta. The above optimization problem only has equality constraint. What if the optimization problem has inequality constraint such as

minwf(w)subject to gi(w)0,i=1,,khi(w)=0,i=1,,l\min_w f(w) \\ \text{subject to }g_i(w) \le 0, i = 1,\dots, k \\ h_i(w) = 0, i = 1, \dots, l

To solve it, we start by defining the generalized Lagrangian.

L(w,α,β)=f(w)+i=1kαigi(w)+i=1lβihi(w)\mathcal L(w, \alpha, \beta) = f(w) + \sum_{i = 1}^k \alpha_ig_i(w) + \sum_{i = 1}^l\beta_ih_i(w)

Consider the quantity

θP(w)=maxα,β:αi0L(w,α,β)\theta_\mathcal P(w) = \max_{\alpha, \beta:\alpha_i\ge 0}\mathcal L(w, \alpha, \beta)

where P\mathcal P means primal . We can easily find that

θP(w)={f(w)if w satisfies primal constraintsotherwise\theta_\mathcal P(w) = \begin{cases} f(w) & \text{if }w \text{ satisfies primal constraints} \\ \infty & \text{otherwise}\end{cases}

Therefore, our original optimization problem can be reformulated as follows

p=minwθP(w)=minwmaxα,β:αi0L(w,α,β)p^{*}= \min_w\theta_{\mathcal P}(w) = \min_w\max_{\alpha, \beta : \alpha_i\ge 0}\mathcal L(w, \alpha, \beta)

Now, let’s look at a a slightly different problem. Define

θD(α,β)=minwL(w,α,β)\theta_\mathcal D(\alpha, \beta) = \min_w \mathcal L(w, \alpha, \beta)

where D\mathcal D means dual. We can define the dual optimization problem:

d=maxα,β:αi0θD(α,β)=maxα,β:αi0minwL(w,α,β)d^* = \max_{\alpha, \beta:\alpha_i\ge0}\theta_\mathcal D(\alpha, \beta) = \max_{\alpha, \beta:\alpha_i\ge0}\min_w\mathcal L(w, \alpha, \beta)
💡
The key difference between the original optimization problem and the dual optimization problem is the order of min and max.

Normally, dpd^* \le p^*. However, under certain conditions, we will have

d=pd^* = p^*
💡
Under such condition, we can solve the original optimization problem via solving the dual optimization problem.

if these conditions hold.

  1. ff and the gig_i’s are convex, and hih_i’s are affine.
  1. The constraints gig_i are strictly feasible; this means that there exists some ww so that gi(w)<0g_i(w) <0 for all ii.

Under above assumptions, there must exist w,α,βw^*, \alpha^*, \beta^* so that ww^* is the solution to the primal problem. α,β\alpha^*, \beta^* are the solutions to the dual problem, and moreover p=d=L(w,α,β)p^* = d^* = \mathcal L(w^*, \alpha^*, \beta^*). Moreover, w,α,w^*, \alpha^*, and β\beta^* satisfy the KKT conditions , which are as follows:

wiL(w,α,β)=0,i=1,,nβiL(w,α,β)=0,i=1,,lαigi(w)=0,i=1,,kgi(w)0,i=1,,kα0,i=1,,k\frac{\partial}{\partial w_i}\mathcal L(w^*, \alpha^*, \beta^*) = 0, i = 1, \dots, n \\ \frac{\partial}{\partial \beta_i}\mathcal L(w^*, \alpha^*, \beta^*) = 0, i = 1, \dots, l\\ \alpha_i^*g_i(w^*) = 0, i = 1, \dots, k \\ g_i(w^*) \le 0, i = 1, \dots, k \\ \alpha^*\ge 0, i = 1,\dots, k

Moreover, if some w,α,βw^*,\alpha^*, \beta^* satisfy the KKT conditions, then it is also a solution to the primal and dual problems.

💡
This means that converse is also true.

More specifically, one of the most important condition in KKT conditions is

αigi(w)=0,i=1,,k\alpha_i^*g_i(w^*) = 0, i = 1, \dots, k

which is called KKT dual complementarity condition. This implies that if αi>0\alpha_i^* >0, then gi(w)=0g_i(w^*) = 0 (i.e. The inequality constraints holds with equality rather than with inequality)

💡
This condition helps SVM has only a small number of support vectors

Revisit the optimal margin classifier

As I mentioned, the final optimization problem we want to solve is

minw,b12w2subject to y(i)(wTx(i)+b)1,i=1,,m\min_{w, b}\frac{1}{2}\|w\|^2 \\ \text{subject to }y^{(i)}(w^Tx^{(i)} + b) \ge 1, i = 1,\dots, m \\

We can write the constraints as

gi(w)=y(i)(wTx(i)+b)+10g_i(w) = -y^{(i)}(w^Tx^{(i)} + b) + 1\le 0

Therefore, by KKT dual complementary conditions, αi>0\alpha_i >0 only for the training examples that have functional margin exactly equal to 1.

The points with the smallest margins are the exactly the ones closest to the decision boundary. Thus, only three of the αi\alpha_i’s that corresponds to the three training example which has the smallest margins have non-zero at the optimal solution. These three points are called the support veectors

💡
The fact is that the number of support vectors can be much smaller than the size of the training set. (So, it is computationally efficient)

When we construct the Lagrangian for our optimization problem we have:

L(w,b,α)=12w2i1mαi[y(i)(wTx(i)+b)1]\mathcal L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{i - 1}^m \alpha_i[y^{(i)}(w^Tx^{(i)} + b) - 1]

Let’s find the dual form of the problem. To do so, we need to first minimize L(w,b,α)\mathcal L(w, b,\alpha) with respect to ww and bb, to get θD\theta_\mathcal D.

wL(w,b,α)=wi=1mαiy(i)x(i)=0bL(w,b,α)=i=1mαiy(i)=0\nabla_w\mathcal L(w, b, \alpha) = w-\sum_{i = 1}^m \alpha_iy^{(i)}x^{(i)} = 0 \\\nabla_b\mathcal L(w, b, \alpha) = \sum_{i = 1}^m \alpha_iy^{(i)} = 0

This implies that

w=i=1mαiy(i)x(i)w = \sum_{i = 1}^m\alpha_iy^{(i)}x^{(i)}

By using these equations, we can reformulate the original Lagrangian as follows

L(w,b,α)=i=1mαi12i,j=1my(i)y(j)αj(x(i))Tx(j)\mathcal L(w, b,\alpha) =\sum_{i = 1}^m\alpha_i - \frac{1}{2}\sum_{i, j = 1}^my^{(i)}y^{(j)}\alpha_j(x^{(i)})^Tx^{(j)}

Therefore, we obtain the following dual optimization problem :

d=maxαθD(α)=maxα[i=1mαi12i,j=1my(i)y(j)αj(x(i))Tx(j)]subject to αi0,i=1,,mand i=1mαiy(i)=0d^* = max_\alpha \theta_D(\alpha) = \max_{\alpha}\Big[\sum_{i = 1}^m\alpha_i - \frac{1}{2}\sum_{i, j = 1}^my^{(i)}y^{(j)}\alpha_j(x^{(i)})^Tx^{(j)}\Big] \\ \text{subject to }\alpha_i\ge0, i = 1,\dots, m \\ \text{and }\sum_{i = 1}^m\alpha_iy^{(i)} = 0
💡
The original optimization problem satisfy KKT condition because ff and gig_i are convex function and gig_i is feasible for all ii. So, we can solve the original optimization via the dual optimization problem.
💡
Dual optimization has inner-product term. So by using this term, we can use kernel trick.

After we find α\alpha^*, we can easily derive ww^* by using w=i=1mαiy(i)x(i)w = \sum_{i = 1}^m\alpha_iy^{(i)}x^{(i)} . In addition, we can also find bb^* by using following fact

miny=1wTx+b=1maxy=1wTx+b=1\min_{y ' = 1}w^{*^T}x' + b^* = 1 \\ \max_{y ' = -1}w^{*^T}x' + b^* = -1

Suppose we’ve fit our model’s parameters to a training set, and now wish to make a prediction at a new point input xx. We would then calculate wx+bw^*x + b, and predict y=1y = 1 iff this quantity is bigger than zero. By using w=i=1mαiy(i)x(i)w = \sum_{i = 1}^m\alpha_iy^{(i)}x^{(i)} , we can derive

wTx+b=(i=1mαiy(i)x(i))Tx+b=i=1mαiy(i)x(i),x+bw^{*^T}x + b^* = \Big(\sum_{i = 1}^m\alpha_iy^{(i)}x^{(i)}\Big)^Tx + b^* \\ = \sum_{i = 1}^m\alpha_i^*y^{(i)}\langle x^{(i)}, x\rangle + b^*

Hence, if we’ve found the αi\alpha_i’s, in order to make a prediction, we have to calculate a quantity that depends only on the inner product between xx and the points in the training set. Moreover, we saw earlier that the αi\alpha_i’s will all be zero except for the support vectors . So we really need to find only the inner products between xx and the support vectors.

💡
By using Dual optimization problem, we can able to write the entire algorithm in terms of only inner products between input feature vectors. By using this fact, we can use kernel so that it can be able to efficiently learn in very high dimensional spaces.

Meaning of Basis function

A basis function is a function used to construct a particular function or set of functions as a linear combination of the basis function. Let ff is a function which is we wanted to find. We can construct this function as a linear combination of the basis function.

f(x)=i=0Lθiϕi(x)=θTϕ(x)f(x) = \sum_{i = 0}^{L}\theta_i\phi_i(x) \\ =\theta^T\phi(x)
💡
Note : ϕ1,,ϕL\phi_1, \cdots, \phi_L is linearly independent

Since span{ϕ1,,ϕL}\text{span}\{\phi_1, \cdots, \phi_L\} is isomorphic to RL\R^L, we can think

ϕ:RdRL\phi : \R^d \to \R^L

So, we can think that

ϕ(x)=Ax\phi(x) = Ax

And we want to take cost function to

i[1,n],g(xiyi,θ)=12(yif(xi))2=12(yiθTϕ(xi))2\forall i \in [1, n], g(x_i|y_i, \theta) = \frac{1}{2}(y_i - f(x_i))^2 \\ = \frac{1}{2}(y_i - \theta^T\phi(x_i))^2

when we have nn data.

We want to find the optimal θ\theta (i.e. we want to regard this function as a function of θ)\theta)

θg=(yiθTϕ(xi))ϕ(xi)\nabla_\theta g = -(y_i - \theta^T\phi(x_i))\phi(x_i)

Batch gradient descent

So, gradient descent for θ\theta is following (Batch gradient descent case)

θ:=θαi=1nθg(xi)=θ+αi=1n(yiθTϕ(xi))ϕ(xi)\theta := \theta -\alpha\sum_{i = 1}^n\nabla_\theta g(x_i) \\ =\theta + \alpha\sum_{i = 1}^n(y_i-\theta^T\phi(x_i))\phi(x_i)

Stochastic gradient descent

For i in range(1, n):

θ:=θαθg(xi)θ+α(yiθTϕ(xi))ϕ(xi) \theta := \theta -\alpha\nabla_\theta g(x_i) \\ \theta + \alpha(y_i-\theta^T\phi(x_i))\phi(x_i)

Conclusion

Using basis functions can help increase the dimensionality of the input space, which can make it easier to find decision boundaries or capture more complex relationships between the inputs and outputs.

Above example shows that input space is d-dimensional, however, output space is L-dimensional.

Kernel Trick

Actually, we can express θ\theta as a linear combination of ϕ(x1),,ϕ(xn)\phi(x_1), \cdots, \phi(x_n) which are generated by input values.

💡
Note : We don’t know whether ϕ(x1),,ϕ(xn)\phi(x_1), \cdots, \phi(x_n) is independent. We just know that basis function is independent.

When we update the value of θ\theta by using batch gradient descent twice, we can easily find that

θ:=θ+αi=1n(y(i)θTϕ(x(i)))ϕ(x(i))=i=1nβiϕ(x(i))+αi=1n(y(i)θTϕ(x(i)))ϕ(x(i))=i=1n{βi+α(y(i)θTϕ(x(i)))}ϕ(x(i))\theta := \theta + \alpha \sum_{i = 1}^n(y^{(i)} - \theta^T\phi(x^{(i)}))\phi(x^{(i)}) \\ = \sum_{i = 1}^n\beta_i \phi(x^{(i)}) + \alpha \sum_{i = 1}^n(y^{(i)} - \theta^T\phi(x^{(i)}))\phi(x^{(i)}) \\ = \sum_{i = 1}^n \{\beta_i + \alpha (y^{(i)} - \theta^T\phi(x^{(i)}))\}\phi(x^{(i)})

Let βi:=βi+α(y(i)θTϕ(x(i)))\beta_i := \beta_i + \alpha (y^{(i)} - \theta^T\phi(x^{(i)})), we can easily found that θ\theta could be expressed by just linear combination of ϕ(x1),,ϕ(xn)\phi(x_1), \cdots, \phi(x_n).

💡
To do this, we initialize the parameter θ\theta = 0

Since θ\theta could be express by linear combination of ϕ(x1),,ϕ(xn)\phi(x_1), \cdots, \phi(x_n),

βi:=βi+α(y(i)j=1nβjϕ(x(j))Tϕ(x(i)))=βi+α(y(i)j=1nβjϕ(x(j)),ϕ(x(i)))\beta_i := \beta_i + \alpha(y^{(i)} - \sum_{j = 1}^n\beta_j\phi(x^{(j)})^T\phi(x^{(i)})) \\ = \beta_i + \alpha(y^{(i)} - \sum_{j = 1}^n\beta_j \langle \phi(x^{(j)}), \phi(x^{(i)}) \rangle )
💡
Actually, we want to find the optimal value of θ\theta. But θ\theta can be express by linear combination of ϕ(x1),,ϕ(xn)\phi(x_1), \cdots, \phi(x_n) with coefficient β1,,βn\beta_1, \cdots, \beta_n respectively. So if we find optimal β\beta, it is equivalent to find the optimal value of θ\theta.

So, from now on, we are focusing on finding an optimal value of β\beta.

Why we have to do this procedure

  1. ϕ(x(j)),ϕ(x(i))\langle \phi(x^{(j)}), \phi(x^{(i)}) \rangle can be pre-calculated for all pairs of i and j.
  1. In addition, we don’t have to compute ϕ(x(i))\phi(x^{(i)}) explicitly, when we try to calculate ϕ(x(j)),ϕ(x(i))\langle \phi(x^{(j)}), \phi(x^{(i)}) \rangle
Regularization and the non-separable case

The derivation of the SVM as presented so far assumed that the data is linearly separable.However, we can’t guarantee that it always will be so. In addition, the original SVM approach is very sensitive to outliers.

The above figure shows that just one outlier changes the decision boundary dramatically. To make the algorithm work for non-linearly separable datasets as well as be less sensitive to outliers, we reformulate our optimization (using l1\mathcal l_1 regularization) as follows

minw,b,ξ12w2+Ci=1mξisubject to y(i)(wTx(i)+b)1ξi,i=1,,mand ξi0,i=1,,m\min_{w, b,\xi}\frac{1}{2}\|w\|^2 + C\sum_{i = 1}^m\xi_i\\ \text{subject to }y^{(i)}(w^Tx^{(i)} + b) \ge 1 - \xi_i, i = 1,\dots, m \\ \text{and }\xi_i\ge0, i = 1, \dots, m
💡
By doing so, the formula is permitted to have functional margin less than 1 and if an example has functional margin 1ξi1-\xi_i, we would pay a cost of the objective function being increased by CξiC\xi_i.

As before, we can form the Lagrangian:

L(w,b,ξ,α,r)=12wTw+Ci=1mξii1mαi[y(i)(wTx(i)+b)1+ξi]i=1mriξi\mathcal L(w, b,\xi, \alpha, r) = \frac{1}{2}w^Tw + C\sum_{i = 1}^m\xi_i - \sum_{i - 1}^m \alpha_i[y^{(i)}(w^Tx^{(i)} + b) - 1+\xi_i] - \sum_{i = 1}^mr_i\xi_i

After we calculate the dual form, we can derive

maxαi=1mαi12i,j=1my(i)y(j)αiαjx(i),x(j)subject to 0αjC,i=1,,mand i=1mαiy(i)=0\max_\alpha \sum_{i = 1}^m\alpha_i - \frac{1}{2}\sum_{i, j = 1}^my^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)}, x^{(j)}\rangle \\ \text{subject to }0\le\alpha_j\le C, i = 1, \dots, m \\ \text{and }\sum_{i = 1}^m\alpha_iy^{(i)} = 0
반응형
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.