Computer Science/Machine learning

4. Classification and Logistic Regression

728x90

Classification

In case of classification, the values y take on only a small number of discrete values.

Difference between classification and regression
Classification has only a small number of discrete values. However, regression can have continuous values.

Given $x^{(i)}$ , the corresponding target value $y^{(i)}$ is referred to as the label for the training example. (Label represent some category)

Decision Boundary

Suppose that there exist two classes $C_1$ and $C_2$

A classifier assigns a new data $x$ into $C_1$ only if $x \in R_1$ .

( $R_1$ is the decision region for $C_1)$

A classification function

h(x) = k \text{ if } x\in R_k

What can we do using classification function?
The out put of classification function is the class that containing input value.

Decision boundaries are boundaries between decision regions

(In multi-dimensional, decision boundaries are surface)

Discriminant Function

x \in C_k \text{ if }\forall j \in I, f_k(x) > f_j(x)

What is the purpose of the discriminant Function?
The value of the discriminant function can be used as a criterion to determine which class it belongs to.

Example
$f_k(x) = P(C_k|x)$

Procedure
1. Define Discriminant function
1. By using discriminant functions, we can define classification function.

The classifier is defined by $h(x) = argmax_k f_k(x)$ ( $h$ : Classification function)

💡

Even if you compose a discriminant function with any monotonically increasing function, the resulting function can still be used as a classification function.

So, what is the discriminant function in supervised learning?
The discriminant function can take many forms depending on the specific algorithm or model being used.
In probabilistic model, the discriminant function can be represented as a $p(y|x, \theta)$ .
However, in some non-probabilistic models, such as support vector machines the discriminant function takes a different form and is not directly related to a posterior probability distribution.

Bayes Decision Rule

By Bayes decision rule

f_k(x) = P(C_k|x) = \frac{P(x|C_k)P(C_k)}{P(x)} = \frac{P(x|C_k)P(C_k)}{\sum_{i = 1}^cP(x|C_j)P(C_j)}

$P(x|C_k)$
class-conditional density

$P(C_k)$
class-prior

$\sum_{i = 1}^cP(x|C_j)P(C_j)$
Normalization factor

Why we can ignore the normalization factor?
Always positive and its value does’t depend on the value of x. So we can ignore the normalization factor when we use the MAP method.
By using discriminant functions, we can draw the decision boundary.

Sigmoid Function

Sigmoid function
$g(z) = \frac{1}{1+e^{-z}}$

Properties of the sigmoid function
1. $g'(x) = g(x)(1-g(x)$
1. $g(-x) = 1-g(x)$

What can we do using sigmoid function?.
Using the value of the sigmoid function as a Discriminate function in binary classification problems.
$h_\theta(x) = g(\theta^Tx) = \frac{1}{1+ e^{-\theta^tx}}$
- h가 아니라 f인지 확인받아야 함!

Binary classification

Binary cross entropy와 관련되어 있음

Newton Method

$f:\R \to \R$ aims to find a value of $\theta$ so that $f(\theta) = 0$

It iteratively performs the following update:

Formular
$\theta := \theta - \frac{f(\theta)}{f'(\theta)}$

Likelihood의 maximum은 결국 도함수가 0일때 발생할 것이다.

그래서 이것을 newton’s method를 통해서 구하고자 하는것

$f(\theta) = l'(\theta)$

Why we use?

It achieves faster convergence

Step size같은 paramter가 없어야 함.

기존의 stochastic gradient descent의 경우는 step size에 의해서 학습 결과가 많이 영향을 받기 때문에, newton method가 장점을 가진다.

단, Hessian matrix의 inverse를 계산해야하기 때문에 굉장히 cost가 큰 편.

Another interpretation

Optimize the quadratic approximation

Convex가 보장되어야 하기 때문에, Hessian matrix가 positive definite임이 보장되어야 한다.

$\hat y$ : logistic function의 예측값

IRLS의 경우는 closed form solution이 아님.

단, normal equation과 굉장히 유사.

세타를 업데이트하면서 S와 b 모두 업데이트 됨.

Contents

새소식

인기 검색어