Computer Science/Machine learning

4. Classification and Logistic Regression

  • -
728x90
반응형

Classification

In case of classification, the values y take on only a small number of discrete values.

  • Difference between classification and regression

    Classification has only a small number of discrete values. However, regression can have continuous values.

Given x(i)x^{(i)}, the corresponding target value y(i)y^{(i)}is referred to as the label for the training example. (Label represent some category)

Spam Detection
Salmon/Sea-Bass classification

Decision Boundary

Suppose that there exist two classes C1C_1 and C2C_2

A classifier assigns a new data xx into C1C_1 only if xR1x \in R_1.

(R1R_1 is the decision region for C1)C_1)

A classification function

h(x)=k if xRkh(x) = k \text{ if } x\in R_k
  • What can we do using classification function?

    The out put of classification function is the class that containing input value.

Decision boundaries are boundaries between decision regions

(In multi-dimensional, decision boundaries are surface)

Discriminant Function

xCk if jI,fk(x)>fj(x)x \in C_k \text{ if }\forall j \in I, f_k(x) > f_j(x)
  • What is the purpose of the discriminant Function?

    The value of the discriminant function can be used as a criterion to determine which class it belongs to.

  • Example

    fk(x)=P(Ckx)f_k(x) = P(C_k|x)

  • Procedure
    1. Define Discriminant function
    1. By using discriminant functions, we can define classification function.

The classifier is defined by h(x)=argmaxkfk(x)h(x) = argmax_k f_k(x) (hh : Classification function)

💡
Even if you compose a discriminant function with any monotonically increasing function, the resulting function can still be used as a classification function.
  • So, what is the discriminant function in supervised learning?

    The discriminant function can take many forms depending on the specific algorithm or model being used.

    In probabilistic model, the discriminant function can be represented as a p(yx,θ)p(y|x, \theta).

    However, in some non-probabilistic models, such as support vector machines the discriminant function takes a different form and is not directly related to a posterior probability distribution.

Bayes Decision Rule

By Bayes decision rule

fk(x)=P(Ckx)=P(xCk)P(Ck)P(x)=P(xCk)P(Ck)i=1cP(xCj)P(Cj)f_k(x) = P(C_k|x) = \frac{P(x|C_k)P(C_k)}{P(x)} = \frac{P(x|C_k)P(C_k)}{\sum_{i = 1}^cP(x|C_j)P(C_j)}
  • P(xCk)P(x|C_k)

    class-conditional density

  • P(Ck)P(C_k)

    class-prior

  • i=1cP(xCj)P(Cj)\sum_{i = 1}^cP(x|C_j)P(C_j)

    Normalization factor

  • Why we can ignore the normalization factor?

    Always positive and its value does’t depend on the value of x. So we can ignore the normalization factor when we use the MAP method.

    By using discriminant functions, we can draw the decision boundary.

Sigmoid Function

  • Sigmoid function
    g(z)=11+ezg(z) = \frac{1}{1+e^{-z}}
  • Properties of the sigmoid function
    1. g(x)=g(x)(1g(x)g'(x) = g(x)(1-g(x)
    1. g(x)=1g(x)g(-x) = 1-g(x)
  • What can we do using sigmoid function?.

    Using the value of the sigmoid function as a Discriminate function in binary classification problems.

    hθ(x)=g(θTx)=11+eθtxh_\theta(x) = g(\theta^Tx) = \frac{1}{1+ e^{-\theta^tx}}
    • h가 아니라 f인지 확인받아야 함!

Binary classification

Binary cross entropy와 관련되어 있음

Newton Method

f:RRf:\R \to \R aims to find a value of θ\theta so that f(θ)=0f(\theta) = 0

Newton's method
In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function. The most basic version starts with a single-variable function f defined for a real variable x, the function's derivative f′, and an initial guess x0 for a root of f. If the function satisfies sufficient assumptions and the initial guess is close, then
https://en.wikipedia.org/wiki/Newton's_method

It iteratively performs the following update:

  • Formular
    θ:=θf(θ)f(θ)\theta := \theta - \frac{f(\theta)}{f'(\theta)}

Likelihood의 maximum은 결국 도함수가 0일때 발생할 것이다.

그래서 이것을 newton’s method를 통해서 구하고자 하는것

f(θ)=l(θ)f(\theta) = l'(\theta)

Why we use?

It achieves faster convergence

Step size같은 paramter가 없어야 함.

기존의 stochastic gradient descent의 경우는 step size에 의해서 학습 결과가 많이 영향을 받기 때문에, newton method가 장점을 가진다.

단, Hessian matrix의 inverse를 계산해야하기 때문에 굉장히 cost가 큰 편.

Another interpretation

Optimize the quadratic approximation

Convex가 보장되어야 하기 때문에, Hessian matrix가 positive definite임이 보장되어야 한다.

y^\hat y : logistic function의 예측값

IRLS의 경우는 closed form solution이 아님.

단, normal equation과 굉장히 유사.

  • 세타를 업데이트하면서 S와 b 모두 업데이트 됨.

반응형
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.