In case of classification, the values y take on only a small number of discrete values.
Difference between classification and regression
Classification has only a small number of discrete values. However, regression can have continuous values.
Given x(i), the corresponding target value y(i)is referred to as the label for the training example. (Label represent some category)
Decision Boundary
Suppose that there exist two classes C1 and C2
A classifier assigns a new data x into C1 only if x∈R1.
(R1 is the decision region for C1)
A classification function
What can we do using classification function?
The out put of classification function is the class that containing input value.
Decision boundaries are boundaries between decision regions
(In multi-dimensional, decision boundaries are surface)
Discriminant Function
What is the purpose of the discriminant Function?
The value of the discriminant function can be used as a criterion to determine which class it belongs to.
Example
fk(x)=P(Ck∣x)
Procedure
Define Discriminant function
By using discriminant functions, we can define classification function.
The classifier is defined by h(x)=argmaxkfk(x) (h : Classification function)
So, what is the discriminant function in supervised learning?
The discriminant function can take many forms depending on the specific algorithm or model being used.
In probabilistic model, the discriminant function can be represented as a p(y∣x,θ).
However, in some non-probabilistic models, such as support vector machines the discriminant function takes a different form and is not directly related to a posterior probability distribution.
Bayes Decision Rule
By Bayes decision rule
P(x∣Ck)
class-conditional density
P(Ck)
class-prior
∑i=1cP(x∣Cj)P(Cj)
Normalization factor
Why we can ignore the normalization factor?
Always positive and its value does’t depend on the value of x. So we can ignore the normalization factor when we use the MAP method.
By using discriminant functions, we can draw the decision boundary.
Sigmoid Function
Sigmoid function
Properties of the sigmoid function
g′(x)=g(x)(1−g(x)
g(−x)=1−g(x)
What can we do using sigmoid function?.
Using the value of the sigmoid function as a Discriminate function in binary classification problems.
h가 아니라 f인지 확인받아야 함!
Binary classification
Binary cross entropy와 관련되어 있음
Newton Method
f:R→R aims to find a value of θ so that f(θ)=0
It iteratively performs the following update:
Formular
Likelihood의 maximum은 결국 도함수가 0일때 발생할 것이다.
그래서 이것을 newton’s method를 통해서 구하고자 하는것
f(θ)=l′(θ)
Why we use?
It achieves faster convergence
Step size같은 paramter가 없어야 함.
기존의 stochastic gradient descent의 경우는 step size에 의해서 학습 결과가 많이 영향을 받기 때문에, newton method가 장점을 가진다.
단, Hessian matrix의 inverse를 계산해야하기 때문에 굉장히 cost가 큰 편.
Another interpretation
Optimize the quadratic approximation
Convex가 보장되어야 하기 때문에, Hessian matrix가 positive definite임이 보장되어야 한다.