Computer Science/Machine learning

8. Neural Networks

728x90

Linear Predictor for Binary Classification

Decision rule can be described by

g(x)= sign(f(x)) = \begin{cases} +1 &\text{if }w^Tx + b \ge 0\\ -1 & \text{if }w^Tx + b < 0 \end{cases}

Perceptron

A single-layer neural network with a thresholding activation function

💡

weight는 학습결과에 의해서 update된다.

Perceptron Criterion

따라서 좋은 parameter에 대한 기준이 필요함

그래서 각 샘플 i에 대해서

y_iw^Tx_i > 0, \forall i \in \{1, \cdots, n\}

💡

부호가 같으면 양수가 되므로 이렇게 식을 잡은 것이다. 위 식을 perceptron criterion 이라고 부른다. (사실상 functional margin이랑 수식이 거의 비슷하다.

그래서 objective function to be minimized를 다음과 같이 정의하게 된다.

\mathcal J(w) = -\sum_{(x_i, y_i) \in \mathcal M(w)}y_iw^tx_i

이때, $\mathcal M(w) = \{(x_i, y_i) |y_iw^Tx_i < 0\}$ : 즉 misclassified samples under w

\nabla_w\mathcal J(w) = - \sum_{(x_i, y_i) \in \mathcal M (w)}y_ix_i

Batch gradient descent의 식은 다음과 같다.

w := w + \alpha \sum_{(x_i, y_i) \in \mathcal M (w)}y_ix_i

Stocastic gradient descent의 식은 다음과 같다.

w := w + \alpha y_ix_i\\ \text{when miss-classified}

💡

Perceptron은 SGD는 Batch gradient descent든 misclassified sample에 대해서만 gradient descent를 진행해주었다. (위 이론이 처음 나왔을 당시에는 이렇게 고려했으므로 토달지 말 것)

💡

Logistic regression : activation function이 sigmoid

Perceptron : activation function이 sign-function

Limitation

Lineary separable한 경우에는 convergence를 반드시 보장할 수 있다. 하지만, non-separable case의 경우에는 single perceptron으로는 해결할 수 없다.

따라서 Multi-layer perceptron (MLP)로 나아가게 된다. 왜냐하면 MLP는 non-linear decision boundary를 가질 수 있기 때문이다.

이게 가능하다는 것은 Universal Approximation Theorem에 의해서 보장할 수 있다.

Any continuous function $f :[0, 1]^d\to [0, 1]$ can be approximated arbitrarily well by a neural network with at least 1 hidden layer with a finite number of weights

💡

직관적으로는 모든 measurable function은 simple function들의 선형결합으로 표현될 수 있다는 것과 관련이 있어보이긴 한다.

Deep Learning

Deep learning can be thought as a composition of differentiable functions.

F(x) \approx f_{W_L}(f_{W_{L-1}}(\cdots f_{W_1}(x)\cdots))

→ The representation power of MLP is from composing functions

💡

사실상 학습은 optimization을 미분을 이용하여 찾아나가는 과정에 불과하다.

이때 문제가 될 수 있는 지점은 미분 불가능한 함수 에 대한 고려이다. 사실 이전의 perceptron이 사용한 sign 함수도 관련해서 문제가 발생할 수 있다. 하지만 실질적으로 이전의 perceptron에서는 크게 문제가 발생하지 않았다.

왜냐하면 추론의 단계에서는 $sign(w^Tx)$ 를 사용했지만, loss function을 정의할 때는 sign 함수를 쓰지 않았기 때문이다. 하지만 이러한 방식은 multi-layer perceptron에서는 더 이상 사용하기 힘들어졌다.

💡

Inference와 cost function을 구분해야 한다. sign function을 activation function으로 잡은 경우, cost function은 직접적으로 sign과 관련이 없다.

그래서 non-linear하면서 대부분의 정의역에서 기울기가 0이 아닌 함수를 찾기 시작하였고 대표적인 예시가 다음 6가지이다.

💡

반드시 activation funciton은 non-linear해야한다. 그렇지 않으면 결국 linear 함수로 귀결될 것이기 때문이다.

FC layer

$x_i$ : input feature (scalar)

Optimization

\mathcal L = \sum_{i = 1}^N \mathcal l(y_n, h_L)

이때, $h_i$ : output of $i$ ’th hidden layers

따라서 모든 레이어에 대해서

$\frac{\partial L}{\partial W_l} , \frac{\partial L}{\partial b_l}$

→ closed form solution을 구하는 것은 non-feasible

그래서 back propagation을 구하는 것!

Back-propagation

Squared Error for Regression

\epsilon = \frac{1}{N}\sum_{i = 1}^N \|y_n - \hat y_n \|^2

Cross Entropy Error for Classification

Binary class

\epsilon = \frac{1}{N}\sum_{n = 1}^N[-y_n\log\hat y_n - (1-y_n)\log(1-\hat y_n)]

Multiple class

\epsilon = \frac{1}{N}\sum_{n = 1}^N\sum_{k = 1}^K[-y_{n, k}\log\hat y_{n, k}]

사실상 binary classification의 경우 true label은 0, 1 아니면 1, 0

→ True label은 해당 class만 1임 (살짝 더 정리해볼 것)

→ 사실상 정답 레이블만 one-hot encoding되어있을 것

→ predicted label은 binary의 경우 logistic, multi의 경우 softmax를 통해 확률로 변환

→ KL divergence를 통해 empirical distribution과의 차이를 줄인다고 생각해주면 됨

💡

Binary class의 경우 한 개의 class의 label값이 0이라 저렇게 찢어지는 것이고, Multiple class의 경우에는 one-hot-encoding 방식이라 시그마 한번에 처리해도 된다.

Derivation

Softmax layer for classification

\hat y_k = \frac{\exp(w_k^Th+b_k)}{\sum_j\exp(w_j^Th+b_j)}

→ 기존에 다 진행하다가 classification의 경우 맨 마지막에 softmax layer를 넣어주면 됨.

KL Divergence

KL(p||q) = \sum_x p(x)\log\frac{p(x)}{q(x)}

여기서 p(x)logp(x)는 상수이고, 실질적으로는 KL을 낮추는 것은 cross entropy를 낮추는 것과 같다.

Cross entropy

$p :$ true distribution, $q$ : estimated distribution

H(p, q) = \mathbb E_p[-\log q] = -\sum_x p(x)\log q(x)

💡

즉 Cross entropy는 실제 분포인

p

를 통해

q

를 모델링하는 것이 목표이다. 즉

p

와

q

의 차이를 줄이고자 하는 것이다.

Relation between KL and Cross entorpy

KL(p||q) = \sum_{x}p(x)\log\frac{p(x)}{q(x)} \\ = \sum_{x}p(x)\log p(x) - \sum_{x}q(x)\log p(x)

Since $p$ is a true distribution, so it is constant and non-negative.

Minimizing the $KL(p||q)$ is equivalent to minimizing cross-entropy loss.

Contents

새소식

인기 검색어