Computer Science/Machine learning

5. MLE, MAP, Bayesian inference

  • -
728x90
반응형

Abstract

First, do not confuse frequentist and bayesian. If so, all the concepts such as MLE or MAP could be seriously confusing. Frequentist do not think unknown parameters as a variable. Instead they consider them as a constant.

Second, MLE and MAP are both basically frequentist methods that can be seen as special cases of Bayesian inference under certain assumptions.

What is the difference between frequentist and bayesian?

Frequentist

In frequentist statistics, probabilities are interpreted as limiting frequencies, or the proportion of times an even occurs in the long run as the number of repetitions of the experiment goes to infinity. By doing so, they treat the unknown parameter(s) as a fixed but unknown constant. Actually we have learned probability frequentist way.

Bayesian

Bayesian inference is a statistical approach that allows us to update our beliefs about uncertain quantities, such as model parameters or hypothesis. In the Bayesian framework, probability is interpreted as a measure of uncertainty or belief, rather than simply being a frequency or proportion of some event occurring in the long run.

What is the meaning of prior, posterior, likelihood in Bayesian?

The Bayesian approach treats the parameter(s) as random variable(s) with an associated prior distribution(f(θf(\theta) that expresses our beliefs about the parameter(s) before observing the data. The posterior distribution(f(θD)f(\theta|\mathcal D)) is then obtained by updating the prior distribution with the observed data using Bayes' theorem. In addition, the likelihood function(f(Dθ)f(\mathcal D|\theta)) represents belief about the data, given a certain value of the parameter.

💡
Likelihood function is not a probability distribution, but rather a function that describes how likely the observed data are different values of the model parameters.

So what is the Bayesian inference?

The procedure of bayesian inference is as follows

  1. Define prior distribution. Before observing any data, we represent our beliefs about the parameters. This distribution describes our uncertainty about the parameters before seeing any data.
  1. Once we have observed some data, we use the likelihood function to describe how likely the data are under different values of the parameters. The likelihood function is a function of the data and the parameters and represents the probability of observing the data given a specific set of parameter values
  1. By using prior distribution and likelihood function, we can calculate the posterior distribution
  1. The posterior distribution can also be used as the prior distribution for the next round of inference, allowing us to iteratively update our beliefs as we collect more data.
  1. Make predictions about new data. Note that when we predicts about new data, we do not choose a specific value of the parameters.
💡
Importantly, in Bayesian inference we do not choose a specific value of the parameters as in classical (frequentist) inference. Instead, we integrate over all possible values of the parameters using the posterior distribution. Basically, MLE and MAP produce point estimates of the parameters, however, in Bayesian inference, we don’t usually use point estimates to estimate the parameters of a model.
Bayesian Inference — Intuition and Example
with Python Code
https://towardsdatascience.com/bayesian-inference-intuition-and-example-148fd8fb95d6

Explanation about 4rd procedure

Suppose we have observed a data set Dn={x1,,xn}D_n = \{x_1, \cdots , x_n\} and we have a prior distribution p(θ)p(\theta). We can update prior distribution after we calculate the posterior distribution.

p(θDn)=p(Dnθ)p(θ)p(Dn)p(\theta|D_n) = \frac{p(D_n|\theta)p(\theta)}{p(D_n)}

If we then observe a new data point xn+1x_{n+1}, we can update our posterior distribution using the same formula.

p(θDn+1)=p(Dn+1θ)p(θDn)p(Dn+1Dn)=p(xn+1θ)p(Dnθ)p(θDn)p(Dn+1Dn) (by i.i.d)p(\theta|D_{n+ 1}) = \frac{p(D_{n + 1}|\theta)p(\theta|D_n)}{p(D_{n + 1}|D_n)} \\= \frac{p(x_{n + 1}|\theta )p(D_n|\theta)p(\theta|D_n)}{p(D_{n + 1}|D_n)} \text{ (by i.i.d)}

Since, p(Dn),p(Dn+1Dn)p(D_n), p(D_{n + 1}|D_n) is just normalization factor, we can ignore them.

The conclusion is that prior is updated when we observe a new data point.

Explanation about 5th procedure

Once we have obtained a posterior distribution over the model parameters using Bayesian updating, we can use this distribution to make predictions about new data.

The general idea is to use the posterior distribution to compute the predictive distribution, which represents out uncertainty about the outcome given new data.

Suppose we have observed a data set Dn={x1,,xn}D_n = \{x_1, \cdots , x_n\} and we have a posterior distribution p(θDn)p(\theta|D_n).

Our goal is to determine the distribution, which is referred to as the predictive distribution, of a new observation xn+1x_{n + 1} , given the data DnD_n that has already been observed.

p(xn+1Dn)=p(xn+1θ)p(θDn)dθp(x_{n+1}|D_n) = \int p(x_{n+1}|\theta)p(\theta|D_n)d\theta

This formula expresses the uncertainty about the new observation xn+1x_{n+1} in terms of the uncertainty about the parameter θ\theta, taking into account the information provided by the previous data set DnD_n. It could be understood as a weighted average over all values of θ\theta.

Maximum Likelihood (MLE)

arg maxθL(θD)=arg maxθp(Dθ)=arg maxθi=1np(xiθ) (By i.i.d)\argmax_\theta \mathcal L(\theta | D) = \argmax_\theta p(\mathcal D|\theta) \\ =\argmax_\theta\prod_{i = 1}^n p(x_i|\theta) \text{ (By i.i.d)}

The goal of MLE is to find the values of the parameters that maximize the likelihood function, which is a function of the observed data and the unknown parameters.

Supervised learning case

When supervised learning, likelihood function is as follows

arg maxθi=1np(yixi,θ)\argmax_\theta \prod_{i = 1}^n p(y_i|x_i, \theta)

Frequentist

The frequentist interpretation of the likelihood function is that it measures the probability of observing the data, assuming that the true parameter value is fixed and known (even though it is actually unknown).

Bayesian

In Bayesian perspective, Maximum Likelihood Estimation (MLE) can be seen as a special case of Bayesian inference where we assume a flat or uniform prior distribution over the parameters of the model.

Unlike bayesian inference, the prior distribution is not update. Instead, we focus on maximizing the likelihood function to estimate the parameters that best fit the observed data.

Interpretation of MLE (By using KL divergence)

Suppose that each sample of dataset D={xi}i=1n\mathcal D = \{x_i\}_{i = 1}^n is drawn independently from an underlying distribution p(xθ)p(x|\theta).

Empirical distribution
p~(x)=1ni=1nδ(xxi)\tilde p (x) = \frac{1}{n}\sum_{i = 1}^n\delta(x - x_i)

The CDF of this function is a step function that jumps up by 1/n at each of the nn data points.

Entropy (Information theory)

The Shannon entropy is restricted to random variable taking discrete values. The corresponding formula for a continuous random variable with probability density function f(x) with finite or infinite support X\mathbb X (What is support? : ) on the real line is defined

H(x):=EX[logf(X)]=Xf(x)logf(x)dxH(x) := \mathbb E_X[-logf(X)] = -\int_\mathbb X f(x)logf(x) dx
Kullback-Leibler (KL) divergence

KL divergence is a measure of the difference between two probability distributions. Specifically, it measures how much information is lost when using one distribution to approximate another. Note that KL divergence is not symmetric and non-negative.

💡
Intuitively, the KL divergence measures the amount of information lost when using q(x)q(x) to approximate p(x)p(x)
KL(pq)=p(x)logp(x)q(x)dxKL(p||q) = \int p(x)log\frac{p(x)}{q(x)} dx

The KL divergence is a fundamental tool in many areas of statics, machine learning.

  1. Bayesian inference

    The KL divergence is often used as a measure of the difference between a prior distribution and a posterior distribution in Bayesian inference. Specifically, it can be used to quantify how much the data “updates” the prior distribution to yield the posterior distribution.

  1. Model selection

    The KL divergence can be used to compare different models by measuring the difference their estimated probability distributions. For example, suppose we have two models, M1 and M2, and we want to choose between them based on some data D. We can compute the KL divergence between the estimated posterior distributions for M1 and M2, and choose the model that yields the smaller KL divergence.

    In other words, if

    KL(p(θD,M1)p(θD,M2))<KL(p(θD,M2)p(θD,M1)KL(p(\theta|D, M1)|| p(\theta|D, M2)) < KL(p(\theta|D, M2) || p(\theta|D, M1)

    choose model M1, otherwise choose model M2

    • Why we have to choose the model that yields the smaller KL divergence?

      KL(p(θD,M1)p(θD,M2))KL(p(\theta|D, M1)|| p(\theta|D, M2))  measures the difference between the estimated posterior distribution of M1 and the reference distribution (i.e., the estimated posterior distribution of M2) relative to the estimated posterior distribution of M1.

      So, if KL(p(θD,M1)p(θD,M2))<KL(p(θD,M2)p(θD,M1)KL(p(\theta|D, M1)|| p(\theta|D, M2)) < KL(p(\theta|D, M2) || p(\theta|D, M1) we choose model M1, because the estimated posterior distribution of M1 is more similar to the estimated posterior distribution of M2 than the other way around.

Interpretation of MLE

MLE approach is actually equivalent to find the probability distribution function s.t

arg minθKL(p~pθ)\argmin_\theta KL(\tilde p ||p_\theta)

pθ(x)=p(xθ)p_\theta(x) = p(x|\theta)

  • Proof
    arg minθKL(p~,pθ)=arg minθp~(x)logp~(x)pθ(x)dx=arg minθH(p~)p~(x)logpθ(x)dx=arg maxθ1ni=1nδ(xxi)logpθ(x)dx (Since H is independent from θ)=arg maxθ1ni=1nlogp(xiθ)=θMLE\argmin_\theta KL(\tilde p, p_\theta) = \argmin_\theta \int \tilde p(x) log\frac{\tilde p(x)}{p_\theta(x)}dx \\ = \argmin_\theta -H(\tilde p) -\int \tilde p(x)log p_\theta(x) dx \\ = \argmax_\theta \int \frac{1}{n}\sum_{i = 1}^n \delta(x- x_i)logp_\theta(x)dx \text{ (Since H is independent from } \theta)\\ =\argmax_\theta \frac{1}{n}\sum_{i = 1}^nlogp(x_i|\theta) \\ =\theta_{MLE}

Maximum a Posteriori

arg maxθp(θD)=arg maxθp(Dθ)p(θ)=arg maxθi=1np(xiθ)p(θ) (By i.i.d)\argmax_\theta p(\theta|\mathcal D) = \argmax_\theta p(\mathcal D|\theta)p(\theta) \\ =\argmax_\theta\prod_{i = 1}^n p(x_i|\theta)p(\theta) \text{ (By i.i.d)}

The goal of MAP is to find the optimal parameters that maximize the posterior distribution. In addition, we can think that the prior as a regularizer.

Supervised learning case

When supervised learning, posteriori function is as follows

θMAP=arg maxθi=1np(yixi,θ)p(θxi)=arg maxθi=1np(yixi,θ)p(θ) (xi and θ are independent)\theta_{MAP}= \argmax_\theta \prod_{i = 1}^n p(y_i|x_i, \theta)p(\theta|x_i) \\ = \argmax_\theta \prod_{i = 1}^np(y_i|x_i, \theta)p(\theta) \text{ (}\because x_i \text{ and } \theta \text{ are independent)}

Frequentist

In frequentist perspective, they think prior distribution as a fixed but possibly unknown parameter. So, the prior is not considered to be a function of the parameter; the prior is not treated as a probability distribution. It is typically treated as a fixed hyper-parameter.

Bayesian

In Bayesian perspective, the parameter is considered a random variable, and this uncertainty is quantified in terms of the posterior distribution, which combines the likelihood function with prior information about the parameter.

They think that the prior distribution reflects our prior beliefs or knowledge about the parameter. So, unlike the frequentist approach, the prior is a probability distribution.

Conjugate prior

Definition

A conjugate prior is a prior distribution that belongs to the same family of distribution as the posterior distribution, so that the posterior distribution is also in the same family as the prior. In other words, for some likelihood functions, if you choose a certain prior, the posterior ends up being in the same distribution as the prior. Such a prior then is called a conjugate prior.

How does the conjugate prior help?

When you know that your prior is a conjugate prior, you can skip the posterior = likelihood * prior computation. Furthermore, if your prior distribution has a closed-form expression, you already know what the maximum posterior is going to be.

For example, the beta distribution is a conjugate prior to the binomial likelihood. It means during the modeling phase, we already know the posterior will also be a beta distribution. Therefore, after carrying out more experiments, you can compute the posterior simply by adding the number of acceptances and rejections to the existing parameters α,β\alpha, \beta respectively, instead of multiplying the likelihood with the prior distribution.

We can simply calculate the posterior by adding the number of successes and failures(n = success + failures) to the existing parameters alpha, beta respectively.

As a data/ML scientist, your model is never complete. You have to update your model as more data come in (and that’s why we use Bayesian Inference).

As you saw, the computations in Bayesian Inference can be heavy or sometimes even intractable. However, if we could use the closed-form formula of the conjugate prior, the computation becomes very light.

Conjugate Prior Explained
With examples & proofs
https://towardsdatascience.com/conjugate-prior-explained-75957dc80bfb

List of conjugate priors

Beta posterior (Related with Bernoulli series)

  1. Beta prior * Bernoulli likelihood → Beta posterior
  1. Beta prior * Binomial likelihood → Beta posterior
  1. Beta prior * Negative Binomial likelihood → Beta posterior
  1. Beta prior * Geometric likelihood → Beta posterior

Gamma posterior (Related with exponential series)

  1. Gamma prior * Poisson likelihood → Gamma posterior
  1. Gamma prior * Exponential likelihood → Gamma posterior

Normal posterior (Normal family)

  1. Normal prior * Normal likelihood → Normal posterior

This is why these three distributions (Beta, Gamma, and Normal) are used a lot as priors.

Estimator

Estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished.&#91;1&#93; For example, the sample mean is a commonly used estimator of the population mean.
https://en.wikipedia.org/wiki/Estimator
  • What is estimator?

    Estimator is the rule for calculating the estimate of a given quantity based on observed data

Definitions

Suppose θ^\hat{\theta} is the estimator of θ\theta (θ\theta : true parameter)

Bias : ED[θ^θ]E_\mathcal D[\hat{\theta} - \theta] (When bias is zero, we say θ^\hat{\theta} is unbiased estimator)

Variance : ED[(θ^ED(θ^)2]E_\mathcal D[(\hat{\theta} - E_\mathcal D(\hat{\theta})^2] (It is used to indicate how far, on average, the collection of estimates are from the expected value of the estimates)

Risk : bias2+variancebias^2 + variance (criteria for comparing two estimator on the same estimate)

반응형
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.