Computer Science/Machine learning

5. MLE, MAP, Bayesian inference

728x90

Abstract

First, do not confuse frequentist and bayesian. If so, all the concepts such as MLE or MAP could be seriously confusing. Frequentist do not think unknown parameters as a variable. Instead they consider them as a constant.

Second, MLE and MAP are both basically frequentist methods that can be seen as special cases of Bayesian inference under certain assumptions.

What is the difference between frequentist and bayesian?

Frequentist

In frequentist statistics, probabilities are interpreted as limiting frequencies, or the proportion of times an even occurs in the long run as the number of repetitions of the experiment goes to infinity. By doing so, they treat the unknown parameter(s) as a fixed but unknown constant. Actually we have learned probability frequentist way.

Bayesian

Bayesian inference is a statistical approach that allows us to update our beliefs about uncertain quantities, such as model parameters or hypothesis. In the Bayesian framework, probability is interpreted as a measure of uncertainty or belief, rather than simply being a frequency or proportion of some event occurring in the long run.

What is the meaning of prior, posterior, likelihood in Bayesian?

The Bayesian approach treats the parameter(s) as random variable(s) with an associated prior distribution( $f(\theta$ ) that expresses our beliefs about the parameter(s) before observing the data. The posterior distribution( $f(\theta|\mathcal D)$ ) is then obtained by updating the prior distribution with the observed data using Bayes' theorem. In addition, the likelihood function( $f(\mathcal D|\theta)$ ) represents belief about the data, given a certain value of the parameter.

💡

Likelihood function is not a probability distribution, but rather a function that describes how likely the observed data are different values of the model parameters.

So what is the Bayesian inference?

The procedure of bayesian inference is as follows

Define prior distribution. Before observing any data, we represent our beliefs about the parameters. This distribution describes our uncertainty about the parameters before seeing any data.

Once we have observed some data, we use the likelihood function to describe how likely the data are under different values of the parameters. The likelihood function is a function of the data and the parameters and represents the probability of observing the data given a specific set of parameter values

By using prior distribution and likelihood function, we can calculate the posterior distribution

The posterior distribution can also be used as the prior distribution for the next round of inference, allowing us to iteratively update our beliefs as we collect more data.

Make predictions about new data. Note that when we predicts about new data, we do not choose a specific value of the parameters.

💡

Importantly, in Bayesian inference we do not choose a specific value of the parameters as in classical (frequentist) inference. Instead, we integrate over all possible values of the parameters using the posterior distribution. Basically, MLE and MAP produce point estimates of the parameters, however, in Bayesian inference, we don’t usually use point estimates to estimate the parameters of a model.

Explanation about 4rd procedure

Suppose we have observed a data set $D_n = \{x_1, \cdots , x_n\}$ and we have a prior distribution $p(\theta)$ . We can update prior distribution after we calculate the posterior distribution.

p(\theta|D_n) = \frac{p(D_n|\theta)p(\theta)}{p(D_n)}

If we then observe a new data point $x_{n+1}$ , we can update our posterior distribution using the same formula.

p(\theta|D_{n+ 1}) = \frac{p(D_{n + 1}|\theta)p(\theta|D_n)}{p(D_{n + 1}|D_n)} \\= \frac{p(x_{n + 1}|\theta )p(D_n|\theta)p(\theta|D_n)}{p(D_{n + 1}|D_n)} \text{ (by i.i.d)}

Since, $p(D_n), p(D_{n + 1}|D_n)$ is just normalization factor, we can ignore them.

The conclusion is that prior is updated when we observe a new data point.

Explanation about 5th procedure

Once we have obtained a posterior distribution over the model parameters using Bayesian updating, we can use this distribution to make predictions about new data.

The general idea is to use the posterior distribution to compute the predictive distribution, which represents out uncertainty about the outcome given new data.

Suppose we have observed a data set $D_n = \{x_1, \cdots , x_n\}$ and we have a posterior distribution $p(\theta|D_n)$ .

Our goal is to determine the distribution, which is referred to as the predictive distribution, of a new observation $x_{n + 1}$ , given the data $D_n$ that has already been observed.

p(x_{n+1}|D_n) = \int p(x_{n+1}|\theta)p(\theta|D_n)d\theta

This formula expresses the uncertainty about the new observation $x_{n+1}$ in terms of the uncertainty about the parameter $\theta$ , taking into account the information provided by the previous data set $D_n$ . It could be understood as a weighted average over all values of $\theta$ .

Maximum Likelihood (MLE)

\argmax_\theta \mathcal L(\theta | D) = \argmax_\theta p(\mathcal D|\theta) \\ =\argmax_\theta\prod_{i = 1}^n p(x_i|\theta) \text{ (By i.i.d)}

The goal of MLE is to find the values of the parameters that maximize the likelihood function, which is a function of the observed data and the unknown parameters.

Supervised learning case

When supervised learning, likelihood function is as follows

\argmax_\theta \prod_{i = 1}^n p(y_i|x_i, \theta)

Frequentist

The frequentist interpretation of the likelihood function is that it measures the probability of observing the data, assuming that the true parameter value is fixed and known (even though it is actually unknown).

Bayesian

In Bayesian perspective, Maximum Likelihood Estimation (MLE) can be seen as a special case of Bayesian inference where we assume a flat or uniform prior distribution over the parameters of the model.

Unlike bayesian inference, the prior distribution is not update. Instead, we focus on maximizing the likelihood function to estimate the parameters that best fit the observed data.

Interpretation of MLE (By using KL divergence)

Suppose that each sample of dataset $\mathcal D = \{x_i\}_{i = 1}^n$ is drawn independently from an underlying distribution $p(x|\theta)$ .

Empirical distribution

\tilde p (x) = \frac{1}{n}\sum_{i = 1}^n\delta(x - x_i)

The CDF of this function is a step function that jumps up by 1/n at each of the $n$ data points.

Entropy (Information theory)

The Shannon entropy is restricted to random variable taking discrete values. The corresponding formula for a continuous random variable with probability density function f(x) with finite or infinite support $\mathbb X$ (What is support? : ) on the real line is defined

H(x) := \mathbb E_X[-logf(X)] = -\int_\mathbb X f(x)logf(x) dx

Kullback-Leibler (KL) divergence

KL divergence is a measure of the difference between two probability distributions. Specifically, it measures how much information is lost when using one distribution to approximate another. Note that KL divergence is not symmetric and non-negative.

💡

Intuitively, the KL divergence measures the amount of information lost when using

q(x)

to approximate $p(x)$

KL(p||q) = \int p(x)log\frac{p(x)}{q(x)} dx

The KL divergence is a fundamental tool in many areas of statics, machine learning.

Bayesian inference
The KL divergence is often used as a measure of the difference between a prior distribution and a posterior distribution in Bayesian inference. Specifically, it can be used to quantify how much the data “updates” the prior distribution to yield the posterior distribution.

Model selection
The KL divergence can be used to compare different models by measuring the difference their estimated probability distributions. For example, suppose we have two models, M1 and M2, and we want to choose between them based on some data D. We can compute the KL divergence between the estimated posterior distributions for M1 and M2, and choose the model that yields the smaller KL divergence.
In other words, if
$KL(p(\theta|D, M1)|| p(\theta|D, M2)) < KL(p(\theta|D, M2) || p(\theta|D, M1)$
choose model M1, otherwise choose model M2
- Why we have to choose the model that yields the smaller KL divergence?
  $KL(p(\theta|D, M1)|| p(\theta|D, M2))$ measures the difference between the estimated posterior distribution of M1 and the reference distribution (i.e., the estimated posterior distribution of M2) relative to the estimated posterior distribution of M1.
  So, if $KL(p(\theta|D, M1)|| p(\theta|D, M2)) < KL(p(\theta|D, M2) || p(\theta|D, M1)$ we choose model M1, because the estimated posterior distribution of M1 is more similar to the estimated posterior distribution of M2 than the other way around.

Interpretation of MLE

MLE approach is actually equivalent to find the probability distribution function s.t

\argmin_\theta KL(\tilde p ||p_\theta)

$p_\theta(x) = p(x|\theta)$

Proof
$\argmin_\theta KL(\tilde p, p_\theta) = \argmin_\theta \int \tilde p(x) log\frac{\tilde p(x)}{p_\theta(x)}dx \\ = \argmin_\theta -H(\tilde p) -\int \tilde p(x)log p_\theta(x) dx \\ = \argmax_\theta \int \frac{1}{n}\sum_{i = 1}^n \delta(x- x_i)logp_\theta(x)dx \text{ (Since H is independent from } \theta)\\ =\argmax_\theta \frac{1}{n}\sum_{i = 1}^nlogp(x_i|\theta) \\ =\theta_{MLE}$

Maximum a Posteriori

\argmax_\theta p(\theta|\mathcal D) = \argmax_\theta p(\mathcal D|\theta)p(\theta) \\ =\argmax_\theta\prod_{i = 1}^n p(x_i|\theta)p(\theta) \text{ (By i.i.d)}

The goal of MAP is to find the optimal parameters that maximize the posterior distribution. In addition, we can think that the prior as a regularizer.

Supervised learning case

When supervised learning, posteriori function is as follows

\theta_{MAP}= \argmax_\theta \prod_{i = 1}^n p(y_i|x_i, \theta)p(\theta|x_i) \\ = \argmax_\theta \prod_{i = 1}^np(y_i|x_i, \theta)p(\theta) \text{ (}\because x_i \text{ and } \theta \text{ are independent)}

Frequentist

In frequentist perspective, they think prior distribution as a fixed but possibly unknown parameter. So, the prior is not considered to be a function of the parameter; the prior is not treated as a probability distribution. It is typically treated as a fixed hyper-parameter.

Bayesian

In Bayesian perspective, the parameter is considered a random variable, and this uncertainty is quantified in terms of the posterior distribution, which combines the likelihood function with prior information about the parameter.

They think that the prior distribution reflects our prior beliefs or knowledge about the parameter. So, unlike the frequentist approach, the prior is a probability distribution.

Conjugate prior

Definition

A conjugate prior is a prior distribution that belongs to the same family of distribution as the posterior distribution, so that the posterior distribution is also in the same family as the prior. In other words, for some likelihood functions, if you choose a certain prior, the posterior ends up being in the same distribution as the prior. Such a prior then is called a conjugate prior.

How does the conjugate prior help?

When you know that your prior is a conjugate prior, you can skip the posterior = likelihood * prior computation. Furthermore, if your prior distribution has a closed-form expression, you already know what the maximum posterior is going to be.

For example, the beta distribution is a conjugate prior to the binomial likelihood. It means during the modeling phase, we already know the posterior will also be a beta distribution. Therefore, after carrying out more experiments, you can compute the posterior simply by adding the number of acceptances and rejections to the existing parameters $\alpha, \beta$ respectively, instead of multiplying the likelihood with the prior distribution.

We can simply calculate the posterior by adding the number of successes and failures(n = success + failures) to the existing parameters alpha, beta respectively.

As a data/ML scientist, your model is never complete. You have to update your model as more data come in (and that’s why we use Bayesian Inference).

As you saw, the computations in Bayesian Inference can be heavy or sometimes even intractable. However, if we could use the closed-form formula of the conjugate prior, the computation becomes very light.

List of conjugate priors

Beta posterior (Related with Bernoulli series)

Beta prior * Bernoulli likelihood → Beta posterior

Beta prior * Binomial likelihood → Beta posterior

Beta prior * Negative Binomial likelihood → Beta posterior

Beta prior * Geometric likelihood → Beta posterior

Gamma posterior (Related with exponential series)

Gamma prior * Poisson likelihood → Gamma posterior

Gamma prior * Exponential likelihood → Gamma posterior

Normal posterior (Normal family)

Normal prior * Normal likelihood → Normal posterior

This is why these three distributions (Beta, Gamma, and Normal) are used a lot as priors.

Estimator

What is estimator?
Estimator is the rule for calculating the estimate of a given quantity based on observed data

Definitions

Suppose $\hat{\theta}$ is the estimator of $\theta$ ( $\theta$ : true parameter)

Bias : $E_\mathcal D[\hat{\theta} - \theta]$ (When bias is zero, we say $\hat{\theta}$ is unbiased estimator)

Variance : $E_\mathcal D[(\hat{\theta} - E_\mathcal D(\hat{\theta})^2]$ (It is used to indicate how far, on average, the collection of estimates are from the expected value of the estimates)

Risk : $bias^2 + variance$ (criteria for comparing two estimator on the same estimate)

Contents

새소식

인기 검색어

5. MLE, MAP, Bayesian inference

Abstract

What is the difference between frequentist and bayesian?

Frequentist

Bayesian

What is the meaning of prior, posterior, likelihood in Bayesian?

So what is the Bayesian inference?

Explanation about 4rd procedure

Explanation about 5th procedure

Maximum Likelihood (MLE)

Supervised learning case

Frequentist

Bayesian

Interpretation of MLE (By using KL divergence)

Maximum a Posteriori

Supervised learning case

Frequentist

Bayesian

Conjugate prior

Definition

How does the conjugate prior help?

List of conjugate priors

Beta posterior (Related with Bernoulli series)

Gamma posterior (Related with exponential series)

Normal posterior (Normal family)

Estimator

Definitions

당신이 좋아할만한 콘텐츠

티스토리툴바