5. MLE, MAP, Bayesian inference
- -
Abstract
First, do not confuse frequentist and bayesian. If so, all the concepts such as MLE or MAP could be seriously confusing. Frequentist do not
think unknown parameters as a variable. Instead they consider them as a constant
.
Second, MLE and MAP are both basically frequentist methods
that can be seen as special cases of Bayesian inference under certain assumptions.
What is the difference between frequentist and bayesian?
Frequentist
In frequentist statistics, probabilities are interpreted as limiting frequencies, or the proportion of times an even occurs in the long run as the number of repetitions of the experiment goes to infinity. By doing so, they treat the unknown parameter(s) as a fixed
but unknown constant
. Actually we have learned probability frequentist way.
Bayesian
Bayesian inference is a statistical approach that allows us to update our beliefs about uncertain quantities, such as model parameters or hypothesis. In the Bayesian framework, probability is interpreted as a measure of uncertainty
or belief
, rather than simply being a frequency or proportion of some event occurring in the long run.
What is the meaning of prior, posterior, likelihood in Bayesian?
The Bayesian approach treats the parameter(s) as random variable(s)
with an associated prior distribution(f(θ) that expresses our beliefs about the parameter(s) before observing the data. The posterior distribution(f(θ∣D)) is then obtained by updating the prior distribution with the observed data using Bayes' theorem. In addition, the likelihood function(f(D∣θ)) represents belief about the data, given a certain value of the parameter.
So what is the Bayesian inference?
The procedure of bayesian inference is as follows
- Define prior distribution. Before observing any data, we represent our beliefs about the parameters. This distribution describes our uncertainty about the parameters before seeing any data.
- Once we have observed some data, we use the likelihood function to describe how likely the data are under different values of the parameters. The likelihood function is a function of the data and the parameters and represents the probability of observing the data given a specific set of parameter values
- By using prior distribution and likelihood function, we can calculate the posterior distribution
- The posterior distribution can also be used as the prior distribution for the next round of inference, allowing us to iteratively update our beliefs as we collect more data.
- Make predictions about new data. Note that when we predicts about new data, we do not choose a specific value of the parameters.
Explanation about 4rd procedure
Suppose we have observed a data set Dn={x1,⋯,xn} and we have a prior distribution p(θ). We can update prior distribution after we calculate the posterior distribution.
If we then observe a new data point xn+1, we can update our posterior distribution using the same formula.
Since, p(Dn),p(Dn+1∣Dn) is just normalization factor, we can ignore them.
The conclusion is that prior is updated when we observe a new data point.
Explanation about 5th procedure
Once we have obtained a posterior distribution over the model parameters using Bayesian updating, we can use this distribution to make predictions about new data.
The general idea is to use the posterior distribution to compute the predictive distribution
, which represents out uncertainty about the outcome given new data.
Suppose we have observed a data set Dn={x1,⋯,xn} and we have a posterior distribution p(θ∣Dn).
Our goal is to determine the distribution, which is referred to as the predictive distribution, of a new observation xn+1 , given the data Dn that has already been observed.
This formula expresses the uncertainty about the new observation xn+1 in terms of the uncertainty about the parameter θ, taking into account the information provided by the previous data set Dn. It could be understood as a weighted average over all values of θ.
Maximum Likelihood (MLE)
The goal of MLE is to find the values of the parameters that maximize the likelihood function, which is a function of the observed data and the unknown parameters.
Supervised learning case
When supervised learning, likelihood function is as follows
Frequentist
The frequentist interpretation of the likelihood function is that it measures the probability of observing the data, assuming that the true parameter value is fixed and known (even though it is actually unknown).
Bayesian
In Bayesian perspective, Maximum Likelihood Estimation (MLE) can be seen as a special case of Bayesian inference where we assume a flat or uniform prior distribution
over the parameters of the model.
Unlike bayesian inference, the prior distribution is not update
. Instead, we focus on maximizing the likelihood function to estimate the parameters that best fit the observed data.
Interpretation of MLE (By using KL divergence)
Suppose that each sample of dataset D={xi}i=1n is drawn independently from an underlying distribution p(x∣θ).
Empirical distribution
The CDF of this function is a step function that jumps up by 1/n at each of the n data points.
Entropy (Information theory)
The Shannon entropy is restricted to random variable taking discrete values. The corresponding formula for a continuous random variable with probability density function f(x) with finite or infinite support X (What is support? : ) on the real line is defined
Kullback-Leibler (KL) divergence
KL divergence is a measure of the difference between two probability distributions. Specifically, it measures how much information is lost when using one distribution to approximate another. Note that KL divergence is not symmetric
and non-negative
.
The KL divergence is a fundamental tool in many areas of statics, machine learning.
- Bayesian inference
The KL divergence is often used as a measure of the difference between a prior distribution and a posterior distribution in Bayesian inference. Specifically, it can be used to quantify how much the data “updates” the prior distribution to yield the posterior distribution.
- Model selection
The KL divergence can be used to compare different models by measuring the difference their estimated probability distributions. For example, suppose we have two models, M1 and M2, and we want to choose between them based on some data D. We can compute the KL divergence between the estimated posterior distributions for M1 and M2, and choose the model that yields the
smaller KL divergence
.In other words, if
choose model M1, otherwise choose model M2
Why we have to choose the model that yields the smaller KL divergence?
KL(p(θ∣D,M1)∣∣p(θ∣D,M2)) measures the difference between the estimated posterior distribution of M1 and the reference distribution (i.e., the estimated posterior distribution of M2) relative to the estimated posterior distribution of M1.
So, if KL(p(θ∣D,M1)∣∣p(θ∣D,M2))<KL(p(θ∣D,M2)∣∣p(θ∣D,M1) we choose model M1, because the estimated posterior distribution of M1 is more similar to the estimated posterior distribution of M2 than the other way around.
Interpretation of MLE
MLE approach is actually equivalent to find the probability distribution function s.t
pθ(x)=p(x∣θ)
Proof
Maximum a Posteriori
The goal of MAP is to find the optimal parameters that maximize the posterior distribution. In addition, we can think that the prior as a regularizer
.
Supervised learning case
When supervised learning, posteriori function is as follows
Frequentist
In frequentist perspective, they think prior distribution as a fixed
but possibly unknown parameter. So, the prior is not
considered to be a function of the parameter; the prior is not
treated as a probability distribution. It is typically treated as a fixed hyper-parameter
.
Bayesian
In Bayesian perspective, the parameter is considered a random variable
, and this uncertainty is quantified in terms of the posterior distribution, which combines the likelihood function with prior information about the parameter.
They think that the prior distribution reflects our prior beliefs or knowledge about the parameter. So, unlike the frequentist approach, the prior is a probability distribution
.
Conjugate prior
Definition
A conjugate prior is a prior distribution that belongs to the same family of distribution as the posterior distribution, so that the posterior distribution is also in the same family as the prior. In other words, for some likelihood functions, if you choose a certain prior, the posterior ends up being in the same distribution as the prior. Such a prior then is called a conjugate prior
.
How does the conjugate prior help?
When you know that your prior is a conjugate prior, you can skip the posterior = likelihood * prior
computation. Furthermore, if your prior distribution has a closed-form expression, you already know what the maximum posterior is going to be.
For example, the beta distribution is a conjugate prior to the binomial likelihood. It means during the modeling phase, we already know the posterior will also be a beta distribution. Therefore, after carrying out more experiments, you can compute the posterior simply by adding the number of acceptances and rejections to the existing parameters α,β respectively, instead of multiplying the likelihood with the prior distribution.
As a data/ML scientist, your model is never complete
. You have to update your model as more data come in (and that’s why we use Bayesian Inference).
As you saw, the computations in Bayesian Inference can be heavy or sometimes even intractable. However, if we could use the closed-form formula of the conjugate prior, the computation becomes very light.
List of conjugate priors
Beta posterior (Related with Bernoulli series)
- Beta prior *
Bernoulli
likelihood → Beta posterior
- Beta prior *
Binomial
likelihood → Beta posterior
- Beta prior *
Negative Binomial
likelihood → Beta posterior
- Beta prior *
Geometric
likelihood → Beta posterior
Gamma posterior (Related with exponential series)
- Gamma prior *
Poisson
likelihood → Gamma posterior
- Gamma prior *
Exponential
likelihood → Gamma posterior
Normal posterior (Normal family)
- Normal prior *
Normal
likelihood → Normal posterior
This is why these three distributions (Beta, Gamma, and Normal) are used a lot as priors.
Estimator
What is estimator?
Estimator is the rule for calculating the estimate of a given quantity based on
observed data
Definitions
Suppose θ^ is the estimator of θ (θ : true parameter)
Bias
: ED[θ^−θ] (When bias is zero, we say θ^ is unbiased
estimator)
Variance
: ED[(θ^−ED(θ^)2] (It is used to indicate how far, on average, the collection of estimates are from the expected value of the estimates)
Risk
: bias2+variance (criteria for comparing two estimator on the same estimate)
소중한 공감 감사합니다