# Bayesian Inference

In the classical approach, a probability is regarded as a long run frequency/propensity. We take random samples from a population of infinite size, and parameter $\theta$ is fixed.

In the Bayesian approach, a probability is a subjective degree of belief. Parameters themselves are regarded as random, and beliefs are updated based on data.

## Ingredients

• Model for the data given some known parameters, $f_{\left.X\right|p}\left(\left.x\right|p\right)$.
• Prior distribution of parameters, $f_{p}\left(p\right)$.
• Bayes Theorem: $\underset{\text{posterior distribution}}{\underbrace{f_{\left.p\right|X}\left(\left.p\right|x\right)}}=\frac{f_{X,p}\left(X,p\right)}{f_{X}\left(x\right)}=\frac{\overset{\text{likelihood function}}{\overbrace{f_{\left.X\right|p}\left(\left.x\right|p\right)}}.\overset{\text{prior distribution}}{\overbrace{f_{p}\left(p\right)}}}{\int f_{\left.X\right|p}\left(\left.x\right|p\right)f_{p}\left(p\right)dp}$

In this framework, estimating a parameter means finding $f_{\left.p\right|X}$ for some data $x_{1}..x_{n}$. Unlike in the classical approach, parameters are now distributed according to function $f_{p}\left(\cdot\right)$, although we’ll keep writing them in lowercase. In actual estimation, this function may encompass the researcher’s past experience or data from previous experiments. In practice, the function is often chosen to be relatively uninformative.

The posterior distribution is the updated distribution of the parameters, conditional on data. We may use the distribution to generate point estimates (by taking expectations of the parameters, for example).

# Example: Coin Tossing

• Likelihood: $f\left(\left.x\right|p\right)=p^{x}\left(1-p\right)^{1-x}1\left(x\in\left\{ 0,1\right\} \right)$
• Prior Belief: $f\left(p\right)=1\left(p\in\left[0,1\right]\right)$
• Joint Distribution: $f\left(x,p\right)=f\left(\left.x\right|p\right)f\left(p\right)=p^{x}\left(1-p\right)^{1-x}$
• Marginal distribution of $X$:

\begin{aligned} f\left(x\right) & =\int_{p\in\left[0,1\right]}f\left(\left.x\right|p\right)f\left(p\right)dp\\ & =\int_{p\in\left[0,1\right]}f\left(x,p\right)dp\\ & =\int_{0}^{1}p^{x}\left(1-p\right)^{1-x}dp\\ & =\frac{1}{2},\,for\,x\in\left\{ 0,1\right\} .\end{aligned}

(The integral above is especially complicated; you don’t need to know how to solve it on your own; however, if you notice that we only need its value at $x\in\left\{ 0,1\right\} ,$ then one can replace the integrand by $px+\left(1-p\right)\left(1-x\right)$ and obtain the same result.)

So,

$f\left(\left.p\right|x\right)=\frac{p^{x}\left(1-p\right)^{1-x}}{\frac{1}{2}}=2p^{x}\left(1-p\right)^{1-x}.$

To see how this posterior depends on data, consider the case where we observe $x=1$. Then, $f\left(\left.p\right|1\right)=2p$, and our estimate for $p$ may be

$\left(\left.\widehat{p}\right|X=1\right)=E\left(\left.p\right|X=1\right)=\int_{0}^{1}p2pdp=\left.\frac{2}{3}p^{3}\right|_{0}^{1}=\frac{2}{3}.$

Now, suppose we observe $x=0$. Then, $f\left(\left.p\right|1\right)=2\left(1-p\right)$, and our estimate for $p$ becomes

$\left(\left.\widehat{p}\right|X=0\right)=E\left(\left.p\right|X=0\right)=\int_{0}^{1}p2\left(1-p\right)dp=\left.2\left(\frac{p^{2}}{2}-\frac{p^{3}}{3}\right)\right|_{0}^{1}=\frac{1}{3}.$

As expected, the point estimate for $p$ increases with $X$ .

Below, see the posterior pdfs for $x=0$ and $x=1$:

The posterior probability of $p$ is higher for higher values of $p$ when $x=1$, but it is lower for high values of $p$ when $x=0$.

# A more general example

• Likelihood: $f\left(\left.x\right|p\right)=p^{x}\left(1-p\right)^{1-x}1\left(x\in\left\{ 0,1\right\} \right)$, before.
• Prior Belief: $f\left(p\right)=Beta\left(\alpha,\beta\right)=\frac{\Gamma\left(\alpha+\beta\right)}{\Gamma\left(\alpha\right)\Gamma\left(\beta\right)}p^{\alpha-1}\left(1-p\right)^{\beta-1}$. (note that $Beta\left(1,1\right)=U\left(0,1\right).$)

Recall that $E\left(p\right)=\frac{\alpha}{\alpha+\beta}$ and $Var\left(p\right)=\frac{\alpha\beta}{\left(\alpha+\beta\right)^{2}\left(1+\alpha+\beta\right)}$.

The posterior probability when $x=1$ is given by:

$f\left(\left.p\right|x=1\right)=\frac{\frac{\Gamma\left(\alpha+\beta\right)}{\Gamma\left(\alpha\right)\Gamma\left(\beta\right)}p^{\alpha-1}\left(1-p\right)^{\beta-1}p}{f_{X}\left(1\right)=\int_{0}^{1}f_{\left.X\right|p}f_{p}dp}$

Notice that $f_{X}\left(x\right)$ does not depend on the random variable of $f\left(\left.p\right|x=1\right)$, which is $p$. In fact, the denominator is only required to ensure that $f\left(\left.p\right|x=1\right)$ integrates to one.

The implication is that we can ignore the denominator and focus on the part that does depend on $p$, the kernel of $f_{\left.p\right|X}$. We end up with the following identity:

$f\left(\left.p\right|X\right)\propto p^{\alpha}\left(1-p\right)^{\beta-1}$

where $\propto$ can be read as “is proportional to”, and $p^{\alpha}\left(1-p\right)^{\beta-1}$ is the kernel.

Inspection of the Kernel implies that $f\left(\left.p\right|x=1\right)$ is a Beta distribution with parameters $\alpha+1$ and $\beta$.

The mean and variance of $\left.p\right|x=1$ are given by

\begin{aligned} E\left(\left.p\right|x=1\right) & =\frac{\alpha+1}{\alpha+\beta+1}.\\ Var\left(\left.p\right|x=1\right) & =\frac{\left(\alpha+1\right)\beta}{\left(\alpha+\beta+1\right)^{2}\left(\alpha+\beta+2\right)}.\end{aligned}

It is often easy to identify the distribution by its kernel. In our example, it is clear that

$f_{X}\left(1\right)=\frac{\Gamma\left(\alpha+\beta\right)}{\Gamma\left(\alpha\right)\Gamma\left(\beta\right)}.\frac{\Gamma\left(\alpha+1\right)\Gamma\left(\beta\right)}{\Gamma\left(\alpha+\beta+1\right)}=\frac{\alpha}{\alpha+\beta},$

so that we end up with a Beta distribution that integrates to one.

# Conjugate Priors

You may have noticed that we started out with a Beta prior, and ended with a Beta posterior; the only different between the distributions are the parameters.

In this case, we say that the Beta distribution is a conjugate prior with a Bernoulli likelihood, i.e., when using the Bernoulli likelihood, start with a Beta prior will leads us to a Beta posterior. This is extremely convenient, since it allows us to simply update distribution parameters.

When this does not hold, one has to keep track of a complicated posterior distribution that gets more and more complicated as more data is fed into it. You can find a list of conjugate priors here.

# Example: Normal Distribution

Let $\sigma^{2}$ be known, and

• Likelihood: $f_{\left.X\right|\mu}=N\left(\mu,1\right)$.
• Prior: $f_{\mu}=N\left(0,100\right)$.

Focusing on the Kernels, we know that

\begin{aligned} f_{\left.\mu\right|X} & \propto\exp\left(-\frac{1}{2}\left(x-\mu\right)^{2}\right)\exp\left(-\frac{1}{2.100}\mu^{2}\right)\\ & =\exp\left\{ -\frac{1}{2}\left(x^{2}+\mu^{2}-2x\mu+\frac{\mu^{2}}{100}\right)\right\} \\ & \propto\exp\left\{ -\frac{1}{2}\left[\frac{101}{100}\left(\mu^{2}-2\mu\frac{100}{101}x+\left(\frac{100}{101}x\right)^{2}-\left(\frac{100}{101}x\right)^{2}\right)\right]\right\} \\ & \propto\exp\left\{ -\frac{\left(\mu-\frac{100}{101}x^{2}\right)}{2.\frac{100}{101}}\right\} \end{aligned}

Where the proportional signs keep track of expressions that depend solely on $x$ being added or removed from the exponential.

The result above implies that

\begin{aligned} f_{\left.\mu\right|X} & \propto\exp\left\{ -\frac{\left(\mu-\frac{100}{101}x^{2}\right)}{2.\frac{100}{101}}\right\} \\ \Downarrow\\ f_{\left.\mu\right|X} & =N\left(\frac{100}{101}x^{2},\frac{100}{101}\right).\end{aligned}

Our result is a consequence of the fact that the normal distribution is a conjugate prior with a normal likelihood function.

# "Counterexample"

Consider now a case where conjugate priors are not used.

• Likelihood: $f_{\left.X\right|\mu}=N\left(\mu,1\right)$
• Prior: $f_{\mu}=Beta\left(4,5\right)$

In this case,

$f_{\left.\mu\right|X}\propto f_{\left.X\right|\mu}f_{\mu}=\exp\left(-\frac{1}{2}\left(x-\mu\right)^{2}\right)\mu^{3}\left(1-\mu\right)^{4}1\left(\mu\in\left(0,1\right)\right).$

This expression is already complex for a single observation of $x$, and not similar to any classical distribution. As more observations are added, the expression of the posterior will complicate even further. On the other hand, when conjugate priors are used, only the parameters evolve. The use of conjugate priors will be clear in the next section.

# Multiple Observations

In the normal case for a single observation, the following holds:

$\left.\begin{array}{c} f_{\left.X\right|\mu}=N\left(\mu,\sigma^{2}\right)\\ f_{\mu}=N\left(\mu_{0},\sigma_{0}^{2}\right) \end{array}\right\} \Rightarrow f_{\left.\mu\right|X}=N\left(\frac{\sigma_{0}^{2}}{\sigma^{2}+\sigma_{0}^{2}}x+\frac{\sigma^{2}}{\sigma^{2}+\sigma_{0}^{2}}\mu_{0},\left(\frac{1}{\sigma^{2}}+\frac{1}{\sigma_{0}^{2}}\right)^{-1}\right)$

Notice the following results:

• The posterior mean is a weighted average of the prior mean $\mu_{0}$ and the data $x$.
• The posterior variance does not depend on $x$ (this is a property of the Normal).
• Estimation: A large $\sigma_{0}^{2}$ is often preferred.
• Setting $\sigma^{2}=\infty$ is possible (uninformative/improper prior) but not well-defined.

Finally, notice that this result can be used to provide a generalization for the case of multiple observations.

## 2 Observations

Note that

\begin{aligned} f\left(\left.\mu\right|x_{1},x_{2}\right) & =\frac{f\left(\mu,x_{1},x_{2}\right)}{f\left(x_{1},x_{2}\right)}\\ & =\frac{f\left(\left.x_{1}\right|\mu,x_{2}\right)f\left(\left.x_{2}\right|\mu\right)f\left(\mu\right)}{fx_{1},x_{2}}\\ & =\frac{f\left(\left.x_{1}\right|\mu\right)f\left(\left.x_{2}\right|\mu\right)f\left(\mu\right)}{fx_{1},x_{2}}\\ & \propto f\left(\left.x_{1}\right|\mu\right)f\left(\left.x_{2}\right|\mu\right)f\left(\mu\right)\end{aligned}

where the last equality follows from the fact that the data is a random sample.

We can take advantage of this result by calculating $f\left(\left.\mu\right|x_{1},x_{2}\right)$ sequentially:

First, we calculate

$f_{\left.\mu\right|x_{1}}=f_{\left.x_{1}\right|,\mu}.f_{\mu}=N\left(\frac{\sigma_{0}^{2}}{\sigma^{2}+\sigma_{0}^{2}}x_{1}+\frac{\sigma^{2}}{\sigma^{2}+\sigma_{0}^{2}}\mu_{0},\left(\frac{1}{\sigma^{2}}+\frac{1}{\sigma_{0}^{2}}\right)^{-1}\right)$

Then, we use the result, $f_{\left.\mu\right|x_{1},x_{2}}=f_{\left.x_{1}\right|,\mu}.f_{\mu}$ as the prior to update for $x_{2}$, i.e.,

$f_{\left.\mu\right|x_{1},x_{2}}\propto\underset{\text{likelihood}}{\underbrace{f_{\left.x_{2}\right|\mu}}}\underset{\text{prior}}{\underbrace{f_{\left.\mu\right|x_{1}}}}$

s.t.

$f_{\left.\mu\right|x_{1},x_{2}}=N\left(\frac{\sigma_{0}^{2}}{\sigma^{2}+2\sigma_{0}^{2}}\left(x_{1}+x_{2}\right)+\frac{\sigma^{2}}{\sigma^{2}+2\sigma_{0}^{2}}\mu_{0},\left(\frac{2}{\sigma^{2}}+\frac{1}{\sigma_{0}^{2}}\right)^{-1}\right)$

# $n$ Observations

For $x_{1}..x_{n}$, it follows that

$f_{\left.\mu\right|x_{1}..x_{n}}=N\left(\frac{\sigma_{0}^{2}}{\sigma^{2}+n\sigma_{0}^{2}}\sum_{i=1}^{n}x_{i}+\frac{\sigma^{2}}{\sigma^{2}+n\sigma_{0}^{2}}\mu_{0},\left(\frac{n}{\sigma^{2}}+\frac{1}{\sigma_{0}^{2}}\right)^{-1}\right).$

Notice that unlike in the previous example, the expression for the posterior distribution is "stable", even for a very high number of observations.

# Theorem: Berstein von-Mises

Let $\widehat{\theta}_{B}$ be a point estimator for Bayesian inference (i.e., $\widehat{\theta}_{B}=E\left(\left.\theta\right|X\right)$) and $\widehat{\theta}_{ML}$ be the ML. Then,

$\sqrt{n}\left(\widehat{\theta}_{B}-\theta_{0}\right)\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)^{-1}\right),\text{ where }\theta_{0}\text{ is the true value of }\theta.$

and

$\sqrt{n}\left(\widehat{\theta}_{B}-\widehat{\theta}_{ML}\right)\overset{p}{\rightarrow}0$

The second result is relatively striking: it tells us that even after scaling by $\sqrt{n}$, the ML and Bayes estimators converge in probability.

In practice, researchers can use an estimate of $I\left(\theta\right)^{-1}$ based on the variance implied by $f_{\left.\theta\right|X}$ for hypothesis testing. In relatively complicated cases, priors need not belong to conjugate families, in which case numerical methods are used, including taking draws from distributions via the Gibbs sampler and the Metropolis Hastings algorithm.