# Point Estimation (cont.)

## Example: Uniform

Suppose $X_{i}\overset{iid}{\sim}U\left(0,\theta\right)$ where $\theta\gt 0$ is unknown.

The likelihood function equals $L\left(\left.\theta\right|x_{1}..x_{n}\right)=\Pi_{i=1}^{n}f\left(\left.x_{i}\right|\theta\right)=\Pi_{i=1}^{n}\frac{1}{\theta}1\left(0\leq x_{i}\leq\theta\right)$

Since $x_{i}$’s are draws from $U\left(0,\theta\right)$, the $0\leq x_{i}$ constraint will always be satisfied. However, since we are uncertain about the true value of $\theta$, the upper constraint may be binding.

This yields the following likelihood: $\Pi_{i=1}^{n}\frac{1}{\theta}1\left(0\leq x_{i}\leq\theta\right)=\Pi_{i=1}^{n}\frac{1}{\theta}1\left(x_{i}\leq\theta\right)=\frac{1}{\theta^{n}}1\left(x_{\left(n\right)}\leq\theta\right)$

Notice that $L\left(\left.\cdot\right|x_{1}..x_{n}\right)$ is not differentiable at $\theta=x_{\left(n\right)}$. We separate the problem:

• $L\left(\left.\cdot\right|x_{1}..x_{n}\right)=0$ if $\theta\lt x_{\left(n\right)}$; this reveals the impossibility that a value is generated above $\theta$.
• $L\left(\left.\cdot\right|x_{1}..x_{n}\right)=\frac{1}{\theta^{n}}$ if $\theta\geq x_{\left(n\right)}$; it is decreasing in $\theta$, so constraint is active, and $\widehat{\theta}_{ML}=x_{\left(n\right)}$.

Notice that the maximum likelihood estimator is different from the method of moments, $\widehat{\theta}_{ML}=x_{\left(n\right)}$ while $\widehat{\theta}_{MM}=2\overline{x}$.

Unlike the method of moments, we cannot obtain an estimator st $x_{i}\gt \widehat{\theta}_{ML}$.

However, as we will discuss later, there are some bad news.

The fact that we can never obtain $\widehat{\theta}_{ML}\gt \theta_{0}$, where $\theta_{0}$ is the true value of parameter $\theta$, means that the maximum likelihood estimator is likely to systematically underestimate the true parameter value.

# Evaluating Estimators

A good estimator of $\theta$ is close to $\theta$ in some probabilistic sense. For reasons of convenience, the leading criterion is the mean squared error:

The mean squared error (MSE) of an estimator $\widehat{\theta}$ of $\theta\in\Theta\subseteq\mathbb{R}$ is a function (of $\widehat{\theta}$) given by

$MSE_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left[\left(\theta-\widehat{\theta}\right)^{2}\right]$

Where from here on, we use notation $E_{\theta}\left[\cdot\right]=E_{\theta}\left[\left.\cdot\right|\theta\right]$, that is, the subscript indicates the variable to be conditioned on (before, it used to mean the variable of integration). So,

$MSE_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left[\left(\theta-\widehat{\theta}\right)^{2}\right]=E\left[\left.\left(\theta-\widehat{\theta}\right)^{2}\right|\theta\right]$ The interpretation is that MSE gives us the expected quadratic difference between our estimator and a specific value of $\theta$, which we usually assume to be some true value.

MSE is mostly popular due to its tractability. When $\theta$ is a vector of parameters, we employ the vector version instead: $MSE_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left[\left(\theta-\widehat{\theta}\right).\left(\theta-\widehat{\theta}\right)^{'}\right]$

The vector version of the MSE produces a matrix. Two compare 2 estimator vectors, we compare these matrices. Namely, we say an MSE is lower than another if the difference of the matrices is positive semi-definite (i.e., $z'.M.z\geq0,\,\forall z\neq0$). We will confine ourselves to the scalar case most of the time.

If $MSE_{\theta}\left(\widehat{\theta}_{1}\right)\gt MSE_{\theta}\left(\widehat{\theta}_{2}\right)$ for all values of $\theta$, we are tempted to say that $\widehat{\theta}_{2}$ is better, since it is on average closer to $\theta$, whatever value it has. However, we may feel different if $\widehat{\theta}_{2}$ systematically underestimates (or overestimates) $\theta$.

In order to take this into account we now introduce the concept of bias:

$Bias_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left(\theta-\widehat{\theta}\right)$

Whenever $E_{\theta}\left(\theta-\widehat{\theta}\right)=0$ - or equivalently, $E_{\theta}\left(\widehat{\theta}\right)=\theta$ - we say estimator $\widehat{\theta}$ is unbiased.

What follows is fundamental result about the decomposition of the MSE:

$MSE_{\theta}\left(\widehat{\theta}\right)=Var_{\theta}\left(\widehat{\theta}\right)+Bias_{\theta}\left(\widehat{\theta}\right)^{2}$

This means that, for estimators with a given MSE, for example, there is a tradeoff between bias and variance.

The proof of the result is obtained by adding and subtracting $E_{\theta}\left(\widehat{\theta}\right)$:

\begin{aligned} MSE_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left[\left(\theta-\widehat{\theta}\right)^{2}\right] & =E_{\theta}\left[\left(\theta-\widehat{\theta}+E_{\theta}\left(\widehat{\theta}\right)-E_{\theta}\left(\widehat{\theta}\right)\right)^{2}\right]\\ & =E_{\theta}\left[\left(\widehat{\theta}-E_{\theta}\left(\widehat{\theta}\right)\right)^{2}\right]+\underset{=\left(\theta-E_{\theta}\left(\widehat{\theta}\right)\right)^{2}}{\underbrace{E_{\theta}\left[\left(E_{\theta}\left(\widehat{\theta}\right)-\theta\right)^{2}\right]}}+\underset{=0}{\underbrace{...}}\\ & =Var_{\theta}\left(\widehat{\theta}\right)+Bias_{\theta}\left(\widehat{\theta}\right)^{2}\end{aligned}

We now define an efficient estimator:

Let $W$ be a collection of estimators of $\theta\in\Theta$. An estimator $\widehat{\theta}$ is efficient relative to $W$ is $MSE_{\theta}\left(\widehat{\theta}\right)\leq MSE_{\theta}\left(w\right),\,\forall\theta\in\Theta,\,\forall w\in W$.

In order to find a “best” estimator, we have to restrict $W$ in some way (otherwise, we can often find many estimators with equal MSE, by exploiting the bias/variance tradeoff).

# Minimum Variance Estimators

We usually focus our attention on unbiased estimators. Those that, one average, produce the correct result.

The collection of unbiased estimators is $W_{u}=\left\{ w:\,Bias_{\theta}\left(w\right)=0,\,Var_{\theta}\left(w\right)\lt \infty,\,\forall\theta\in\Theta\right\}$.

So, if $\widehat{\theta}\in W_{u}$, then $MSE_{\theta}\left(\widehat{\theta}\right)=Var_{\theta}\left(\widehat{\theta}\right).$

We can now define a type of minimum variance estimator:

An estimator $\widehat{\theta}\in W_{u}$ of $\theta$ is a uniform minimum-variance unbiased (UMVU) estimator of $\theta$ if it is efficient relative to $W_{u}$.

The minimum-variance unbiased part of UMVU should be clear. Of the unbiased estimators, $\widehat{\theta}$ is “MVU” if it achieves the lowest variance and is unbiased. The “uniform” part simply means that $\widehat{\theta}$ is unbiased and minimum variance for all values that $\theta$ may hold. It is MVU if $\theta=4$, and if $\theta=-3$, etc.

It is often possible to identify UMVU estimators. The tool to do this is the Rao-Blackwell theorem. Before we do so, we need to introduce an additional concept.

# Sufficient Statistics

Let $X_{1}..X_{n}$ be a random sample from a distribution with pmf/pdf $f\left(\left.\cdot\right|\theta\right)$, where $\theta\in\Theta$ is unknown.

A statistic $T=T\left(X_{1}..X_{n}\right)$ is a sufficient statistic for parameter $\theta$ if the conditional pmf/pdf of $\left(X_{1}..X_{n}\right)$ given $T$ does not depend on $\theta$, i.e.,

$f\left(\left.X\right|\theta,T\right)=f\left(\left.X\right|T\right)$.

The reason why we are interested in sufficient statistics will be clear once we present the Rao-Blackwell theorem. However, it is worth thinking a bit about the meaning of sufficient statistics first.

Intuitively, sufficient statistics summarize all the useful information in a sample that can be used to characterize $\theta$. Once conditioned upon, the a sufficient statistic informs $\theta$ so completely that there is nothing left in the sample that can be useful to inform $\theta$.

## Intuitive Example: Uniform

Consider the uniform sample pdf

$f_{X_{1}..X_{n}}\left(\left.x\right|\theta\right)=\frac{1}{\theta^{n}}1\left(X_{1}\leq\theta\wedge X_{_{2}}\leq\theta\wedge...\wedge X_{n}\leq\theta\right)$.

In order to estimate $\theta$, one could write down the likelihood function based on $f_{X}\left(\left.x\right|\theta\right)$, which uses each observation in the sample. However, note that it is sufficient to have information about the maximum observation, $X_{\left(n\right)}$.

The previous pdf can also be written as $f_{X}\left(\left.x\right|\theta\right)=\frac{1}{\theta^{n}}1\left(X_{\left(n\right)}\leq\theta\right)$, meaning that a researcher employing maximum likelihood will obtain the same estimate of $\theta$ independently of whether she observes the whole sample or simply the sample maximum.

Indeed, $X_{\left(n\right)}$ is a sufficient statistic for $\theta$, since it characterizes it completely.

## Example: Normal

The pdf of a normal random variable with variance 1 is $f_{\left.X_{i}\right|\mu}\left(x\right)=\frac{1}{\sqrt{2\pi}}\exp\left\{ -\frac{1}{2}\left(x-\mu\right)^{2}\right\}$.

We also know that the pdf of the mean of normal random variables - in this case each with variance 1 - is distributed according to $\overline{X}\sim N\left(\mu,\frac{1}{n}\right)$.

Let us now check whether $\overline{X}$ is a sufficient statistic for $\mu$.

We can obtain $f_{\left.X_{i}\right|\mu,\overline{X}}$ by using the conditional normal formula

\begin{aligned} \mu_{\left.X_{i}\right|\overline{X}} & =\mu_{X_{i}}+\frac{\sigma_{12}}{\sigma_{\overline{X}}^{2}}\left(\overline{x}-\mu_{\overline{X}}\right)\\ \sigma_{\left.X_{i}\right|\overline{X}}^{2} & =\sigma_{X_{i}}^{2}-\frac{\sigma_{12}^{2}}{\sigma_{\overline{X}}^{2}}\end{aligned}

where

$\sigma_{12}$ the covariance between $X_{i}$ and $\overline{X}$.

This covariance equals

$\sigma_{12}=Cov\left(X_{i},\overline{X}\right)=Cov\left(X_{i},\frac{1}{n}\sum_{j=1}^{n}X_{j}\right)=0+0+...+\frac{1}{n}Cov\left(X_{i},X_{i}\right)=\frac{1}{n}$.

In addition, $\mu_{X_{i}}=\mu_{\overline{X}}=\mu$, $\sigma_{X_{i}}^{2}=1$, and $\sigma_{\overline{X}}^{2}=\frac{1}{n}$ such that

$f_{\left.X_{i}\right|\mu,\overline{X}}\left(x\right)=N\left(\mu_{\left.X_{i}\right|\overline{X}},\sigma_{\left.X_{i}\right|\overline{X}}^{2}\right)=N\left(\mu+\left(\overline{x}-\mu\right),1-\frac{\frac{1}{n^{2}}}{\frac{1}{n}}\right)=N\left(\overline{x},1-\frac{1}{n}\right)$, which does not depend on $\mu$.

Given that $f_{\left.X_{1}..X_{n}\right|\mu,\overline{X}}\left(x\right)=\Pi_{i=1}^{n}f_{\left.X_{i}\right|\mu,\overline{X}}\left(x\right)$, it is clear that $f_{\left.X_{1}..X_{n}\right|\mu,\overline{X}}\left(x\right)=f_{\left.X_{1}..X_{n}\right|\overline{X}}\left(x\right)$: The distribution of the sample, once conditioned on $\overline{X}$, does not depend on $\mu$.

Hence, $\overline{X}$ is a sufficient statistic for $\mu$, and the effect of $\mu$ on a maximum likelihood estimator - through its role in generating the data - takes place only through its effect on $\overline{X}$.

# Rao-Blackwell Theorem

The Rao-Blackwell theorem allows us to take an existing estimator, and create a more efficient one. In order to do this, one requires a sufficient statistic.

The theorem states the following:

Let $\widehat{\theta}\in W_{u}$ and let $T$ be a sufficient statistic for $\theta$.

Then,

• $\widetilde{\theta}=E\left(\left.\widehat{\theta}\right|T\right)\in W_{u}$
• $Var_{\theta}\left(\widetilde{\theta}\right)\leq Var_{\theta}\left(\widehat{\theta}\right),\,\forall\theta\in\Theta$

The new estimator $\widetilde{\theta}$ is the expected value of a previous one, $\widehat{\theta}$, conditioning on statistic $T$. As we will see, the conditioning preserves the mean (so that if $\widehat{\theta}$ is unbiased, so is $\widetilde{\theta}$), and reduces variance.

Let us first open up the formula for the new estimator:

\begin{aligned} \widetilde{\theta}\left(x\right)=E\left(\left.\widehat{\theta}\right|T\right) & =\int_{-\infty}^{\infty}\widehat{\theta}\left(x\right)f_{\left.X\right|\theta,T}\left(x\right)dx\\ & =\int_{-\infty}^{\infty}\widehat{\theta}\left(x\right)f_{\left.X\right|T}\left(x\right)dx\end{aligned}

where the second equality follows from the fact that $T$ is a sufficient statistic. This clarifies why we require a sufficient statistic $T$ to apply the Rao-Blackwell theorem: If this was not the case, the expectation $E\left(\left.\widehat{\theta}\right|T\right)$ would have produced a function of $\theta$, which cannot be an estimator by definition.

We now prove the theorem:

• $E_{\theta}\left(\widetilde{\theta}\right)=E_{\theta}\left(E\left(\left.\widehat{\theta}\right|T\right)\right)\underset{L.I.E.}{\underbrace{=}}E_{\theta}\left(\widehat{\theta}\right)\underset{\widehat{\theta}\in W_{u}}{\underbrace{=}}\theta.$
• $Var_{\theta}\left(\widetilde{\theta}\right)=Var_{\theta}\left(E\left(\left.\widehat{\theta}\right|T\right)\right)\underset{C.V.I.}{\underbrace{=}}Var_{\theta}\left(\widehat{\theta}\right)-E_{\theta}\left(Var\left(\left.\widehat{\theta}\right|T\right)\right)$. Because $E_{\theta}\left(Var\left(\left.\widehat{\theta}\right|T\right)\right)\geq0$, $Var_{\theta}\left(E\left(\left.\widehat{\theta}\right|T\right)\right)\leq Var_{\theta}\left(\widehat{\theta}\right)$.

The operation of producing an estimator via the conditional expectation on a sufficient statistic is often called Rao-Blackwellization.

# Factorization Theorem

As we saw in the example of the Normal distribution, it can be tedious to find a sufficient statistic for a parameter. Luckily, there the factorization theorem makes it easy, provided the pmf/pdf of the sample is available:

Let $X_{1}..X_{n}$ be a random sample from a distribution with pmf/pdf $f\left(\left.\cdot\right|\theta\right)$, where $\theta\in\Theta$ is unknown.

A statistic $T=T\left(X_{1}..X_{n}\right)$ is sufficient for $\theta$ if and only if there exist functions $g\left(\cdot\right)$ and $h\left(\cdot\right)$ s.t. $\Pi_{i=1}^{n}f\left(\left.x_{i}\right|\theta\right)=g\left(\left.T\left(x_{1},...,x_{n}\right)\right|\theta\right).h\left(x_{1},...,x_{n}\right)$ for every $\left(x_{1}..x_{n}\right)\in\mathbb{R}^{n}$ and every $\theta\in\Theta$.

## Example: Uniform

Suppose $X_{i}\sim U\left(0,\theta\right)$ such that the joint pdf equals

$\Pi_{i=1}^{n}f\left(\left.x_{i}\right|\theta\right)=\underset{g\left(\left.x_{\left(n\right)}\right|\theta\right)}{\underbrace{\frac{1}{\theta^{n}}.1\left(x_{\left(n\right)}\leq\theta\right)}}.\underset{h\left(x_{1}..x_{n}\right)}{\underbrace{1\left(x_{\left(1\right)}\geq0\right)}}$

Hence, $x_{\left(n\right)}$ is a sufficient statistic for $\theta$. One intuition for this result is that the maximization of the likelihood function w.r.t. $\theta$ will only depend on $x_{\left(n\right)}$, since $h\left(x_{1}..x_{n}\right)$ is a constant that will not affect the estimator.