Full Lecture 9

From Significant Statistics
Jump to navigation Jump to search

Point Estimation (cont.)

Example: Uniform

Suppose [math]X_{i}\overset{iid}{\sim}U\left(0,\theta\right)[/math] where [math]\theta\gt 0[/math] is unknown.

The likelihood function equals [math]L\left(\left.\theta\right|x_{1}..x_{n}\right)=\Pi_{i=1}^{n}f\left(\left.x_{i}\right|\theta\right)=\Pi_{i=1}^{n}\frac{1}{\theta}1\left(0\leq x_{i}\leq\theta\right)[/math]

Since [math]x_{i}[/math]’s are draws from [math]U\left(0,\theta\right)[/math], the [math]0\leq x_{i}[/math] constraint will always be satisfied. However, since we are uncertain about the true value of [math]\theta[/math], the upper constraint may be binding.

This yields the following likelihood: [math]\Pi_{i=1}^{n}\frac{1}{\theta}1\left(0\leq x_{i}\leq\theta\right)=\Pi_{i=1}^{n}\frac{1}{\theta}1\left(x_{i}\leq\theta\right)=\frac{1}{\theta^{n}}1\left(x_{\left(n\right)}\leq\theta\right)[/math]

Notice that [math]L\left(\left.\cdot\right|x_{1}..x_{n}\right)[/math] is not differentiable at [math]\theta=x_{\left(n\right)}[/math]. We separate the problem:

  • [math]L\left(\left.\cdot\right|x_{1}..x_{n}\right)=0[/math] if [math]\theta\lt x_{\left(n\right)}[/math]; this reveals the impossibility that a value is generated above [math]\theta[/math].
  • [math]L\left(\left.\cdot\right|x_{1}..x_{n}\right)=\frac{1}{\theta^{n}}[/math] if [math]\theta\geq x_{\left(n\right)}[/math]; it is decreasing in [math]\theta[/math], so constraint is active, and [math]\widehat{\theta}_{ML}=x_{\left(n\right)}[/math].

Notice that the maximum likelihood estimator is different from the method of moments, [math]\widehat{\theta}_{ML}=x_{\left(n\right)}[/math] while [math]\widehat{\theta}_{MM}=2\overline{x}[/math].

Unlike the method of moments, we cannot obtain an estimator st [math]x_{i}\gt \widehat{\theta}_{ML}[/math].

However, as we will discuss later, there are some bad news.

The fact that we can never obtain [math]\widehat{\theta}_{ML}\gt \theta_{0}[/math], where [math]\theta_{0}[/math] is the true value of parameter [math]\theta[/math], means that the maximum likelihood estimator is likely to systematically underestimate the true parameter value.

Evaluating Estimators

A good estimator of [math]\theta[/math] is close to [math]\theta[/math] in some probabilistic sense. For reasons of convenience, the leading criterion is the mean squared error:

The mean squared error (MSE) of an estimator [math]\widehat{\theta}[/math] of [math]\theta\in\Theta\subseteq\mathbb{R}[/math] is a function (of [math]\widehat{\theta}[/math]) given by


Where from here on, we use notation [math]E_{\theta}\left[\cdot\right]=E_{\theta}\left[\left.\cdot\right|\theta\right][/math], that is, the subscript indicates the variable to be conditioned on (before, it used to mean the variable of integration). So,


The interpretation is that MSE gives us the expected quadratic difference between our estimator and a specific value of [math]\theta[/math], which we usually assume to be some true value.

MSE is mostly popular due to its tractability. When [math]\theta[/math] is a vector of parameters, we employ the vector version instead:


The vector version of the MSE produces a matrix. Two compare 2 estimator vectors, we compare these matrices. Namely, we say an MSE is lower than another if the difference of the matrices is positive semi-definite (i.e., [math]z'.M.z\geq0,\,\forall z\neq0[/math]). We will confine ourselves to the scalar case most of the time.

If [math]MSE_{\theta}\left(\widehat{\theta}_{1}\right)\gt MSE_{\theta}\left(\widehat{\theta}_{2}\right)[/math] for all values of [math]\theta[/math], we are tempted to say that [math]\widehat{\theta}_{2}[/math] is better, since it is on average closer to [math]\theta[/math], whatever value it has. However, we may feel different if [math]\widehat{\theta}_{2}[/math] systematically underestimates (or overestimates) [math]\theta[/math], where as [math]\widehat{\theta}_{1}[/math] is on average correct.

In order to take this into account, we introduce the concept of bias:


Whenever [math]E_{\theta}\left(\theta-\widehat{\theta}\right)=0[/math] - or equivalently, [math]E_{\theta}\left(\widehat{\theta}\right)=\theta[/math] - we say estimator [math]\widehat{\theta}[/math] is unbiased.

What follows is fundamental result about the decomposition of the MSE:


This means that, for estimators with a given MSE, for example, there is a tradeoff between bias and variance.

The proof of the result is obtained by adding and subtracting [math]E_{\theta}\left(\widehat{\theta}\right)[/math]:

[math]\begin{aligned} MSE_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left[\left(\theta-\widehat{\theta}\right)^{2}\right] & =E_{\theta}\left[\left(\theta-\widehat{\theta}+E_{\theta}\left(\widehat{\theta}\right)-E_{\theta}\left(\widehat{\theta}\right)\right)^{2}\right]\\ & =E_{\theta}\left[\left(\widehat{\theta}-E_{\theta}\left(\widehat{\theta}\right)\right)^{2}\right]+\underset{=\left(\theta-E_{\theta}\left(\widehat{\theta}\right)\right)^{2}}{\underbrace{E_{\theta}\left[\left(E_{\theta}\left(\widehat{\theta}\right)-\theta\right)^{2}\right]}}+\underset{=0}{\underbrace{...}}\\ & =Var_{\theta}\left(\widehat{\theta}\right)+Bias_{\theta}\left(\widehat{\theta}\right)^{2}\end{aligned}[/math]

We now define an efficient estimator:

Let [math]W[/math] be a collection of estimators of [math]\theta\in\Theta[/math]. An estimator [math]\widehat{\theta}[/math] is efficient relative to [math]W[/math] if

[math]MSE_{\theta}\left(\widehat{\theta}\right)\leq MSE_{\theta}\left(w\right),\,\forall\theta\in\Theta,\,\forall w\in W[/math].

In order to find a “best” estimator, we have to restrict [math]W[/math] in some way (otherwise, we can often find many estimators with equal MSE, by exploiting the bias/variance tradeoff).

Minimum Variance Estimators

We usually focus our attention on unbiased estimators. Those that, one average, produce the correct result.

The collection of unbiased estimators is

[math]W_{u}=\left\{ w:\,Bias_{\theta}\left(w\right)=0,\,Var_{\theta}\left(w\right)\lt \infty,\,\forall\theta\in\Theta\right\}[/math].

So, if [math]\widehat{\theta}\in W_{u}[/math], then [math]MSE_{\theta}\left(\widehat{\theta}\right)=Var_{\theta}\left(\widehat{\theta}\right).[/math]

We can now define a type of minimum variance estimator:

An estimator [math]\widehat{\theta}\in W_{u}[/math] of [math]\theta[/math] is a uniform minimum-variance unbiased (UMVU) estimator of [math]\theta[/math] if it is efficient relative to [math]W_{u}[/math].

The minimum-variance unbiased part of UMVU should be clear. Of the unbiased estimators, [math]\widehat{\theta}[/math] is “MVU” if it achieves the lowest variance and is unbiased. The “uniform” part simply means that [math]\widehat{\theta}[/math] is unbiased and minimum variance for all values that [math]\theta[/math] may hold. It is MVU if [math]\theta=4[/math], and if [math]\theta=-3[/math], etc.

It is often possible to identify UMVU estimators. The tool to do this is the Rao-Blackwell theorem. Before we do so, we need to introduce an additional concept.

Sufficient Statistics

Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\in\Theta[/math] is unknown.

A statistic [math]T=T\left(X_{1}..X_{n}\right)[/math] is a sufficient statistic for parameter [math]\theta[/math] if the conditional pmf/pdf of [math]\left(X_{1}..X_{n}\right)[/math] given [math]T[/math] does not depend on [math]\theta[/math], i.e.,


The reason we are interested in sufficient statistics will be clear once we present the Rao-Blackwell theorem. However, it is worth thinking a bit about the meaning of sufficient statistics first.

Sufficient statistics are the portion of the data that is useful to calculate the maximum likelihood estimator of [math]\theta[/math].

Intuitive Example: Uniform

In order to estimate [math]\theta[/math] via maximum likelihood, one writes down the likelihood function of the sample:

[math]f_{X_{1}..X_{n}}\left(\left.x\right|\theta\right)=\frac{1}{\theta^{n}}1\left(x_{1}\leq\theta\wedge x_{2}\leq\theta\wedge...\wedge x_{n}\leq\theta\right).[/math]

We can rewrite the pdf of the sample as:


From the expression above, it is clear that the MLE only depends on the maximum observation and not the whole sample.

Hence, [math]\max_{i=1..n}X_{i}[/math] is a sufficient statistic of [math]\theta[/math].

Remark: What does [math]f\left(\left.X\right|\theta,T\right)[/math] mean?

At this point, you have have worked with conditional pdfs. For example, you may have worked with a pdf [math]f_{X|Y}[/math], where [math]X[/math] and [math]Y[/math] are random variables.

However, the expression [math]f\left(\left.X\right|\theta,T\right)[/math] means something slightly different. The reason is that [math]T=T\left(X_{1}..X_{n}\right)[/math] is a function of the random variables [math]X_{1}..X_{n},[/math] i.e., it's a function of the sample.

The simplest way to see this is to consider the following joint pmf:

0 1 2
[math]X_{2}[/math] 0 5% 7% 8%
1 20% 5% 5%
2 15% 25% 10%

and suppose we would like to calculate the pmf


Clearly, this pmf is obtained by dividing the probability of observing each combination of [math]X_{1},X_{2}[/math] - for the cases where [math]X_{1}+X_{2}=1[/math] - by [math]P\left(X_{1}+X_{2}=1\right)[/math], whereas the remaining elements will be equal to zero. This yields the conditional pmf:

0 1 2
[math]X_{2}[/math] 0 0 26% 0
1 74% 0 0
2 0 0 0

where [math]7\% \div (20\%+7\%) = 26\% [/math] and [math]20\% \div (20\%+7\%) = 74\%.[/math] In this example, [math]T\left(X_{1}..X_{n}\right)=X_{1}+X_{2}.[/math]

It follows that


This result is valid for both pmfs and pdfs.

The result is a bit different from the one we are used to, when we condition [math]X[/math] on [math]Y[/math], for example. The reason is that in this case we are not conditioning [math]X[/math] on a different random variable. Rather, we are interested in the pdf of a random vector [math]X[/math], conditional on it respecting some equality [math]T\left(X\right)=t.[/math]

<< Minimum Variance Estimators
Rao-Blackwell Theorem >>

Rao-Blackwell Theorem

The Rao-Blackwell theorem allows us to take an existing estimator, and create a more efficient one. In order to do this, one requires a sufficient statistic.

The theorem states the following:

Let [math]\widehat{\theta}\in W_{u}[/math] and let [math]T[/math] be a sufficient statistic for [math]\theta[/math].


  • [math]\widetilde{\theta}=E\left(\left.\widehat{\theta}\right|T\right)\in W_{u}[/math]
  • [math]Var_{\theta}\left(\widetilde{\theta}\right)\leq Var_{\theta}\left(\widehat{\theta}\right),\,\forall\theta\in\Theta[/math]

The new estimator [math]\widetilde{\theta}[/math] is the expected value of a previous one, [math]\widehat{\theta}[/math], conditioning on statistic [math]T[/math]. As we will see, the conditioning preserves the mean (so that if [math]\widehat{\theta}[/math] is unbiased, so is [math]\widetilde{\theta}[/math]), and reduces variance.

Let us first open up the formula for the new estimator:

[math]\begin{aligned} \widetilde{\theta}\left(x\right)=E\left(\left.\widehat{\theta}\right|T\right) & =\int_{-\infty}^{\infty}\widehat{\theta}\left(x\right)f_{\left.X\right|\theta,T}\left(x\right)dx\\ & =\int_{-\infty}^{\infty}\widehat{\theta}\left(x\right)f_{\left.X\right|T}\left(x\right)dx\end{aligned}[/math]

where the second equality follows from the fact that [math]T[/math] is a sufficient statistic. This clarifies why we require a sufficient statistic [math]T[/math] to apply the Rao-Blackwell theorem: If this was not the case, the expectation [math]E\left(\left.\widehat{\theta}\right|T\right)[/math] would have produced a function of [math]\theta[/math], which cannot be an estimator by definition.

We now prove the theorem:

  • [math]E_{\theta}\left(\widetilde{\theta}\right)=E_{\theta}\left(E\left(\left.\widehat{\theta}\right|T\right)\right)\underset{L.I.E.}{\underbrace{=}}E_{\theta}\left(\widehat{\theta}\right)\underset{\widehat{\theta}\in W_{u}}{\underbrace{=}}\theta.[/math]
  • [math]Var_{\theta}\left(\widetilde{\theta}\right)=Var_{\theta}\left(E\left(\left.\widehat{\theta}\right|T\right)\right)\underset{C.V.I.}{\underbrace{=}}Var_{\theta}\left(\widehat{\theta}\right)-E_{\theta}\left(Var\left(\left.\widehat{\theta}\right|T\right)\right)[/math]. Because [math]E_{\theta}\left(Var\left(\left.\widehat{\theta}\right|T\right)\right)\geq0[/math], [math]Var_{\theta}\left(E\left(\left.\widehat{\theta}\right|T\right)\right)\leq Var_{\theta}\left(\widehat{\theta}\right)[/math].

The operation of producing an estimator via the conditional expectation on a sufficient statistic is often called Rao-Blackwellization.

Factorization Theorem

As we saw in the example of the Normal distribution, it can be tedious to find a sufficient statistic for a parameter. Luckily, the factorization theorem makes it easy, provided the pmf/pdf of the sample is available:

Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\in\Theta[/math] is unknown.

A statistic [math]T=T\left(X_{1}..X_{n}\right)[/math] is sufficient for [math]\theta[/math] if and only if there exist functions [math]g\left(\cdot\right)[/math] and [math]h\left(\cdot\right)[/math] s.t.


for every [math]\left(x_{1}..x_{n}\right)\in\mathbb{R}^{n}[/math] and every [math]\theta\in\Theta.[/math]

One way to understand the result above, is that if we were to maximize the likelihood above, only the first factor would be relevant, since the second equals a constant that is independent of the parameters.

Example: Uniform

Suppose [math]X_{i}\sim U\left(0,\theta\right)[/math] such that the joint pdf equals


From the factorization theorem, [math]x_{\left(n\right)}[/math] is a sufficient statistic for [math]\theta[/math]. Again, notice that the maximization of the likelihood function w.r.t. [math]\theta[/math] will only depend on [math]x_{\left(n\right)}[/math], since [math]h\left(x_{1}..x_{n}\right)[/math] is a constant that will not affect the estimator.