Full Lecture 9
Contents
Point Estimation (cont.)
Example: Uniform
Suppose [math]X_{i}\overset{iid}{\sim}U\left(0,\theta\right)[/math] where [math]\theta\gt 0[/math] is unknown.
The likelihood function equals [math]L\left(\left.\theta\right|x_{1}..x_{n}\right)=\Pi_{i=1}^{n}f\left(\left.x_{i}\right|\theta\right)=\Pi_{i=1}^{n}\frac{1}{\theta}1\left(0\leq x_{i}\leq\theta\right)[/math]
Since [math]x_{i}[/math]’s are draws from [math]U\left(0,\theta\right)[/math], the [math]0\leq x_{i}[/math] constraint will always be satisfied. However, since we are uncertain about the true value of [math]\theta[/math], the upper constraint may be binding.
This yields the following likelihood: [math]\Pi_{i=1}^{n}\frac{1}{\theta}1\left(0\leq x_{i}\leq\theta\right)=\Pi_{i=1}^{n}\frac{1}{\theta}1\left(x_{i}\leq\theta\right)=\frac{1}{\theta^{n}}1\left(x_{\left(n\right)}\leq\theta\right)[/math]
Notice that [math]L\left(\left.\cdot\right|x_{1}..x_{n}\right)[/math] is not differentiable at [math]\theta=x_{\left(n\right)}[/math]. We separate the problem:
- [math]L\left(\left.\cdot\right|x_{1}..x_{n}\right)=0[/math] if [math]\theta\lt x_{\left(n\right)}[/math]; this reveals the impossibility that a value is generated above [math]\theta[/math].
- [math]L\left(\left.\cdot\right|x_{1}..x_{n}\right)=\frac{1}{\theta^{n}}[/math] if [math]\theta\geq x_{\left(n\right)}[/math]; it is decreasing in [math]\theta[/math], so constraint is active, and [math]\widehat{\theta}_{ML}=x_{\left(n\right)}[/math].
Notice that the maximum likelihood estimator is different from the method of moments, [math]\widehat{\theta}_{ML}=x_{\left(n\right)}[/math] while [math]\widehat{\theta}_{MM}=2\overline{x}[/math].
Unlike the method of moments, we cannot obtain an estimator st [math]x_{i}\gt \widehat{\theta}_{ML}[/math].
However, as we will discuss later, there are some bad news.
The fact that we can never obtain [math]\widehat{\theta}_{ML}\gt \theta_{0}[/math], where [math]\theta_{0}[/math] is the true value of parameter [math]\theta[/math], means that the maximum likelihood estimator is likely to systematically underestimate the true parameter value.
Evaluating Estimators
A good estimator of [math]\theta[/math] is close to [math]\theta[/math] in some probabilistic sense. For reasons of convenience, the leading criterion is the mean squared error:
The mean squared error (MSE) of an estimator [math]\widehat{\theta}[/math] of [math]\theta\in\Theta\subseteq\mathbb{R}[/math] is a function (of [math]\widehat{\theta}[/math]) given by
[math]MSE_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left[\left(\theta-\widehat{\theta}\right)^{2}\right][/math]
Where from here on, we use notation [math]E_{\theta}\left[\cdot\right]=E_{\theta}\left[\left.\cdot\right|\theta\right][/math], that is, the subscript indicates the variable to be conditioned on (before, it used to mean the variable of integration). So,
[math]MSE_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left[\left(\theta-\widehat{\theta}\right)^{2}\right]=E\left[\left.\left(\theta-\widehat{\theta}\right)^{2}\right|\theta\right].[/math]
The interpretation is that MSE gives us the expected quadratic difference between our estimator and a specific value of [math]\theta[/math], which we usually assume to be some true value.
MSE is mostly popular due to its tractability. When [math]\theta[/math] is a vector of parameters, we employ the vector version instead:
[math]MSE_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left[\left(\theta-\widehat{\theta}\right).\left(\theta-\widehat{\theta}\right)^{'}\right].[/math]
The vector version of the MSE produces a matrix. Two compare 2 estimator vectors, we compare these matrices. Namely, we say an MSE is lower than another if the difference of the matrices is positive semi-definite (i.e., [math]z'.M.z\geq0,\,\forall z\neq0[/math]). We will confine ourselves to the scalar case most of the time.
If [math]MSE_{\theta}\left(\widehat{\theta}_{1}\right)\gt MSE_{\theta}\left(\widehat{\theta}_{2}\right)[/math] for all values of [math]\theta[/math], we are tempted to say that [math]\widehat{\theta}_{2}[/math] is better, since it is on average closer to [math]\theta[/math], whatever value it has. However, we may feel different if [math]\widehat{\theta}_{2}[/math] systematically underestimates (or overestimates) [math]\theta[/math], where as [math]\widehat{\theta}_{1}[/math] is on average correct.
In order to take this into account, we introduce the concept of bias:
[math]Bias_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left(\theta-\widehat{\theta}\right)[/math]
Whenever [math]E_{\theta}\left(\theta-\widehat{\theta}\right)=0[/math] - or equivalently, [math]E_{\theta}\left(\widehat{\theta}\right)=\theta[/math] - we say estimator [math]\widehat{\theta}[/math] is unbiased.
What follows is fundamental result about the decomposition of the MSE:
[math]MSE_{\theta}\left(\widehat{\theta}\right)=Var_{\theta}\left(\widehat{\theta}\right)+Bias_{\theta}\left(\widehat{\theta}\right)^{2}[/math]
This means that, for estimators with a given MSE, for example, there is a tradeoff between bias and variance.
The proof of the result is obtained by adding and subtracting [math]E_{\theta}\left(\widehat{\theta}\right)[/math]:
[math]\begin{aligned} MSE_{\theta}\left(\widehat{\theta}\right)=E_{\theta}\left[\left(\theta-\widehat{\theta}\right)^{2}\right] & =E_{\theta}\left[\left(\theta-\widehat{\theta}+E_{\theta}\left(\widehat{\theta}\right)-E_{\theta}\left(\widehat{\theta}\right)\right)^{2}\right]\\ & =E_{\theta}\left[\left(\widehat{\theta}-E_{\theta}\left(\widehat{\theta}\right)\right)^{2}\right]+\underset{=\left(\theta-E_{\theta}\left(\widehat{\theta}\right)\right)^{2}}{\underbrace{E_{\theta}\left[\left(E_{\theta}\left(\widehat{\theta}\right)-\theta\right)^{2}\right]}}+\underset{=0}{\underbrace{...}}\\ & =Var_{\theta}\left(\widehat{\theta}\right)+Bias_{\theta}\left(\widehat{\theta}\right)^{2}\end{aligned}[/math]
We now define an efficient estimator:
Let [math]W[/math] be a collection of estimators of [math]\theta\in\Theta[/math]. An estimator [math]\widehat{\theta}[/math] is efficient relative to [math]W[/math] if
[math]MSE_{\theta}\left(\widehat{\theta}\right)\leq MSE_{\theta}\left(w\right),\,\forall\theta\in\Theta,\,\forall w\in W[/math].
In order to find a “best” estimator, we have to restrict [math]W[/math] in some way (otherwise, we can often find many estimators with equal MSE, by exploiting the bias/variance tradeoff).
Minimum Variance Estimators
We usually focus our attention on unbiased estimators. Those that, one average, produce the correct result.
The collection of unbiased estimators is
[math]W_{u}=\left\{ w:\,Bias_{\theta}\left(w\right)=0,\,Var_{\theta}\left(w\right)\lt \infty,\,\forall\theta\in\Theta\right\}[/math].
So, if [math]\widehat{\theta}\in W_{u}[/math], then [math]MSE_{\theta}\left(\widehat{\theta}\right)=Var_{\theta}\left(\widehat{\theta}\right).[/math]
We can now define a type of minimum variance estimator:
An estimator [math]\widehat{\theta}\in W_{u}[/math] of [math]\theta[/math] is a uniform minimum-variance unbiased (UMVU) estimator of [math]\theta[/math] if it is efficient relative to [math]W_{u}[/math].
The minimum-variance unbiased part of UMVU should be clear. Of the unbiased estimators, [math]\widehat{\theta}[/math] is “MVU” if it achieves the lowest variance and is unbiased. The “uniform” part simply means that [math]\widehat{\theta}[/math] is unbiased and minimum variance for all values that [math]\theta[/math] may hold. It is MVU if [math]\theta=4[/math], and if [math]\theta=-3[/math], etc.
It is often possible to identify UMVU estimators. The tool to do this is the Rao-Blackwell theorem. Before we do so, we need to introduce an additional concept.
Sufficient Statistics
Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\in\Theta[/math] is unknown.
A statistic [math]T=T\left(X_{1}..X_{n}\right)[/math] is a sufficient statistic for parameter [math]\theta[/math] if the conditional pmf/pdf of [math]\left(X_{1}..X_{n}\right)[/math] given [math]T[/math] does not depend on [math]\theta[/math], i.e.,
[math]f\left(\left.X\right|\theta,T\right)=f\left(\left.X\right|T\right).[/math]
The reason we are interested in sufficient statistics will be clear once we present the Rao-Blackwell theorem. However, it is worth thinking a bit about the meaning of sufficient statistics first.
Sufficient statistics are the portion of the data that is useful to calculate the maximum likelihood estimator of [math]\theta[/math].
Intuitive Example: Uniform
In order to estimate [math]\theta[/math] via maximum likelihood, one writes down the likelihood function of the sample:
[math]f_{X_{1}..X_{n}}\left(\left.x\right|\theta\right)=\frac{1}{\theta^{n}}1\left(x_{1}\leq\theta\wedge x_{2}\leq\theta\wedge...\wedge x_{n}\leq\theta\right).[/math]
We can rewrite the pdf of the sample as:
[math]f_{X_{1}..X_{n}}\left(\left.x\right|\theta\right)=\frac{1}{\theta^{n}}1\left(\max_{i=1..n}x_{i}\leq\theta\right).[/math]
From the expression above, it is clear that the MLE only depends on the maximum observation and not the whole sample.
Hence, [math]\max_{i=1..n}X_{i}[/math] is a sufficient statistic of [math]\theta[/math].
Remark: What does [math]f\left(\left.X\right|\theta,T\right)[/math] mean?
At this point, you have have worked with conditional pdfs. For example, you may have worked with a pdf [math]f_{X|Y}[/math], where [math]X[/math] and [math]Y[/math] are random variables.
However, the expression [math]f\left(\left.X\right|\theta,T\right)[/math] means something slightly different. The reason is that [math]T=T\left(X_{1}..X_{n}\right)[/math] is a function of the random variables [math]X_{1}..X_{n},[/math] i.e., it's a function of the sample.
The simplest way to see this is to consider the following joint pmf:
[math]X_{1}[/math] | ||||
---|---|---|---|---|
0 | 1 | 2 | ||
[math]X_{2}[/math] | 0 | 5% | 7% | 8% |
1 | 20% | 5% | 5% | |
2 | 15% | 25% | 10% |
and suppose we would like to calculate the pmf
[math]f_{X_{1},X_{2}|X_{1}+X_{2}=1}.[/math]
Clearly, this pmf is obtained by dividing the probability of observing each combination of [math]X_{1},X_{2}[/math] - for the cases where [math]X_{1}+X_{2}=1[/math] - by [math]P\left(X_{1}+X_{2}=1\right)[/math], whereas the remaining elements will be equal to zero. This yields the conditional pmf:
[math]X_{1}[/math] | ||||
---|---|---|---|---|
0 | 1 | 2 | ||
[math]X_{2}[/math] | 0 | 0 | 26% | 0 |
1 | 74% | 0 | 0 | |
2 | 0 | 0 | 0 |
where [math]7\% \div (20\%+7\%) = 26\% [/math] and [math]20\% \div (20\%+7\%) = 74\%.[/math] In this example, [math]T\left(X_{1}..X_{n}\right)=X_{1}+X_{2}.[/math]
It follows that
[math]f_{X|T}\left(x|t\right)=\frac{f_X\left(x\right)}{f_T\left(t\right)}.1\left(T\left(x\right)=t\right).[/math]
This result is valid for both pmfs and pdfs.
The result is a bit different from the one we are used to, when we condition [math]X[/math] on [math]Y[/math], for example. The reason is that in this case we are not conditioning [math]X[/math] on a different random variable. Rather, we are interested in the pdf of a random vector [math]X[/math], conditional on it respecting some equality [math]T\left(X\right)=t.[/math]
Rao-Blackwell Theorem
The Rao-Blackwell theorem allows us to take an existing estimator, and create a more efficient one. In order to do this, one requires a sufficient statistic.
The theorem states the following:
Let [math]\widehat{\theta}\in W_{u}[/math] and let [math]T[/math] be a sufficient statistic for [math]\theta[/math].
Then,
- [math]\widetilde{\theta}=E\left(\left.\widehat{\theta}\right|T\right)\in W_{u}[/math]
- [math]Var_{\theta}\left(\widetilde{\theta}\right)\leq Var_{\theta}\left(\widehat{\theta}\right),\,\forall\theta\in\Theta[/math]
The new estimator [math]\widetilde{\theta}[/math] is the expected value of a previous one, [math]\widehat{\theta}[/math], conditioning on statistic [math]T[/math]. As we will see, the conditioning preserves the mean (so that if [math]\widehat{\theta}[/math] is unbiased, so is [math]\widetilde{\theta}[/math]), and reduces variance.
Let us first open up the formula for the new estimator:
[math]\begin{aligned} \widetilde{\theta}\left(x\right)=E\left(\left.\widehat{\theta}\right|T\right) & =\int_{-\infty}^{\infty}\widehat{\theta}\left(x\right)f_{\left.X\right|\theta,T}\left(x\right)dx\\ & =\int_{-\infty}^{\infty}\widehat{\theta}\left(x\right)f_{\left.X\right|T}\left(x\right)dx\end{aligned}[/math]
where the second equality follows from the fact that [math]T[/math] is a sufficient statistic. This clarifies why we require a sufficient statistic [math]T[/math] to apply the Rao-Blackwell theorem: If this was not the case, the expectation [math]E\left(\left.\widehat{\theta}\right|T\right)[/math] would have produced a function of [math]\theta[/math], which cannot be an estimator by definition.
We now prove the theorem:
- [math]E_{\theta}\left(\widetilde{\theta}\right)=E_{\theta}\left(E\left(\left.\widehat{\theta}\right|T\right)\right)\underset{L.I.E.}{\underbrace{=}}E_{\theta}\left(\widehat{\theta}\right)\underset{\widehat{\theta}\in W_{u}}{\underbrace{=}}\theta.[/math]
- [math]Var_{\theta}\left(\widetilde{\theta}\right)=Var_{\theta}\left(E\left(\left.\widehat{\theta}\right|T\right)\right)\underset{C.V.I.}{\underbrace{=}}Var_{\theta}\left(\widehat{\theta}\right)-E_{\theta}\left(Var\left(\left.\widehat{\theta}\right|T\right)\right)[/math]. Because [math]E_{\theta}\left(Var\left(\left.\widehat{\theta}\right|T\right)\right)\geq0[/math], [math]Var_{\theta}\left(E\left(\left.\widehat{\theta}\right|T\right)\right)\leq Var_{\theta}\left(\widehat{\theta}\right)[/math].
The operation of producing an estimator via the conditional expectation on a sufficient statistic is often called Rao-Blackwellization.
Factorization Theorem
As we saw in the example of the Normal distribution, it can be tedious to find a sufficient statistic for a parameter. Luckily, the factorization theorem makes it easy, provided the pmf/pdf of the sample is available:
Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\in\Theta[/math] is unknown.
A statistic [math]T=T\left(X_{1}..X_{n}\right)[/math] is sufficient for [math]\theta[/math] if and only if there exist functions [math]g\left(\cdot\right)[/math] and [math]h\left(\cdot\right)[/math] s.t.
[math]\Pi_{i=1}^{n}f\left(\left.x_{i}\right|\theta\right)=g\left(\left.T\left(x_{1},...,x_{n}\right)\right|\theta\right).h\left(x_{1},...,x_{n}\right),[/math]
for every [math]\left(x_{1}..x_{n}\right)\in\mathbb{R}^{n}[/math] and every [math]\theta\in\Theta.[/math]
One way to understand the result above, is that if we were to maximize the likelihood above, only the first factor would be relevant, since the second equals a constant that is independent of the parameters.
Example: Uniform
Suppose [math]X_{i}\sim U\left(0,\theta\right)[/math] such that the joint pdf equals
[math]\Pi_{i=1}^{n}f\left(\left.x_{i}\right|\theta\right)=\underset{g\left(\left.x_{\left(n\right)}\right|\theta\right)}{\underbrace{\frac{1}{\theta^{n}}.1\left(x_{\left(n\right)}\leq\theta\right)}}.\underset{h\left(x_{1}..x_{n}\right)}{\underbrace{1\left(x_{\left(1\right)}\geq0\right)}}[/math]
From the factorization theorem, [math]x_{\left(n\right)}[/math] is a sufficient statistic for [math]\theta[/math]. Again, notice that the maximization of the likelihood function w.r.t. [math]\theta[/math] will only depend on [math]x_{\left(n\right)}[/math], since [math]h\left(x_{1}..x_{n}\right)[/math] is a constant that will not affect the estimator.