Full Lecture 10

From Significant Statistics
Jump to navigation Jump to search

Finding UMVU estimators

In the previous lecture, we have introduced Rao-Blackwell’s theorem, which can be used to reduce the variance of an existing estimator while preserving its mean. This immediately implies that UMVU estimators need to be based on sufficient statistics. Otherwise, one could Rao-Blackwellize such estimators - in the limit, by using the whole sample as a sufficient statistic - to obtain a more efficient estimator.

While the Rao-Blackwell theorem is useful to find a more efficient estimator, we are still to discover a method to produce an UMVU estimator. It turns out that Rao-Blackwellization can be used to produce the unique UMVU under certain conditions. These are defined in the Lehmann-Scheffé Theorem.

Lehmann-Scheffé Theorem

Let [math]T[/math] be a sufficient and complete statistic for [math]\theta[/math]. Then, if [math]\widehat{\theta}[/math] is unbiased,

[math]\widetilde{\theta}=E_{\theta}\left(\left.\widehat{\theta}\right|T\right)[/math] is the unique UMVU.

We will define what it means for a statistic to be complete soon.

A question that immediately arises is whether UMVU estimators are always unique. The answer is yes, they are. This can be shown by contradiction, because if there existed (for example) two different UMVU estimators, the variance of their arithmetic mean can be calculated, and is lower than each individual variance.

The Lehmann-Scheffé Theorem shows that Rao-Blackwellization based on a sufficient and complete statistic of an unbiased estimator provides the UMVU.

The intuition is as follows: If one can ensure that Rao-Blackwellization based on a given type of statistic always yields the same estimator, then it must the UMVU, because for any other UMVU candidate, we could always Rao-Blackwellize it and obtain a unique estimator, whose variance will not be higher than our candidate's.


Complete Statistic

A statistic [math]T[/math] is complete for [math]\theta\in\Theta[/math] if for any (measurable) function [math]g\left(\cdot\right)[/math],

if

[math]E_{\theta}\left(g\left(T\right)\right)=0,\,\forall\theta\in\Theta[/math]

then

[math]P_{\theta}\left(g\left(T\right)=0\right)=1,\,\forall\theta\in\Theta[/math]

where

[math]P_{\theta}\left(\cdot\right)[/math] is a probability function parameterized by [math]\theta[/math]. In other words, if [math]T[/math] is complete, then the expectation of [math]g\left(T\right)[/math] equals zero only if [math]g\left(T\right)=0[/math] almost everywhere.

The intuition of how the Lehmann Scheffé theorem produces the UMVU through Rao-Blackwellization is the following: Suppose we have two unbiased estimators obtained through Rao-Blackwellization via a sufficient and complete statistic [math]T[/math], [math]w_{1}\left(T\right)[/math] and [math]w_{2}\left(T\right)[/math]. Because both estimators are unbiased, and because they are both based on a complete statistic [math]T[/math], they are the same.

Formally, [math]E_{\theta}\left(\underset{g\left(T\right)}{\underbrace{w_{1}\left(T\right)-w_{2}\left(T\right)}}\right)=0[/math] implies uniqueness, via [math]P_{\theta}\left(g\left(T\right)=0\right)=1,\,\forall\theta\in\Theta[/math]. So, completeness of [math]T[/math] implies that two unbiased estimators [math]w_{1}\left(T\right)[/math] and [math]w_{2}\left(T\right)[/math] are actually the same estimator.

In other words, there exists only one unbiased estimator that is based on a complete statistic.

While complete statistics may fail to exist, we have learned that when they do exist, they can be used through Rao-Blackwellization to produce the UMVU.

Example: Uniform

Suppose [math]X_{i}\overset{iid}{\sim}U\left(0,\theta\right)[/math] where [math]\theta[/math] is unknown. We have shown that [math]X_{\left(n\right)}[/math] is a sufficient statistic. Now we show that it is complete, by calculating

[math]E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right)[/math]

and showing that if it equals zero for all values of [math]\theta\in\Theta[/math], then

[math]P_{\theta}\left(g\left(X_{\left(n\right)}\right)=0\right)=1.[/math]

We proceeded by deriving the pdf of [math]X_{\left(n\right)}[/math], [math]f_{X_{\left(n\right)}}=\frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1}[/math] and calculating [math]E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right):[/math]

[math]\begin{aligned} E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right) & =0\\ \Leftrightarrow\int_{0}^{\theta}g\left(x\right)f_{X_{\left(n\right)}}\left(x\right)dx & =0\\ \Leftrightarrow\int_{0}^{\theta}g\left(x\right)\frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1}dx & =0\end{aligned}[/math]

It is not clear whether the integral above implies that [math]g\left(x\right)=0[/math] almost everywhere. For one, if [math]g\left(x\right)[/math] is allowed to be positive and negative in different regions, the areas under the curve could offset to yield an integral that equals zero. If this is holds, the statistic is not complete.

A typical approach to tackle this problem is to differentiate both sides of the equation above:

[math]\begin{aligned} & E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right)=0,\,\forall\theta\in\Theta\\ \Leftrightarrow & \frac{d}{d\theta}E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right)=\frac{d}{d\theta}0\\ \Leftrightarrow & \frac{d}{d\theta}E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right)=0\end{aligned}[/math]

Why can we differentiate both sides? This is clearly not always correct. For example, consider equation [math]x^{2}=5.[/math] Taking derivatives on both sides would yield [math]2x=0,[/math] which is not consistent with the initial equation.

The reason we can differentiate equation [math]E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right)=0,\,\forall\theta\in\Theta[/math] on both sides is that the identity holds for all values of [math]\theta[/math]. This type of equation is called a functional equation: It applies to all the values of the domain of function [math]E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right)[/math].

Continuing the derivation, we obtain:

[math]\begin{aligned} & \frac{d}{d\theta}E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right)=0\\ \Leftrightarrow & \frac{d}{d\theta}\left(\int_{0}^{\theta}g\left(x\right)\frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1}dx\right)=0\\ \Leftrightarrow & \frac{d}{d\theta}\left[\left(\frac{n}{\theta^{n}}\right)\int_{0}^{\theta}g\left(x\right)x^{n-1}dx\right]=0\end{aligned}[/math]

Now, since each factor is a function of [math]\theta[/math], the derivative becomes

[math]\begin{aligned} & -\frac{n^{2}}{\theta^{n+1}}\int_{0}^{\theta}g\left(x\right)x^{n-1}dx+\frac{n}{\theta^{n}}g\left(\theta\right)\theta^{n-1}=0\\ \Leftrightarrow & \underset{=E_{\theta}\left(g\left(X_{\left(n\right)}\right)\right)=0}{-n\underbrace{\int_{0}^{\theta}g\left(x\right)\frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1}dx}+\frac{n}{\theta^{n}}g\left(\theta\right)\theta^{n-1}}=0\\ \Leftrightarrow & \frac{n}{\theta}g\left(\theta\right)=0,\,\forall\theta\in\Theta\\ \Leftrightarrow & g\left(\theta\right)=0,\,\forall\theta\in\Theta\end{aligned}[/math]

The result above applies for all values [math]\theta\gt 0[/math]. Hence, we were able to show that the zero expectation of the statistic implies that [math]g\left(\theta\right)=0[/math], implying that [math]X_{\left(n\right)}[/math] is complete.

Exponential Family

It turns out that the exponential family provides a direct way to derive sufficient and complete statistics.

Recall: A family [math]\left\{ f\left(\left.\cdot\right|\theta\right):\theta\in\Theta\right\}[/math] of pmfs/pdfs is an exponential family if [math]f\left(\left.x\right|\theta\right)=h\left(x\right)c\left(\theta\right)\exp\left\{ \sum_{i=1}^{K}\omega_{i}\left(\theta\right)t_{i}\left(x\right)\right\} ,\,x\in\mathbb{R},\,\theta\in\Theta[/math]

The following holds:

  • [math]T\left(X_{1}..X_{n}\right)=\sum_{i=1}^{n}\left(t_{1}\left(X_{i}\right),...,T_{K}\left(X_{i}\right)\right)[/math] is sufficient for [math]\theta[/math].
  • [math]T[/math] is complete if [math]\left\{ \left(\omega_{1}\left(\theta\right),...,\omega_{K}\left(\theta\right)\right)^{'},\theta\in\Theta\right\}[/math] contains an open set in [math]\mathbb{R}^{K}[/math].

The first condition is simple: The statistic comprised by the collection of [math]t_{i}[/math] summed over [math]X_{i}[/math] is sufficient for the parameter vector [math]\theta[/math].

The second condition is also simple, but requires that functions [math]\left(\omega_{1}\left(\theta\right)..\omega_{K}\left(\theta\right)\right)^{'}[/math] contain an open set in [math]\mathbb{R}^{K}[/math].

For example, if these turned out to be [math]\left(\mu,\mu^{2}\right)^{'}[/math] for the normal distribution, then while [math]\mu\in\left(-\infty,\infty\right)[/math], it does not span an open set in [math]\mathbb{R}^{2}[/math], and so [math]T[/math] is not complete.

Example: Bernoulli

Suppose [math]X_{i}\overset{iid}{\sim}Ber\left(p\right)[/math], where

[math]p\in\left(0,1\right)[/math] is unknown.

Its marginal pmf is

[math]f\left(\left.x\right|p\right)=p^{x}\left(1-p\right)^{1-x}1\left(x\in\left\{ 0,1\right\} \right)[/math]

such that

[math]h\left(x\right)=1\left(x\in\left\{ 0,1\right\} \right);c\left(p\right)=1-p;\omega\left(p\right)=\log\left(\frac{p}{1-p}\right);t\left(x\right)=x[/math]

From the first property, we have

[math]\Sigma_{i=1}^{n}x_{i}=\Sigma_{i=1}^{n}t\left(x_{i}\right)[/math] is a sufficient statistic (i.e., the total number of 1s).

For completeness, notice that we have

[math]\left\{ \omega\left(p\right):p\in\left(0,1\right)\right\} =\left\{ \log\left(\frac{p}{1-p}\right):p\in\left(0,1\right)\right\} =\left\{ \log\left(r\right):r\in\left(0,+\infty\right)\right\}[/math],

where [math]\left(0,+\infty\right)[/math] contains an open interval in [math]\mathbb{R}[/math].

Hence, [math]\Sigma_{i=1}^{n}x_{i}[/math] is complete.

Finally, [math]\widehat{p}_{ML}=\widehat{p}_{MM}=\frac{\sum_{i=1}^{n}X_{i}}{n}[/math] is UMVU.


Cramér-Rao Lower Bound (CRLB)

It is possible to provide a meaningful lower bound to the variance of an estimator. (An example of a meaningless bound is zero.)

Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with marginal pmf/pdf - i.e., of a single observation - [math]f\left(\left.\cdot\right|\theta\right)[/math].

Under some regularity conditions (finite variance of the estimator and differentiation under the integral sign is allowed),

[math]Var_{\theta}\left(\widehat{\theta}\right)\geq\frac{\left(\frac{d}{d\theta}E_{\theta}\left(\widehat{\theta}\left(X_{i}\right)\right)\right)^{2}}{nE_{\theta}\left[\left(\frac{\partial}{\partial\theta}\log\,f\left(\left.X_{i}\right|\theta\right)\right)^{2}\right]}=\frac{\left(\frac{d}{d\theta}E_{\theta}\left(\widehat{\theta}\left(X_{i}\right)\right)\right)^{2}}{nVar_{\theta}\left[\frac{\partial}{\partial\theta}\log\,f\left(\left.X_{i}\right|\theta\right)\right]},\,\forall\theta\in\Theta.[/math]

This is the version for the scalar case, but the analogue multivariate case exists as well.

Notice that when [math]\widehat{\theta}[/math] is unbiased, we obtained the simplified inequality

[math]Var_{\theta}\left(\widehat{\theta}\right)\geq\frac{1}{nE_{\theta}\left[\left(\frac{\partial}{\partial\theta}\log\,f\left(\left.X_{i}\right|\theta\right)\right)^{2}\right]}=\frac{1}{nVar_{\theta}\left[\frac{\partial}{\partial\theta}\log\,f\left(\left.X_{i}\right|\theta\right)\right]},\,\forall\theta\in\Theta.[/math]

This result presents a few striking features.

First, the log-likelihood function shows up in the denominator.

Second, it is not evaluated at some point [math]x_{i}[/math]. Rather, the expectation over [math]X_{i}[/math] of its derivative w.r.t. [math]\theta[/math] is taken.

Third, the last equality follows from a novel result (which we do not prove here):

[math]Var_{\theta}\left[\frac{\partial}{\partial\theta}\log\,f\left(\left.X_{i}\right|\theta\right)\right]=E_{\theta}\left[\left(\frac{\partial}{\partial\theta}\log\,f\left(\left.X_{i}\right|\theta\right)\right)^{2}\right][/math].

This is a special property of the log-likelihood function, since [math]E_{\theta}\left[\left(\frac{\partial}{\partial\theta}\log\,f\left(\left.X_{i}\right|\theta\right)\right)\right]=0[/math].

Fisher Information

We denote the denominator, [math]I\left(\theta\right)=nE_{\theta}\left[\left(\frac{\partial}{\partial\theta}\log\,f\left(\left.X_{i}\right|\theta\right)\right)^{2}\right][/math] as the Fisher information, which is the reciprocal of the minimum attainable variance of unbiased estimators.

CRLB: Possible Cases

The CRLB is a weak bound, in the sense that an UMVU may fail to reach it.

Three possible cases can occur:

  • The CRLB is applicable and attainable:
    • Estimating [math]p[/math] when [math]X_{i}\sim Ber\left(p\right)[/math]
    • Estimating [math]\mu[/math] when [math]X_{i}\sim N\left(\mu,\sigma^{2}\right)[/math] with [math]\sigma^{2}[/math] known.
  • The CRLB is applicable, but not attainable:
    • Estimating [math]\sigma^{2}[/math] when [math]X_{i}\sim N\left(\mu,\sigma^{2}\right)[/math]: [math]\widehat{\sigma^{2}}=s^{2},[/math] while [math]Var_{\sigma^{2}}\left(s^{2}\right)=\frac{2\sigma^{4}}{n-1}\gt \frac{2\sigma^{4}}{n}[/math], the latter of which is the CRLB.
  • The CRLB is not applicable:
    • Estimating [math]\theta[/math] when [math]X_{i}\sim U\left(0,\theta\right)[/math]; [math]Var_{\theta}\left(\widehat{\theta}_{UMVU}\right)=\frac{1}{n\left(n+2\right)}\theta^{2}[/math], yet CRLB[math]=\infty[/math] or [math]\frac{\theta^{2}}{n}[/math].