Lecture 9. D) Sufficient Statistics
Sufficient Statistics
Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\in\Theta[/math] is unknown.
A statistic [math]T=T\left(X_{1}..X_{n}\right)[/math] is a sufficient statistic for parameter [math]\theta[/math] if the conditional pmf/pdf of [math]\left(X_{1}..X_{n}\right)[/math] given [math]T[/math] does not depend on [math]\theta[/math], i.e.,
[math]f\left(\left.X\right|\theta,T\right)=f\left(\left.X\right|T\right).[/math]
The reason we are interested in sufficient statistics will be clear once we present the Rao-Blackwell theorem. However, it is worth thinking a bit about the meaning of sufficient statistics first.
Sufficient statistics are the portion of the data that is useful to calculate the maximum likelihood estimator of [math]\theta[/math].
Intuitive Example: Uniform
In order to estimate [math]\theta[/math] via maximum likelihood, one writes down the likelihood function of the sample:
[math]f_{X_{1}..X_{n}}\left(\left.x\right|\theta\right)=\frac{1}{\theta^{n}}1\left(x_{1}\leq\theta\wedge x_{2}\leq\theta\wedge...\wedge x_{n}\leq\theta\right).[/math]
We can rewrite the pdf of the sample as:
[math]f_{X_{1}..X_{n}}\left(\left.x\right|\theta\right)=\frac{1}{\theta^{n}}1\left(\max_{i=1..n}x_{i}\leq\theta\right).[/math]
From the expression above, it is clear that the MLE only depends on the maximum observation and not the whole sample.
Hence, [math]\max_{i=1..n}X_{i}[/math] is a sufficient statistic of [math]\theta[/math].
Remark: What does [math]f\left(\left.X\right|\theta,T\right)[/math] mean?
At this point, you have have worked with conditional pdfs. For example, you may have worked with a pdf [math]f_{X|Y}[/math], where [math]X[/math] and [math]Y[/math] are random variables.
However, the expression [math]f\left(\left.X\right|\theta,T\right)[/math] means something slightly different. The reason is that [math]T=T\left(X_{1}..X_{n}\right)[/math] is a function of the random variables [math]X_{1}..X_{n},[/math] i.e., it's a function of the sample.
The simplest way to see this is to consider the following joint pmf:
[math]X_{1}[/math] | ||||
---|---|---|---|---|
0 | 1 | 2 | ||
[math]X_{2}[/math] | 0 | 5% | 7% | 8% |
1 | 20% | 5% | 5% | |
2 | 15% | 25% | 10% |
and suppose we would like to calculate the pmf
[math]f_{X_{1},X_{2}|X_{1}+X_{2}=1}.[/math]
Clearly, this pmf is obtained by dividing the probability of observing each combination of [math]X_{1},X_{2}[/math] - for the cases where [math]X_{1}+X_{2}=1[/math] - by [math]P\left(X_{1}+X_{2}=1\right)[/math], whereas the remaining elements will be equal to zero. This yields the conditional pmf:
[math]X_{1}[/math] | ||||
---|---|---|---|---|
0 | 1 | 2 | ||
[math]X_{2}[/math] | 0 | 0 | 26% | 0 |
1 | 74% | 0 | 0 | |
2 | 0 | 0 | 0 |
where [math]7\% \div (20\%+7\%) = 26\% [/math] and [math]20\% \div (20\%+7\%) = 74\%.[/math] In this example, [math]T\left(X_{1}..X_{n}\right)=X_{1}+X_{2}.[/math]
It follows that
[math]f_{X|T}\left(x|t\right)=\frac{f_X\left(x\right)}{f_T\left(t\right)}.1\left(T\left(x\right)=t\right).[/math]
This result is valid for both pmfs and pdfs.
The result is a bit different from the one we are used to, when we condition [math]X[/math] on [math]Y[/math], for example. The reason is that in this case we are not conditioning [math]X[/math] on a different random variable. Rather, we are interested in the pdf of a random vector [math]X[/math], conditional on it respecting some equality [math]T\left(X\right)=t.[/math]