Lecture 7. B) Statistics

From Significant Statistics
Jump to navigation Jump to search

Statistics

Let [math]X_{1}..X_{n}[/math] be a random sample and let [math]T:\mathbb{R}^{n}\rightarrow\mathbb{R}^{k}[/math] be a function (for some [math]k\gt 1[/math]).

The random variable [math]Y=T\left(X_{1}..X_{n}\right)[/math] is called a statistic, and its distribution is called the sampling distribution of [math]Y[/math].

Some Examples

  • The sample mean is [math]\overline{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i}[/math].
  • The sample variance is [math]s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2}[/math].
  • The sample standard deviation is [math]s=\sqrt{s^{2}}[/math].

Notice that each of the statistics above is a random variable. Each random sample of [math]X[/math]s will yield a slightly different sample mean, sample variance, etc.

At this point you may be wondering about the [math]\frac{1}{n-1}[/math] factor in the formula for the sample variance. We will explain that shortly.

The statistics above are random variables in their own right. They too have moments. Here are a few:

Expected Sample Mean

  • [math]E\left(\overline{X}\right)=E\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)=\frac{1}{n}\sum_{i=1}^{n}\underset{=\mu}{\underbrace{E\left(X_{i}\right)}}=\frac{n\mu}{n}=\mu.[/math]

Variance of the Sample Mean

  • [math]Var\left(\overline{X}\right)=Var\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)=\frac{1}{n^{2}}Var\left(\sum_{i=1}^{n}X_{i}\right)=\frac{1}{n^{2}}\sum_{i=1}^{n}Var\left(X_{i}\right)=\frac{n\sigma^{2}}{n^{2}}=\frac{\sigma^{2}}{n}[/math].

The variance result is interesting and intuitive: As we increase the sample size, the variance of the mean decreases. For example, suppose you’d take 100 draws of [math]X_{i}[/math] many times, and each time, calculated the mean. (For example, in Excel, each column would contain 100 draws of [math]X_{i}[/math], and the final row calculates the means across all columns). The variance of the means decreases with the number of draws (in our case, 100). If we increased the number of draws to 1,000,000, then the means of each column would probably be very similar, and so the variance of those means would be further reduced.

The result on [math]Var\left(\overline{X}\right)[/math] tells us the specific rate at which the variance of the mean decreases with [math]n[/math].

Consider also the following well-known result, which we provide a lot of detail for:

Expectation of [math]s^{2}[/math]

  • [math]E\left(s^{2}\right)=E\left[\frac{1}{n-1}\sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2}\right]=\frac{1}{n-1}E\left[\sum_{i=1}^{n}\left(X_{i}^{2}+\overline{X}^{2}-2X_{i}\overline{X}\right)\right]=\frac{1}{n-1}E\left[n\overline{X}^{2}+\sum_{i=1}^{n}\left(X_{i}^{2}\right)-2\overline{X}\sum_{i=1}^{n}\left(\frac{n}{n}X_{i}\right)\right][/math][math]=\frac{1}{n-1}E\left[n\overline{X}^{2}+\sum_{i=1}^{n}\left(X_{i}^{2}\right)-2n\overline{X}^{2}\right]=\frac{1}{n-1}E\left[\sum_{i=1}^{n}\left(X_{i}^{2}\right)-n\overline{X}^{2}\right]=\frac{1}{n-1}\left(nE\left(X_{i}^{2}\right)-nE\left(\overline{X}^{2}\right)\right)=\frac{1}{n-1}\left(n\left(\mu^{2}+\sigma^{2}\right)-n\left(\mu^{2}+\frac{\sigma^{2}}{n}\right)\right)[/math]

[math]=\frac{n\sigma^{2}-\sigma^{2}}{n-1}=\sigma^{2}.[/math]

We have used the fact that [math]E\left(X_{i}^{2}\right)=Var\left(X_{i}\right)+E\left(X_{i}\right)^{2}[/math] and [math]E\left(\overline{X}_{i}^{2}\right)=Var\left(\overline{X}_{i}\right)+E\left(\overline{X}_{i}\right)^{2}[/math].

It may be surprising that [math]E\left(s^{2}\right)=\sigma^{2}[/math], given that the denominator of [math]s^{2}[/math] is [math]n-1[/math]. The reason this works is that the draws of [math]X_{i}[/math] are be closer to their average [math]\left(\overline{X}\right)[/math] than to the true population mean, [math]E\left(X\right)[/math]. As a result, we are required to use a lower denominator than if we knew the true population mean.