Full Lecture 14

From Significant Statistics
Jump to navigation Jump to search

Lecture 14

Convergence

In this lecture we will focus on extremely useful results that occur when we consider large sample sizes. In order to analyze these results, we need to introduce a few concepts related to convergence of sequences of random variables.

A sequence of random variables [math]X_{n}[/math] converges to a random variable [math]X[/math]:

  • in probability if [math]\lim_{n\rightarrow\infty}P\left(\left|X_{n}-X\right|\geq\varepsilon\right)=0,\,\forall\varepsilon\gt 0[/math]
  • almost surely if [math]P\left(\lim_{n\rightarrow\infty}\left|X_{n}-X\right|\gt \varepsilon\right)=0,\,\forall\varepsilon\gt 0[/math]
  • in quadratic mean if [math]\lim_{n\rightarrow\infty}E\left[\left(X_{n}-X\right)^{2}\right]=0[/math]

The concepts above apply if random variable [math]X[/math] is a constant. In this case, often denote it by [math]\mu[/math].

Some convergence concepts are stronger than others. The following facts are useful:

  • [math]X_{n}\overset{q.m.}{\rightarrow}X\Rightarrow X_{n}\overset{p}{\rightarrow}X[/math]
  • [math]X_{n}\overset{a.s.}{\rightarrow}X\Rightarrow X_{n}\overset{p}{\rightarrow}X[/math]
  • Quadratic mean convergence does not imply, nor is it implied, by almost sure convergence

Example: Convergence in Probability vs. Almost Sure Convergence

Let

[math]X_{n}=\begin{cases} 1, & \text{with probability }\frac{1}{n}\\ 0, & \text{with probability }1-\frac{1}{n} \end{cases}[/math]

[math]Y_{n}=\begin{cases} 1, & \text{with probability }\frac{1}{n^{2}}\\ 0, & \text{with probability }1-\frac{1}{n^{2}} \end{cases}[/math]

Do these random sequences converge in probability and/or almost surely to zero?

We first check that both sequences converge in probability to zero. For [math]X_{n}[/math], we require

[math]\begin{aligned} & \lim_{n\rightarrow\infty}P\left(\left|X_{n}-0\right|\geq\varepsilon\right)=0\\ \Leftrightarrow & \lim_{n\rightarrow\infty}P\left(X_{n}\geq\varepsilon\right)=0\end{aligned}[/math]

If [math]\varepsilon\gt 1[/math], the condition above is always satisfied, since [math]X_{n}\in\left\{ 0,1\right\}[/math].

For [math]\varepsilon\in\left(0,1\right)[/math], we have [math]\begin{aligned} & \lim_{n\rightarrow\infty}P\left(X_{n}\geq\varepsilon\right)=0\\ \Leftrightarrow & \lim_{n\rightarrow\infty}P\left(X_{n}=1\right)=0\\ \Leftrightarrow & \lim_{n\rightarrow\infty}\frac{1}{n}=0\end{aligned}[/math]

which does indeed hold. The same method can be used to prove convergence in probability for [math]Y_{n}[/math].

Now, consider convergence almost surely.

For [math]X_{n}[/math], we require

[math]\begin{aligned} & P\left(\lim_{n\rightarrow\infty}\left|X_{n}-X\right|\gt \varepsilon\right)=0\\ \Leftrightarrow & P\left(\lim_{n\rightarrow\infty}X_{n}\gt \varepsilon\right)=0\\ \Rightarrow & P\left(\lim_{n\rightarrow\infty}X_{n}=1\right)=0\end{aligned}[/math]

where the last equation follows from the fact that [math]X\in\left\{ 0,1\right\}[/math], and that the condition can only be satisfied if the probability of [math]X_{n}[/math] equaling 1 vanishes as [math]n\rightarrow0[/math].

We will approach this problem indirectly.

Consider the sum, starting at a very high [math]n[/math], of the probability that [math]X_{n}=1[/math]: [math]\sum_{i=n}^{\infty}P\left(X_{n}=1\right)=\sum_{i=n}^{\infty}\frac{1}{i}[/math].

If this sum diverges, it means that for very high [math]n[/math], we still obtain [math]X_{n}=1[/math] with a finite probability, such that adding all the ones creates a diverging sum. If, on the other hand, the sum converges, this means that when [math]n[/math] is large, the probability of observing [math]X_{n}=1[/math] equals zero. Notice that

[math]\sum_{i=n}^{\infty}P\left(X_{n}=1\right)=\sum_{i=n}^{\infty}\frac{1}{i}=\infty[/math]

and

[math]\sum_{i=n}^{\infty}P\left(Y_{n}=1\right)=\sum_{i=n}^{\infty}\frac{1}{i^{2}}\lt \infty[/math]

(We don’t prove the results above here.)

While both sequences converge in probability to zero, only [math]Y_{n}[/math] converges almost surely. The reason is that, when [math]n[/math] is very high, the probability of observing [math]X_{n}=1[/math] remains finite (so that the sum of subsequent probabilities diverges), while the probability of observing [math]Y_{n}=1[/math] vanishes to zero (so that the sum of subsequent probabilities converges).

Estimator Consistency

Because estimators are statistics, we can apply the definitions we have found to them as well.

We say that an estimator [math]\widehat{\theta}[/math] is consistent if

[math]\widehat{\theta}\left(X_{1}..X_{n}\right)\overset{p}{\rightarrow}\theta,\forall\theta\in\Theta[/math],

where [math]X_{1}..X_{n}[/math] is a sequence of random variables (usually, data).

For example, the maximum-likelihood estimator is consistent, i.e., [math]\widehat{\theta_{ML}}\overset{p}{\rightarrow}\theta_{0}[/math].


Theorem: Law of Large Numbers

Let [math]X_{1}..X_{n}[/math] be a sequence of random variables, where [math]E\left(X_{i}\right)=\mu[/math] and [math]Var\left(X_{i}\right)\lt \sigma^{2}[/math].

Then,

[math]\overline{X}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}\overset{p}{\rightarrow}\mu[/math]

This result is probably not very surprising, and in a sense, we have been using it intuitively for the method-of-moments estimator.

The sample mean, as [math]n[/math] increases, converges in probability to the population mean [math]E\left(X_{i}\right)[/math].

The proof follows from Chebychev’s inequality:

[math]Pr\left(\left|\overline{X}_{n}-\mu\right|\geq\varepsilon\right)\leq\frac{Var\left(\overline{X}_{n}\right)}{\varepsilon^{2}}=\frac{\sigma^{2}}{n\varepsilon^{2}}[/math] such that [math]\lim_{n\rightarrow\infty}\,Pr\left(\left|\overline{X}_{n}-\mu\right|\geq\varepsilon\right)\leq\lim_{n\rightarrow\infty}\,\frac{\sigma^{2}}{n\varepsilon^{2}}=0,\,\forall\varepsilon\gt 0.[/math]

Notice that we did not restrict [math]X_{i}[/math] to be a random sample. If we are willing to impose that condition, we can prove a stronger result:

Strong Law of Large Numbers

Let [math]X_{1}..X_{n}[/math] be a random sample with mean [math]\mu[/math]. Then,

[math]\overline{X}_{n}=\overset{a.s.}{\rightarrow}\mu[/math]

In this case, [math]\sigma^{2}[/math] may not exist, which means that convergence in quadratic mean may not hold.


Convergence in Distribution

A sequence of random variables [math]X_{n}[/math] converges in distribution to a random variable [math]Y[/math] if, at all continuity points of [math]F_{Y}\left(y\right)[/math],

[math]\lim_{n\rightarrow\infty}\,F_{X_{n}}\left(y\right)=F_{Y}\left(y\right)[/math]

Convergence in distribution is very intuitive: As [math]n\rightarrow\infty[/math], the distributions of [math]X_{n}[/math] converge point-by-point to that of [math]Y[/math].

Example

The reason we only require the limit condition to hold only at continuity points of [math]F_{Y}[/math] can be illustrated with an example (the specific reasoning is a bit deeper, but we won’t concern ourselves with that).

Let [math]f_{X_{N}}=\begin{cases} n, & 0\lt x\lt \frac{1}{n}\\ 0, & otherwise \end{cases}[/math]

Clearly, as [math]n\rightarrow\infty[/math], [math]X_{n}[/math] is approaching a mass point at zero, so we may want it to converge in distribution to [math]Y[/math], where [math]P\left(Y=0\right)=1[/math]. However, the value of [math]F_{X_{N}}\left(0\right)[/math] is zero for all [math]n[/math], and so [math]\lim_{n\rightarrow\infty}F_{X_{N}}\left(0\right)\neq F_{Y}\left(0\right)=1[/math].

By not requiring that the limit condition holds at [math]F_{Y}\left(0\right)[/math], we do satisfy convergence in distribution, i.e., [math]X_{n}\overset{d}{\rightarrow}Y_{n}[/math].


Slutsky’s Theorem

Slutsky’s Theorem provides some nice results that apply to convergence in distribution:

If a sequence [math]X_{n}[/math] converges in distribution to [math]X[/math], and a sequence [math]Y_{n}[/math] converges in probability to a constant [math]c[/math], then

  • [math]X_{n}.Y_{n}\overset{d}{\rightarrow}c.X[/math]
  • [math]X_{n}+Y_{n}\overset{d}{\rightarrow}c+X[/math]
  • [math]\frac{X_{n}}{Y_{n}}\overset{d}{\rightarrow}\frac{X}{c}[/math] if [math]c\neq0[/math]

The results above also holds if [math]X_{n}[/math] converges in probability, in which case the implications also apply to convergence in probability.


Central Limit Theorem

Let [math]X_{1}..X_{n}[/math] be a random sample with twice differentiable generating function [math]M_{X}\left(t\right)[/math] in a neighborhood of zero, and mean [math]\mu[/math] and variance [math]\sigma^{2}.[/math]

Then,

[math]\frac{\sqrt{n}\left(\overline{X}_{n}-\mu\right)}{\sigma}\overset{d}{\rightarrow}N\left(0,1\right)[/math]

This is a striking result. When [math]n[/math] is large, the standardized sample mean converges to the standard normal distribution.

Proof

We first define the standardized variable [math]Y_{i}=\frac{X_{i}-\mu}{\sigma}[/math], s.t.

[math]M_{Y}^{'}\left(0\right)=E\left(Y\right)=0[/math] and [math]M_{Y}^{''}\left(0\right)=Var\left(Y\right)=1[/math].

Notice that [math]\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{X_{i}-\mu}{\sigma}=\frac{n}{\sqrt{n}}\sum_{i=1}^{n}\frac{\overline{X}_{n}-\mu}{\sigma}=\sqrt{n}\sum_{i=1}^{n}\frac{\overline{X}_{n}-\mu}{\sigma}[/math] which is our statistic of interest.

So, if we show that [math]\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}[/math] has the same m.g.f. as [math]N\left(0,1\right)[/math], because we have assumed that [math]M_{X}\left(t\right)[/math] exists in a neighborhood of zero, this will imply

[math]\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}=\sqrt{n}\sum_{i=1}^{n}\frac{\overline{X}_{n}-\mu}{\sigma}\sim N\left(0,1\right)[/math] (Recall that the m.g.f. only identifies the distribution of [math]Z[/math] if [math]M_{Z}\left(0\right)[/math] exists).

Now, notice that if [math]Z_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}[/math],

[math]\begin{aligned} & =M_{Z_{n}}\left(t\right)=E\left[\exp\left(t\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}\right)\right]=E\left[\Pi_{i=1}^{n}\exp\left(t\frac{Y_{i}}{\sqrt{n}}\right)\right]\\ & \underset{(i.i.d.)}{=}\Pi_{i=1}^{n}E\left[\exp\left(t\frac{Y_{i}}{\sqrt{n}}\right)\right]=\Pi_{i=1}^{n}M_{Y_{i}}\left(\frac{t}{\sqrt{n}}\right)=M_{Y}\left(\frac{t}{\sqrt{n}}\right)^{n}\end{aligned}[/math].

Let us now expand [math]M_{Y}\left(\frac{t}{\sqrt{n}}\right)[/math] around zero:

[math]M_{Y}\left(\frac{t}{\sqrt{n}}\right)=M_{Y}\left(0\right)+M_{Y}^{'}\left(0\right)\frac{t}{\sqrt{n}}+\frac{M_{Y}^{''}\left(0\right)}{2}\left(\frac{t}{\sqrt{n}}\right)^{2}+R_{Y}\left(\frac{t}{\sqrt{n}}\right)[/math]

by Taylor’s theorem, the remainder approaches zero as [math]n[/math] approaches infinity. We will ignore it from now on, but it is possible to prove precisely that it vanishes in the following steps.

Notice also that [math]M_{Y}\left(0\right)=1,[/math] [math]M_{Y}^{'}\left(0\right)=E\left(Y\right)=0[/math] and [math]M_{Y}^{''}\left(0\right)=Var\left(Y\right)=1[/math], s.t. we obtain

[math]M_{Y}\left(\frac{t}{\sqrt{n}}\right)\simeq1+0+\frac{t^{2}}{2n}[/math]

When [math]n[/math] is large, we obtain [math]\lim_{n\rightarrow\infty}\left[M_{Y}\left(\frac{t}{\sqrt{n}}\right)\right]^{n}=\lim_{n\rightarrow\infty}\left(1+\frac{t^{2}}{2n}+\underset{\rightarrow0}{\underbrace{R_{Y}\left(\frac{t}{\sqrt{n}}\right)}}\right)^{n}=\lim_{n\rightarrow\infty}\left(1+\frac{t^{2}}{2n}\right)^{n}=\text{e}^{\frac{t^{2}}{2}}.[/math]

The result is the m.g.f. of [math]N\left(0,1\right)[/math], thereby proving the CLT.


Delta Method

The delta method allows us to approximate the distribution of the transformation of a random variable, as long as [math]n[/math] is large.

The delta method states that if a sequence of random variables [math]X_{n}[/math] satisfies [math]\sqrt{n}\left(X_{n}-\mu\right)\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)[/math] then for any function [math]g\left(\cdot\right)[/math] continuously differentiable in a neighborhood of [math]\mu[/math] with derivative [math]g^{'}\left(\mu\right)[/math],

[math]\sqrt{n}\left(g\left(X_{n}\right)-g\left(\mu\right)\right)\overset{d}{\rightarrow}N\left(0,g^{'}\left(\mu\right)^{2}.\sigma^{2}\right)[/math]

Proof

The proof involves expanding [math]g\left(X_{n}\right)[/math] around [math]\mu[/math]:

[math]g\left(X_{n}\right)=g\left(\mu\right)+\left(X_{n}-\mu\right)g^{'}\left(\mu\right)+R\left(X_{n}\right)[/math].

We can rewrite this version of the Taylor expansion using the mean value theorem, which yields the alternative version

[math]g\left(X_{n}\right)=g\left(\mu\right)+\left(X_{n}-\mu\right)g^{'}\left(\tilde{X}_{n}\right),\,for\,some\,\tilde{X}_{n}\in\left(X_{n},\mu\right)[/math].

Moving [math]g\left(\mu\right)[/math] to the lhs yields [math]g\left(X_{n}\right)-g\left(\mu\right)=\left(X_{n}-\mu\right)g^{'}\left(\tilde{X}_{n}\right)[/math]

Because [math]\tilde{X}_{n}[/math] lies between [math]X_{n}[/math] and [math]\mu[/math], [math]\tilde{X}_{n}\overset{p}{\rightarrow}\mu[/math] if [math]X_{n}\overset{p}{\rightarrow}\mu[/math].

Remembering that

[math]\sqrt{n}\left(X_{n}-\mu\right)\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)[/math],

then,

[math]\underset{\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)}{\underbrace{\sqrt{n}\left(X_{n}-\mu\right)}}.\underset{\overset{p}{\rightarrow}g^{'}\left(\mu\right)}{\underbrace{g^{'}\left(\tilde{X}_{n}\right)}}\Rightarrow\sqrt{n}\left(g\left(X_{n}\right)-g\left(\mu\right)\right)\overset{d}{\rightarrow}N\left(0,g^{'}\left(\mu\right)^{2}.\sigma^{2}\right)[/math]

To obtain the result above, we first used the facts that [math]\sqrt{n}\left(X_{n}-\mu\right)\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)[/math] and that [math]g^{'}\left(\tilde{X}_{n}\right)\overset{p}{\rightarrow}g^{'}\left(\mu\right)[/math]. By Slutsky’s theorem, the product converges in distribution to the distribution of [math]g^{'}\left(\mu\right).\sigma.Z[/math], where [math]Z=N\left(0,1\right)[/math]. Finally, the last result follows from the fact that [math]\sqrt{n}\left(X_{n}-\mu\right)g^{'}\left(\tilde{X}_{n}\right)=\sqrt{n}\left(g\left(X_{n}\right)-g\left(\mu\right)\right)[/math].

This result is often useful to calculate the distribution of a transformation of an estimator. For example, while estimating a model, it may be convenient to constrain a parameter estimate of [math]\theta[/math] to be positive. One way to accomplish this is to estimate some parameter [math]\widehat{\beta}[/math], s.t. [math]\widehat{\theta}=\exp\left(\widehat{\beta}\right)[/math]. For any value of [math]\widehat{\beta}[/math], we will obtain a positive estimate of [math]\widehat{\theta}[/math] by applying the exponential. If we know the distribution of [math]\sqrt{n}\left(\widehat{\beta}-E\left(\widehat{\beta}\right)\right)[/math] has a [math]N\left(0,\sigma^{2}\right)[/math] distribution, then the delta method can be used to produce the distribution of [math]\widehat{\theta}[/math]. To be clear, in this case the sequence of random variables [math]X_{n}[/math] would be data, such that [math]\widehat{\beta}_{n}=\widehat{\beta}\left(X_{n}\right)[/math], for example.


Somewhat Pedantic Remark on Notation

You may have noticed that statements like

[math]\sqrt{n}\left(X_{n}-\mu\right)\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)[/math]

seem overly complicated. It may seem simpler to write the statement above as

[math]\frac{\sqrt{n}\left(X_{n}-\mu\right)}{\sigma}\overset{d}{\rightarrow}N\left(0,1\right)[/math]

or even

[math]X_{n}\overset{d}{\rightarrow}N\left(\mu,\frac{\sigma^{2}}{n}\right)[/math].

While the second statement is perfectly fine, the last statement is problematic. The reason is that the right-hand side, [math]N\left(\mu,\frac{\sigma^{2}}{n}\right)[/math], is the distribution limit of the left-hand side in [math]n[/math], and so should not depend on [math]n[/math]. So, try to keep [math]n[/math] in the left-hand side. If you really want to have it on the right-hand side, you can use the alternative notation,

[math]X_{n}\overset{\sim}{\sim}N\left(\mu,\frac{\sigma^{2}}{n}\right).[/math]