# Convergence

In this lecture we will focus on extremely useful results that occur when we consider large sample sizes. In order to analyze these results, we need to introduce a few concepts related to convergence of sequences of random variables.

A sequence of random variables $X_{n}$ converges to a random variable $X$:

• in probability if $\lim_{n\rightarrow\infty}P\left(\left|X_{n}-X\right|\geq\varepsilon\right)=0,\,\forall\varepsilon\gt 0$
• almost surely if $P\left(\lim_{n\rightarrow\infty}\left|X_{n}-X\right|\gt \varepsilon\right)=0,\,\forall\varepsilon\gt 0$
• in quadratic mean if $\lim_{n\rightarrow\infty}E\left[\left(X_{n}-X\right)^{2}\right]=0$

The concepts above apply if random variable $X$ is a constant. In this case, often denote it by $\mu$.

Some convergence concepts are stronger than others. The following facts are useful:

• $X_{n}\overset{q.m.}{\rightarrow}X\Rightarrow X_{n}\overset{p}{\rightarrow}X$
• $X_{n}\overset{a.s.}{\rightarrow}X\Rightarrow X_{n}\overset{p}{\rightarrow}X$
• Quadratic mean convergence does not imply, nor is it implied, by almost sure convergence

## Example: Convergence in Probability vs. Almost Sure Convergence

Let

$X_{n}=\begin{cases} 1, & \text{with probability }\frac{1}{n}\\ 0, & \text{with probability }1-\frac{1}{n} \end{cases}$

$Y_{n}=\begin{cases} 1, & \text{with probability }\frac{1}{n^{2}}\\ 0, & \text{with probability }1-\frac{1}{n^{2}} \end{cases}$

Do these random sequences converge in probability and/or almost surely to zero?

We first check that both sequences converge in probability to zero. For $X_{n}$, we require

\begin{aligned} & \lim_{n\rightarrow\infty}P\left(\left|X_{n}-0\right|\geq\varepsilon\right)=0\\ \Leftrightarrow & \lim_{n\rightarrow\infty}P\left(X_{n}\geq\varepsilon\right)=0\end{aligned}

If $\varepsilon\gt 1$, the condition above is always satisfied, since $X_{n}\in\left\{ 0,1\right\}$.

For $\varepsilon\in\left(0,1\right)$, we have \begin{aligned} & \lim_{n\rightarrow\infty}P\left(X_{n}\geq\varepsilon\right)=0\\ \Leftrightarrow & \lim_{n\rightarrow\infty}P\left(X_{n}=1\right)=0\\ \Leftrightarrow & \lim_{n\rightarrow\infty}\frac{1}{n}=0\end{aligned}

which does indeed hold. The same method can be used to prove convergence in probability for $Y_{n}$.

Now, consider convergence almost surely.

For $X_{n}$, we require

\begin{aligned} & P\left(\lim_{n\rightarrow\infty}\left|X_{n}-X\right|\gt \varepsilon\right)=0\\ \Leftrightarrow & P\left(\lim_{n\rightarrow\infty}X_{n}\gt \varepsilon\right)=0\\ \Rightarrow & P\left(\lim_{n\rightarrow\infty}X_{n}=1\right)=0\end{aligned}

where the last equation follows from the fact that $X\in\left\{ 0,1\right\}$, and that the condition can only be satisfied if the probability of $X_{n}$ equaling 1 vanishes as $n\rightarrow0$.

We will approach this problem indirectly.

Consider the sum, starting at a very high $n$, of the probability that $X_{n}=1$: $\sum_{i=n}^{\infty}P\left(X_{n}=1\right)=\sum_{i=n}^{\infty}\frac{1}{i}$.

If this sum diverges, it means that for very high $n$, we still obtain $X_{n}=1$ with a finite probability, such that adding all the ones creates a diverging sum. If, on the other hand, the sum converges, this means that when $n$ is large, the probability of observing $X_{n}=1$ equals zero. Notice that

$\sum_{i=n}^{\infty}P\left(X_{n}=1\right)=\sum_{i=n}^{\infty}\frac{1}{i}=\infty$

and

$\sum_{i=n}^{\infty}P\left(Y_{n}=1\right)=\sum_{i=n}^{\infty}\frac{1}{i^{2}}\lt \infty$

(We don’t prove the results above here.)

While both sequences converge in probability to zero, only $Y_{n}$ converges almost surely. The reason is that, when $n$ is very high, the probability of observing $X_{n}=1$ remains finite (so that the sum of subsequent probabilities diverges), while the probability of observing $Y_{n}=1$ vanishes to zero (so that the sum of subsequent probabilities converges).

## Estimator Consistency

Because estimators are statistics, we can apply the definitions we have found to them as well.

We say that an estimator $\widehat{\theta}$ is consistent if

$\widehat{\theta}\left(X_{1}..X_{n}\right)\overset{p}{\rightarrow}\theta,\forall\theta\in\Theta$,

where $X_{1}..X_{n}$ is a sequence of random variables (usually, data).

For example, the maximum-likelihood estimator is consistent, i.e., $\widehat{\theta_{ML}}\overset{p}{\rightarrow}\theta_{0}$.

# Theorem: Law of Large Numbers

Let $X_{1}..X_{n}$ be a sequence of random variables, where $E\left(X_{i}\right)=\mu$ and $Var\left(X_{i}\right)\lt \sigma^{2}$.

Then,

$\overline{X}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}\overset{p}{\rightarrow}\mu$

This result is probably not very surprising, and in a sense, we have been using it intuitively for the method-of-moments estimator.

The sample mean, as $n$ increases, converges in probability to the population mean $E\left(X_{i}\right)$.

The proof follows from Chebychev’s inequality:

$Pr\left(\left|\overline{X}_{n}-\mu\right|\geq\varepsilon\right)\leq\frac{Var\left(\overline{X}_{n}\right)}{\varepsilon^{2}}=\frac{\sigma^{2}}{n\varepsilon^{2}}$ such that $\lim_{n\rightarrow\infty}\,Pr\left(\left|\overline{X}_{n}-\mu\right|\geq\varepsilon\right)\leq\lim_{n\rightarrow\infty}\,\frac{\sigma^{2}}{n\varepsilon^{2}}=0,\,\forall\varepsilon\gt 0.$

Notice that we did not restrict $X_{i}$ to be a random sample. If we are willing to impose that condition, we can prove a stronger result:

## Strong Law of Large Numbers

Let $X_{1}..X_{n}$ be a random sample with mean $\mu$. Then,

$\overline{X}_{n}=\overset{a.s.}{\rightarrow}\mu$

In this case, $\sigma^{2}$ may not exist, which means that convergence in quadratic mean may not hold.

# Convergence in Distribution

A sequence of random variables $X_{n}$ converges in distribution to a random variable $Y$ if, at all continuity points of $F_{Y}\left(y\right)$,

$\lim_{n\rightarrow\infty}\,F_{X_{n}}\left(y\right)=F_{Y}\left(y\right)$

Convergence in distribution is very intuitive: As $n\rightarrow\infty$, the distributions of $X_{n}$ converge point-by-point to that of $Y$.

## Example

The reason we only require the limit condition to hold only at continuity points of $F_{Y}$ can be illustrated with an example (the specific reasoning is a bit deeper, but we won’t concern ourselves with that).

Let $f_{X_{N}}=\begin{cases} n, & 0\lt x\lt \frac{1}{n}\\ 0, & otherwise \end{cases}$

Clearly, as $n\rightarrow\infty$, $X_{n}$ is approaching a mass point at zero, so we may want it to converge in distribution to $Y$, where $P\left(Y=0\right)=1$. However, the value of $F_{X_{N}}\left(0\right)$ is zero for all $n$, and so $\lim_{n\rightarrow\infty}F_{X_{N}}\left(0\right)\neq F_{Y}\left(0\right)=1$.

By not requiring that the limit condition holds at $F_{Y}\left(0\right)$, we do satisfy convergence in distribution, i.e., $X_{n}\overset{d}{\rightarrow}Y_{n}$.

# Slutsky’s Theorem

Slutsky’s Theorem provides some nice results that apply to convergence in distribution:

If a sequence $X_{n}$ converges in distribution to $X$, and a sequence $Y_{n}$ converges in probability to a constant $c$, then

• $X_{n}.Y_{n}\overset{d}{\rightarrow}c.X$
• $X_{n}+Y_{n}\overset{d}{\rightarrow}c+X$
• $\frac{X_{n}}{Y_{n}}\overset{d}{\rightarrow}\frac{X}{c}$ if $c\neq0$

The results above also holds if $X_{n}$ converges in probability, in which case the implications also apply to convergence in probability.

# Central Limit Theorem

Let $X_{1}..X_{n}$ be a random sample with twice differentiable generating function $M_{X}\left(t\right)$ in a neighborhood of zero, and mean $\mu$ and variance $\sigma^{2}.$

Then,

$\frac{\sqrt{n}\left(\overline{X}_{n}-\mu\right)}{\sigma}\overset{d}{\rightarrow}N\left(0,1\right)$

This is a striking result. When $n$ is large, the standardized sample mean converges to the standard normal distribution.

## Proof

We first define the standardized variable $Y_{i}=\frac{X_{i}-\mu}{\sigma}$, s.t.

$M_{Y}^{'}\left(0\right)=E\left(Y\right)=0$ and $M_{Y}^{''}\left(0\right)=Var\left(Y\right)=1$.

Notice that $\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{X_{i}-\mu}{\sigma}=\frac{n}{\sqrt{n}}\sum_{i=1}^{n}\frac{\overline{X}_{n}-\mu}{\sigma}=\sqrt{n}\sum_{i=1}^{n}\frac{\overline{X}_{n}-\mu}{\sigma}$ which is our statistic of interest.

So, if we show that $\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}$ has the same m.g.f. as $N\left(0,1\right)$, because we have assumed that $M_{X}\left(t\right)$ exists in a neighborhood of zero, this will imply

$\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}=\sqrt{n}\sum_{i=1}^{n}\frac{\overline{X}_{n}-\mu}{\sigma}\sim N\left(0,1\right)$ (Recall that the m.g.f. only identifies the distribution of $Z$ if $M_{Z}\left(0\right)$ exists).

Now, notice that if $Z_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}$,

\begin{aligned} & =M_{Z_{n}}\left(t\right)=E\left[\exp\left(t\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Y_{i}\right)\right]=E\left[\Pi_{i=1}^{n}\exp\left(t\frac{Y_{i}}{\sqrt{n}}\right)\right]\\ & \underset{(i.i.d.)}{=}\Pi_{i=1}^{n}E\left[\exp\left(t\frac{Y_{i}}{\sqrt{n}}\right)\right]=\Pi_{i=1}^{n}M_{Y_{i}}\left(\frac{t}{\sqrt{n}}\right)=M_{Y}\left(\frac{t}{\sqrt{n}}\right)^{n}\end{aligned}.

Let us now expand $M_{Y}\left(\frac{t}{\sqrt{n}}\right)$ around zero:

$M_{Y}\left(\frac{t}{\sqrt{n}}\right)=M_{Y}\left(0\right)+M_{Y}^{'}\left(0\right)\frac{t}{\sqrt{n}}+\frac{M_{Y}^{''}\left(0\right)}{2}\left(\frac{t}{\sqrt{n}}\right)^{2}+R_{Y}\left(\frac{t}{\sqrt{n}}\right)$

by Taylor’s theorem, the remainder approaches zero as $n$ approaches infinity. We will ignore it from now on, but it is possible to prove precisely that it vanishes in the following steps.

Notice also that $M_{Y}\left(0\right)=1,$ $M_{Y}^{'}\left(0\right)=E\left(Y\right)=0$ and $M_{Y}^{''}\left(0\right)=Var\left(Y\right)=1$, s.t. we obtain

$M_{Y}\left(\frac{t}{\sqrt{n}}\right)\simeq1+0+\frac{t^{2}}{2n}$

When $n$ is large, we obtain $\lim_{n\rightarrow\infty}\left[M_{Y}\left(\frac{t}{\sqrt{n}}\right)\right]^{n}=\lim_{n\rightarrow\infty}\left(1+\frac{t^{2}}{2n}+\underset{\rightarrow0}{\underbrace{R_{Y}\left(\frac{t}{\sqrt{n}}\right)}}\right)^{n}=\lim_{n\rightarrow\infty}\left(1+\frac{t^{2}}{2n}\right)^{n}=\text{e}^{\frac{t^{2}}{2}}.$

The result is the m.g.f. of $N\left(0,1\right)$, thereby proving the CLT.

# Delta Method

The delta method allows us to approximate the distribution of the transformation of a random variable, as long as $n$ is large.

The delta method states that if a sequence of random variables $X_{n}$ satisfies $\sqrt{n}\left(X_{n}-\mu\right)\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)$ then for any function $g\left(\cdot\right)$ continuously differentiable in a neighborhood of $\mu$ with derivative $g^{'}\left(\mu\right)$,

$\sqrt{n}\left(g\left(X_{n}\right)-g\left(\mu\right)\right)\overset{d}{\rightarrow}N\left(0,g^{'}\left(\mu\right)^{2}.\sigma^{2}\right)$

## Proof

The proof involves expanding $g\left(X_{n}\right)$ around $\mu$:

$g\left(X_{n}\right)=g\left(\mu\right)+\left(X_{n}-\mu\right)g^{'}\left(\mu\right)+R\left(X_{n}\right)$.

We can rewrite this version of the Taylor expansion using the mean value theorem, which yields the alternative version

$g\left(X_{n}\right)=g\left(\mu\right)+\left(X_{n}-\mu\right)g^{'}\left(\tilde{X}_{n}\right),\,for\,some\,\tilde{X}_{n}\in\left(X_{n},\mu\right)$.

Moving $g\left(\mu\right)$ to the lhs yields $g\left(X_{n}\right)-g\left(\mu\right)=\left(X_{n}-\mu\right)g^{'}\left(\tilde{X}_{n}\right)$

Because $\tilde{X}_{n}$ lies between $X_{n}$ and $\mu$, $\tilde{X}_{n}\overset{p}{\rightarrow}\mu$ if $X_{n}\overset{p}{\rightarrow}\mu$.

Remembering that

$\sqrt{n}\left(X_{n}-\mu\right)\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)$,

then,

$\underset{\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)}{\underbrace{\sqrt{n}\left(X_{n}-\mu\right)}}.\underset{\overset{p}{\rightarrow}g^{'}\left(\mu\right)}{\underbrace{g^{'}\left(\tilde{X}_{n}\right)}}\Rightarrow\sqrt{n}\left(g\left(X_{n}\right)-g\left(\mu\right)\right)\overset{d}{\rightarrow}N\left(0,g^{'}\left(\mu\right)^{2}.\sigma^{2}\right)$

To obtain the result above, we first used the facts that $\sqrt{n}\left(X_{n}-\mu\right)\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)$ and that $g^{'}\left(\tilde{X}_{n}\right)\overset{p}{\rightarrow}g^{'}\left(\mu\right)$. By Slutsky’s theorem, the product converges in distribution to the distribution of $g^{'}\left(\mu\right).\sigma.Z$, where $Z=N\left(0,1\right)$. Finally, the last result follows from the fact that $\sqrt{n}\left(X_{n}-\mu\right)g^{'}\left(\tilde{X}_{n}\right)=\sqrt{n}\left(g\left(X_{n}\right)-g\left(\mu\right)\right)$.

This result is often useful to calculate the distribution of a transformation of an estimator. For example, while estimating a model, it may be convenient to constrain a parameter estimate of $\theta$ to be positive. One way to accomplish this is to estimate some parameter $\widehat{\beta}$, s.t. $\widehat{\theta}=\exp\left(\widehat{\beta}\right)$. For any value of $\widehat{\beta}$, we will obtain a positive estimate of $\widehat{\theta}$ by applying the exponential. If we know the distribution of $\sqrt{n}\left(\widehat{\beta}-E\left(\widehat{\beta}\right)\right)$ has a $N\left(0,\sigma^{2}\right)$ distribution, then the delta method can be used to produce the distribution of $\widehat{\theta}$. To be clear, in this case the sequence of random variables $X_{n}$ would be data, such that $\widehat{\beta}_{n}=\widehat{\beta}\left(X_{n}\right)$, for example.

# Somewhat Pedantic Remark on Notation

You may have noticed that statements like

$\sqrt{n}\left(X_{n}-\mu\right)\overset{d}{\rightarrow}N\left(0,\sigma^{2}\right)$

seem overly complicated. It may seem simpler to write the statement above as

$\frac{\sqrt{n}\left(X_{n}-\mu\right)}{\sigma}\overset{d}{\rightarrow}N\left(0,1\right)$

or even

$X_{n}\overset{d}{\rightarrow}N\left(\mu,\frac{\sigma^{2}}{n}\right)$.

While the second statement is perfectly fine, the last statement is problematic. The reason is that the right-hand side, $N\left(\mu,\frac{\sigma^{2}}{n}\right)$, is the distribution limit of the left-hand side in $n$, and so should not depend on $n$. So, try to keep $n$ in the left-hand side. If you really want to have it on the right-hand side, you can use the alternative notation,

$X_{n}\overset{\sim}{\sim}N\left(\mu,\frac{\sigma^{2}}{n}\right).$