Full Lecture 15

From Significant Statistics
Jump to navigation Jump to search

Asymptotic Properties of ML Estimators

Up to now, we have derived the distributions of several test statistics. This has been a relatively tedious process, however.

For example, the hypothesis test for the [math]\mu[/math] parameter in the normal distribution depends on whether [math]\sigma^{2}[/math] is known or not. In general, the distribution of more complicated tests may be extremely challenging to find.

It turns out that, as long as [math]n[/math] is large, we can use a nice property of the maximum likelihood estimator.

Let [math]X_{1}..X_{n}[/math] be a random sample with pdf [math]f\left(\left.x\right|\theta\right)[/math]. Under a few regularity conditions,

[math]\sqrt{n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)^{-1}\right)[/math]

where [math]I\left(\theta_{0}\right)[/math] is the Fisher information at [math]\theta=\theta_{0}[/math].

Proof

First,

  • Denote [math]l\left(\theta\right)=l\left(\left.\theta\right|x_{1}..x_{n}\right)=\sum_{i=1}^{n}\log\left(f\left(\left.x_{i}\right|\theta\right)\right)[/math].
  • Let [math]I\left(\theta\right)=E_{\theta}\left[\left(l_{1}^{'}\left(\theta\right)\right)^{2}\right]=-E_{\theta}\left[l_{1}^{''}\left(\theta\right)\right]=Var_{\theta}\left(l_{1}^{'}\left(\theta\right)\right)[/math] where [math]l_{1}\left(\theta\right)=l_{1}\left(\left.\theta\right|x_{1}\right)[/math] is the log-likelihood for one observation.

We expand the first derivative of the log-likelihood function around [math]\theta_{0}[/math]:

[math]l^{'}\left(\theta\right)=l^{'}\left(\theta_{0}\right)+\left(\theta-\theta_{0}\right)l^{''}\left(\theta_{0}\right)+\frac{\left(\theta-\theta_{0}\right)^{2}}{2}l^{'''}\left(\theta^{*}\right),\,\theta^{*}\in\left(\theta,\theta_{0}\right)[/math]

Now, we evaluate the expansion at [math]\theta=\widehat{\theta}_{ML}[/math]:

[math]\begin{aligned} & \underset{=0}{\underbrace{l^{'}\left(\widehat{\theta}_{ML}\right)}}=l^{'}\left(\theta_{0}\right)+\left(\widehat{\theta}_{ML}-\theta_{0}\right)l^{''}\left(\theta_{0}\right)+\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{2}l^{'''}\left(\theta^{*}\right),\,\theta^{*}\in\left(\theta,\theta_{0}\right)\\ \Leftrightarrow & \widehat{\theta}_{ML}-\theta_{0}=\frac{-l^{'}\left(\theta_{0}\right)}{l^{''}\left(\theta_{0}\right)+\frac{1}{2}\left(\widehat{\theta}_{ML}-\theta_{0}\right)l^{'''}\left(\theta^{*}\right)}\\ \Leftrightarrow & \sqrt{n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)=\frac{\frac{1}{\sqrt{n}}l^{'}\left(\theta_{0}\right)}{-\frac{1}{n}l^{''}\left(\theta_{0}\right)-\frac{1}{2n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)l^{'''}\left(\theta^{*}\right)}\end{aligned}[/math]

Under the assumption that [math]l^{'''}\left(\theta^{*}\right)[/math] is “well-behaved” around [math]\theta_{0}[/math] such that we can ignore it, notice that

[math]\frac{1}{\sqrt{n}}l^{'}\left(\theta_{0}\right)=\sqrt{n}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\theta}\log\,f\left(\left.x_{i}\right|\theta_{0}\right)\right)=\sqrt{n}\overline{W}\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)\right)[/math] where [math]W_{i}=\frac{\partial}{\partial\theta}\log\,f\left(\left.x_{i}\right|\theta_{0}\right)[/math].

To prove this first result, it suffices that [math]E\left(W_{i}\right)=0[/math] and [math]Var\left(W_{i}\right)=I\left(\theta_{0}\right)[/math].

Notice that

[math]\begin{aligned} E\left(W_{i}\right) & =E\left(\left.\frac{\partial}{\partial\theta}\log\,f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}\right)\\ & =E\left(\frac{\left.\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}}{f\left(\left.x\right|\theta_{0}\right)}\right)\\ & =\int_{-\infty}^{\infty}\frac{\left.\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}}{f\left(\left.x\right|\theta_{0}\right)}f\left(\left.x\right|\theta_{0}\right)dx\\ & =\int_{-\infty}^{\infty}\left.\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}dx\\ & =0\end{aligned}[/math]

The last identity follows from the fact that [math]\int_{-\infty}^{\infty}f\left(\left.x\right|\theta\right)dx=1[/math], i.e., the value of the integral w.r.t. [math]x[/math] does not change with [math]\theta[/math], such that

[math]\begin{aligned} & \frac{d}{d\theta}\int_{-\infty}^{\infty}f\left(\left.x\right|\theta\right)dx=\frac{d}{d\theta}1\\ \Leftrightarrow & \int_{-\infty}^{\infty}\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)dx=0\end{aligned}[/math]

And the variance expression is the same used to define the Fisher information (Lecture 10. c) Cramer-Rao Lower Bound).

As for the denominator, notice that by the law of large numbers, [math]-\frac{1}{n}l^{''}\left(\theta_{0}\right)=-\frac{1}{n}\sum_{i=1}^{n}\frac{\partial^{2}}{\partial\theta^{2}}f\left(\left.x\right|\theta\right)\overset{p}{\rightarrow}I\left(\theta_{0}\right)[/math]

By Slutsky’s theorem, we obtain that the ratio

[math]\frac{\frac{1}{\sqrt{n}}l^{'}\left(\theta_{0}\right)\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)\right)}{-\frac{1}{n}l^{''}\left(\theta_{0}\right)\overset{p}{\rightarrow}I\left(\theta_{0}\right)}[/math]

converges such that

[math]\sqrt{n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)\sim N\left(0,I\left(\theta_{0}\right)^{-1}\right).[/math]


Some Implications

  • If one has access to [math]I\left(\theta_{0}\right)^{-1}[/math], or one can approximate it via some estimator say [math]\widehat{I\left(\theta_{0}\right)}[/math] and/or [math]\widehat{I\left(\theta_{ML}\right)}[/math], one can use the normal distribution for hypothesis testing as long as [math]n[/math] is large.
  • The point above applies, even if one does not know the exact distribution of the test.
  • It can be shown that the LRT, Wald, and LM tests are asymptotically equivalent. The proof, perhaps unsurprisingly, uses Taylor expansion.


Example: Hypothesis Test

Let [math]X_{1}..X_{n}[/math] be a random sample with Poisson distribution

[math]f\left(\left.x\right|\theta\right)=\frac{\theta^{x}\exp\left(-\theta\right)}{x!},\,x\in\left\{ 0,1,...\right\}[/math]

and let us test [math]H_{0}:\theta=6[/math] vs. [math]H_{1}:\theta\neq6[/math] at the 10% level.

Let

[math]n=100[/math] and [math]\sum_{i=1}^{n}x_{i}=500[/math].

[math]\widehat{\theta}_{ML}[/math] calculation

[math]l\left(\theta\right)=\sum_{i=1}^{n}\log\left(\frac{\theta^{x_{i}}\exp\left(-\theta\right)}{x!}\right)=\sum_{i=1}^{n}\left(x_{i}\log\left(\theta\right)-\theta-\log\left(x!\right)\right)[/math]

Taking the foc:

[math]\begin{aligned} foc\left(\theta\right):\, & \frac{\sum x_{i}}{\theta}-n=0\\ \Leftrightarrow & \widehat{\theta}_{ML}=\frac{\sum x_{i}}{n}=\frac{500}{100}=5.\end{aligned}[/math]

We can also verify that we have found a maximum:

[math]soc\left(\theta\right):\,-\frac{\sum x_{i}}{\theta^{2}}\lt 0.[/math]

Information Matrix

The single observation information matrix is:

[math]I\left(\theta\right)=-E_{\theta}\left(l^{''}\left(\theta\right)\right)=-E_{\theta}\left(-\frac{x_{i}}{\theta^{2}}\right)=\frac{\theta}{\theta^{2}}=\frac{1}{\theta}.[/math]

(We could have also used [math]E_{\theta}\left(l^{'}\left(\theta\right)^{2}\right)[/math], and would have obtained the same result.)

Finally, define the following two estimators for the information matrix (for a single observation):

[math]\begin{aligned} I_{1} & =I\left(\widehat{\theta}_{ML}\right)=\frac{1}{5}.\\ I_{2} & =I\left(\theta_{0}\right)=\frac{1}{6}.\end{aligned}[/math]

Tests

The LR test is given by [math]2\left(l\left(\widehat{\theta}_{ML}\right)-l\left(\theta_{0}\right)\right)\simeq17.6.[/math]

The Wald test is given by [math]\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{Var\left(\widehat{\theta}_{ML}\right)}\simeq\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{\left(n.I_{1}\right)^{-1}}=\frac{100\left(6-5\right)^{2}}{\frac{1}{6}}\simeq16.7.[/math]

The LM test is given by [math]\frac{l^{'}\left(\theta_{0}\right)^{2}}{nI\left(\theta_{0}\right)}=\frac{\left(\frac{\sum x_{i}}{\theta_{0}}-n\right)^{2}}{\frac{100}{6}}=\frac{6\left(\frac{500}{6}-100\right)^{2}}{100}\simeq16.67.[/math]

Using the fact that [math]-l^{''}\left(\theta_{0}\right)\overset{p}{\rightarrow}I\left(\theta_{0}\right)[/math], we could have instead used [math]-l^{''}\left(\theta_{0}\right)[/math] as an approximation of the denominator:

[math]\frac{l^{'}\left(\theta_{0}\right)^{2}}{-l^{''}\left(\theta_{0}\right)}=\frac{\left(\frac{\sum x_{i}}{\theta_{0}}-n\right)^{2}}{-\frac{\sum x_{i}}{\theta^{2}}}=\frac{\left(\frac{500}{6}-100\right)^{2}}{\frac{500}{36}}=20.[/math]

In all cases, the test statistics exceed the [math]\chi_{\left(1\right)}^{2}[/math] critical value of [math]2.728[/math] associated with a type 1 error rate of 10%.


Example: Exponential Distribution

Let [math]X_{1}..X_{n}[/math] be a random sample with density [math]f\left(\left.x\right|\theta\right)=\lambda\exp\left(-\lambda x\right).[/math]

The maximum likelihood estimate is [math]\widehat{\lambda}_{ML}=\frac{1}{\overline{x}}[/math]. In large samples,

[math]\sqrt{n}\left(\widehat{\lambda}_{ML}-\lambda\right)\sim N\left(0,\frac{1}{\lambda^{2}}\right)[/math]

A confidence interval with asymptotic level 0.95 exploits this result: [math]\begin{aligned} CI & =\left(\widehat{\lambda}_{ML}-1.96\frac{1}{\sqrt{n}\widehat{\lambda}_{ML}},\widehat{\lambda}_{ML}+1.96\frac{1}{\sqrt{n}\widehat{\lambda}_{ML}}\right)\end{aligned}.[/math]

This confidence interval can be obtained by test inversion.

Consider the test problem [math]H_{0}:\lambda=\lambda_{0}[/math] vs. [math]H_{1}:\lambda\neq\lambda_{0}[/math].

The Wald test statistic is given by

[math]T_{W}=n\left(\widehat{\lambda}_{ML}-\lambda_{0}\right)^{2}\underset{=\widehat{\lambda}_{ML}^{2}}{\underbrace{I\left(\widehat{\lambda}_{ML}\right)}}\sim\chi_{\left(1\right)}^{2}[/math]

And we reject the null hypothesis if [math]T_{W}\gt 1.96^{2}[/math] in order to obtain a test with [math]\alpha=0.05[/math]. This leads to the 95% confidence interval above.


Example: Multiple Parameters

Let [math]X_{i}\sim\overset{iid}{\sim}N\left(\mu,\sigma^{2}\right)[/math] where [math]\mu[/math] and [math]\sigma^{2}[/math] are unknown.

Suppose we have observations s.t. [math]\overline{x}=1[/math] and [math]\frac{\Sigma x_{i}^{2}}{n}=6[/math] and face the testing problem [math]H_{0}:\,\sigma^{2}=4[/math] vs. [math]H_{1}:\,\sigma^{2}\neq4[/math] at the 0.1 level.

Log-Likelihood

[math]l_{i}\left(\theta\right)=\sum_{i=1}^{n}-\frac{1}{2}\log\left(\sigma^{2}\right)-\frac{x_{i}^{2}-2\mu x_{i}+\mu^{2}}{2\sigma^{2}}[/math]

First-Order Conditions:

[math]\begin{aligned} \frac{\partial l\left(\theta\right)}{\partial\mu}= & \sum_{i=1}^{n}\frac{x_{i}-\mu}{\sigma^{2}}\\ \frac{\partial l\left(\theta\right)}{\partial\sigma^{2}}= & \sum_{i=1}^{n}-\frac{1}{2\sigma^{2}}+\frac{x_{i}^{2}-2\mu x_{i}+\mu^{2}}{2\sigma^{4}}\end{aligned}[/math]

Solving the focs, we get

[math]\begin{aligned} \widehat{\mu}_{ML}= & \overline{x}=1.\\ \widehat{\sigma^{2}}_{ML}= & \frac{\sum x_{i}^{2}}{n}-\left(\frac{\sum x_{i}}{n}\right)^{2}=5.\end{aligned}[/math]

Information Matrix

The information matrix for a single observation equals

[math]I_{1}=-E\left[\begin{array}{cc} \frac{\partial^{2}}{\partial\mu^{2}}l_{1}\left(\theta\right) & \frac{\partial^{2}}{\partial\mu\partial\sigma^{2}}l_{1}\left(\theta\right)\\ \frac{\partial^{2}}{\partial\sigma^{2}\partial\mu}l_{1}\left(\theta\right) & \frac{\partial^{2}}{\partial\left(\sigma^{2}\right)^{2}}l_{1}\left(\theta\right) \end{array}\right]=-E\left[\begin{array}{cc} -\frac{1}{\sigma^{2}} & -\frac{x_{i}-\mu}{\sigma^{4}}\\ -\frac{x_{i}-\mu}{\sigma^{4}} & \frac{1}{2\sigma^{4}}-\frac{\left(x_{i}-\mu\right)^{2}}{\sigma^{6}} \end{array}\right].[/math]

Taking expectations (the expectation operator applies to each member of the matrix) yields

[math]I_{1}=\left[\begin{array}{cc} \frac{1}{\sigma^{2}} & 0\\ 0 & \frac{1}{2\sigma^{4}} \end{array}\right][/math]

We now calculate the information matrix at the null hypothesis as well as at the value of its maximum likelihood estimate:

  • [math]I_{1}\left(\widehat{\mu}_{ML},\sigma_{0}^{2}\right)=I\left(1,4\right)=\left[\begin{array}{cc} \frac{1}{4} & 0\\ 0 & \frac{1}{32} \end{array}\right][/math]
  • [math]I_{1}\left(\widehat{\mu}_{ML},\widehat{\sigma^{2}}_{ML}\right)=I\left(1,5\right)=\left[\begin{array}{cc} \frac{1}{5} & 0\\ 0 & \frac{1}{50} \end{array}\right][/math]

Confidence Interval

Note that [math]Var\left(\begin{array}{c} \sqrt{n}\left(\widehat{\mu}-\mu\right)\\ \sqrt{n}\left(\widehat{\sigma^{2}}-\sigma^{2}\right) \end{array}\right)=I_{1}^{-1}=\left[\begin{array}{cc} \sigma^{2} & 0\\ 0 & 2\sigma^{4} \end{array}\right].[/math]

(Operation [math]I^{-1}[/math] does not mean taking the inverse of the elements of matrix [math]I[/math]; it’s the matrix inverse operation, which coincides with the matrix of the inverses in this case.)

Hence, [math]\widehat{Var}\left(\sqrt{n}\left(\widehat{\sigma^{2}}_{ML}-\sigma^{2}\right)\right)=\left.2\sigma^{4}\right|_{\sigma^{2}=5}=50[/math], and the resulting CI is

[math]CI:\,\left(\widehat{\sigma^{2}}_{ML}-1.96\sqrt{\frac{50}{n}},\widehat{\sigma^{2}}_{ML}+1.96\sqrt{\frac{50}{n}}\right)=3.61,6.39.[/math]

Wald Test

For the Wald test,

[math]T_{W}=\frac{\left(\widehat{\sigma^{2}}_{ML}-\sigma^{2}\right)^{2}}{I_{22}^{-1}\left(\widehat{\sigma^{2}}_{ML}\right)}=\frac{1^{2}}{\left(100.\frac{1}{50}\right)^{-1}}=2.[/math]

The null hypothesis is not rejected, since the critical value is 2.71.

LM Test

For the LM test,

[math]\begin{aligned} T_{LM} & =\frac{1}{n}\left[l^{'}\left(\theta_{0}\right)\right]^{'}I^{-1}\left(\theta_{0}\right)\left[l^{'}\left(\theta_{0}\right)\right]=\\ & =\frac{1}{n}\left[\begin{array}{c} \sum_{i=1}^{n}\frac{x_{i}-\mu_{0}}{\sigma_{0}^{2}}\\ \sum_{i=1}^{n}-\frac{1}{2\sigma_{0}^{2}}+\frac{\left(x_{i}-\mu_{0}\right)^{2}}{2\sigma_{0}^{4}} \end{array}\right]^{'}\left[\begin{array}{cc} \sigma_{0}^{2} & 0\\ 0 & 2\sigma_{0}^{4} \end{array}\right]\left[\begin{array}{c} \sum_{i=1}^{n}\frac{x_{i}-\mu_{0}}{\sigma_{0}^{2}}\\ \sum_{i=1}^{n}-\frac{1}{2\sigma_{0}^{2}}+\frac{\left(x_{i}-\mu_{0}\right)^{2}}{2\sigma_{0}^{4}} \end{array}\right]\\ & =3.125.\end{aligned}[/math]

such that we reject [math]H_{0}[/math].

LR Test

For the LR test,

[math]\begin{aligned} l\left(\mu=1,\sigma^{2}=4\right) & =-131.81\\ l\left(\mu=1,\widehat{\sigma^{2}}=5\right) & =-130.47\end{aligned}[/math]

s.t.

[math]T_{LR}=2\left(l\left(\widehat{\theta}_{ML}\right)-l\left(\theta_{0}\right)\right)=2.69.[/math]

and we do not reject the null hypothesis.

Clearly, in this example, [math]n[/math] is not large enough for the tests to converge.