Full Lecture 15
Contents
Asymptotic Properties of ML Estimators
Up to now, we have derived the distributions of several test statistics. This has been a relatively tedious process, however.
For example, the hypothesis test for the [math]\mu[/math] parameter in the normal distribution depends on whether [math]\sigma^{2}[/math] is known or not. In general, the distribution of more complicated tests may be extremely challenging to find.
It turns out that, as long as [math]n[/math] is large, we can use a nice property of the maximum likelihood estimator.
Let [math]X_{1}..X_{n}[/math] be a random sample with pdf [math]f\left(\left.x\right|\theta_{0}\right)[/math]. Under a few regularity conditions,
[math]\sqrt{n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)^{-1}\right)[/math]
where [math]I\left(\theta_{0}\right)[/math] is the Fisher information.
Proof
First,
- Denote [math]l\left(\theta\right)=l\left(\left.\theta\right|x_{1}..x_{n}\right)=\sum_{i=1}^{n}\log\left(f\left(\left.x_{i}\right|\theta\right)\right)[/math].
- Let [math]I\left(\theta\right)=E_{\theta}\left[\left(l_{1}^{'}\left(\theta\right)\right)^{2}\right]=-E_{\theta}\left[l_{1}^{''}\left(\theta\right)\right]=Var_{\theta}\left(l_{1}^{'}\left(\theta\right)\right)[/math] where [math]l_{1}\left(\theta\right)=l_{1}\left(\left.\theta\right|x_{1}\right)[/math] is the log-likelihood for one observation.
We expand the first derivative of the log-likelihood function around [math]\theta_{0}[/math]:
[math]l^{'}\left(\theta\right)=l^{'}\left(\theta_{0}\right)+\left(\theta-\theta_{0}\right)l^{''}\left(\theta_{0}\right)+\frac{\left(\theta-\theta_{0}\right)^{2}}{2}l^{'''}\left(\theta^{*}\right),\,\theta^{*}\in\left(\theta,\theta_{0}\right)[/math]
Now, we evaluate the expansion at [math]\theta=\widehat{\theta}_{ML}[/math]:
[math]\begin{aligned} & \underset{=0}{\underbrace{l^{'}\left(\widehat{\theta}_{ML}\right)}}=l^{'}\left(\theta_{0}\right)+\left(\widehat{\theta}_{ML}-\theta_{0}\right)l^{''}\left(\theta_{0}\right)+\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{2}l^{'''}\left(\theta^{*}\right),\,\theta^{*}\in\left(\theta,\theta_{0}\right)\\ \Leftrightarrow & \widehat{\theta}_{ML}-\theta_{0}=\frac{-l^{'}\left(\theta_{0}\right)}{l^{''}\left(\theta_{0}\right)+\frac{1}{2}\left(\widehat{\theta}_{ML}-\theta_{0}\right)l^{'''}\left(\theta^{*}\right)}\\ \Leftrightarrow & \sqrt{n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)=\frac{\frac{1}{\sqrt{n}}l^{'}\left(\theta_{0}\right)}{-\frac{1}{n}l^{''}\left(\theta_{0}\right)-\frac{1}{2n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)l^{'''}\left(\theta^{*}\right)}\end{aligned}[/math]
Under the assumption that [math]l^{'''}\left(\theta^{*}\right)[/math] is “well-behaved” around [math]\theta_{0}[/math] such that we can ignore it, notice that
[math]\frac{1}{\sqrt{n}}l^{'}\left(\theta_{0}\right)=\sqrt{n}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\theta}\log\,f\left(\left.x_{i}\right|\theta_{0}\right)\right)=\sqrt{n}\overline{W}\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)\right)[/math] where [math]W_{i}=\frac{\partial}{\partial\theta}\log\,f\left(\left.x_{i}\right|\theta_{0}\right)[/math].
To prove this first result, it suffices that [math]E\left(W_{i}\right)=0[/math] and [math]Var\left(W_{i}\right)=I\left(\theta_{0}\right)[/math].
Notice that
[math]\begin{aligned} E\left(W_{i}\right) & =E\left(\left.\frac{\partial}{\partial\theta}\log\,f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}\right)\\ & =E\left(\frac{\left.\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}}{f\left(\left.x\right|\theta_{0}\right)}\right)\\ & =\int_{-\infty}^{\infty}\frac{\left.\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}}{f\left(\left.x\right|\theta_{0}\right)}f\left(\left.x\right|\theta_{0}\right)dx\\ & =\int_{-\infty}^{\infty}\left.\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}dx\\ & =0\end{aligned}[/math]
The last identity follows from the fact that [math]\int_{-\infty}^{\infty}f\left(\left.x\right|\theta\right)dx=1[/math], i.e., the value of the integral w.r.t. [math]x[/math] does not change with [math]\theta[/math], such that
[math]\begin{aligned} & \frac{d}{d\theta}\int_{-\infty}^{\infty}f\left(\left.x\right|\theta\right)dx=\frac{d}{d\theta}1\\ \Leftrightarrow & \int_{-\infty}^{\infty}\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)dx=0\end{aligned}[/math]
And the variance expression is the same used to define the Fisher information (Lecture 10. c) Cramer-Rao Lower Bound).
As for the denominator, notice that by the law of large numbers, [math]-\frac{1}{n}l^{''}\left(\theta_{0}\right)=-\frac{1}{n}\sum_{i=1}^{n}\frac{\partial^{2}}{\partial\theta^{2}}f\left(\left.x\right|\theta\right)\overset{p}{\rightarrow}I\left(\theta_{0}\right)[/math]
By Slutsky’s theorem, we obtain that the ratio
[math]\frac{\frac{1}{\sqrt{n}}l^{'}\left(\theta_{0}\right)\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)\right)}{-\frac{1}{n}l^{''}\left(\theta_{0}\right)\overset{p}{\rightarrow}I\left(\theta_{0}\right)}[/math]
converges such that
[math]\sqrt{n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)\sim N\left(0,I\left(\theta_{0}\right)^{-1}\right).[/math]
Some Implications
- If one has access to [math]I\left(\theta_{0}\right)^{-1}[/math], or one can approximate it via some estimator say [math]\widehat{I\left(\theta_{0}\right)}[/math] and/or [math]\widehat{I\left(\theta_{ML}\right)}[/math], one can use the normal distribution for hypothesis testing as long as [math]n[/math] is large.
- The point above applies, even if one does not know the exact distribution of the test.
- It can be shown that the LRT, Wald, and LM tests are asymptotically equivalent. The proof, perhaps unsurprisingly, uses Taylor expansion.
Example: Hypothesis Test
Let [math]X_{1}..X_{n}[/math] be a random sample with Poisson distribution
[math]f\left(\left.x\right|\theta\right)=\frac{\theta^{x}\exp\left(-\theta\right)}{x!},\,x\in\left\{ 0,1,...\right\}[/math]
and let us test [math]H_{0}:\theta=6[/math] vs. [math]H_{1}:\theta\neq6[/math] at the 10% level.
Let
[math]n=100[/math] and [math]\sum_{i=1}^{n}x_{i}=500[/math].
[math]\widehat{\theta}_{ML}[/math] calculation
[math]l\left(\theta\right)=\sum_{i=1}^{n}\log\left(\frac{\theta^{x_{i}}\exp\left(-\theta\right)}{x!}\right)=\sum_{i=1}^{n}\left(x_{i}\log\left(\theta\right)-\theta-\log\left(x!\right)\right)[/math]
Taking the foc:
[math]\begin{aligned} foc\left(\theta\right):\, & \frac{\sum x_{i}}{\theta}-n=0\\ \Leftrightarrow & \widehat{\theta}_{ML}=\frac{\sum x_{i}}{n}=\frac{500}{100}=5.\end{aligned}[/math]
We can also verify that we have found a maximum:
[math]soc\left(\theta\right):\,-\frac{\sum x_{i}}{\theta^{2}}\lt 0.[/math]
Information Matrix
The single observation information matrix is:
[math]I\left(\theta\right)=-E_{\theta}\left(l^{''}\left(\theta\right)\right)=-E_{\theta}\left(-\frac{X_{i}}{\theta^{2}}\right)=\frac{\theta}{\theta^{2}}=\frac{1}{\theta}.[/math]
(We could have also used [math]E_{\theta}\left(l^{'}\left(\theta\right)^{2}\right)[/math], and would have obtained the same result.)
Finally, define the following two estimators for the information matrix (for a single observation):
[math]\begin{aligned} I_{1} & =I\left(\widehat{\theta}_{ML}\right)=\frac{1}{5}.\\ I_{2} & =I\left(\theta_{0}\right)=\frac{1}{6}.\end{aligned}[/math]
Tests
The LR test is given by [math]2\left(l\left(\widehat{\theta}_{ML}\right)-l\left(\theta_{0}\right)\right)\simeq17.6.[/math]
The Wald test is given by [math]\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{Var\left(\widehat{\theta}_{ML}\right)}\simeq\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{\left(n.I_{1}\right)^{-1}}=\frac{100\left(6-5\right)^{2}}{5}\simeq20.[/math]
The LM test is given by [math]\frac{l^{'}\left(\theta_{0}\right)^{2}}{nI\left(\theta_{0}\right)}=\frac{\left(\frac{\sum x_{i}}{\theta_{0}}-n\right)^{2}}{\frac{100}{6}}=\frac{6\left(\frac{500}{6}-100\right)^{2}}{100}\simeq16.67.[/math]
Using the fact that [math]-l^{''}\left(\theta_{0}\right)\overset{p}{\rightarrow}I\left(\theta_{0}\right)=nI_{1}\left(\theta_{0}\right)[/math], we could have instead used [math]-l^{''}\left(\theta_{0}\right)[/math] as an approximation of the denominator:
[math]\frac{l^{'}\left(\theta_{0}\right)^{2}}{-l^{''}\left(\theta_{0}\right)}=\frac{\left(\frac{\sum x_{i}}{\theta_{0}}-n\right)^{2}}{-\frac{\sum x_{i}}{\theta^{2}}}=\frac{\left(\frac{500}{6}-100\right)^{2}}{\frac{500}{36}}=20.[/math]
In all cases, the test statistics exceed the [math]\chi_{\left(1\right)}^{2}[/math] critical value of [math]2.728[/math] associated with a type 1 error rate of 10%.
Example: Exponential Distribution
Let [math]X_{1}..X_{n}[/math] be a random sample with density [math]f\left(\left.x\right|\theta\right)=\lambda\exp\left(-\lambda x\right).[/math]
The maximum likelihood estimate is [math]\widehat{\lambda}_{ML}=\frac{1}{\overline{x}}[/math]. In large samples,
[math]\sqrt{n}\left(\widehat{\lambda}_{ML}-\lambda\right)\sim N\left(0,\frac{1}{\lambda^{2}}\right)[/math]
A confidence interval with asymptotic level 0.95 exploits this result: [math]\begin{aligned} CI & =\left(\widehat{\lambda}_{ML}-1.96\frac{1}{\sqrt{n}\widehat{\lambda}_{ML}},\widehat{\lambda}_{ML}+1.96\frac{1}{\sqrt{n}\widehat{\lambda}_{ML}}\right)\end{aligned}.[/math]
This confidence interval can be obtained by test inversion.
Consider the test problem [math]H_{0}:\lambda=\lambda_{0}[/math] vs. [math]H_{1}:\lambda\neq\lambda_{0}[/math].
The Wald test statistic is given by
[math]T_{W}=n\left(\widehat{\lambda}_{ML}-\lambda_{0}\right)^{2}\underset{=\widehat{\lambda}_{ML}^{2}}{\underbrace{I\left(\widehat{\lambda}_{ML}\right)}}\sim\chi_{\left(1\right)}^{2}[/math]
And we reject the null hypothesis if [math]T_{W}\gt 1.96^{2}[/math] in order to obtain a test with [math]\alpha=0.05[/math]. This leads to the 95% confidence interval above.
Example: Multiple Parameters
Let [math]X_{i}\overset{iid}{\sim}N\left(\mu,\sigma^{2}\right)[/math] where [math]\mu[/math] and [math]\sigma^{2}[/math] are unknown.
Suppose we have observations s.t. [math]\overline{x}=1[/math] and [math]\frac{\Sigma x_{i}^{2}}{n}=6[/math] and face the testing problem [math]H_{0}:\,\sigma^{2}=4[/math] vs. [math]H_{1}:\,\sigma^{2}\neq4[/math] at the 0.1 level.
Log-Likelihood
[math]l\left(\theta\right)=\sum_{i=1}^{n}-\frac{1}{2}\log\left(\sigma^{2}\right)-\frac{x_{i}^{2}-2\mu x_{i}+\mu^{2}}{2\sigma^{2}}[/math]
First-Order Conditions:
[math]\begin{aligned} \frac{\partial l\left(\theta\right)}{\partial\mu}= & \sum_{i=1}^{n}\frac{x_{i}-\mu}{\sigma^{2}}\\ \frac{\partial l\left(\theta\right)}{\partial\sigma^{2}}= & \sum_{i=1}^{n}-\frac{1}{2\sigma^{2}}+\frac{x_{i}^{2}-2\mu x_{i}+\mu^{2}}{2\sigma^{4}}\end{aligned}[/math]
Solving the focs, we get
[math]\begin{aligned} \widehat{\mu}_{ML}= & \overline{x}=1.\\ \widehat{\sigma^{2}}_{ML}= & \frac{\sum x_{i}^{2}}{n}-\left(\frac{\sum x_{i}}{n}\right)^{2}=5.\end{aligned}[/math]
Information Matrix
The information matrix for a single observation equals
[math]I_{1}=-E\left[\begin{array}{cc} \frac{\partial^{2}}{\partial\mu^{2}}l_{1}\left(\theta\right) & \frac{\partial^{2}}{\partial\mu\partial\sigma^{2}}l_{1}\left(\theta\right)\\ \frac{\partial^{2}}{\partial\sigma^{2}\partial\mu}l_{1}\left(\theta\right) & \frac{\partial^{2}}{\partial\left(\sigma^{2}\right)^{2}}l_{1}\left(\theta\right) \end{array}\right]=-E\left[\begin{array}{cc} -\frac{1}{\sigma^{2}} & -\frac{x_{i}-\mu}{\sigma^{4}}\\ -\frac{x_{i}-\mu}{\sigma^{4}} & \frac{1}{2\sigma^{4}}-\frac{\left(x_{i}-\mu\right)^{2}}{\sigma^{6}} \end{array}\right].[/math]
Taking expectations (the expectation operator applies to each member of the matrix) yields
[math]I_{1}=\left[\begin{array}{cc} \frac{1}{\sigma^{2}} & 0\\ 0 & \frac{1}{2\sigma^{4}} \end{array}\right][/math]
We now calculate the information matrix at the null hypothesis as well as at the value of its maximum likelihood estimate:
- [math]I_{1}\left(\widehat{\mu}_{ML},\sigma_{0}^{2}\right)=I\left(1,4\right)=\left[\begin{array}{cc} \frac{1}{4} & 0\\ 0 & \frac{1}{32} \end{array}\right][/math]
- [math]I_{1}\left(\widehat{\mu}_{ML},\widehat{\sigma^{2}}_{ML}\right)=I\left(1,5\right)=\left[\begin{array}{cc} \frac{1}{5} & 0\\ 0 & \frac{1}{50} \end{array}\right][/math]
Confidence Interval
Note that [math]Var\left(\begin{array}{c} \sqrt{n}\left(\widehat{\mu}-\mu\right)\\ \sqrt{n}\left(\widehat{\sigma^{2}}-\sigma^{2}\right) \end{array}\right)=I_{1}^{-1}=\left[\begin{array}{cc} \sigma^{2} & 0\\ 0 & 2\sigma^{4} \end{array}\right].[/math]
(Operation [math]I^{-1}[/math] does not mean taking the inverse of the elements of matrix [math]I[/math]; it’s the matrix inverse operation, which coincides with the matrix of the inverses in this case.)
Hence, [math]\widehat{Var}\left(\sqrt{n}\left(\widehat{\sigma^{2}}_{ML}-\sigma^{2}\right)\right)=\left.2\sigma^{4}\right|_{\sigma^{2}=5}=50[/math], and the resulting CI is
[math]CI:\,\left(\widehat{\sigma^{2}}_{ML}-1.96\sqrt{\frac{50}{n}},\widehat{\sigma^{2}}_{ML}+1.96\sqrt{\frac{50}{n}}\right)=3.61,6.39.[/math]
Wald Test
For the Wald test,
[math]T_{W}=\frac{\left(\widehat{\sigma^{2}}_{ML}-\sigma^{2}\right)^{2}}{I_{22}^{-1}\left(\widehat{\sigma^{2}}_{ML}\right)}=\frac{1^{2}}{\left(\frac{50}{100}\right)}=2.[/math]
The null hypothesis is not rejected, since the critical value is 2.71.
LM Test
For the LM test,
[math]\begin{aligned} T_{LM} & =\frac{1}{n}\left[l^{'}\left(\theta_{0}\right)\right]^{'}I^{-1}\left(\theta_{0}\right)\left[l^{'}\left(\theta_{0}\right)\right]=\\ & =\frac{1}{n}\left[\begin{array}{c} \sum_{i=1}^{n}\frac{x_{i}-\mu_{0}}{\sigma_{0}^{2}}\\ \sum_{i=1}^{n}-\frac{1}{2\sigma_{0}^{2}}+\frac{\left(x_{i}-\mu_{0}\right)^{2}}{2\sigma_{0}^{4}} \end{array}\right]^{'}\left[\begin{array}{cc} \sigma_{0}^{2} & 0\\ 0 & 2\sigma_{0}^{4} \end{array}\right]\left[\begin{array}{c} \sum_{i=1}^{n}\frac{x_{i}-\mu_{0}}{\sigma_{0}^{2}}\\ \sum_{i=1}^{n}-\frac{1}{2\sigma_{0}^{2}}+\frac{\left(x_{i}-\mu_{0}\right)^{2}}{2\sigma_{0}^{4}} \end{array}\right]\\ & =3.125.\end{aligned}[/math]
such that we reject [math]H_{0}[/math].
LR Test
For the LR test,
[math]\begin{aligned} l\left(\mu=1,\sigma^{2}=4\right) & =-131.81\\ l\left(\mu=1,\widehat{\sigma^{2}}=5\right) & =-130.47\end{aligned}[/math]
s.t.
[math]T_{LR}=2\left(l\left(\widehat{\theta}_{ML}\right)-l\left(\theta_{0}\right)\right)=2.69.[/math]
and we do not reject the null hypothesis.
Clearly, in this example, [math]n[/math] is not large enough for the tests to converge.