# Asymptotic Properties of ML Estimators

Up to now, we have derived the distributions of several test statistics. This has been a relatively tedious process, however.

For example, the hypothesis test for the $\mu$ parameter in the normal distribution depends on whether $\sigma^{2}$ is known or not. In general, the distribution of more complicated tests may be extremely challenging to find.

It turns out that, as long as $n$ is large, we can use a nice property of the maximum likelihood estimator.

Let $X_{1}..X_{n}$ be a random sample with pdf $f\left(\left.x\right|\theta_{0}\right)$. Under a few regularity conditions,

$\sqrt{n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)^{-1}\right)$

where $I\left(\theta_{0}\right)$ is the Fisher information.

## Proof

First,

• Denote $l\left(\theta\right)=l\left(\left.\theta\right|x_{1}..x_{n}\right)=\sum_{i=1}^{n}\log\left(f\left(\left.x_{i}\right|\theta\right)\right)$.
• Let $I\left(\theta\right)=E_{\theta}\left[\left(l_{1}^{'}\left(\theta\right)\right)^{2}\right]=-E_{\theta}\left[l_{1}^{''}\left(\theta\right)\right]=Var_{\theta}\left(l_{1}^{'}\left(\theta\right)\right)$ where $l_{1}\left(\theta\right)=l_{1}\left(\left.\theta\right|x_{1}\right)$ is the log-likelihood for one observation.

We expand the first derivative of the log-likelihood function around $\theta_{0}$:

$l^{'}\left(\theta\right)=l^{'}\left(\theta_{0}\right)+\left(\theta-\theta_{0}\right)l^{''}\left(\theta_{0}\right)+\frac{\left(\theta-\theta_{0}\right)^{2}}{2}l^{'''}\left(\theta^{*}\right),\,\theta^{*}\in\left(\theta,\theta_{0}\right)$

Now, we evaluate the expansion at $\theta=\widehat{\theta}_{ML}$:

\begin{aligned} & \underset{=0}{\underbrace{l^{'}\left(\widehat{\theta}_{ML}\right)}}=l^{'}\left(\theta_{0}\right)+\left(\widehat{\theta}_{ML}-\theta_{0}\right)l^{''}\left(\theta_{0}\right)+\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{2}l^{'''}\left(\theta^{*}\right),\,\theta^{*}\in\left(\theta,\theta_{0}\right)\\ \Leftrightarrow & \widehat{\theta}_{ML}-\theta_{0}=\frac{-l^{'}\left(\theta_{0}\right)}{l^{''}\left(\theta_{0}\right)+\frac{1}{2}\left(\widehat{\theta}_{ML}-\theta_{0}\right)l^{'''}\left(\theta^{*}\right)}\\ \Leftrightarrow & \sqrt{n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)=\frac{\frac{1}{\sqrt{n}}l^{'}\left(\theta_{0}\right)}{-\frac{1}{n}l^{''}\left(\theta_{0}\right)-\frac{1}{2n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)l^{'''}\left(\theta^{*}\right)}\end{aligned}

Under the assumption that $l^{'''}\left(\theta^{*}\right)$ is “well-behaved” around $\theta_{0}$ such that we can ignore it, notice that

$\frac{1}{\sqrt{n}}l^{'}\left(\theta_{0}\right)=\sqrt{n}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\theta}\log\,f\left(\left.x_{i}\right|\theta_{0}\right)\right)=\sqrt{n}\overline{W}\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)\right)$ where $W_{i}=\frac{\partial}{\partial\theta}\log\,f\left(\left.x_{i}\right|\theta_{0}\right)$.

To prove this first result, it suffices that $E\left(W_{i}\right)=0$ and $Var\left(W_{i}\right)=I\left(\theta_{0}\right)$.

Notice that

\begin{aligned} E\left(W_{i}\right) & =E\left(\left.\frac{\partial}{\partial\theta}\log\,f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}\right)\\ & =E\left(\frac{\left.\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}}{f\left(\left.x\right|\theta_{0}\right)}\right)\\ & =\int_{-\infty}^{\infty}\frac{\left.\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}}{f\left(\left.x\right|\theta_{0}\right)}f\left(\left.x\right|\theta_{0}\right)dx\\ & =\int_{-\infty}^{\infty}\left.\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)\right|_{\theta=\theta_{0}}dx\\ & =0\end{aligned}

The last identity follows from the fact that $\int_{-\infty}^{\infty}f\left(\left.x\right|\theta\right)dx=1$, i.e., the value of the integral w.r.t. $x$ does not change with $\theta$, such that

\begin{aligned} & \frac{d}{d\theta}\int_{-\infty}^{\infty}f\left(\left.x\right|\theta\right)dx=\frac{d}{d\theta}1\\ \Leftrightarrow & \int_{-\infty}^{\infty}\frac{\partial}{\partial\theta}f\left(\left.x\right|\theta\right)dx=0\end{aligned}

And the variance expression is the same used to define the Fisher information (Lecture 10. c) Cramer-Rao Lower Bound).

As for the denominator, notice that by the law of large numbers, $-\frac{1}{n}l^{''}\left(\theta_{0}\right)=-\frac{1}{n}\sum_{i=1}^{n}\frac{\partial^{2}}{\partial\theta^{2}}f\left(\left.x\right|\theta\right)\overset{p}{\rightarrow}I\left(\theta_{0}\right)$

By Slutsky’s theorem, we obtain that the ratio

$\frac{\frac{1}{\sqrt{n}}l^{'}\left(\theta_{0}\right)\overset{d}{\rightarrow}N\left(0,I\left(\theta_{0}\right)\right)}{-\frac{1}{n}l^{''}\left(\theta_{0}\right)\overset{p}{\rightarrow}I\left(\theta_{0}\right)}$

converges such that

$\sqrt{n}\left(\widehat{\theta}_{ML}-\theta_{0}\right)\sim N\left(0,I\left(\theta_{0}\right)^{-1}\right).$

# Some Implications

• If one has access to $I\left(\theta_{0}\right)^{-1}$, or one can approximate it via some estimator say $\widehat{I\left(\theta_{0}\right)}$ and/or $\widehat{I\left(\theta_{ML}\right)}$, one can use the normal distribution for hypothesis testing as long as $n$ is large.
• The point above applies, even if one does not know the exact distribution of the test.
• It can be shown that the LRT, Wald, and LM tests are asymptotically equivalent. The proof, perhaps unsurprisingly, uses Taylor expansion.

# Example: Hypothesis Test

Let $X_{1}..X_{n}$ be a random sample with Poisson distribution

$f\left(\left.x\right|\theta\right)=\frac{\theta^{x}\exp\left(-\theta\right)}{x!},\,x\in\left\{ 0,1,...\right\}$

and let us test $H_{0}:\theta=6$ vs. $H_{1}:\theta\neq6$ at the 10% level.

Let

$n=100$ and $\sum_{i=1}^{n}x_{i}=500$.

## $\widehat{\theta}_{ML}$ calculation

$l\left(\theta\right)=\sum_{i=1}^{n}\log\left(\frac{\theta^{x_{i}}\exp\left(-\theta\right)}{x!}\right)=\sum_{i=1}^{n}\left(x_{i}\log\left(\theta\right)-\theta-\log\left(x!\right)\right)$

Taking the foc:

\begin{aligned} foc\left(\theta\right):\, & \frac{\sum x_{i}}{\theta}-n=0\\ \Leftrightarrow & \widehat{\theta}_{ML}=\frac{\sum x_{i}}{n}=\frac{500}{100}=5.\end{aligned}

We can also verify that we have found a maximum:

$soc\left(\theta\right):\,-\frac{\sum x_{i}}{\theta^{2}}\lt 0.$

## Information Matrix

The single observation information matrix is:

$I\left(\theta\right)=-E_{\theta}\left(l^{''}\left(\theta\right)\right)=-E_{\theta}\left(-\frac{X_{i}}{\theta^{2}}\right)=\frac{\theta}{\theta^{2}}=\frac{1}{\theta}.$

(We could have also used $E_{\theta}\left(l^{'}\left(\theta\right)^{2}\right)$, and would have obtained the same result.)

Finally, define the following two estimators for the information matrix (for a single observation):

\begin{aligned} I_{1} & =I\left(\widehat{\theta}_{ML}\right)=\frac{1}{5}.\\ I_{2} & =I\left(\theta_{0}\right)=\frac{1}{6}.\end{aligned}

## Tests

The LR test is given by $2\left(l\left(\widehat{\theta}_{ML}\right)-l\left(\theta_{0}\right)\right)\simeq17.6.$

The Wald test is given by $\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{Var\left(\widehat{\theta}_{ML}\right)}\simeq\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{\left(n.I_{1}\right)^{-1}}=\frac{100\left(6-5\right)^{2}}{5}\simeq20.$

The LM test is given by $\frac{l^{'}\left(\theta_{0}\right)^{2}}{nI\left(\theta_{0}\right)}=\frac{\left(\frac{\sum x_{i}}{\theta_{0}}-n\right)^{2}}{\frac{100}{6}}=\frac{6\left(\frac{500}{6}-100\right)^{2}}{100}\simeq16.67.$

Using the fact that $-l^{''}\left(\theta_{0}\right)\overset{p}{\rightarrow}I\left(\theta_{0}\right)=nI_{1}\left(\theta_{0}\right)$, we could have instead used $-l^{''}\left(\theta_{0}\right)$ as an approximation of the denominator:

$\frac{l^{'}\left(\theta_{0}\right)^{2}}{-l^{''}\left(\theta_{0}\right)}=\frac{\left(\frac{\sum x_{i}}{\theta_{0}}-n\right)^{2}}{-\frac{\sum x_{i}}{\theta^{2}}}=\frac{\left(\frac{500}{6}-100\right)^{2}}{\frac{500}{36}}=20.$

In all cases, the test statistics exceed the $\chi_{\left(1\right)}^{2}$ critical value of $2.728$ associated with a type 1 error rate of 10%.

# Example: Exponential Distribution

Let $X_{1}..X_{n}$ be a random sample with density $f\left(\left.x\right|\theta\right)=\lambda\exp\left(-\lambda x\right).$

The maximum likelihood estimate is $\widehat{\lambda}_{ML}=\frac{1}{\overline{x}}$. In large samples,

$\sqrt{n}\left(\widehat{\lambda}_{ML}-\lambda\right)\sim N\left(0,\frac{1}{\lambda^{2}}\right)$

A confidence interval with asymptotic level 0.95 exploits this result: \begin{aligned} CI & =\left(\widehat{\lambda}_{ML}-1.96\frac{1}{\sqrt{n}\widehat{\lambda}_{ML}},\widehat{\lambda}_{ML}+1.96\frac{1}{\sqrt{n}\widehat{\lambda}_{ML}}\right)\end{aligned}.

This confidence interval can be obtained by test inversion.

Consider the test problem $H_{0}:\lambda=\lambda_{0}$ vs. $H_{1}:\lambda\neq\lambda_{0}$.

The Wald test statistic is given by

$T_{W}=n\left(\widehat{\lambda}_{ML}-\lambda_{0}\right)^{2}\underset{=\widehat{\lambda}_{ML}^{2}}{\underbrace{I\left(\widehat{\lambda}_{ML}\right)}}\sim\chi_{\left(1\right)}^{2}$

And we reject the null hypothesis if $T_{W}\gt 1.96^{2}$ in order to obtain a test with $\alpha=0.05$. This leads to the 95% confidence interval above.

# Example: Multiple Parameters

Let $X_{i}\overset{iid}{\sim}N\left(\mu,\sigma^{2}\right)$ where $\mu$ and $\sigma^{2}$ are unknown.

Suppose we have observations s.t. $\overline{x}=1$ and $\frac{\Sigma x_{i}^{2}}{n}=6$ and face the testing problem $H_{0}:\,\sigma^{2}=4$ vs. $H_{1}:\,\sigma^{2}\neq4$ at the 0.1 level.

## Log-Likelihood

$l\left(\theta\right)=\sum_{i=1}^{n}-\frac{1}{2}\log\left(\sigma^{2}\right)-\frac{x_{i}^{2}-2\mu x_{i}+\mu^{2}}{2\sigma^{2}}$

First-Order Conditions:

\begin{aligned} \frac{\partial l\left(\theta\right)}{\partial\mu}= & \sum_{i=1}^{n}\frac{x_{i}-\mu}{\sigma^{2}}\\ \frac{\partial l\left(\theta\right)}{\partial\sigma^{2}}= & \sum_{i=1}^{n}-\frac{1}{2\sigma^{2}}+\frac{x_{i}^{2}-2\mu x_{i}+\mu^{2}}{2\sigma^{4}}\end{aligned}

Solving the focs, we get

\begin{aligned} \widehat{\mu}_{ML}= & \overline{x}=1.\\ \widehat{\sigma^{2}}_{ML}= & \frac{\sum x_{i}^{2}}{n}-\left(\frac{\sum x_{i}}{n}\right)^{2}=5.\end{aligned}

## Information Matrix

The information matrix for a single observation equals

$I_{1}=-E\left[\begin{array}{cc} \frac{\partial^{2}}{\partial\mu^{2}}l_{1}\left(\theta\right) & \frac{\partial^{2}}{\partial\mu\partial\sigma^{2}}l_{1}\left(\theta\right)\\ \frac{\partial^{2}}{\partial\sigma^{2}\partial\mu}l_{1}\left(\theta\right) & \frac{\partial^{2}}{\partial\left(\sigma^{2}\right)^{2}}l_{1}\left(\theta\right) \end{array}\right]=-E\left[\begin{array}{cc} -\frac{1}{\sigma^{2}} & -\frac{x_{i}-\mu}{\sigma^{4}}\\ -\frac{x_{i}-\mu}{\sigma^{4}} & \frac{1}{2\sigma^{4}}-\frac{\left(x_{i}-\mu\right)^{2}}{\sigma^{6}} \end{array}\right].$

Taking expectations (the expectation operator applies to each member of the matrix) yields

$I_{1}=\left[\begin{array}{cc} \frac{1}{\sigma^{2}} & 0\\ 0 & \frac{1}{2\sigma^{4}} \end{array}\right]$

We now calculate the information matrix at the null hypothesis as well as at the value of its maximum likelihood estimate:

• $I_{1}\left(\widehat{\mu}_{ML},\sigma_{0}^{2}\right)=I\left(1,4\right)=\left[\begin{array}{cc} \frac{1}{4} & 0\\ 0 & \frac{1}{32} \end{array}\right]$
• $I_{1}\left(\widehat{\mu}_{ML},\widehat{\sigma^{2}}_{ML}\right)=I\left(1,5\right)=\left[\begin{array}{cc} \frac{1}{5} & 0\\ 0 & \frac{1}{50} \end{array}\right]$

## Confidence Interval

Note that $Var\left(\begin{array}{c} \sqrt{n}\left(\widehat{\mu}-\mu\right)\\ \sqrt{n}\left(\widehat{\sigma^{2}}-\sigma^{2}\right) \end{array}\right)=I_{1}^{-1}=\left[\begin{array}{cc} \sigma^{2} & 0\\ 0 & 2\sigma^{4} \end{array}\right].$

(Operation $I^{-1}$ does not mean taking the inverse of the elements of matrix $I$; it’s the matrix inverse operation, which coincides with the matrix of the inverses in this case.)

Hence, $\widehat{Var}\left(\sqrt{n}\left(\widehat{\sigma^{2}}_{ML}-\sigma^{2}\right)\right)=\left.2\sigma^{4}\right|_{\sigma^{2}=5}=50$, and the resulting CI is

$CI:\,\left(\widehat{\sigma^{2}}_{ML}-1.96\sqrt{\frac{50}{n}},\widehat{\sigma^{2}}_{ML}+1.96\sqrt{\frac{50}{n}}\right)=3.61,6.39.$

## Wald Test

For the Wald test,

$T_{W}=\frac{\left(\widehat{\sigma^{2}}_{ML}-\sigma^{2}\right)^{2}}{I_{22}^{-1}\left(\widehat{\sigma^{2}}_{ML}\right)}=\frac{1^{2}}{\left(\frac{50}{100}\right)}=2.$

The null hypothesis is not rejected, since the critical value is 2.71.

## LM Test

For the LM test,

\begin{aligned} T_{LM} & =\frac{1}{n}\left[l^{'}\left(\theta_{0}\right)\right]^{'}I^{-1}\left(\theta_{0}\right)\left[l^{'}\left(\theta_{0}\right)\right]=\\ & =\frac{1}{n}\left[\begin{array}{c} \sum_{i=1}^{n}\frac{x_{i}-\mu_{0}}{\sigma_{0}^{2}}\\ \sum_{i=1}^{n}-\frac{1}{2\sigma_{0}^{2}}+\frac{\left(x_{i}-\mu_{0}\right)^{2}}{2\sigma_{0}^{4}} \end{array}\right]^{'}\left[\begin{array}{cc} \sigma_{0}^{2} & 0\\ 0 & 2\sigma_{0}^{4} \end{array}\right]\left[\begin{array}{c} \sum_{i=1}^{n}\frac{x_{i}-\mu_{0}}{\sigma_{0}^{2}}\\ \sum_{i=1}^{n}-\frac{1}{2\sigma_{0}^{2}}+\frac{\left(x_{i}-\mu_{0}\right)^{2}}{2\sigma_{0}^{4}} \end{array}\right]\\ & =3.125.\end{aligned}

such that we reject $H_{0}$.

## LR Test

For the LR test,

\begin{aligned} l\left(\mu=1,\sigma^{2}=4\right) & =-131.81\\ l\left(\mu=1,\widehat{\sigma^{2}}=5\right) & =-130.47\end{aligned}

s.t.

$T_{LR}=2\left(l\left(\widehat{\theta}_{ML}\right)-l\left(\theta_{0}\right)\right)=2.69.$

and we do not reject the null hypothesis.

Clearly, in this example, $n$ is not large enough for the tests to converge.