Full Lecture 12

From Significant Statistics
Jump to navigation Jump to search

Statistical Tests

In the previous class we introduced hypothesis testing, and discussed how to set the critical value of a test. Here, we will discuss the selection of the test itself.

There are three main test statistics used in hypothesis testing. We will discuss their optimality later. The test statistics are

  • Likelihood Ratio
  • Lagrange Multiplier
  • Wald Test

All of these tests reject when [math]T\left(X_{1}..X_{n}\right)\gt c[/math], where [math]T[/math] is the test statistic and [math]c[/math] is the critical value.


Likelihood Ratio Test

Let [math]X_{1}..X_{n}[/math] be a random sample with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\in\Theta\subseteq\mathbb{R}[/math], and consider the test

[math]H_{0}:\,\theta=\theta_{0}\,\text{vs. }\theta\neq\theta_{0}[/math]

Recall that the log-likelihood function is [math]l\left(\left.\theta\right|x_{1}..x_{n}\right)=\sum_{i=1}^{n}\text{log}\left(f\left(\left.x_{i}\right|\theta\right)\right)[/math]

The Likelihood Ratio test (LRT) statistic is

[math]T_{LR}\left(X_{1}..X_{n}\right)=2\left[l\left(\left.\widehat{\theta}_{ML}\right|X_{1}..X_{n}\right)-l\left(\left.\theta_{0}\right|X_{1}..X_{n}\right)\right][/math]

In order to calculate this statistic, we simply take the difference of the log-likelihoods evaluated at the ML estimator and the value of the null hypothesis (assuming [math]H_{0}[/math] is a simple hypothesis).

The result is the test statistic that will be compared with the cutoff point (aka, the critical value).

A Few Important Notes

  • We have just produced a test, which will yield a value whenever we apply it to data. However, we were not told anything about its distribution. We will discuss this later.
  • If [math]H_{0}[/math] is composite, then we calculate [math]T_{LR}\left(X_{1}..X_{n}\right)=2\left[l\left(\left.\widehat{\theta}_{ML}\right|X_{1}..X_{n}\right)-\text{sup}_{\theta_{0}\in\Theta_{0}}l\left(\left.\theta_{0}\right|X_{1}..X_{n}\right)\right][/math] instead.
  • The presence of the [math]2[/math] will be explained later. Clearly, it does not affect the test statistic, since removing it will simply scale the critical value by one half.
  • The LRT is motivated by the Neyman-Pearson Lemma, which we will discuss later.
  • The intuition for the test statistic is as follows. Suppose [math]T_{LR}[/math] is very high. This means that the likelihood of the sample at the most likely value of [math]\widehat{\theta}_{ML}[/math], is very far from value of the null hypothesis. If the difference is very high, [math]\theta[/math] is unlikely to equal [math]\theta_{0}[/math], and we will reject the null hypothesis.


Lagrange Multiplier Test

The Lagrange Multiplier (LM) test statistic is given by

[math]T_{LM}\left(X_{1}..X_{n}\right)=\frac{\left[\frac{\partial}{\partial\theta}l\left(\left.\theta_{0}\right|X_{1}..X_{n}\right)\right]^{2}}{-\frac{\partial^{2}}{\partial\theta^{2}}l\left(\left.\theta_{0}\right|X_{1}..X_{n}\right)}.[/math]

The motivation for the LM test is that it is an approximation of the LRT. Unlike the LRT, though, the LM test does not require estimation! It suffices to evaluate the first and second derivative of the log-likelihood function at the parameter value of the null hypothesis.

This test is often known as the score test, because we often refer to [math]\frac{\partial}{\partial\theta}l\left(\left.\theta\right|X_{1}..X_{n}\right)[/math] as the score function.

A Brief Note

  • Notation [math]\frac{\partial}{\partial\theta}l\left(\left.\theta_{0}\right|X_{1}..X_{n}\right)[/math] means [math]\left.\frac{\partial}{\partial\theta}l\left(\left.\theta\right|X_{1}..X_{n}\right)\right|_{\theta=\theta_{0}}.[/math]
  • Notice that the expectation of the denominator of the LM test equals the Fisher information, [math]I\left(\theta_{0}\right)[/math].


Wald Test

The Wald test statistic is given by

[math]T_{w}\left(X_{1}..X_{n}\right)=\frac{\left(\widehat{\theta}_{ML}-\theta_{0}\right)^{2}}{\left[-\frac{\partial^{2}}{\partial\theta^{2}}l\left(\left.\widehat{\theta}_{ML}\right|X_{1}..X_{n}\right)\right]^{-1}}[/math].

This test can also be motivated as an approximation to the LRT. The principle of the Wald test is to reject the null hypothesis when [math]\left|\widehat{\theta}_{ML}-\theta_{0}\right|[/math] is large.

Dividing by [math]Var\left(\widehat{\theta}_{ML}\right)[/math] increases the rejection rate as the uncertainty over [math]\widehat{\theta}_{ML}[/math] decreases. (It the estimator is more precise, then we reject for lower distances).


Example: LRT, Normal

Suppose [math]X_{i}\overset{iid}{\sim}N\left(\mu,1\right)[/math], and the test problem is

[math]H_{0}:\mu=0\,vs.\,H_{1}:\mu\neq0[/math]

The LRT can be written as

[math]\begin{aligned} 2\left[\max_{\theta}\,\left\{ l\left(\left.\theta\right|X_{1}..X_{n}\right)\right\} -l\left(\left.\theta_{0}\right|X_{1}..X_{n}\right)\right]\\ =2\left(\max_{\theta}\,\left\{ \sum_{i=1}^{n}\log\left[\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{\left(x_{i}-\theta\right)^{2}}{2}\right)\right]\right\} -\sum_{i=1}^{n}\log\left[\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{\left(x_{i}-\theta_{0}\right)^{2}}{2}\right)\right]\right)\\ =2\left(\max_{\theta}\,\left\{ \sum_{i=1}^{n}-\frac{\left(x_{i}-\theta\right)^{2}}{2}\right\} +\frac{\left(x_{i}-\theta_{0}\right)^{2}}{2}\right)\end{aligned}[/math].

To continue, first notice that

[math]\begin{aligned} & \frac{\partial}{\partial\theta}\sum_{i=1}^{n}-\frac{\left(x_{i}-\theta\right)^{2}}{2}=0\\ \Leftrightarrow & -2\sum_{i=1}^{n}\left(x_{i}-\theta\right)=0\\ \Leftrightarrow & \sum_{i=1}^{n}x_{i}=n\theta\\ \Rightarrow & \widehat{\theta}_{ML}=\frac{1}{n}\sum_{i=1}^{n}x_{i}=\overline{x}\end{aligned}[/math]

s.t.

[math]\begin{aligned} & 2\left(\max_{\theta}\,\left\{ \sum_{i=1}^{n}-\frac{\left(x_{i}-\theta\right)^{2}}{2}\right\} +\frac{\left(x_{i}-\theta_{0}\right)^{2}}{2}\right)\\ = & 2\left(\sum_{i=1}^{n}-\frac{\left(x_{i}-\overline{x}\right)^{2}}{2}+\frac{\left(x_{i}-\theta_{0}\right)^{2}}{2}\right)\\ = & \sum_{i=1}^{n}-x_{i}^{2}+2x_{i}\overline{x}-\overline{x}^{2}+x_{i}^{2}-2x_{i}\theta_{0}+\theta_{0}^{2}\\ = & n\left(\theta_{0}^{2}-2\overline{x}\theta_{0}+\overline{x}^{2}\right)+\sum_{i=1}^{n}-x_{i}^{2}+2x_{i}\overline{x}-2\overline{x}^{2}+x_{i}^{2}\\ = & n\left(\theta_{0}^{2}-2\overline{x}\theta_{0}+\overline{x}^{2}\right)+2n\overline{x}^{2}-2n\overline{x}^{2}\\ = & n\left(\overline{x}-\theta_{0}\right)^{2}\end{aligned}[/math]

So, the LRT becomes “reject [math]H_{0}[/math] iff [math]n\left(\overline{x}-\theta_{0}\right)^{2}\gt c[/math].

One can verify that the LM and Wald approaches yield the same test (We will show this later.)

[math]\chi^{2}[/math] Distribution

Since we know that [math]\overline{X}[/math] follows a normal distribution, and we also know that the sum of squares of standard normal distributions follows a chi-square distribution, let us try to derive the distribution of our test statistic.

First, notice that under [math]H_{0}[/math], [math]\overline{X}\sim N\left(\theta_{0},\frac{1}{n}\right)[/math] such that [math]\overline{X}-\theta_{0}\sim N\left(0,\frac{1}{n}\right)[/math] and [math]\sqrt{n}\left(\overline{X}-\theta_{0}\right)\sim N\left(0,1\right)[/math].

Define [math]Z=\sqrt{n}\left(\overline{X}-\theta_{0}\right)[/math], and notice that [math]Z^{2}[/math] is our original test statistic, such that [math]n\left(\overline{x}-\theta_{0}\right)^{2}\sim\chi_{\left(1\right)}^{2}[/math] where the [math]\left(1\right)[/math] subscript is the number of squared normals summed in our statistic.

It is often challenging to calculate the distribution of test statistics. In these cases, computers come in handy (we will cover this later).

For completeness only, the density of the chi-square distribution is given by

[math]f_{Z^{2}}\left(\left.z\right|k\right)=\frac{1}{2^{\frac{k}{2}}\Gamma\left(\frac{k}{2}\right)}x^{\left(\frac{k}{2}-1\right)\exp\left(-\frac{x}{2}\right)}1\left(x\gt 0\right)[/math].


Test equivalence

We now show that both the LM and Wald tests approximate the LRT test.

  • The LM test is obtained by expanding the LRT via Taylor expansion around [math]\theta_{0}[/math].
  • The Wald test is obtained by expanding the LRT via Taylor expansion around [math]\widehat{\theta}_{ML}[/math].

For simplicity, below we will write [math]l\left(\theta\right)[/math] rather than [math]l\left(\left.\theta\right|x_{1}..x_{n}\right).[/math]


Equivalence Between LRT and LM Tests

One can expand the log-likelihood function around [math]\theta_{0}[/math]:

[math]l\left(\theta\right)\simeq l\left(\theta_{0}\right)+l'\left(\theta_{0}\right)\left(\theta-\theta_{0}\right)+\frac{l''\left(\theta_{0}\right)}{2}\left(\theta-\theta_{0}\right)^{2}[/math]

We will show that this approximation can yield the LRT.

In order to calculate the maximum log-likelihood, we maximize w.r.t. [math]\theta[/math], by use of the first-order condition:

[math]\begin{aligned} foc\left(\theta\right):\,\frac{d}{d\theta}l\left(\theta\right)=0 & \Leftrightarrow l'\left(\theta_{0}\right)+l''\left(\theta_{0}\right)\left(\theta-\theta_{0}\right)=0\\ & \Leftrightarrow\theta^{*}=\theta_{0}-\frac{l'\left(\theta_{0}\right)}{l''\left(\theta_{0}\right)}\end{aligned}[/math]

Such that the maximized log-likelihood is approximately equal to

[math]\begin{aligned} l\left(\theta^{*}\right) & =l\left(\theta_{0}\right)+l'\left(\theta_{0}\right)\left(\theta^{*}-\theta_{0}\right)+\frac{l''\left(\theta_{0}\right)}{2}\left(\theta^{*}-\theta_{0}\right)^{2}\\ & =l\left(\theta_{0}\right)+l'\left(\theta_{0}\right)\left(-\frac{l'\left(\theta_{0}\right)}{l''\left(\theta_{0}\right)}\right)+\frac{l''\left(\theta_{0}\right)}{2}\left(-\frac{l'\left(\theta_{0}\right)}{l''\left(\theta_{0}\right)}\right)^{2}\\ & =l\left(\theta_{0}\right)-\frac{l'\left(\theta_{0}\right)^{2}}{l''\left(\theta_{0}\right)}+\frac{l'\left(\theta_{0}\right)^{2}}{2l''\left(\theta_{0}\right)}\\ & =l\left(\theta_{0}\right)-\frac{l'\left(\theta_{0}\right)^{2}}{2l''\left(\theta_{0}\right)}\end{aligned}[/math]

The [math]LRT[/math] test is given by [math]2\left[l\left(\widehat{\theta}_{ML}\right)-l\left(\theta_{0}\right)\right][/math].

Plugging in our approximation of the maximum log-likelihood yields

[math]\begin{aligned} & 2\left[l\left(\widehat{\theta}_{ML}\right)-l\left(\theta_{0}\right)\right]\\ \simeq & 2\left[l\left(\theta^{*}\right)-l\left(\theta_{0}\right)\right]\\ = & 2\left[l\left(\theta_{0}\right)-\frac{l'\left(\theta_{0}\right)^{2}}{2l''\left(\theta_{0}\right)}-l\left(\theta_{0}\right)\right]\\ = & -\frac{l'\left(\theta_{0}\right)^{2}}{l''\left(\theta_{0}\right)}\end{aligned}.[/math]

This is the LM test, except the denominator appears different. We will later discuss that [math]-l''\left(\theta_{0}\right)[/math] approximates [math]I\left(\theta_{0}\right)[/math].

Notice that using the “2” in the LRT test was useful to establish the approximate equality between the LRT and the LM test.


Equivalence Between LRT and Wald Tests

In this case, we expand the log-likelihood function around [math]\widehat{\theta}_{ML}[/math]:

[math]\begin{aligned} l\left(\theta\right) & =l\left(\widehat{\theta}_{ML}\right)+\underset{=0}{\underbrace{l^{'}\left(\widehat{\theta}_{ML}\right)}}\left(\theta-\widehat{\theta}_{ML}\right)+\frac{l^{''}\left(\widehat{\theta}_{ML}\right)}{2}\left(\theta-\widehat{\theta}_{ML}\right)^{2}\\ & =l\left(\widehat{\theta}_{ML}\right)+\frac{l^{''}\left(\widehat{\theta}_{ML}\right)}{2}\left(\theta-\widehat{\theta}_{ML}\right)^{2}\end{aligned}[/math]

Now, plugging this result evaluated at [math]\theta_{0}[/math] into the LRT yields:

[math]\begin{aligned} 2\left[l\left(\widehat{\theta}_{ML}\right)-l\left(\theta_{0}\right)\right] & \simeq2\left[l\left(\widehat{\theta}_{ML}\right)-\left(l\left(\widehat{\theta}_{ML}\right)+\frac{l^{''}\left(\widehat{\theta}_{ML}\right)}{2}\left(\theta_{0}-\widehat{\theta}_{ML}\right)^{2}\right)\right]\\ & =-l^{''}\left(\widehat{\theta}_{ML}\right)\left(\theta_{0}-\widehat{\theta}_{ML}\right)^{2}\\ & =\frac{\left(\theta_{0}-\widehat{\theta}_{ML}\right)^{2}}{-l^{''}\left(\widehat{\theta}_{ML}\right)}\end{aligned}[/math]

which yields the Wald test.

Notice that the log-likelihood of the Normal distribution is quadratic in [math]\mu[/math], such that these 3 procedures produce exactly the same test for [math]\mu[/math] (when [math]\sigma^{2}[/math] is known), because we have used quadratic approximations to the log-likelihood function.

Finally, when the null hypothesis is composite, it is usually possible to construct an LM and Wald test.


Optimal Tests

We first provide a couple of definitions in regards to tests:

  • A test has level [math]\alpha[/math] if [math]\beta\left(\theta\right)\leq\alpha[/math][math]\forall\theta\in\theta_{0}[/math].
  • The size of a test is [math]\sup_{\theta\in\Theta_{0}}\,\beta\left(\theta\right)[/math].

Notice that a test has multiple levels. A 5% level test has less than a 5% probability of a type 1 error. However, this test is also a level 10% test, level 15%, etc.

A test with size [math]5\%[/math] means that the highest probability (among the potential values of the true parameter [math]\theta[/math]) is 5%.

Neyman Approach

In order to define test optimality, we will follow the Neyman approach. We will select a class [math]C[/math] of tests with level [math]\alpha[/math] (e.g., [math]\alpha=0.05[/math]). Then, we minimize the probability of type 2 errors (for all possible values of [math]\theta\in\Theta_{1}[/math]), with the constraint that the level of the test is fixed at [math]5\%[/math].

Test Optimality (UMP)

First, let the testing problem be

[math]H_{0}:\theta\in\Theta_{0}\,vs.\,H_{1}:\theta\in\Theta_{1}[/math]

and let [math]C[/math] be a collection of tests.

A test in [math]C[/math] with power function [math]\beta\left(\cdot\right)[/math] is a uniformly most powerful (UMP) [math]C[/math] test if

[math]\beta\left(\theta\right)\geq\beta^{*}\left(\theta\right),\,\forall\,\theta\in\Theta_{1}[/math] for every [math]\beta^{*}[/math] corresponding to a test in [math]C[/math].

The UMP test has the lowest type 2 error probability among tests with level [math]\alpha[/math].

Finding an UMP by hand is challenging. This is where the Neyman-Pearson lemma comes in.


Neyman-Pearson Lemma

Let [math]X[/math] be a random sample with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], and consider the problem [math]H_{0}:\theta=\theta_{0}\,vs.\,H_{1}:\theta=\theta_{1}[/math]

The test that rejects [math]H_{0}[/math] iff [math]f\left(\left.X\right|\theta_{1}\right)\gt k.f\left(\left.X\right|\theta_{0}\right)[/math] for some [math]k\gt 0[/math] is a UMP level [math]\alpha[/math] test of [math]H_{0}[/math] where

[math]\alpha=P_{\theta_{0}}\left(f\left(\left.X\right|\theta_{1}\right)\gt k.f\left(\left.X\right|\theta_{0}\right)\right)[/math].

This lemma shows that if one considers the case of a simple null vs. alternative hypothesis, then it is possible to obtain the UMP test.

It also shows that if a test is UMP, then it can be obtained by the form above.

At this point, you may wonder why we care about UMP tests. After all, for any candidate test, one can always find a constant that yields a given probability of a type 1 error at [math]\theta=\theta_{0}[/math].

The reason we care is that the UMP test dominates all others in terms of type 2 errors, i.e., for all values [math]\theta\in\Theta_{1}[/math]: The UMP test minimizes type 2 errors uniformly.

Relationship with LRT

Consider the case [math]k\geq1[/math]. We can rewrite the test above as

[math]f\left(\left.X\right|\theta_{1}\right)\gt k.f\left(\left.X\right|\theta_{0}\right)\Leftrightarrow\frac{\max_{\theta\in\left\{ \theta_{0},\theta_{1}\right\} }\,f\left(\left.X\right|\theta\right)}{f\left(\left.X\right|\theta_{0}\right)}\gt k[/math],

where the second expression is the LRT. Notice that when the left-hand side equation is satisfied, so is the right-hand side equation, and vice-versa, for any [math]k\geq1[/math].

So, for simple tests, the LRT yields the UMP. For more complicated tests, the UMP may not exist, but we will still be able to apply some optimality concept.