# Hypothesis Testing

The goal of hypothesis testing is to select a subset of the parameter space $\Theta$.

Set $\Theta$ is first partitioned into disjoint subsets, $\Theta_{0}$ and $\Theta_{1}$, where $\Theta_{1}=\Theta\backslash\Theta_{0}$.

Then, we decide on a rule for choosing between $\Theta_{0}$ and $\Theta_{1}$.

Some Terminology

• A hypothesis is a statement about $\theta$.
• Null Hypothesis: $H_{0}:\theta\in\Theta_{0}$.
• Alternative Hypothesis: $H_{1}:\theta\in\Theta_{1}$.
• Maintained Hypothesis: $H:$$\theta\in\Theta$.

The goal of hypothesis testing is to decide for the null or the alternative hypothesis. Throughout the procedure, the maintained hypothesis is assumed.

A typical formulation of a hypothesis test is: $H_{0}:\theta\in\Theta_{0}\,vs.\,H_{1}:\theta\in\Theta_{1}$

## Example: Normal

Suppose $X_{i}\overset{iid}{\sim}N\left(\mu,1\right)$, where $\mu\geq0$ (maintained hypothesis) is unknown. The aim is to test whether $\mu=0$.

Notice that we can write the problem down in two equivalent formulations:

$H_{0}:\mu=0\,vs.\,H_{1}:\mu\gt 0$

or

$H_{0}:\mu\gt 0\,vs.\,H_{1}:\mu=0$

It is usually easier to consider the hypothesis test with the simple null hypothesis. The null (alternative) hypothesis is simple if $\Theta_{0}$($\Theta_{1}$) is a singleton. Otherwise, it is composite.

# Testing Procedure

Suppose $X_{1}..X_{n}$ is a random sample with a pmf/pdf $f\left(\left.\cdot\right|\theta\right)$ where $\theta\in\Theta$ is unknown. Consider the test

$H_{0}:\theta\in\Theta_{0}\,vs.\,H_{1}:\theta\in\Theta_{1}$

A testing procedure is a rule for choosing between $H_{0}$ and $H_{1}.$

For example, in the normal example above, if the data come out relatively low, then we may opt for hypothesis $\mu=0$; whereas we may opt for the alternative hypothesis that $\mu\gt 0$ if the data are relatively high.

There is no obvious way how we should define the decision rule. However, for any rule, we can define a data region that corresponds to supporting one of the alternatives.

Let $C\subseteq\mathbb{R}^{n}$. The rule "reject $H_{0}$ iff $\left(X_{1}..X_{n}\right)\in C$" (i.e., if the data fall in the region) is a testing procedure with critical region $C$.

## Example: Normal

Suppose $X_{i}\overset{iid}{\sim}N\left(\mu,1\right)$, where $\mu\geq0$ is unknown, and $H_{0}:\mu=0\,vs.\,H_{1}:\mu\gt 0.$

A possible decision rule is to reject $H_{0}$ if $\overline{X}\gt 2$. Of course, we could have selected a different right-hand side, like 3, or even a function of $n$ to account for the fact that higher samples produce more precise means.

For our current example, the critical region is given by $C=\left\{ \left(X_{1}..X_{n}\right)^{'}:\frac{\sum_{i=1}^{n}X_{i}}{n}\gt 2\right\}$.

• We call $\frac{\sum_{i=1}^{n}X_{i}}{n}$ the test-statistic: It’s a statistic that will be used to decide the test result.
• We call $2$ the test threshold or the critical value.
• One can practically always write tests as $T\left(X\right)\gt c$.

Hypothesis testing involves choosing a test statistic (left-hand side) and a critical value (right-hand side). Depending on the data, the condition is either satisfied or not (so, a test produces a binary outcome).

We will first discuss critical value selection (for generic tests or specific ones). I.e., we will focus on the right-hand side. In the next lecture, we discuss test selection.

But before we get in too deep, something completely different.

# Variation on the Theme

Let’s stop for a second, and ask “why are we even doing this”? Why is it so important to determine whether $\mu=0$ or $\mu\gt 0$? Why not simply estimate $\mu$ through maximum likelihood or the method of moments, and use whatever information we obtained? If we estimate $\mu=0.3$, well, then maybe $\mu$ is indeed 0.3 for all we know.

The debate about hypothesis testing dates back to the early to 20th century primarily between Ronald Fisher and Jerzy Neyman and continued for several decades. A significant part of the debate was philosophical. What we teach and study today is a product of that debate, taking mostly Neyman’s as well as his co-author Karl Pearson’s approach, but still informed by ideas from Ronald Fisher. Fisher’s motivation was in part whether to maintain or reject a currently-held scientific hypothesis. If the data disagreed with the current hypothesis sufficiently - in some formal way - then one could do away with it. In contrast, Neyman and Pearson’s approach pitches two hypotheses, and favors one or the other.

Over time, Neyman and Pearson’s approach gained momentum, probably due to the amount of formal tools used, as well as due to the Neyman-Pearson lemma, which establishes a form of optimality when selecting a test. The approach is agnostic in terms of the scientific method. In practice though, natural sciences employ this method conservatively: A current theory is disproved if - statistically speaking - the chance that the observations from an experiment disagreed simply because of randomness are very very small; yet, we we observe a disagreement in the data. (For example, our theory predicted that the chances of observing a sample mean higher than 0 was 0.00001%; yet we observed it.)

In the social sciences, there exists a mild debate about how conservative hypothesis testing should be. Because humans are so volatile, data about their behavior is not always good enough to convincingly disprove a theory. And clearly, social sciences face challenges in replicating a experiments while keeping conditions completely stable. So, some authors have proposed that the dichotomous approach of lending support to one theory or another by partitioning the parameter space into two is inadequate for the social sciences. Instead, researchers should keep track of the parameters estimated over time, in different studies and experiments, and use full information instead.

Some related debates still take place: For example, does the dichotomous approach provide too freedom/too much incentive for scientists to interfere with experimental results?

The point of this section is to try to sensitize you to the fact that, despite its mathematical language, hypothesis testing is a tool that does the job your ask of it. By studying it and understanding it well, you will be able to decide whether it solves the particular problem you face, whether it needs a tweak or two, or whether it is completely inapplicable/inappropriate.

Keep in mind though, that the theory is intricate and sometimes deceiving. The people I’ve talked to who know the most in the world about this area of statistics say as much. You may also want to keep in mind that many people know only a bit about it, yet will speak as if they knew all lot.

With that, let’s proceed into the jewel of the modern scientific method.

# Type 1 and Type 2 Errors

The critical value is a fundamental aspect of hypothesis testing. In the previous normal example, we chose a critical value of $2$. Was this a good idea? Surely the probability of selecting the right hypothesis is not 100% either way.

For any critical value, there will be cases where we will say that $\mu=0$ when in reality $\mu\gt 0$, and the converse will also happen. If we were always able to pick the right hypothesis, then there would be no uncertainty.

Let us first organize the possible cases of “hit and miss” in the following table:

 Truth $H_{0}$ $H_{1}$ $H_{0}$ $\unicode{x2714}$ Type 1 error $H_{1}$ Type 2 error $\unicode{x2714}$

The main diagonal is simple to memorize: If $H_{0}$ is true and we opt for $H_{0}$ (or if $H_{1}$ is true and... you get the point), then no errors were made. If we decide for $H_{1}$ when $H_{0}$ is true, then we have committed a type 1 error. If we decide for $H_{0}$ when $H_{1}$ is true, then we committed a type 2 error.

## Example: Normal

Let’s use the normal example from before. After all, we need to get used to these relatively artificial error names.

In our example,

$H_{0}:\mu=0$ and $H_{1}:\mu\gt 0$.

• If we reject $\mu=0$ when it was true, then we commit a type 1 error.
• If we accept $\mu=0$ and it was false, then we commit a type 2 error.

So, provided an error was made, type 1 error happens when rejecting the null, type 2 error occurs when accepting the null. Rehearse this: Type 1 error, reject the null.... Type 1 error, reject the null...

Often, we select the critical value so that $P_{\theta_{0}}\left(\text{type 1 error}\right)\leq5\%$. One interpretation typical in sciences is that we would like to be conservative, and only reject the null hypothesis erroneously.

The 5% threshold used above is arbitrary. For example, in the natural sciences where experiments can be reproduced with high precision, we may use $p=0.003$ or even lower.

We will talk about the probability of committing a type 2 error in the next lecture. As a preview, we will minimize that probability, constrained by the fact that the probability of a type 1 error cannot surpass 5% (or some other established level).

An incredibly useful too to analyze this problem further is the power function (and then some graphs).

## A Quick Note on Notation

Statements like $P_{\theta_{0}}\left(\text{type 1 error}\right)\leq5\%$ will be frequent from here on.

When using the subscript notation $P_{\theta_{0}}\left(\cdot\right)$, it may appear that we mean $P\left(\cdot|\theta=\theta_{0}\right)$. The issue with this statement is that we do not consider $\theta$ to be a random variable (rather, it is the true value of the parameter, a constant), so it does not make sense to condition on it. Hence, the use of the subscript notation.

The statement $P_{\theta_{0}}\left(\cdot\right)$ can be interpreted as "we will replace $\theta$ by $\theta_{0}$ eventually."

# Power Function

The power function of a test with critical region $C$ is the function $\beta:\Theta\rightarrow\left[0,1\right]$ given by

$\beta\left(\theta\right)=P_{\theta}\left[\left(X_{1}..X_{n}\right)'\in C\right]=P_{\theta}\left(\text{reject }H_{0}\right),\theta\in\Theta$

So, the power function returns a probability. It is super convenient because it summarizes type 1 and type 2 errors in a single function.

To see this, note that

\begin{aligned} P_{\theta}\left(\text{type 1 error}\right) & =\beta\left(\theta\right),\,\theta\in\Theta_{0}\\ P_{\theta}\left(\text{type 2 error}\right) & =1-\beta\left(\theta\right),\,\theta\in\Theta_{1}\end{aligned}

From here on, we will work with examples.

## Example 1: Normal

Let $X_{i}\overset{iid}{\sim}N\left(\mu,1\right)$, and consider the test

$H_{0}:\mu=0\,vs.\,H_{1}:\mu\gt 0$,

where the maintained hypothesis is $\mu\in\left[0,\infty\right)$.

Suppose we decide to reject $H_{0}$ if $\overline{X}\gt 0$ (i.e., the critical value is 0). We’ll discuss whether this is a sensible rule later.

The power function is given by

$\beta\left(\theta\right)=P_{\theta}\left(\text{reject }H_{0}\right),\theta\in\Theta$.

Since we know that $\overline{X}\sim N\left(\mu,\frac{1}{n}\right)$, we can further write

$\beta\left(\theta\right)=P_{\mu}\left(\text{reject }H_{0}\right)=P_{\mu}\left(\overline{X}\gt 0\right)=1-P_{\mu}\left(\overline{X}\leq0\right)=1-F_{\overline{X},\mu}\left(0\right)=1-\Phi\left(\frac{0-\mu}{\sqrt{\frac{1}{n}}}\right)=1-\Phi\left(-\sqrt{n}\mu\right)$.

Because we know that

\begin{aligned} P_{\theta}\left(\text{type 1 error}\right) & =\beta\left(\theta\right),\,\theta\in\Theta_{0}\\ P_{\theta}\left(\text{type 2 error}\right) & =1-\beta\left(\theta\right),\,\theta\in\Theta_{1}\end{aligned}

It follows that (with a bit of abuse of notation)

$P_{\mu}\left(\text{type 1 error}\right)=\beta\left(\mu=\mu_0\right)=1-\Phi\left(-\sqrt{n}0\right)=50\%.$

$P_{\mu}\left(\text{type 2 error}\right)=1-\beta\left(\mu\neq\mu_0\right)=1-\left(1-\Phi\left(-\sqrt{n}\mu\right)\right) = \Phi\left(-\sqrt{n}\mu\right)\text{ for }\mu\neq\mu_{0}$.

Let’s plot this power function, for $n=20$, as a function of the parameter of interest, $\mu$:

The first noticeable feature is that the function is increasing. We will interpret the function more carefully soon.

Under the null hypothesis, $\mu=0$. In this case, the probability of committing a type 1 error is $50\%$. This means that if we repeated this exact experiment many times with $\mu=0$, and drew multiple values of each time $\overline{x}$, our rule would have rejected the null hypothesis approximately 50% of the time.

Is this surprising? The magical 50% figure is a consequence of the chosen test statistic and the cutoff of zero of our admittedly arbitrary rule. The random variable of interest in our test (i.e., our test statistic) is the sample mean. We obtained this result because a $N\left(0,1\right)$ produces sample means (each calculated with $n=20$) above zero 50% of the time. Whether this rejection rate of the null hypothesis is appropriate depends on the application.

What does the power function tell us?

It tells us the probability of rejecting $H_{0}$ when $\mu=0$, according to our rule. At $H_{1}$ ($\mu\gt 0$), it gives us one minus the probability of committing a type 2 error. In other words, at $\mu\gt 0$, it tells us the probability of accepting the alternative hypothesis when it was indeed correct, for each specific value of $\mu$.

We would like to produce a power function that equals zero at $\mu=0$, and equals 1 at $\mu\gt 1$. However, as long as $\overline{X}$ is uncertain, we cannot ensure this.

Intuitively, we want to minimize the power function at the null set, and maximize it at the alternative set.

# Setting the Critical Value

Suppose we would like the probability of a type 1 error to equal 5% exactly. The way to do this is start with some fictitious threshold $c$, write down the probability of a type 1 error, and then equal it to 5% and solve w.r.t. $c$.

We rewrite our rule to reject $H_{0}$ when $\overline{X}\gt c$.

Then, $P_{\mu}\left(\text{type 1 error}\right)=P_{\mu}\left(\overline{X}\gt c\right)=1-P_{\mu}\left(\overline{X}\leq c\right)=1-\Phi\left(\frac{c-\mu}{\sqrt{\frac{1}{n}}}\right)$.

Under the null hypothesis that $\mu=0$, and for $n=20$, we want $P_{\mu}\left(\text{type 1 error}\right)=1-\Phi\left(c\sqrt{20}\right)=0.05$. This equation does not admit a closed form solution, but the approximate numerical solution is $c\simeq0.368$.

At this value, $P_{\mu}\left(\text{type 1 error}\right)=0.05$.

Is it intuitive that as we increase the critical value (from zero in the previous example to 0.368 in this one), the probability of a type 1 error decreases (from 50% to 5%).

First, fix $\mu=0$, and imagine drawing multiple sample means. Clearly, fewer of them fall above $c=0.368$ than above $c=0$. Given that we reject $H_{0}$ when $\overline{X}\gt c$, the likelihood of rejection decreases as $c$ decreases. With $n=20,$ at $c=0.368$, it is exactly 0.05.

## A note on $n$

You may feel a bit uncomfortable about the fact that we fixed $n$ in these examples. Clearly, $n$ affects the choice of critical value to use. For example, if $n$ were very high, a very small deviation from zero could already justify rejecting the null hypothesis.

When $n$ is given, this is not an issue. However, when the researcher has the ability to set $n$, then she has two degrees of freedom, $c$ and $n$. This can be useful in experimental design and data collection. If one has good information about $\sigma^{2}$, one may select $c$ and $n$ to determine the probability of a type 1 error, and the probability of a type 2 error simultaneously (at specific values of $\mu$), for example, since it will be possible to create a system of equations with two unknowns.

## Composite $H_{0}$

We could have designed a test with a composite null hypothesis. In this case, we could select $c$ for example by solving problem

\begin{aligned} \max_{c\in\mathbb{R},\mu\in\Theta_{0}} & \,P_{\mu}\left(\overline{X}\gt c\right)\\ s.t. & P_{\mu}\left(\overline{X}\gt c\right)\leq0.05,\end{aligned}

where

$\Theta_{0}$ is the set of $\mu$’s contemplated in the null hypothesis.

However, designing a test with a simple null hypothesis is simpler, because we require equality to the specified rejection probability at a single value of $\mu$.