Statistical hypothesis testing

This is what people want:

  1. State a hypothesis.
  2. Gather data.
  3. Find out the probability that the hypothesis is true, given the data.

This is not how statistical hypothesis testing works. It's more like this:

  1. State a hypothesis.
  2. Gather data.
  3. Reason about the likeliness of data assuming that the hypothesis is true.
    1. If the data is likely given the hypothesis, we fail to reject the hypothesis.
    2. If the data is unlikely given the hypothesis, we reject the hypothesis.

How do we measure the likeliness or “unlikeliness” of data? Using a probability. And a threshold, which has been fixed to .05. A century of controversy ensues.

Details about hypothesis testing

Some teaching materials about hypothesis testing seem to jump directly to “computing test statistics”, say the z-score, and getting the results as quick as possible, with no indication as to why the z-score is used or why the test has a “right tail” or “two tails”. Here I will try to show the reasoning behind every step.

Let's work on a simple example of the kind we can find everywhere. We read that the average height of people in the world is 150 cm. We want to challenge this assumption. So we state a hypothesis:

$$ H_0: \mu = 150 $$

This is the null hypothesis, the status quo, which we are trying to challenge. The alternative hypothesis could be that people are actually taller, so:

$$ H_1: \mu > 150 $$

Now gather data from $n=1000$ people… and get a sample average $\bar{x}$.

How would we interpret $\bar{x}$? Let's say $\bar{x} = 160$. Do we immediately conclude that $H_1$ is true? What about $\bar{x} = 200$, that would be overwhelming evidence, right? But how about $\bar{x} = 151$, does that support $H_1$, or is it only due to chance that we got a value slightly greater than $150$? Or $\bar{x} = 149$, for that matter: one centimeter should be within the expected variation from a random sample, shouldn't it? Hey, even $\bar{x} = 220$ doesn't allow jumping to conclusions, what if we were really unlucky and sampled from a country with lots of really tall people?

No matter the value, there is a chance of making a wrong inference from $\bar{x}$. Namely:

  1. We could get $\bar{x} > 150$ even if $H_0$ is true.
  2. We could get $\bar{x} \leq 150$ even if $H_0$ is not true.

These scenarios would be surprising, but not implausible. We can quantify how probable these scenarios are. For instance, let's say we get $\bar{x} = 160$. So we are really, really tempted to conclude that we should reject $H_0$ once for all. But… could it be due to chance? What are the odds of getting $\bar{x} = 160$ or worse (i.e. greater) if $H_0$ is true?

$$ \begin{aligned} P(\bar{x} \geq 160 | H_0 \: true) &= P(\bar{x} \geq 160 | \mu = 150) \\ &= P\left(\frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \geq \frac{160-\mu}{\sigma/\sqrt{n}} \middle| \mu = 150\right) \\ &= P\left(Z \geq \frac{160-\mu}{\sigma/\sqrt{n}} \middle| \mu = 150\right) \end{aligned} $$

In line 2 we substract the mean and divide by the sample standard deviation to form a quantity $Z$ whose distribution is $N(0,1)$. This is simply a fact about the normal distribution (of course, we are assuming that heights are normal here…). For the sake of simplicity, assume the standard deviation $\sigma=30$ is known. So:

$$ \begin{aligned} P(\bar{x} \geq 160 | H_0 \: true) &= P\left(Z \geq \frac{160-\mu}{\sigma/\sqrt{n}} \middle| \mu = 150\right) \\ &= P\left(Z \geq \frac{160-150}{30/\sqrt{1000}}\right) \\ &= P(Z \geq 1.05) \\ &= 0.15 \end{aligned} $$

Yay, we have a result! The probability of getting a sample such as the observed, assuming $H_0$ is true, equals $0.15$!

But… how does this value help us in deciding whether we should reject $H_0$? Well $0.15$ sounds like a considerable value, close to the probability of getting any given number by rolling a die. Maybe we shouldn't take chances and decide that, maybe, $H_0$ is true? (or rather, fail to reject $H_0$. You know, because assuming $H_0$ is true would explain the data). We would certainly fail to reject $H_0$ if we had obtained, say, $0.99$ (“if the data is very likely when $H_0$ is true, then surely $H_0$ holds, huh?”). And if we had obtained $0.00001$, we would be thinking that the data is unlikely when $H_0$ is true, so we would reject $H_0$, right?

This $0.15$ value we obtained is called the p-value. Some statistician guy said: if the p-value is less than $0.05$, reject $H_0$. Otherwise, fail to reject it.

This threshold $\alpha=0.05$ is called the significance level of the test. It quantifies how likely we are to mistakenly reject $H_0$.

The fallacy

(will write about Jacob Cohen's paper “The Earth is Round (p .05)”).