P-Values Part 1: The Concept

Cory Stasko
Jul 25, 2016
5 min read

Introduction

P-values provide us with a never-ending stream of confusion. From high school classrooms to the top research labs, p-values are infamous for being difficult to understand. Unfortunately, they are also central to how science is conducted and evaluated. Presented with this dilemma, the American Statistical Association set about developing a definition of a p-value that is both accurate and comprehensible. Their result was reported in March:

Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

Oh well, it was worth a shot.

Actually, I think it's worth another shot. However, I don't think it can be done in one sentence, or even one paragraph. Despite its ubiquity, the p-value is incredibly difficult to define for someone unfamiliar with statistics. It is not only an internal tool that depends upon many other statistical ideas, but it also violates our intuition for how probabilities should work.

So I'm going to try to explain it with a layered approach. I'll start with the core -- its purpose rather than its mechanics -- and iteratively add detail and complexity in order to paint an increasingly complete picture of the p-value, both in statistics and in society. Feel free to stop reading when you feel satisfied. Note that the explanation continues into a second post.

Description of a p-value

At its most basic, a p-value is a way of communicating the robustness a scientific result. Typically, a lower p-value indicates that the result is more robust.

The lower the p-value, the more reason there is to believe that the finding was not simply an over-interpretation of noise in the data, but a real discovery of a real pattern in the world. This is what the phrase "statistical significance" is supposed to indicate. For example, in high-energy physics, the criteria for "evidence of a particle" is p<= 0.003. So measurements that produce a p-value less than 0.003 would be said to provide statistically significant evidence of a particle.

The p-value is not, however, the probability that the finding is a false positive (a finding erroneously obtained by over-interpreting noise in the data). Many people make this mistake and therefore imbue in p-values meaning that is inappropriate. The lower the p-value, the lower the risk of a false positive. But the p-value is not itself the probability of a false positive. So the bottom line is this: tests of statistical significance and the p-values they produce are helpful in demonstrating the relative robustness of scientific findings, but they cannot provide the probability that a given theory or finding is correct.

Scientific Motivation

Before giving a somewhat more technical explanation of what a p-value actually is, I first need to give some background. The problem that leads us to statistical constructs like p-values is rather simple: we want to explain how the world works (from particle physics to psychology) with firm rules, but there's lots of variation when we actually make observations. There's a disconnect between the mess of measuring the real world and the elegant formulas we want to record into textbooks, or give to policymakers. Sometimes, unexplained variation means that your hypothesis is wrong or incomplete. But other times, you are on the cusp of making a real discovery, but evidence of it is masked by noise.

The challenge we face is to separate the signal from the noise. We try to do this in three steps: 1) design the best-controlled experiment possible that will capture minimal noise, 2) collect as much data as possible to study, and 3) use the appropriate statistical tools to distinguish the signal of a particular discovery from the noise of everything else. P-values are just one tool used in that third step.

Consider the following example. A group of scientists is investigating the possible difference in average height between male and female members of a mammalian species that is very similar to humans -- except that they vary widely in how hairy they are. Some have curly hair that rises a foot or more above their head. The scientists don't want to include hair in their measurements, and have a few options for how to deal with it.

Ignore it the issue and measure to the top of the hair anyway.
Compress the hair against the person's head when measuring their height.
Approximate by eye where the top of each person's head is.
Shave everyone's head and then measure.

In this example, the height is the signal, the hair is the noise, and the four methodologies represent possible ways to control the experiment such that it captures more of the signal and less of the noise.

Science and Statistics

Each of those four methods yields a set of data the scientists can then analyze, looking for the difference in height between genders. They will use statistics and will produce a p-value in the course of each analysis. It is possible that, even under method #1, the data would indicate that there is a significant difference in heights. Suppose that, on average, males of this species are four feet tall and females are ten feet tall. Then the noise that hair-height contributed to the overall-height measurement wouldn't be very problematic, because the females are just so much taller that the difference is easy to spot. The p-value would be quite small, despite the hair problem, giving the scientists an easy answer. In fact, they probably wouldn't have done the study at all, if it weren't funded.

Instead, let's suppose the actual difference in height is on average 2 inches. In order of the four methodologies, the scientists might find p-values like this: 0.42, 0.12, 0.09, and 0.02. Those p-values indicate that, as you go down the list of methodologies, the results increasingly point to a statistically significant difference in heights between the genders, because the p-values are getting smaller. The first p-value (0.42) says that the scientists found nothing; there's no reason to believe there's a difference in height between the genders based on the results of the first experiment. The second and third (0.12 and 0.09) are weak indications of a possible result, but are somewhat more compelling when taken together. The last p-value (0.02) is significant in standard terms, such that the scientists could probably publish their findings in whichever hairy humanoid journal is in vogue.

Sometimes it isn't possible to implement a more controlled methodology, however. Budgets, time, and technology all constrain the improvement of experiments. What then? The other common way to collect data that will give you more statistically significant results is simply to collect more data. The equations used to calculate p-values all prominently feature n, the sample size. The larger the sample size, the more places you have to look for and confirm a signal. If you have very few measurements, it's impossible to know what's signal and what's noise. With more measurements, it's much easier to see what changes (noise) and what stays the same (signal). The calculation of a p-value accounts for this principle, and therefore a larger sample size almost always produces a smaller p-value, and a more significant result.

The discussion continues in subsequent blog posts...

#pvalue #pvalue #hypothesis #hypothesistesting #signal #noise

Cory Stasko, Grad Student

P-Values Part 1: The Concept

Comments