P-Values Part 5: Common Mistakes

Cory Stasko
Aug 1, 2016
6 min read

Picking up from where we left off with our discussion on error rates -- and the important but surprisingly challenging task of picking the correct denominator -- I will now highlight a couple classic issues that repeatedly trip people up, when interpreting probabilities.

Base Rate Fallacy

First, the base rate fallacy. Take the example of the FBI attempting to detect which Syrian refugees entering the U.S. are terrorists. Suppose 50,000 refugees will flee from Syria to the U.S. in 2016 and 10 of them (0.02%) will be terrorists. Ideally, the FBI would like to detect all of them, and not cause unnecessary trouble for anyone else. However, the screening methods currently being used are not perfect; suppose that 90% of terrorists are correctly identified, and 95% of incoming non-terrorists are correctly identified.

So here's the fallacy. You work at the FBI, and are reviewing the results of a screen for a particular refugee. The screen says he's a terrorist; what is the probability that he's actually a terrorist? Many people would say 90%, thinking "that's the probability of terrorists that are correctly identified." But that's not the right probability to use; that 90% already assumes the person is a terrorist, so it really doesn't make any sense to use it to figure out if he is or isn't. What we want is the probability that, given the screen says he is a terrorist, he actually is. To find that, we need Bayesian statistics. First, we find that overall, the probability of being classified as a terrorist (regardless of the truth) is 5.017%. Note that this is much higher than the actual 0.02%. Then, we use Bayes' Theorem to find the conditional probability that this person actually is a terrorist, given that he was classified that way. The probability is (0.9)*(0.0002)/(0.05017) = 3.588%. Yep, that's right, there's about a 3.6% chance that he's actually a terrorist, given that the screen flags him.

Why? The problem is that the vast majority of refugees are not terrorists. In statistics lingo, there is a highly biased base rate. Therefore, that 95% of incoming non-terrorists being correctly identified is actually a huge problem. 5% of 49,990 non-terrorists means 2499 false positive screens. Meanwhile, 90% of the 10 real terrorists only produces 9 true positives. Here's a nice little check you can do: 9/(2499+9) = 3.588%. Yay, statistics works!

This base rate fallacy is a potential problem in medical diagnostics. An oncologist interpreting a screen for a rare cancer in a patient might face very similar numbers as the FBI agent looking at the terrorist screen results. 50,000 people are tested per year and 10 of them actually have the rare cancer. 90% of the cancers are correctly diagnosed, and 95% of non-cancer patients are correctly diagnosed. At first glance, those probabilities sound pretty good. But because of the lopsided base rate, the probability that a patient with a positive screen actually has the cancer is again 3.6%. Thankfully, physicians often do incorporate base rates into their assessment of diagnostic results.

Prosecutor's Fallacy

The prosecutor's fallacy has a similar structure to the base rate fallacy (and there's also a version for defense attorneys!) In a criminal trial, a prosecutor may introduce DNA evidence which matches DNA from the suspect to DNA found at the scene with a certain probability. Specifically, the expert witness describing the DNA evidence for the prosecution may say that the probability of finding this evidence if the accused were innocent is tiny, say 0.3%. That is effectively is a p-value. The fallacy occurs when someone then says or believes that the probability the accused is innocent given the evidence was found is also 0.3%. Those are two very different things.

In one potential case, call it case A, the accused is innocent and the erroneously damning evidence is found by coincidence. Case A is a false positive event. Both of those two probabilities -- the p-value actually produced by the forensic analysis and the probability people mistake it for -- put case A in the numerator. The difference between the two probabilities is (again) the denominator. So effectively we are (again) distinguishing between the False Positive Rate and the False Discovery Rate.

So, is that 0.3% the rate of false positive cases per cases of innocence, or per cases with this evidence? Since we have a case with this evidence that's what the judge and jury want the denominator to be. Then they would know the probability of innocence, given the present scenario. But that would be the False Discovery Rate, which cannot be calculated using frequentist statistics. Instead, 0.3% is the rate of false positive cases per cases of innocence, or the False Positive Rate. So 3/1000 cases of innocence would produce DNA evidence like this. That's not that helpful.

In fact, given how many people are innocent of this crime, there are many, many innocent suspects that would then match the DNA evidence to same extent as the accused individual on trial. So the manner in which suspects are identified becomes important. Did the police cast a wide net? Search the entire DNA database for a match? Or did they have this suspect based on other evidence, and then tested his DNA to confirm their suspicions? The strength of separate evidence is extremely important to putting this p-value of 0.003 in context. So how would you combine multiple pieces of information to produce a single estimate for the probability of guilt? Yep, Bayesian statistics.

Unifying Mistakes

Hopefully you notice the similarity between these examples. The base rate fallacy, the prosecutor's fallacy, and the classic p-value misinterpretation are statistically quite similar. Researchers, doctors, lawyers -- they're all tempted to make the same kind of mistakes. Specifically, all the examples describes thus far largely boil down to these two forms of wishful thinking:

Assuming that the two classes being compared are equally common when in fact they are not -- also known as the base rate fallacy. Forgetting to account for the relative rates of cancer and non-cancer patients, or the relative rates of people innocent and guilty of a particular crime, or the relative rates of terrorists and non-terrorists, or the relative likelihood of the null and alternative hypotheses. All of these are examples of the same easy mistake.
Assuming that an error probability is relevant to what you care about. This is the question of denominators. The probability of falsely predicting cancer given that the patient doesn't have cancer is not the same as the probability of falsely predicting cancer given that a physician is looking at a positive test result. Physicians are in the latter scenario: they are given a test result. They are never given the truth about the patient, but sometimes apply the error rate built upon that denominator nonetheless.

Bayesian statistics can often correct both of these mistakes, as I've shown. To address the first mistake, relative rates of cancer or terrorism can be estimated and properly accounted for. For the second mistake, we simply must use the correct denominator: positive test results. But there's one case in which these solutions aren't available: the researcher's misinterpreted p-value.

The Frequentist Interpretation

We cannot estimate (at least not traditionally) the relative likelihood of hypotheses because we can't go out and count how many times the null and alternative hypotheses each appear in the same way we count the number of patients with and without cancer. But that doesn't mean we're allowed to just assume they are equally likely! That's another example of mistake #1. In truth, the null and alternative hypotheses could have any relative likelihood.

Suppose an experiment produces a p-value of 0.03, meaning the null hypothesis would only produce the observed results 3% of the time, and other results 97% of the time. But what if the null hypothesis is extremely likely, and the alternative is not? It's like the non-cancer patients or the non-terrorists: they produce false positives at a fairly low rate, but are extremely common. As a result, most of the apparent positives you get are actually false positives, despite the low p-value. In this case, we have a low p-value producing a high False Discovery Rate, so clearly it's wrong to think they're the same thing.

Of course, it could also be the other way around. The null hypothesis could be extremely unlikely, making the p-value effectively much stronger than 0.03. At the end of the day, the relative likelihood of the two hypotheses could be anything. Stated differently, in the Frequentist world we don't know the base rate, because a hypothesis -- a universal rule -- cannot have a base rate. It's either true or not. Once. Without a base rate, we cannot use Bayes' Theorem to convert between error rates with different denominators. In other words, we're stuck with the non-intuitive error rate produced by a hypothesis test: false positive events given the null hypothesis. When we're so ambitious as to attempt to explain a universal rule, the best we can get is a difficult-to-interpret p-value.

You might say that Bayesians are able to make clearer statements because they're not attempting to explain as much.

[Note: I plan on adding additional posts with some more technical discussion of hypothesis testing, because there are a few conceptually minor but technically important complications that I should mention.]

#pvalue #pvalue #hypothesistesting #hypothesis #frequentiststatistics #bayesianstatistics #FalseDiscoveryRate

Cory Stasko, Grad Student

P-Values Part 5: Common Mistakes

Comments