Why p values can’t tell you what you need to know and what to do about it

Published October 18, 2020

This is a transcript of the talk that I gave to the RIOT science club on 1st October 2020. The video of the talk is on YouTube . The transcript was very kindly made by Chris F Carroll, but I have modified it a bit here to increase clarity. Links to the original talk appear throughout.

This is the original video of the talk.

My title slide is a picture of UCL’s front quad, taken on the day that it was the starting point for the second huge march that attempted to stop the Iraq war. That’s a good example of the folly of believing things that aren’t true.

“Today I speak to you of war. A war that has pitted statistician against statistician for nearly 100 years. A mathematical conflict that has recently come to the attention of the normal people and these normal people look on in fear, in horror, but mostly in confusion because they have no idea why we’re fighting.”

Kristin Lennox (Director of Statistical Consulting, Lawrence Livermore National Laboratory)

That sums up a lot of what’s been going on. The problem is that there is near unanimity among statisticians that p values don’t tell you what you need to know but statisticians themselves haven’t been able to agree on a better way of doing things.

This talk is about the probability that if we claim to have made a discovery we’ll be wrong. This is what people very frequently want to know. And that is not the p value. You want to know the probability that you’ll make a fool of yourself by claiming that an effect is real when in fact it’s nothing but chance.

Just to be clear, what I’m talking about is how you interpret the results of a single unbiased experiment. Unbiased in the sense the experiment is randomized, and all the assumptions made in the analysis are exactly true. Of course in real life false positives can arise in any number of other ways: faults in the randomization and blinding, incorrect assumptions in the analysis, multiple comparisons, p hacking and so on, and all of these things are going to make the risk of false positives even worse. So in a sense what I’m talking about is your minimum risk of making a false positive even if everything else were perfect.

The conclusion of this talk will be:

If you observe a p value close to 0.05 and conclude that you’ve discovered something, then the chance that you’ll be wrong is not 5%, but is somewhere between 20% and 30% depending on the exact assumptions you make. If the hypothesis was an implausible one to start with, the false positive risk will be much higher.

There’s nothing new about this at all. This was written by a psychologist in 1966.

The major point of this paper is that the test of significance does not provide the information concerning phenomena characteristically attributed to it, and that a great deal of mischief has been associated with its use.

Bakan, D. (1966) Psychological Bulletin, 66 (6), 423 – 237

Bakan went on to say this is already well known, but if so it’s certainly not well known, even today, by many journal editors or indeed many users.

The p value

Let’s start by defining the p value. An awful lot of people can’t do this but even if you can recite it, it’s surprisingly difficult to interpret it.

I’ll consider it in the context of comparing two independent samples to make it a bit more concrete. So the p value is defined thus:

If there were actually no effect -for example if the true means of the two samples were equal, so the difference was zero -then the probability of observing a value for the difference between means which is equal to or greater than that actually observed is called the p value.

Now there’s at least five things that are dodgy with that, when you think about it. It sounds very plausible but it’s not.

“If there are actually no effect …”: first of all this implies that the denominator for the probability is the number of cases in which there is no effect and this is not known.
“… or greater than…” : why on earth should we be interested in values that haven’t been observed? We know what the effect size that was observed was, so why should we be interested in values that are greater than that which haven’t been observed?
It doesn’t compare the hypothesis of no effect with anything else. This is put well by Sellke et al in 2001, “knowing that the data are rare when there is no true difference [that’s what the p value tells you] is of little use unless one determines whether or not they are also rare when there is a true difference”. In order to understand things properly, you’ve got to have not only the null hypothesis but also an alternative hypothesis.
Since the definition assumes that the null hypothesis is true, it’s obvious that it can’t tell us about the probability that the null hypothesis is true.
The definition invites users to make the error of the transposed conditional. That sounds a bit fancy but it’s very easy to say what it is.

The probability that you have four legs given that you’re a cow is high but the probability that you’re a cow given that you’ve got four legs is quite low many animals that have four legs that aren’t cows.
Take a legal example. The probability of getting the evidence given that you’re guilty may be known. (It often isn’t of course — but that’s the sort of thing you can hope to get). But it’s not what you want. What you want is the probability that you’re guilty given the evidence.
The probability you’re catholic given that you’re the pope is probably very high, but the probability you’re a pope given that you’re a catholic is very low.

So now to the nub of the matter.

The probability of the observations given that the null hypothesis is the p value. But it’s not what you want. What you want is the probability that the null hypothesis is true given the observations.

The first statement is a deductive process; the second process is inductive and that’s where the problems lie. These probabilities can be hugely different and transposing the conditional simply doesn’t work.

The False Positive Risk

The false positive risk avoids these problems. Define the false positive risk as follows.

If you declare a result to be “significant” based on a p value after doing a single unbiased experiment, the False Positive Risk is the probability that your result is in fact a false positive.

That, I maintain, is what you need to know. The problem is that in order to get it, you need Bayes’ theorem and as soon as that’s mentioned, contention immediately follows.

Bayes’ theorem

Suppose we call the null-hypothesis H₀, and the alternative hypothesis H₁. For example, H₀ can be that the true effect size is zero and H₁ can be the hypothesis that there’s a real effect, not just chance. Bayes’ theorem states that the odds on H₁ being true, rather than H₀ , after you’ve done the experiment are equal to the likelihood ratio times the odds on there being a real effect before the experiment:

In general we would want a Bayes’ factor here, rather than the likelihood ratio, but under my assumptions we can use the likelihood ratio, which is a much simpler thing [explanation here].

The likelihood ratio represents the evidence supplied by the experiment. It’s what converts the prior odds to the posterior odds, in the language of Bayes’ theorem. The likelihood ratio is a purely deductive quantity and therefore uncontentious. It’s the probability of the observations if there’s a real effect divided by the probability of the observations if there’s no effect.

Notice a simplification you can make: if the prior odds equal 1, then the posterior odds are simply equal to the likelihood ratio. “Prior odds of 1” means that it’s equally probable before the experiment that there was an effect or that there’s no effect. Put another way, prior odds of 1 means that the prior probability of H₀ and of H₁ are equal: both are 0.5. That’s probably the nearest you can get to declaring equipoise.

Comparison: Consider Screening Tests

I wrote a statistics textbook in 1971 [download it here] which by and large stood the test of time but the one thing I got completely wrong was the limitations of p values. Like many other people I came to see my errors through thinking about screening tests. These are very much in the news at the moment because of the COVID-19 pandemic. The illustration of the problems they pose which follows is now quite commonplace.

Suppose you test 10,000 people and that 1 in a 100 of those people have the condition, e.g. Covid-19, and 99 don’t have it. The prevalence in the population you’re testing is 1 in a 100. So you have 100 people with the condition and 9,900 who don’t. If the specificity of the test is 95%, you get 5% false positives.

This is very much like a null-hypothesis test of significance. But you can’t get the answer without considering the alternative hypothesis, which null-hypothesis significance tests don’t do. So now add the upper arm to the Figure above.

You’ve got 1% (so that’s 100 people) who have the condition, so if the sensitivity of the test is 80% (that’s like the power of a significance test) then you get to the total number of positive tests is 80 plus 495 and the proportion of tests that are false is 495 false positives divided by the total number of positives, which is 86%. A test that gives 86% false positives is pretty disastrous. It is not 5%! Most people are quite surprised by that when they first come across it.

Now look at significance tests in a similar way

Now we can do something similar for significance tests (though the parallel is not exact, as I’ll explain).

Suppose we do 1,000 tests and in 10% of them there’s a real effect, and in 90% of them there is no effect. If the significance level, so-called, is 0.05 then we get 5% false positive tests, which is 45 false positives.

But that’s as far as you can go with a null-hypothesis significance test. You can’t tell what’s going on unless you consider the other arm. If the power is 80% then we get 80 true positive tests and 20 false negative tests, so the total number of positive tests is 80 plus 45 and the false positive risk is the number of false positives divided by the total number of positives which is 36 percent.

So the p value is not the false positive risk. And the type 1 error rate is not the false positive risk.

The difference between them lies not in the numerator, it lies in the denominator. In the example above, of the 900 tests in which the null-hypothesis was true, there were 45 false positives. So looking at it from the classical point of view, the false positive risk would turn out to be 45 over 900 which is 0.05 but that’s not what you want. What you want is the total number of false positives, 45, divided by the total number of positives (45+80), which is 0.36.

The p value is NOT the probability that your results occurred by chance. The false positive risk is.

A complication: “p-equals” vs “p-less-than”

But now we have to come to a slightly subtle complication. It’s been around since the 1930s and it was made very explicit by Dennis Lindley in the 1950s. Yet it is unknown to most people which is very weird. The point is that there are two different ways in which we can calculate the likelihood ratio and therefore two different ways of getting the false positive risk.

A lot of writers including Ioannidis and Wacholder and many others use the “p less than” approach. That’s what that tree diagram gives you. But it is not what is appropriate for interpretation of a single experiment. It underestimates the false positive risk.

What we need is the “p equals” approach, and I’ll try and explain that now.

Suppose we do a test and we observe p = 0.047 then all we are interested in is, how tests behave that come out with p = 0.047. We aren’t interested in any other different p value. That p value is now part of the data. The tree diagram approach we’ve just been through gave a false positive risk of only 6%, if you assume that the prevalence of true effects was 0.5 (prior odds of 1). 6% isn’t much different from 5% so it might seem okay.

But the tree diagram approach, although it is very simple, still asks the wrong question. It looks at all tests that gives p ≤ 0.05, the “p-less-than” case. If we observe p = 0.047 then we should look only at tests that give p = 0.047 rather than looking at all tests which come out with p ≤ 0.05. If you’re doing it with simulations of course as in my 2014 paper then you can’t expect any tests to give exactly 0.047; what you can do is look at all the tests that come out with p in a narrow band around there, say 0.045 ≤ p ≤ 0.05.

This approach gives a different answer from the tree diagram approach. If you look at only tests that give p values between 0.045 and 0.05, the false positive risk turns out to be not 6% but at least 26%.

I say at least, because that assumes a prior probability of there being a real effect of 50:50. If only 10% of the experiments had a real effect of (a prior of 0.1 in the tree diagram) this rises to 76% of false positives. That really is pretty disastrous. Now of course the problem is you don’t know this prior probability.

The problem with Bayes theorem is that there exists an infinite number of answers. Not everyone agrees with my approach, but it is one of the simplest.

The likelihood-ratio approach to comparing two hypotheses

The likelihood ratio -that is to say, the relative probabilities of observing the data given two different hypotheses, is the natural way to compare two hypotheses. For example, in our case one hypothesis is the zero effect (that’s the null-hypothesis) and the other hypothesis is that there’s a real effect of the observed size. That’s the maximum likelihood estimate of the real effect size. Notice that we are not saying that the effect size is exactly zero; but rather we are asking whether a zero effect explains the observations better than a real effect.

Now this amounts to putting a “lump” of probability on there being a zero effect. If you put a prior probability of 0.5 for there being a zero effect, you’re saying the prior odds are 1. If you are willing to put a lump of probability on the null-hypothesis, then there are several methods of doing that. They all give similar results to mine within a factor of two or so.

Putting a lump of probability on their being a zero effect, for example a prior probability of 0.5 of there being zero effect, is regarded by some people as being over-sceptical (though others might regard 0.5 as high, given that most bright ideas are wrong).

E.J. Wagenmakers summed it up in a tweet:

“at least Bayesians attempt to find an approximate answer to the right question instead of struggling to interpret an exact answer to the wrong question [that’s the p value]”.

Some results.

The 2014 paper used simulations, and that’s a good way to see what’s happening in particular cases. But to plot curves of the sort shown in the next three slides we need exact calculations of FPR and how to do this was shown in the 2017 paper (see Appendix for details).

Comparison of p-equals and p-less-than approaches

The slide at slide at 26:05 is designed to show the difference between the “p-equals” and the “p-less than” cases.

On each diagram the dashed red line is the “line of equality”: that’s where the points would lie if the p value were the same as the false positive risk. You can see that in every case the blue lines -the false positive risk -is greater than the p value. And for any given observed p value, the p-equals approach gives a bigger false positive risk than the p-less-than approach. For a prior probability of 0.5 then the false positive risk is about 26% when you’ve observed p = 0.05.

So from now on I shall use only the “p-equals” calculation which is clearly what’s relevant to a test of significance.

The false positive risk as function of the observed p value for different sample sizes

Now another set of graphs (slide at 27:46), for the false positive risk as a function of the observed p value, but this time we’ll vary the number in each sample. These are all for comparing two independent samples.

The curves are red for n = 4 ; green for n = 8 ; blue for n = 16.

The top row is for an implausible hypothesis with a prior of 0.1, the bottom row for a plausible hypothesis with a prior of 0.5.

The left column shows arithmetic plots; the right column shows the same curves in log-log plots, The power these lines correspond to is:

n = 4 (red) has power 22%
n = 8 (green) has power 46%
n = 16 (blue) one has power 78%

Now you can see these behave in a slightly curious way. For most of the range it’s what you’d expect: n = 4 gives you a higher false positive risk than n = 8 and that still higher than n = 16 the blue line.

The curves behave in an odd way around 0.05; they actually begin to cross, so the false positive risk for p values around 0.05 is not strongly dependent on sample size.

But the important point is that in every case they’re above the line of equality, so the false positive risk is much bigger than the p value in any circumstance.

False positive risk as a function of sample size (i.e. of power)

Now the really interesting one (slide at 29:34). When I first did the simulation study I was challenged by the fact that the false positive risk actually becomes 1 if the experiment is a very powerful one. That seemed a bit odd.

The plot here is the false positive risk FPR₅₀ which I define as “the false positive risk for prior odds of 1, i.e. a 50:50 chance of being a real effect or not a real effect.

Let’s just concentrate on the p = 0.05 curve (blue). Notice that, because the number per sample is changing, the power changes throughout the curve. For example on the p = 0.05 curve for n = 4 (that’s the lowest sample size plotted), power is 0.22, but if we go to the other end of the curve, n = 64 (the biggest sample size plotted), the power is 0.9999. That’s something not achieved very often in practice.

But how is it that p = 0.05 can give you a false positive risk which approaches 100%? Even with p = 0.001 the false positive risk will eventually approach 100% though it does so later and more slowly.

In fact this has been known for donkey’s years. It’s called the Jeffreys-Lindley paradox, though there’s nothing paradoxical about it. In fact it’s exactly what you’d expect. If the power is 99.99% then you expect almost every p value to be very low. Everything is detected if we have a high power like that. So it would be very rare, with that very high power, to get a p value as big as 0.05. Almost every p value will be much less than 0.05, and that’s why observing a p value as big as 0.05 would, in that case, provide strong evidence for the null-hypothesis. Even p = 0.01 would provide strong evidence for the null hypothesis when the power is very high because almost every p value would be much less than 0.01.

This is a direct consequence of using the p-equals definition which I think is what’s relevant for testing hypotheses. So the Jeffreys-Lindley phenomenon makes absolute sense.

In contrast, if you use the p-less-than approach, the false positive risk would decrease continuously with the observed p value. That’s why, if you have a big enough sample (high enough power), even the smallest effect becomes “statistically significant”, despite the fact that the odds may favour strongly the null hypothesis. [Here, ‘the odds’ means the likelihood ratio calculated by the p-equals method.]

A real life example

Now let’s consider an actual practical example. The slide shows a study of transcranial electromagnetic stimulation published in Science magazine (so a bit suspect to begin with).

The study concluded (among other things) that an improved associated memory performance was produced by transcranial electromagnetic stimulation, p = 0.043. In order to find out how big the sample sizes were I had to dig right into the supplementary material. It was only 8. Nonetheless let’s assume that they had an adequate power and see what we make of it.

In fact it wasn’t done in a proper parallel group way, it was done as ‘before and after’ the stimulation, and sham stimulation, and it produces one lousy asterisk. In fact most of the paper was about functional magnetic resonance imaging, memory was mentioned only as a subsection of Figure 1, but this is what was tweeted out because it sounds more dramatic than other things and it got a vast number of retweets. Now according to my calculations p = 0.043 means there’s at least an 18% chance that it’s false positive.

How better might we express the result of this experiment?

We should say, conventionally, that the increase in memory performance was 1.88 ± 0.85 (SEM) with confidence interval 0.055 to 3.7 (extra words recalled on a baseline of about 10). Thus p = 0.043. But then supplement this conventional statement with

This implies a false positive risk, FPR₅₀, (i.e. the probability that the results occurred by chance only) of at least 18%, so the result is no more than suggestive.

There are several other ways you can put the same idea. I don’t like them as much because they all suggest that it would be helpful to create a new magic threshold at FPR₅₀ = 0.05, and that’s as undesirable as defining a magic threshold at p = 0.05. For example you could say that the increase in performance gave p = 0.043, and in order to reduce the false positive risk to 0.05 it would be necessary to assume that the prior probability of there being a real effect was 81%. In other words, you’d have to be almost certain that there was a real effect before you did the experiment before that result became convincing. Since there’s no independent evidence that that’s true, the result is no more than suggestive.

Or you could put it this way: the increase in performance gave p = 0.043. In order to reduce the false positive risk to 0.05 it would have been necessary to observe p = 0.0043, so the result is no more than suggestive.

The reason I now prefer the first of these possibilities is because the other two involve an implicit threshold of 0.05 for the false positive risk and that’s just as daft as assuming a threshold of 0.05 for the p value.

The web calculator

Scripts in R are provided with all my papers. For those who can’t master R Studio, you can do many of the calculations very easily with our web calculator [for latest links please go to http://www.onemol.org.uk/?page_id=456]. There are three options : if you want to calculate the false positive risk for a specified p value and prior, you enter the observed p value (e.g. 0.049), the prior probability that there’s a real effect (e.g. 0.5), the normalized effect size (e.g. 1 standard deviation) and the number in each sample. All the numbers cited here are based on an effect size if 1 standard deviation, but you can enter any value in the calculator. The output panel updates itself automatically.

web-calc

We see that the false positive risk for the p-equals case is 0.26 and the likelihood ratio is 2.8 (I’ll come back to that in a minute).

Using the web calculator or using the R programs which are provided with the papers, this sort of table can be very quickly calculated.

The top row shows the results if we observe p = 0.05. The prior probability that you need to postulate to get a 5% false positive risk would be 87%. You’d have to be almost ninety percent sure there was a real effect before the experiment, in order to to get a 5% false positive risk. The likelihood ratio comes out to be about 3; what that means is that your observations will be about 3 times more likely if there was a real effect than if there was no effect. 3:1 is very low odds compared with the 19:1 odds which you might incorrectly infer from p = 0.05. The false positive risk for a prior of 0.5 (the default value) which I call the FPR_50,would be 27% when you observe p = 0.05.

In fact these are just directly related to each other. Since the likelihood ratio is a purely deductive quantity, we can regard FPR₅₀ as just being a transformation of the likelihood ratio and regard this as also a purely deductive quantity. For example, 1 / (1 + 2.8) = 0.263, the FPR₅₀. But in order to interpret it as a posterior probability then you do have to go into Bayes’ theorem. If the prior probability of a real effect was only 0.1 then that would correspond to a 76% false positive risk when you’ve observed p = 0.05.

If we go to the other extreme, when we observe p = 0.001 (bottom row of the table) the likelihood ratio is 100 -notice not 1000, but 100 -and the false positive risk, FPR₅₀ , would be 1%. That sounds okay but if it was an implausible hypothesis with only a 10% prior chance of being true (last column of Table), then the false positive risk would be 8% even when you observe p = 0.001: even in that case it would still be above 5%. In fact, to get the FPR down to 0.05 you’d have to observe p = 0.00043, and that’s good food for thought.

So what do you do to prevent making a fool of yourself?

Never use the words significant or non-significant and then don’t use those pesky asterisks please, it makes no sense to have a magic cut off. Just give a p value.
Don’t use bar graphs. Show the data as a series of dots.
Always remember, it’s a fundamental assumption of all significance tests that the treatments are randomized. When this isn’t the case, you can still calculate a test but you can’t expect an accurate result. This is well-illustrated by thinking about randomisation tests.
So I think you should still state the p value and an estimate of the effect size with confidence intervals but be aware that this tells you nothing very direct about the false positive risk. The p value should be accompanied by an indication of the likely false positive risk. It won’t be exact but it doesn’t really need to be; it does answer the right question. You can for example specify the FPR₅₀, the false positive risk based on a prior probability of 0.5. That’s really just a more comprehensible way of specifying the likelihood ratio. You can use other methods, but they all involve an implicit threshold of 0.05 for the false positive risk. That isn’t desirable.

So p = 0.04 doesn’t mean you discovered something, it means it might be worth another look. In fact even p = 0.005 can under some circumstances be more compatible with the null-hypothesis than with there being a real effect.

We must conclude, however reluctantly, that Ronald Fisher didn’t get it right. Matthews (1998) said,

“the plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning boloney into breakthroughs and flukes into funding”.

Robert Matthews Sunday Telegraph, 13 September 1998.

But it’s not quite fair to blame R. A. Fisher because he himself described the 5% point as a “quite a low standard of significance”.

Questions & Answers

Q: “There are lots of competing ideas about how best to deal with the issue of statistical testing. For the non-statistician it is very hard to evaluate them and decide on what is the best approach. Is there any empirical evidence about what works best in practice? For example, training people to do analysis in different ways, and then getting them to analyze data with known characteristics. If not why not? It feels like we wouldn’t rely so heavily on theory in e.g. drug development, so why do we in stats?

A: The gist: why do we rely on theory and statistics? Well, we might as well say, why do we rely on theory in mathematics? That’s what it is! You have concrete theories and concrete postulates. Which you don’t have in drug testing, that’s just empirical.

Q: Is there any empirical evidence about what works best in practice, so for example training people to do analysis in different ways? and then getting them to analyze data with known characteristics and if not why not?

A: Why not: because you never actually know unless you’re doing simulations what the answer should be. So no, it’s not known which works best in practice. That being said, simulation is a great way to test out ideas. My 2014 paper used simulation, and it was only in the 2017 paper that the maths behind the 2014 results was worked out. I think you can rely on the fact that a lot of the alternative methods give similar answers. That’s why I felt justified in using rather simple assumptions for mine, because they’re easier to understand and the answers you get don’t differ greatly from much more complicated methods.

In my 2019 paper there’s a comparison of three different methods, all of which assume that it’s reasonable to test a point (or small interval) null-hypothesis (one that says that treatment effect is exactly zero), but given that assumption, all the alternative methods give similar answers within a factor of two or so. A factor of two is all you need: it doesn’t matter if it’s 26% or 52% or 13%, the conclusions in real life are much the same.

So I think you might as well use a simple method. There is an even simpler one than mine actually, proposed by Sellke et al. (2001) that gives a very simple calculation from the p value and that gives a false positive risk of 29 percent when you observe p = 0.05. My method gives 26%, so there’s no essential difference between them. It doesn’t matter which you use really.

Q: The last question gave an example of training people so maybe he was touching on how do we teach people how to analyze their data and interpret it accurately. Reporting effect sizes and confidence intervals alongside p values has been shown to improve interpretation in teaching contexts. I wonder whether in your own experience that you have found that this helps as well? Or can you suggest any ways to help educators, teachers, lecturers, to help the next generation of researchers properly?

A: Yes I think you should always report the observed effect size and confidence limits for it. But be aware that confidence intervals tell you exactly the same thing as p values and therefore they too are very suspect. There’s a simple one-to-one correspondence between p values and confidence limits. So if you use the criterion, “the confidence limits exclude zero difference” to judge whether there’s a real effect you’re making exactly the same mistake as if you use p ≤ 0.05 to to make the judgment. So they they should be given for sure, because they’re sort of familiar but you do need, separately, some sort of a rough estimate of the false positive risk too.

Q: I’m struggling a bit with the “p equals” intuition. How do you decide the band around 0.047 to use for the simulations? Presumably the results are very sensitive to this band. If you are using an exact p value in a calculation rather than a simulation, the probability of exactly that p value to many decimal places will presumably become infinitely small. Any clarification would be appreciated.

A: Yes, that’s not too difficult to deal with: you’ve got to use a band which is wide enough to get a decent number in. But the result is not at all sensitive to that: if you make it wider, you’ll get larger numbers in both numerator and denominator so the result will be much the same. In fact, that’s only a problem if you do it by simulation. If you do it by exact calculation it’s easier. To do a 100,000 or a million t-tests with my R script in simulation, doesn’t take long. But it doesn’t depend at all critically on the width of the interval; and in any case it’s not necessary to do simulations, you can do the exact calculation.

Q: Even if an exact calculation can’t be done—it probably can—you can get a better and better approximation by doing more simulations and using narrower and narrower bands around 0.047?

A: Yes, the larger the number of simulated tests that you do, the more accurate the answer. I did check it with a million occasionally. But once you’ve done the maths you can get exact answers much faster. The slide at 53:17 shows how you do the exact calculation.

• The Student’s t value along the bottom
• Probability density at the side
• The blue line is the distribution you get under the null-hypothesis, with a mean of 0 and a standard deviation of 1 in this case.
• So the red areas are the rejection areas for a t-test.
• The green curve is the t distribution (it’s a non-central t-distribution which is what you need in this case) for the alternative hypothesis.
• The yellow area is the power of the test, which here is 78%
• The orange area is (1 – power) so it’s 22%

The p-less-than calculation considers all values in the red area or in the yellow area as being positives. The p-equals calculation uses not the areas, but the ordinates here, the probability densities. The probability (density) of getting a t value of 2.04 under the null hypothesis is y₀ = 0.053. And the probability (density) under the alternative hypothesis is y₁ = 0.29. It’s true that the probability of getting t = 2.04 exactly is infinitesimally small (the area of an infinitesimally narrow band around t = 2.04) but the ratio if the two infinitesimally small probabilities is perfectly well-define). so for the p-equals approach, the likelihood ratio in favour of the alternative hypothesis would be L₁₀ = y₁ / 2y₀ (the factor of 2 arises because of the two red tails) and that gives you a likelihood ratio of 2.8. That corresponds to an FPR₅₀ of 26% as we explained. That’s exactly what you get from simulation. I hope that was reasonably clear. It may not have been if you aren’t familiar with looking at those sorts of things.

Q: To calculate FPR₅₀ -false positive risk for a 50:50 prior -I need to assume an effect size. Which one do you use in the calculator? Would it make sense to calculate FPR₅₀ for a range of effect sizes?

A: Yes if you use the web calculator or the R scripts then you need to specify what the normalized effect size is. You can use your observed one. If you’re trying to interpret real data, you’ve got an estimated effect size and you can use that. For example when you’ve observed p = 0.05 that corresponds to a likelihood ratio of 2.8 when you use the true effect size (that’s known when you do simulations). All you’ve got is the observed effect size. So they’re not the same of course. But you can easily show with simulations, that if you use the observed effect size in place of the the true effect size (which you don’t generally know) then that likelihood ratio goes up from about 2.8 to 3.6; it’s around 3, either way. You can plug your observed normalised effect size into the calculator and you won’t be led far astray. This shown in section 5 of the 2017 paper (especially section 5.1).

Q: Consider hypothesis H₁ versus H₂ which is the interpretation to go with?

A: Well I’m not quite clear still what the two interpretations the questioner is alluding to are but I shouldn’t rely on the p value. The most natural way to compare two hypotheses is the calculate the likelihood ratio.

You can do a full Bayesian analysis. Some forms of Bayesian analysis can give results that are quite similar to the p values. But that can’t possibly be generally true because are defined differently. Stephen Senn produced an example where there was essentially no problem with p value, but that was for a one-sided test with a fairly bizarre prior distribution.

In general in Bayes, you specify a prior distribution of effect sizes, what you believe before the experiment. Now, unless you have empirical data for what that distribution is, which is very rare indeed, then I just can’t see the justification for that. It’s bad enough making up the probability that there’s a real effect compared with there being no real effect. To make up a whole distribution just seems to be a bit like fantasy.

Mine is simpler because by considering a point null-hypothesis and a point alternative hypothesis, what in general would be called Bayes’ factors become likelihood ratios. Likelihood ratios are much easier to understand than Bayes’ factors because they just give you the relative probability of observing your data under two different hypotheses. This is a special case of Bayes’ theorem. But as I mentioned, any approach to Bayes’ theorem which assumes a point null hypothesis gives pretty similar answers, so it doesn’t really matter which you use.

There was edition of the American Statistician last year which had 44 different contributions about “the world beyond p = 0.05″. I found it a pretty disappointing edition because there was no agreement among people and a lot of people didn’t get around to making any recommendation. They said what was wrong, but didn’t say what you should do in response. The one paper that I did like was the one by Benjamin & Berger. They recommended their false positive risk estimate (as I would call it; they called it something different but that’s what it amounts to) and that’s even simpler to calculate than mine. It’s a little more pessimistic, it can give a bigger false positive risk for a given p value, but apart from that detail, their recommendations are much the same as mine. It doesn’t really matter which you choose.

Q: If people want a procedure that does not too often lead them to draw wrong conclusions, is it fine if they use a p value?

A: No, that maximises your wrong conclusions, among the available methods! The whole point is, that the false positive risk is a lot bigger than the p value under almost all circumstances. Some people refer to this as the p value exaggerating the evidence; but it only does so if you incorrectly interpret the p value as being the probability that you’re wrong. It certainly is not that.

Q: Your thoughts on, there’s lots of recommendations about practical alternatives to p values. Most notably the Nature piece that was published last year—something like 400 signatories—that said that we should retire the p value. Their alternative was to just report effect sizes and confidence intervals. Now you’ve said you’re not against anything that should be standard practice, but I wonder whether this alternative is actually useful, to retire the p value?

A: I don’t think the 400 author piece in Nature recommended ditching p values at all. It recommended ditching the 0.05 threshold, and just stating a p value. That would mean abandoning the term “statistically significant” which is so shockingly misleading for the reasons I’ve been talking about. But it didn’t say that you shouldn’t give p values, and I don’t think it really recommended an alternative. I would be against not giving p values because it’s the p value which enables you to calculate the equivalent false positive risk which would be much harder work if people didn’t give the p value.

If you use the false positive risk, you’ll inevitably get a larger false negative rate. So, if you’re using it to make a decision, other things come into it than the false positive risk and the p value. Namely, the cost of missing an effect which is real (a false negative), and the cost of getting a false positive. They both matter. If you can estimate the costs associated with either of them, then then you can draw some sort of optimal conclusion.

Certainly the costs of getting false positives or rather low for most people. In fact, there may be a great advantage to your career to publish a lot of false positives, unfortunately. This is the problem that the RIOT science club is dealing with I guess.

Q: What about changing the alpha level? To tinker with the alpha level has been popular in the light of the replication crisis, to make it even a more difficult test pass when testing your hypothesis. Some people have said that it should be 0.005 should be the threshold.

A: Daniel Benjamin said that and a lot of other authors. I wrote to them about that and they said that they didn’t really think it was very satisfactory but it would be better than the present practice. They regarded it as a sort of interim thing.

It’s true that you would have fewer false positives if you did that, but it’s a very crude way of treating the false positive risk problem. I would much prefer to make a direct estimate, even though it’s rough, of the false positive risk rather than just crudely reducing to p = 0.005. I do have a long paragraph in one of the papers discussing this particular thing {towards the end of Conclusions in the 2017 paper).

If you were willing to assume a 50:50 prior chance of there being a real effect the p = 0.005 would correspond to FPR50 = 0.034, which sounds satisfactory (from Table, above, or web calculator).

But if, for example, you are testing a hypothesis about teleportation or mind-reading or homeopathy then you probably wouldn’t be willing to give a prior of 50% to that being right before the experiment. If the prior probability of there being a real effect were 0.1, rather than 0.5, the Table above shows that observation of p = 0.005 would suggest, in my example, FPR = 0.24 and a 24% risk of a false positive would still be disastrous. In this case you would have to have observed p = 0.00043 in order to reduce the false positive risk to 0.05.

So no fixed p value threshold will cope adequately with every problem.

Twitter

Tagged P values, seminars, statistics

5 Responses to Why p values can’t tell you what you need to know and what to do about it

Luis O. Romero says:

November 1, 2021 at 22:47

Thank you so much for this post!

Reply
Mircea Zloteanu says:

January 29, 2022 at 23:34

I have nothing but admiration for you as a researcher and an educator. You have forced me to consider my approach to science and knowledge more closely, and allowed me to gain new insights.

Reply
Margaretha Hendrickx says:

December 21, 2023 at 18:27

Have you considered that all this troublesome confusion about what to do with p-values may be correlated with equally confused thinking about the termporal relation between logical and mathematical operations? That is, the question whether it is universally true that logical operations are temporally prior to mathematical operations?By logical operations I mean the art of working with schemas of distinctions in a self/other aware disciplined and organized manner.My very best, Margaretha H.

Reply
- David Colquhoun says:
  
  December 21, 2023 at 21:41
  
  @Margarethe H
  I really have difficulty in understanding what you are proposing. Perhaps that’s because I’m not a professional mathematician. In the sort of problem that I’ve encountered,, you certainly envisage a physical model for the system (in my case, an ion channel) which you’re observing, and then express the model mathematically.
  I don’t see much analogy between such procedures and the Bayes vs frequentist wars. I guess that the physical model casts light on the sort of prior probabilities that a Bayesian might propose. In the case of estimation of a chemical rate constant it seems sensible to use a uniform prior over the range from zero, up to the fastest value that’s physically feasible. On the other hand, if you are looking at the difference between the means of two independent samples,it makes more sense to compare two precise hypotheses, that the true difference is zero, compared with the hypothesis that the true value has its most likely value (that which you observed).
Margaretha Hendrickx says:

January 6, 2024 at 00:24

Hi David, you are pointing out something universal about the human condition about which many are in denial and which is most elegantly expressed in the context of statistics: the cost of making the wrong decision. In decision science, one discovered that some people are very risk averse and become paralyzed just thinking about this possibility. Hence they prefer not to think at all.https://plato.stanford.edu/entries/risk/You are one of the few people who in his own way realizes the significance of this problem.The Academy ought to pay much more attention to this issue at a much more organized manner than it has done until now.Best wishes for the new year.

Reply