It explains it quite clearly and gives a worked example.

]]>I’m very glad that you recovered. But no cancer (if that’s what you had) kills 100% of its victims. You were one of the lucky ones, but it’s most unlikely that it had anything to do with Gerson nonsense, nor IV Vitamin C. There is certainly a financial payoff for the people who sell these things. And of course they’d make much more money, if they’d just arrange a proper trial to test their claims. The only reason that they don’t do that is because they must know that the chances of the test being passed are very low. Why run the risk of losing your income? It’s really very easy. Just produce the evidence and you’d have the world at you feet.

The point that you make about diet is interesting. Please take a look at http://www.dcscience.net/?p=6300 It’s a great deal more complicated than you seem to think. The WCRF report looked at a vast range of food and none had a large protective effect or a large harmful effect. A few years ago it was fat that killed you. That proved to be wrong, and now the fashionable villain is sugar. That’s probably wrong too. It’s one of the great mysteries. Some aspect of lifestyle seems to influence cancer risks, but nobody knows what.

I think you exaggerate what’s known. When you read the original papers, and know enough to understand them, you realise how little is known for certain. It makes me angry that quacks are so willing to prey on the desperate. If they had any conscience they’d do some proper research.

]]>If we see a study result of 9/10, then this study is an element of the set of studies of 10 subjects with an outcome of 9 events, the pooled estimate for them all being approximately 0.9 (or 10/12 = 0.833 perhaps after using Laplace’s correction for small proportions). If this study was repeated with 10 observations then what is the probability that in future the result would be 0/10, 1/10, 2/10 etc up to 10/10? We could estimate the probability of each of these results with the binomial theorem. The probability in future of ‘replicating’ a result with a result of >5/10 based on a past population of approximately 0.9 would be roughly the same as getting a value of 1-P when P is the probability of getting a result of ≥9/10 if the 10 had been selected from a ‘null’ hypothetical population of 0.5.

In other words, we could regard the ‘P value’ as approximately the probability of failing to replicate a result greater than the mean in the same direction given the study values. It must be emphasised that this would be a probability of non-replication using the same numerical study result, NOT the probability of not getting the TRUE result within some bound. This probability would not change if the study seemed perfect in all other respects. However, if there were other contradictory study results or if there was dubious methodology or that the subjects were very different from the reader’s subjects, or the write-up was poor or vague or there was a suspicion that unflattering results were hidden then the estimated subjective probability of non-replication would become higher and the probability of replication lower. In other words, the ‘P value’ would be an ‘objective’ lower limit of the probability of non-replication and would be only the first hurdle. If the ‘P value’ was high, then the test of replication would fail at the first ‘objective’ hurdle. Confidence intervals would be interpreted in the same way. The study would have a preliminary 95% or 99% chance of being replicated within the confidence limits, but this probability would fall if imperfections were found in the conduct of the study. It might be interesting to set up a calibration curve of the subjective probability of replication against the actual frequency of replication for a series of studies.

Most people would arrive at their own probability of replication intuitively from the first hurdle of the P value or confidence interval, but those who wished to apply Bayesian (or other) calculations by combining their subjective probabilities with the objective probabilities would be free to do so. However, the true test would be for the study to be repeated independently; if replicated successfully then the probability of further replication would be far higher for the next time. None of this claims that an estimate can be made of the probability of the true result, only the probability of getting the same result again after repeating the study. This is what is in the scientist’s mind when about to publish and every reader’s mind later. We could set the first hurdle much higher, say at a confidence interval of at least 99% or a ‘P value’ of no less than 1% to minimise possible disillusionment. Also instead of regarding replication at just more than a 50% difference between two treatments, the bar could be set higher.

]]>If in a cross-over study 9/10 patients were found to be better on drug ‘A’ this could a chance selection of 10 subjects from a population where only 50% of the total were actually better on drug A (the null hypothesis). The probability of selecting 9/10 under these circumstances would be very small. If 9/10 was the actual (positive) study result then those __without__ the actual (positive) result but thus a ‘negative’ result would be 0/10 or 1/10 or 2/10 up to 10/10 but excluding 9/10. Similarly a population __without__ 50% being better on drug A would be all those from 0% to 100% but excluding 50%. Analysing a result of ‘9/10 and not 9/10’ and also a population of ‘50% better’ and ‘not 50% better’ using Bayes theorem would be valid but very difficult and tell us little in the end. Instead, we currently use a result of 9/10 combined with a more extreme hypothetical result (e.g. 10/10 in this example) to calculate a ‘P value’, which we then use as an arbitrary index of reproducibility.

I agree with you; there surely has to be a better way in order to reduce the hazards of significance testing.

]]>“calculation of the ‘false discovery rate’ using Bayes rule cannot be applied to ‘P’ value outputs as ‘specificities’ and ‘power’ calculation outputs as ‘sensitivities’”

I don’t know why you say that. The examples that I give are based on simple counting, and I can’t see any part of them that’s contentious.

I carefully avoided using the term “Bayes” because that is associated with subjective probabilities and a great deal of argument.

I liked Stephen Senn’s first comment on twitter (the twitter stream is storified here). He said ” I may have to write a paper ‘You may believe you are NOT a Bayesian but you’re wrong’”. I maintain that the analysis here may bear a formal similarity to a Bayesian argument, but is free of more contentious parts of the Bayesian approach. The arguments that I have used contain no subjective probabilities, and are an application of obvious rules of conditional probabilities.

The classical example of Bayesian argument is the assessment of the evidence of the hypothesis that the earth goes round the sun. The probability of this hypothesis being true, given some data, must be subjective since it’s not possible to imagine a population of solar systems, some of which are heliocentric and some of which are not.

I argue that that problem of testing a series of drugs to see whether or not their effects differ from a control group is quite different. It’s easy to imagine a large number of candidate drugs some of which are active (fraction *P*(*real*) say) , some of which aren’t. So the prevalence (or prior, if you must) is a perfectly well-defined probability, which could be determined with sufficient effort. If you test one drug at random, the probability of it being active is *P*(*real*). It’s no different from the probability of picking a black ball from an urn that contains a fraction *P*(*real*) of black balls. to use the statisticians’ favourite example.

P values and confidence intervals relate to the means of a group of measurements or proportions. The ‘likelihoods’ in these cases may be probability densities. However, in the case of P values, the ‘likelihood’ represented by P is not about the result actually seen but about the result actually seen OR other more extreme hypothetical results that were NOT seen. The P value is thus the probability of one standardised hypothetical finding conditional on another standardised hypothetical finding. This provides a consistent index of reproducibility. Furthermore, when we calculate ‘power’ we base it on a hypothetical estimated finite ‘real’ value H_{R}. But this finite value H_{R} is not the complement of the finite value H_{0} (i.e. p(H_{0}) ≠ 1- p(H_{R}) and so we cannot apply Bayes rule to it which is:

p(H_{0}/V) = [p(H_{0})*p(V/H_{0})] / [p(H_{0})*p(V/H_{0})+ [p(-_{0})*p(V/-_{0})]

when -_{0} is the complement of H_{0 }so that p(-_{0}) = 1 – p(H_{0}).

I think therefore that a calculation of the ‘false discovery rate’ using Bayes rule cannot be applied to ‘P’ value outputs as ‘specificities’ and ‘power’ calculation outputs as ‘sensitivities’ unless I have misunderstood the reasoning.

However, I do agree with that a 5% P value or a 95% confidence interval that only just excludes a difference of zero does not suggest a high probability of replication (i.e. a low false discovery rate when repeated by others). This is because those repeat results of barely more than a zero difference would not be regarded as ‘replication’. So how similar should a result be to the original to ‘replicate’ it? The Bayesians may of course incorporate their own subjective priors. We would also study other factors (e.g. the soundness of the methods, clarity of the write-up and absence of hidden biases such as selective reporting of results where the less flattering studies are omitted) when estimating the probability of replication (i.e. the false discovery rate when readers repeat the study).

]]>

This discussion keeps raising important issues. A common cause of cognitive impairment is ‘vascular dementia’ due to ‘mini-strokes’ suggested by ‘scarring’ on a CT scan in someone with cognitive impairment (suggested by symptoms e.g. being unable to find the way home or a poor score when taking some questionnaire test). Unlike AD, its risk or progression can be reduced by anti-platelet therapy (e.g. aspirin) and statins. This is an example of how ‘stratifying’ a diagnosis (e.g. ‘dementia’ into AD, vascular dementia, etc) advances knowledge. The NNT from such treatment for one patient to benefit will vary.

A high NNT is now being called ‘over-treatment’ due to ‘over-diagnosis’. The finger of blame is pointed at pharmaceutical companies trying to maximise sales and their corrupting influence on scientists and doctors. However, I think that much of the blame is ignorance e.g. by imposing over-simplifying cut-off points resulting in large numbers of patients being given an unnecessary diagnosis and adverse effects through over-treatment. I get blank looks and apparent incomprehension when I try to explain this to representatives from NICE, statisticians, evidence-based medicine advocates and journal editors who have been advocating these methods for decades and perhaps are reluctant to admit they were wrong all along.

For example, I have shown that when screening diabetics for micro-albuminuria, the number needed to treat (NNT) for one patient to benefit when the albumin excretion rate (AER) is 19 (below 20, normal and not treated) is obviously similar to when it is 21 (above 20, abnormal and treated). The NNT is about 200 to prevent one getting nephropathy (an AER of greater 200mcg/min) within 2 years! The risk of nephropathy is about 1/200 until the AER is above 40 when the risk begins to rise. At present about 1/3 of the patients diagnosed with ‘micro-albuminuria’ and being treated have an AER between 20 and 40mcg/min. See: Llewelyn DEH, Garcia-Puig J. How different urinary albumin excretion rates can predict progression to nephropathy and the effect of treatment in hypertensive diabetics. J Renin Angiotensin Aldosterone System. *2004*; *5*:*141*- and http://www.bmj.com/content/344/bmj.e1238?sso=.

I don’t think that I was being at all complacent. I said that confidence intervals don’t tell you about your false discovery rate. But I really like the way you put it it.

]]>Actually, in the example I give, 98% of those who test negative get the correct diagnosis (since 99% really are negative, that’s not surprising). But if only 14% of those who test positive are actually ill, and they are not warned that they have an 86% of being fine, that’s an awful lot of people who will be terrified unnecessarily. That seems rather cruel to me.

]]>Actually, if you try the SAGE test, you will see that a poor score on it can’t even be classed as a clinical ‘sign’ like high blood pressure (something informative, but not evident to the patient). Consistently poor scores would reveal a degree of impairment that is almost certainly already evident to the subject – in other words a symptom. The subject may be in some sort of denial about this, but whatever the cause (stress, drugs, visual or linguistic issues, infarcts, dementia, etc.) it is something they should confront constructively with doctors and family. A ‘false positive’ is not someone who actually doesn’t have a problem, but someone whose problem doesn’t match whatever they are being correlated with. It is of course quite wrong to bill such a failed test as a diagnosis or definite predictor of Alzheimers. No scientist or medic is doing this. But it is scary if people who would fail this test are not facing up to their problem: possibly even endangering others in cognitively challenging situations like driving. There is a lot more to this than just whether there is an effective medical treatment available for the worst possible condition (maybe AD) that may be causing poor test performance.

]]>“ In other words, the number needed to treat for one to benefit is probably surprisingly high in modern medicine”

Indeed it is. That’s something I go on about a lot. Statins terrible, analgesics poor, and “cough medicines” and “tonics” that don’t work at all. Regression to the mean is not just a problem for quacks.

]]>