Download Lectures on Biostatistics (1971).
Corrected and searchable version of Google books edition

Download review of Lectures on Biostatistics (THES, 1973).

Latest Tweets

David Colquhoun

1 2 3 13

Jump to follow-up

On Sunday 23 September, we recorded an interview with Rosi Sexton. Ever since I got to know her, I’ve been impressed by her polymathy. She’s a musician, a mathematician and a champion athlete, and now an osteopath: certainly an unusual combination. You can read about her on her Wikipedia page: https://en.wikipedia.org/wiki/Rosi_Sexton.

The video is long and wide-ranging, so I’ll give some bookmarks, in case you don’t want to watch it all. (And please excuse my garish London marathon track suit.)

Rosi recently started to take piano lessons again, after a 20 year break. She plays Chopin in the introduction, and Prokofiev and Schubert at 17:37 – 20:08. They are astonishingly good, given the time that’s elapsed since she last played seriously.

We started corresponding in 2011, about questions concerning evidence and alternative medicine as well as sports. Later we talked about statistics too: her help is acknowledged in my 2017 paper about p values. And discussions with her gave rise to the slide at 26:00 in my video on that topic.

Rosi’s accomplishments in MMA have been very well-documented and my aim was to concentrate on her other achievements. Nonetheless we inevitably had to explore the reasons why a first class mathematician chose to spend 14 years of her life in such a hard sport. I’m all for people taking risks if they want to. I have more sympathy for her choice than many of my friends, having myself spent time doing boxing, rugby, flying, sailing, long distance running, and mountain walking. I know how they can provide a real relief from the pressures of work.

The interview starts by discussing when she started music (piano, age 6) and how she became interested in maths. In her teens, she was doing some quite advanced maths: she relates later (at 1:22:50) how she took on holiday some of Raymond Smullyan’s books on mathematical logic at the age of 15 or 16. She was also playing the piano and the cello in the Reading Youth Orchestra, and became an Associate of the London College of music at 17. And at 14 she started Taekwondo, which she found helpful in dealing with teenage demons.

She was so good at maths that she was accepted at Trinity College, Cambridge where she graduated with 1st class hons. And then went on to a PhD, at Manchester. It was during her PhD that she became interested in MMA. We talk at 23:50 about why she abandoned maths (there’s a glimpse of some of her maths at 24:31), and devoted herself to MMA until she retired from that in 2014. In the meantime she took her fifth degree, in osteopathy, in 2010. She talks about some of her teenage demons at 28:00.

Many of my sceptical friends regard all osteopaths as quacks. Some certainly are. I asked Rosi about this at 38:40 and her responses can’t be faulted. She agrees that it’s rarely possible to know whether the treatments she uses are effective or whether the patient would have improved anyway. She understands regression to the mean. We discussed the problem of responders and non-responders. She appreciates that it’s generally not possible to tell whether or not they exist (for more on this, see Stephen Senn’s work. . Even the best RCT tells us only about the average response. Not all osteopath’s are the same.

We talk about the problems of doping and of trans competitors in sports at 49:30, and about the perception of contact sports at 59:32. Personally I have no problem with people competing in MMA, boxing or rugby, if that’s what they want to do. Combat sports are the civilised alternative to war. It isn’t the competitors that I worry about, it’s the fans.

At 1:14:28 we discussed how little is known about the long-term dangers of contact sports. The possible dangers of concussion led to a discussion of Russell’s paradox at 1:20:40.

I asked why she’s reluctant to criticise publicly things like acupuncture or “craniosacral therapy” (at 1:25:00). I found her answers quite convincing.

At 1:43:50, there’s a clip taken from a BBC documentary of Rosi’s father speaking about his daughter’s accomplishments, her perfectionism and her search for happiness.

Lastly, at 1:45:27, there’s a section headed “A happy new beginning”. It documents Rosi’s 40th birthday treat, when she with her new partner, Stephen Caudwell, climbed the highest climbing wall in the world, the Luzzone dam. After they walked down at the end of the climb, they got engaged.

I wish them both a very happy future.

Postcript. Rosi now runs the Combat Sports Clinic. The have recently produced a video about neck strength training, designed to help people who do contact sports -things like rugby, boxing, muay thai and MMA. I’ve seen only the preview, but there is certainly nothing quackish about it. It’s about strength training.

Jump to follow-up

If you are not a pharmacologist or physiologist, you may never have heard of Bernard Ginsborg. I first met him in 1960. He was a huge influence on me and a great friend. I’m publishing this here because the Physiological Society has published only a brief obituary.

Bernard & Andy

Bernard with his wife, Andy (Andrina).

You can download the following documents.

I’ll post here my own recollections of Bernard here.

Bernard Ginsborg was a lecturer in the Pharmacology department in Edinburgh when I joined that department in 1960, as a PhD student.

I recall vividly our first meeting in the communal tea room: smallish in stature, large beard and umbrella, My first reaction was ‘is this chap ever going to stop talking?’. My second reaction followed quickly: this chap has an intellect like nobody I’d encountered before.

I’d been invited to Edinburgh by Walter Perry, who had been external examiner for my first degrees in Leeds. In my 3rd year  viva, he’d asked me to explain the difference between confidence limits and fiducial limits. Of course I couldn’t answer, and spent much of my 4th year trying to find out.  I didn’t succeed but produced a paper that most have impressed him.  He, together with W.E. Brocklehurst, were my PhD supervisors. I saw Perry only when he dropped into my lab for a cigarette between committee meetings., but he treated me very well. He got me a Scottish Hospitals Endowment Trust scholarship which paid twice the usual MRC salary for a PhD student, and he made me an honorary lecturer so that I could join the magnificent staff club on Chambers Street (now gone), where I met, among many others, Peter Higgs, of boson fame.

I very soon came to wish that Bernard was my supervisor rather than Perry. I loved his quantitative approach. A physicist was more appealing to me than a medic.  We spent a lot of time talking and I learnt a huge amount from him.  I had encountered some of Bernard Katz’s papers in my 4th undergraduate year, and realised they were something special, but I didn’t know enough about electrophysiology to appreciate them fully. Bernard explained it all to me.  His 1967 review, Ion movements in junctional transmission, is a classic: still worth reading by anyone interested in electrophysiology. Bernard’s mathematical ability was invaluable to me when, during my PhD, I was wrestling with the equations for diffusion in a cylinder with binding (see appendix here).

The discussions in the communal tea room were hugely educational. Dick Barlow and R.P. Stephenson were especially interesting, I soon came to realise that Bernard had a better grasp on quantitative ideas about receptors than either of them. His use of Laplace transforms to solve simultaneous differential equations in a 1974 paper was my first introduction to them, and that proved very useful to me later. Those discussions laid the ground for a life-long interest in the topic for me.

After I left the pharmacology department in 1964, contact became more intermittent for a while. I recall vividly a meeting held in Crans sur Sierre, Switzerland in 1977,  The meetings there were good, despite have started as golfing holidays for J. Murdoch Ritchie and Joe Hoffman.  There was a certain amount of tension between Bernard and Charles F Stevens, the famous US electrophysiologist. Alan Hawkes and I had just postulated that the unitary event in ion channel opening at the neuromuscular junction was a short burst of openings rather than single openings.  This contradicted the postulate by  Anderson & Stevens (1973) that binding of the agonist was very fast compared with the channel opening and shutting. At the time our argument was theoretical –it wasn’t confirmed experimentally until the early 80s.  Bernard was chairing a session and he tried repeatedly to get Stevens to express an opinion on our ideas, but failed. 

At dinner, Stevens was holding court: he expressed the view that rich people shouldn’t pay tax because there were too few of them and it cost more to get them pay up than it was worth.  He sat back to wait for the angry protestations of the rest of the people at the table,  He hadn’t reckoned with Bernard. He said how much he agreed, and by the same token, the police shouldn’t waste time trying to catch murderers. There were too few of them and it wasted too much police time.  The argument was put eloquently as only Bernard could do.  Stevens, who I suspect had not met Bernard before, was uncharacteristically speechless. He had no idea what hit him.  It was a moment to savour. 

M+BLG 1977
May 1977, Crans sur Sierre, Switzerland.

For those who knew Bernard, it was another example of his ability to argue eloquently for any proposition whatsoever. I’d been impressed by his speech on how the best way to teach about drugs was to teach them in alphabetical order: it would make as much sense as any other way of categorising them.  Usually there was just enough truth in these propositions to make the listener who hadn’t heard him in action before, wonder, for a moment, if he was serious.  The picture shows him with my wife, Margaret, at the top of a mountain during a break in the meeting.  He’d walked up, putting those of us who’d taken the train to shame.

In 1982, Alan Hawkes and I published a paper with the title “On the stochastic properties of bursts of single ion channel openings and of clusters of bursts.”  It was 59 pages long with over 400 equations, most of which used matrix notation. After it had been accepted, I discovered that Bernard had taken on the heroic job of reviewing itl  This came to light when I got a letter from him that consisted of two matrices which, when multiplied out, revealed his role.

For many years Bernard invited me to Edinburgh to talk to undergraduates about receptors and electrophysiology. (I’ve often wondered if that’s why most of our postdocs came from Glasgow than from Edinburgh during that time.)  It was on one such visit in 1984 that I got a phone call to say that my wife, Margaret, had collapsed on the railway station at Walton-on-Thames while 6 months pregnant, and had been taken to St Peter’s Hospital in Chertsey.  The psychodrama of our son’s birth has been documented elsewhere,  A year later we came to Edinburgh once again. The pictures taken then show Bernard looking impishly happy, as he very often did, in his Edinburgh house in Magdala Crescent. The high rooms were lined with books, all of which he seemed to have read.  His intellect was simply dazzling.

BLG 1985


 December 19th 1985. Magdala Crescent, Edinburgh

The following spring we visited again, this time with our son Andrew, aged around 15 months. We went with Bernard and Andy to the Edinburgh Botanic gardens. Andrew who was still not walking, crawled away rapidly up a grassy slope. Andy said don’t worry, when he gets to the tope he’ll stop and look back for you.  She was a child psychologist so we believed her. Andrew, however, disappeared from sight over the brow of the hill.

During these visits, we stayed with Bernard and Andy at their Edinburgh house.

The experience of staying with them was like being exposed to an effervescent intellectual fountain. It’s hard to think of a better matched couple.

After Bernard retired in 1985, he took no further interest in science.  For him, it was a chance to spend time on his many other interests.  After he went to live in France, contact became more intermittent. Occasional emails were exchanged.  It was devastating to hear about the death of Andy in 2013.  The last time that I saw both of them was in 2008, at John Kelly’s house.  He was barely changed from the day that I met him in 1960.

Bernard was a legend.  It’s hard to believe that he’s no longer here.

Kelly's 2008

BLG+cat 2008
Bernard in 2008 at John Kelly’s house.

Lastly, here is a picture taken at the 2009 meeting of the British Pharmacological Society, held in Edinburgh.

At the British Pharm. Soc meeting, 2009.Left to right: DC, BLG, John Kelly, Mark Evans, Anthony Harmer



Jump to follow-up

On Monday evening (8th January 2018), I got an email from Ben van der Merwe, a UCL student who works as a reporter for the student newspaper, London Student.  He said

“Our investigation has found a ring of academic psychologists associated with Richard Lynn’s journal Mankind Quarterly to be holding annual conferences at UCL. This includes the UCL psychologist professor James Thompson”.

He asked me for comment about the “London Conference on Intelligence”. His piece came out on Wednesday 10th January. It was a superb piece of investigative journalism.  On the same day, Private Eye published a report on the same topic.

I had never heard about this conference, but it quickly became apparent that it was a forum for old-fashioned eugenicists of the worst kind.  Perhaps it isn’t surprising that neither I, nor anyone else at UCL that I’ve spoken to had heard of these conferences because they were surrounded by secrecy.  According to the Private Eye report:

“Attendees were only told the venue at the last minute and asked not to share the information"

The conference appears to have been held at least twice before. The programmes for the 2015 conference [download pdf] and the 2016 conference [download pdf] are now available, but weren’t public at the time.   They have the official UCL logo across the top despite the fact that Thompson has been only an honorary lecturer since 2007.

LCI header

A room was booked for the conference through UCL’s external room booking service. The abstracts are written in the style of a regular conference. It’s possible that someone with no knowledge of genetics (as is likely to be the case for room-booking staff) might have not spotted the problem. 

The huge problems are illustrated by the London Student piece, which identifies many close connections between conference speakers and far-right, and neo-nazi hate groups.

"[James Thompson’s] political leanings are betrayed by his public Twitter account, where he follows prominent white supremacists including Richard Spencer (who follows him back), Virginia Dare, American Renaissance, Brett Stevens, the Traditional Britain Group, Charles Murray and Jared Taylor.”

“Thompson is a frequent contributor to the Unz Review, which has been described as “a mix of far-right and far-left anti-Semitic crackpottery,” and features articles such as ‘America’s Jews are Driving America’s Wars’ and ‘What to do with Latinos?’.

His own articles include frequent defences of the idea that women are innately less intelligent than men (1, 2, 3,and 4), and an analysis of the racial wage gap which concludes that “some ethnicities contribute relatively little,” namely “blacks.”

“By far the most disturbing of part of Kirkegaard’s internet presence, however, is a blog-post in which he justifies child rape. He states that a ‘compromise’ with paedophiles could be:

“having sex with a sleeping child without them knowing it (so, using sleeping medicine. If they don’t notice it is difficult to see how they cud be harmed, even if it is rape. One must distinguish between rape becus the other was disconsenting (wanting to not have sex), and rape becus the other is not consenting, but not disconsenting either.”

The UCL Students’ Union paper, Cheesegrater, lists some of James Thompson’s tweets,including some about brain size in women

It’s interesting that these came to light on the same day that I learned that the first person to show that there was NO correlation  between brain size and intelligence was Dr Alice Lee, in 1901 [download pdf].
Alice Lee was the first woman to get a PhD in mathematics from UCL and she was working in the Galton laboratory, under Karl Pearson. Pearson was a great statistician but also a eugenicist.  It was good to learn that he supported women in science at a time when that was almost unknown.

What’s been done so far?

After I’d warned UCL of the impending scandal, they had time to do some preliminary investigation. An official UCL announcement appeared on the same day (10 Jan, 2018) as the articles were published.

“Our records indicate the university was not informed in advance about the speakers and content of the conference series, as it should have been for the event to be allowed to go ahead”

“We are an institution that is committed to free speech but also to combatting racism and sexism in all forms.”

"We have suspended approval for any further conferences of this nature by the honorary lecturer and speakers pending our investigation into the case."

That is about as good as can be expected. It remains to be seen why the true nature of the conferences was not spotted, and it remains to be seen why someone like James Thompson was an honorary senior lecturer at UCL. Watch this space.

How did it happen

Two videos that feature Thompson are easily found. One, from 2010, is on the UCLTV channel. And in March 2011, a BBC World News video featured Thompson.

But both of these videos are about his views on disaster psychology (Chilean miners, and Japanese earthquake, respectively). Neither gives any hint of his extremist political views. To discover them you’d have to delve into his twitter account (@JamesPsychol) or his writings on the unz site.  It’s not surprising that they were missed.

I hope we’ll know more soon about how these meetings slipped under the radar.  Until recently, they were very secret.  But then six videos of talks at the 2017 meeting were posted on the web, by the organisers themselves. Perhaps they were emboldened by the presence of an apologist for neo-nazis in the White House, and by the government’s support for Toby Young, who wrote in support of eugenics. The swing towards far-right views in the UK, in the USA and in Poland, Hungary and Turkey, has seen a return to public discussions of views that have been thought unspeakable since the 1930s. See, for example, this discussion of eugenics by Spectator editor Fraser Nelson with Toby Young, under the alarming heading "Eugenics is back".

The London Conference on Intelligence channel used the UCL logo, and it was still public on 10th January. It had only 49 subscribers. By 13th January it had been taken down (apparently by its authors). But it still has a private playlist with four videos which have been viewed only 36 times (some of which were me). Before it vanished, I made a copy of Emil Kirkegard’s talk, for the record.

youtube channel

Freedom of speech

Incidents like this pose difficult problems, especially given UCL’s past history. Galton and Pearson supported the idea of eugenics at the beginning of the 20th century, as did George Bernard Shaw. But modern geneticists at the Galton lab have been at the forefront in showing that these early ideas were simply wrong.

UCL has, in the past, rented rooms for conferences of homeopaths. Their ideas are deluded and sometimes dangerous, but not illegal. I don’t think they should be arrested, but I’d much prefer that their conferences were not at UCL.

A more serious case occurred on 26 February 2008. The student Islamic Society invited  representatives of the radical Islamic creationist, Adnan Oktar, to speak at UCL. They were crowing that the talk would be held in the Darwin lecture theatre (built in the place formerly occupied by Charles Darwin’s house on Gower Street). In the end, the talk was allowed to go ahead, but it was moved by the then provost to the Gustave Tuck lecture theatre, which is much smaller, and which was built from a donation by the former president of the Jewish Historical Society. See more accounts here, here and here. It isn’t known what was said, so there is no way to tell whether it was illegal, or just batty.

It is very hard to draw the line between hate talk and freedom of speech.  There was probably nothing illegal about what was said at the Intelligence Conferences.  It was just bad science, used to promote deeply distasteful ideas..

Although, in principle, renting a room doesn’t imply any endorsement, in practice all crackpot organisations love to use the name of UCL to promote their cause. That alone is sufficient reason to tell these people to find somewhere else to promote their ideas.

Follow up in the media

For a day or two the media were full of the story. It was reported, for example, in the Guardian and in the speakerClick to play the interview.

The real story

Recently some peope have demanded that the names of Galton and Pearson should be expunged from UCL.

There would be a case for that if their 19th century ideas were still celebrated, just as there is a case for removing statues that celebrate confederate generals in the southern USA.  Their ideas about measurement and statistics are justly celebrated. But their ideas about eugenics are not celebrated.

On the contrary, it is modern genetics, done in part by people in the Galton lab, that has shown the wrongness of 19th century views on race. If you want to know the current views of the Galton lab, try these.  They could not be further from Thompson’s secretive pseudoscience.

Steve Jones’ 2015 lecture “Nature, nurture or neither: the view from the genes”,

or “A matter of life and death: To condemn the study of complex genetic issues as eugenics is to wriggle out of an essential debate".

Or check the writing of UCL alumnus, Adam Rutherford: “Why race is not a thing, according to genetics”,

or, from Rutherford’s 2017 article

“We’ve known for many years that genetics has profoundly undermined the concept of race”

“more and more these days, racists and neo-Nazis are turning to consumer genetics to attempt to prove their racial purity and superiority. They fail, and will always fail, because no one is pure anything.”

“the science that Galton founded in order to demonstrate racial hierarchies had done precisely the opposite”

Or read this terrific account of current views by Jacob A Tennessen “Consider the armadillos".

These are accounts of what geneticists now think. Science has shown that views expressed at the London Intelligence Conference are those of a very small lunatic fringe of pseudo-scientists. But they are already being exploited by far-right politicians.

It would not be safe to ignore them.


15 January 2018. The involvement of Toby Young

The day after this was posted, my attention was drawn to a 2018 article by the notorious Toby Young. In it he confirms the secretiveness of the conference organisers.

“I discovered just how cautious scholars in this field can be when I was invited to attend a two-day conference on intelligence at University College London by the academic and journalist James Thompson earlier this year. Attendees were only told the venue at the last minute – an anonymous antechamber at the end of a long corridor called ‘Lecture
Room 22’ – and asked not to share the information with anyone else.”

More importantly,it shows that Toby Young has failed utterly to grasp the science.

“You really have to be pretty stubborn to dispute that general cognitive ability is at least partly genetically based.”

There is nobody who denies this.
The point is that the interaction of nature and nurture is far more subtle than Young believes, and that makes attempts to separate them quantitatively futile. He really should educate himself by looking at the accounts listed above (The real story)

16 January 2018. How UCL has faced its history

Before the current row about the “London Intelligence Conference”, UCL has faced up frankly to its role in the development of eugenics. It started at the height of Empire, in the 19th century and continued into the early part of the 20th century. The word “eugenics” has not been used at UCL since it fell into the gravest disrepute in the 1930s, and has never been used since WW2. Not, that is, until Robert Thompson and Toby Young brought it back. The history has been related by curator and science historian, Subhadra Das. You can read about it, and listen to episodes of her podcast, at “Bricks + Mortals, A history of eugenics told through buildings“. Or you can listen to her whole podcast.

Although Subhadra Das describes Galton as the Victorian scientist that you’ve never heard of. I was certainly well aware of his ideas before I first came to UCL (in 1964). But at that time. I thought of Karl Pearson only as a statistician, and I doubt if I’d even heard of Flinders Petrie. Learning about their roles was a revelation.

17 January 2018.

Prof Semir Zeki has been pointed out to me that it’s not strictly to say “the word “eugenics” has not been used at UCL since it fell into the gravest disrepute in the 1930s”. It’s true to say that nobody advocated it but the chair of Eugenics was not renamed the chair of Human Genetics until 1963. This certainly didn’t imply approval. Zeki tells me that it’s holder “Lionel Penrose, when he mentioned his distaste for the title, saying that it was a hangover from the past, and should be changed”.

Jump to follow-up

Today we went to see the film Goodbye Christopher Robin.  It was very good. I, like most children, read Pooh books as a child. 

Image from Wikipedia

I got interested in their author, A.A. Milne, when I discovered that he’d done a mathematics degree at Cambridge. So had my scientific hero A.V. Hill, and (through twitter) I met AV’s granddaughter, Alison Hill. I learned that AV loved to quote A.A.Milne’s poem, OBE.


I know a Captain of Industry,
Who made big bombs for the R.F.C.,
And collared a lot of £ s. d.–
And he–thank God!–has the O.B.E.

I know a Lady of Pedigree,
Who asked some soldiers out to tea,
And said “Dear me!” and “Yes, I see”–
And she–thank God!–has the O.B.E.

I know a fellow of twenty-three,
Who got a job with a fat M.P.–
(Not caring much for the Infantry.)
And he–thank God!–has the O.B.E.

I had a friend; a friend, and he
Just held the line for you and me,
And kept the Germans from the sea,
And died–without the O.B.E.
Thank God!
He died without the O.B.E.

This poem clearly reflects Milne’s experience in WW1. He was at the Battle of the Somme, despite describing himself as a pacifist.  In the film he’s portrayed as suffering from PTSD (shell shock as it used to be called).  The sound of a balloon popping could trigger a crisis.  He was from a wealthy background. He, and his wife Daphne, employed a nanny and maid.

The first Pooh book, When We Were Very Young, came out in 1924, when Milne’s son, Christopher Robin, was four. The nanny is, in some ways, the hero of the film. It was she, not his parents, who looked after Christopher Robin, and the child loved her.  In contrast, his parents were distant and uncommunicative. 

By today’s standards, Christopher Robin’s upbringing looks almost like child neglect. One can only speculate about how much his father’s PTSD was to blame for this. But his mother had no such excuse.  It seems likely to me that part of the blame attaches to the fact that Milne was brought up as an “English gentleman”.  Looking after children was a job for nannies, not parents.   Milne went to a private school (Westminster), and Christopher Robin was sent to private schools. At 13 he was sent away from his parents, to Stowe school, where he suffered a lot of bullying.  That is a problem that’s endemic and it’s particularly bad in private boarding schools.

I have seen it at first hand. I went to what was known at the time as a direct grant school, and I was a day boy.  But the school did its best to ape a private school. It was a cold and cruel place. Once, I came off my bike and went head first into a sandstone wall.  While recovering in the matron’s room I looked at some of the books there. They were mostly ancient boys’ stories that lauded the virtues of the British Empire. Even at 13, I was horrified.

After he reached the age of 9, Christopher Robin resented increasingly what he came to see as his parents’ exploitation of his childhood.  After WW2, Christopher Robin got married but his parents didn’t approve of his choice. He became estranged from his parents, and went to run a bookshop in Dartmouth (Devon).  Once his father died, he did not see his mother during the 15 years that passed before her death. Even when she was on her deathbed, she refused to see her son. 

It’s a sad story, and the film conveys that well.  I wonder whether it might have been different if it were not for the horrors of WW1 and the horrors of the upbringing of English gentlemen.

It would be good to think that things were better now. They are better, but the old problems haven’t vanished. The UK is still ruled largely by graduates from Oxford and Cambridge. They take mostly white kids from expensive private schools. These institutions specialise in giving people confidence that exceeds their abilities.   Now the UK is mocked across the world for its refusal to modernise and for the delusions of empire that are brexit. The New York Times commented

if what the Brexiteers want is to return Britain to a utopia they have devised by splicing a few rose-tinted memories of the 1950s together with an understanding of imperial history derived largely from images on vintage biscuit tins,

Just look at at this recent New Yorker cover. Look at Jacob Rees-Mogg. And look at Brexit.




Jump to follow-up

This piece is almost identical with today’s Spectator Health article.

This week there has been enormously wide coverage in the press for one of the worst papers on acupuncture that I’ve come across. As so often, the paper showed the opposite of what its title and press release, claimed. For another stunning example of this sleight of hand, try Acupuncturists show that acupuncture doesn’t work, but conclude the opposite: journal fails, published in the British Journal of General Practice).

Presumably the wide coverage was a result of the hyped-up press release issued by the journal, BMJ Acupuncture in Medicine. That is not the British Medical Journal of course, but it is, bafflingly, published by the BMJ Press group, and if you subscribe to press releases from the real BMJ. you also get them from Acupuncture in Medicine. The BMJ group should not be mixing up press releases about real medicine with press releases about quackery. There seems to be something about quackery that’s clickbait for the mainstream media.

As so often, the press release was shockingly misleading: It said

Acupuncture may alleviate babies’ excessive crying Needling twice weekly for 2 weeks reduced crying time significantly

This is totally untrue. Here’s why.

Luckily the Science Media Centre was on the case quickly: read their assessment.

The paper made the most elementary of all statistical mistakes. It failed to make allowance for the jelly bean problem.

The paper lists 24 different tests of statistical significance and focusses attention on three that happen to give a P value (just) less than 0.05, and so were declared to be "statistically significant". If you do enough tests, some are bound to come out “statistically significant” by chance. They are false postives, and the conclusions are as meaningless as “green jelly beans cause acne” in the cartoon. This is called P-hacking and it’s a well known cause of problems. It was evidently beyond the wit of the referees to notice this naive mistake. It’s very doubtful whether there is anything happening but random variability.

And that’s before you even get to the problem of the weakness of the evidence provided by P values close to 0.05. There’s at least a 30% chance of such values being false positives, even if it were not for the jelly bean problem, and a lot more than 30% if the hypothesis being tested is implausible. I leave it to the reader to assess the plausibility of the hypothesis that a good way to stop a baby crying is to stick needles into the poor baby.

If you want to know more about P values try Youtube or here, or here.


jelly bean

One of the people asked for an opinion on the paper was George Lewith, the well-known apologist for all things quackish. He described the work as being a "good sized fastidious well conducted study ….. The outcome is clear". Thus showing an ignorance of statistics that would shame an undergraduate.

On the Today Programme, I was interviewed by the formidable John Humphrys, along with the mandatory member of the flat-earth society whom the BBC seems to feel obliged to invite along for "balance". In this case it was professional acupuncturist, Mike Cummings, who is an associate editor of the journal in which the paper appeared. Perhaps he’d read the Science media centre’s assessment before he came on, because he said, quite rightly, that

"in technical terms the study is negative" "the primary outcome did not turn out to be statistically significant"

to which Humphrys retorted, reasonably enough, “So it doesn’t work”. Cummings’ response to this was a lot of bluster about how unfair it was for NICE to expect a treatment to perform better than placebo. It was fascinating to hear Cummings admit that the press release by his own journal was simply wrong.

Listen to the interview here

Another obvious flaw of the study is that the nature of the control group. It is not stated very clearly but it seems that the baby was left alone with the acupuncturist for 10 minutes. A far better control would have been to have the baby cuddled by its mother, or by a nurse. That’s what was used by Olafsdottir et al (2001) in a study that showed cuddling worked just as well as another form of quackery, chiropractic, to stop babies crying.

Manufactured doubt is a potent weapon of the alternative medicine industry. It’s the same tactic as was used by the tobacco industry. You scrape together a few lousy papers like this one and use them to pretend that there’s a controversy. For years the tobacco industry used this tactic to try to persuade people that cigarettes didn’t give you cancer, and that nicotine wasn’t addictive. The main stream media obligingly invite the representatives of the industry who convey to the reader/listener that there is a controversy, when there isn’t.

Acupuncture is no longer controversial. It just doesn’t work -see Acupuncture is a theatrical placebo: the end of a myth. Try to imagine a pill that had been subjected to well over 3000 trials without anyone producing convincing evidence for a clinically useful effect. It would have been abandoned years ago. But by manufacturing doubt, the acupuncture industry has managed to keep its product in the news. Every paper on the subject ends with the words "more research is needed". No it isn’t.

Acupuncture is pre-scientific idea that was moribund everywhere, even in China, until it was revived by Mao Zedong as part of the appalling Great Proletarian Revolution. Now it is big business in China, and 100 percent of the clinical trials that come from China are positive.

if you believe them, you’ll truly believe anything.


29 January 2017

Soon after the Today programme in which we both appeared, the acupuncturist, Mike Cummings, posted his reaction to the programme. I thought it worth posting the original version in full. Its petulance and abusiveness are quite remarkable.

I thank Cummings for giving publicity to the video of our appearance, and for referring to my Wikipedia page. I leave it to the reader to judge my competence, and his, in the statistics of clinical trials. And it’s odd to be described as a "professional blogger" when the 400+ posts on dcscience.net don’t make a penny -in fact they cost me money. In contrast, he is the salaried medical director of the British Medical Acupuncture Society.

It’s very clear that he has no understanding of the error of the transposed conditional, nor even the mulltiple comparison problem (and neither, it seems, does he know the meaning of the word ‘protagonist’).

I ignored his piece, but several friends complained to the BMJ for allowing such abusive material on their blog site. As a result a few changes were made. The “baying mob” is still there, but the Wikipedia link has gone. I thought that readers might be interested to read the original unexpurgated version. It shows, better than I ever could, the weakness of the arguments of the alternative medicine community. To quote Upton Sinclair:

“It is difficult to get a man to understand something, when his salary depends upon his not understanding it.”

It also shows that the BBC still hasn’t learned the lessons in Steve Jones’ excellent “Review of impartiality and accuracy of the BBC’s coverage of science“. Every time I appear in such a programme, they feel obliged to invite a member of the flat earth society to propagate their make-believe.

Acupuncture for infantile colic – misdirection in the media or over-reaction from a sceptic blogger?

26 Jan, 17 | by Dr Mike Cummings

So there has been a big response to this paper press released by BMJ on behalf of the journal Acupuncture in Medicine. The response has been influenced by the usual characters – retired professors who are professional bloggers and vocal critics of anything in the realm of complementary medicine. They thrive on oiling up and flexing their EBM muscles for a baying mob of fellow sceptics (see my ‘stereotypical mental image’ here). Their target in this instant is a relatively small trial on acupuncture for infantile colic.[1] Deserving of being press released by virtue of being the largest to date in the field, but by no means because it gave a definitive answer to the question of the efficacy of acupuncture in the condition. We need to wait for an SR where the data from the 4 trials to date can be combined.
On this occasion I had the pleasure of joining a short segment on the Today programme on BBC Radio 4 led by John Humphreys. My protagonist was the ever-amusing David Colquhoun (DC), who spent his short air-time complaining that the journal was even allowed to be published in the first place. You can learn all about DC care of Wikipedia – he seems to have a surprisingly long write up for someone whose profession career was devoted to single ion channels, perhaps because a significant section of the page is devoted to his activities as a quack-busting blogger. So why would BBC Radio 4 invite a retired basic scientist and professional sceptic blogger to be interviewed alongside one of the journal editors – a clinician with expertise in acupuncture (WMA)? At no point was it made manifest that only one of the two had ever been in a position to try to help parents with a baby that they think cries excessively. Of course there are a lot of potential causes of excessive crying, but I am sure DC would agree that it is unlikely to be attributable to a single ion channel.

So what about the research itself? I have already said that the trial was not definitive, but it was not a bad trial. It suffered from under-recruiting, which meant that it was underpowered in terms of the statistical analysis. But it was prospectively registered, had ethical approval and the protocol was published. Primary and secondary outcomes were clearly defined, and the only change from the published protocol was to combine the two acupuncture groups in an attempt to improve the statistical power because of under recruitment. The fact that this decision was made after the trial had begun means that the results would have to be considered speculative. For this reason the editors of Acupuncture in Medicine insisted on alteration of the language in which the conclusions were framed to reflect this level of uncertainty.

DC has focussed on multiple statistical testing and p values. These are important considerations, and we could have insisted on more clarity in the paper. P values are a guide and the 0.05 level commonly adopted must be interpreted appropriately in the circumstances. In this paper there are no definitive conclusions, so the p values recorded are there to guide future hypothesis generation and trial design. There were over 50 p values reported in this paper, so by chance alone you must expect some to be below 0.05. If one is to claim statistical significance of an outcome at the 0.05 level, ie a 1:20 likelihood of the event happening by chance alone, you can only perform the test once. If you perform the test twice you must reduce the p value to 0.025 if you want to claim statistical significance of one or other of the tests. So now we must come to the predefined outcomes. They were clearly stated, and the results of these are the only ones relevant to the conclusions of the paper. The primary outcome was the relative reduction in total crying time (TC) at 2 weeks. There were two significance tests at this point for relative TC. For a statistically significant result, the p values would need to be less than or equal to 0.025 – neither was this low, hence my comment on the Radio 4 Today programme that this was technically a negative trial (more correctly ‘not a positive trial’ – it failed to disprove the null hypothesis ie that the samples were drawn from the same population and the acupuncture intervention did not change the population treated). Finally to the secondary outcome – this was the number of infants in each group who continued to fulfil the criteria for colic at the end of each intervention week. There were four tests of significance so we need to divide 0.05 by 4 to maintain the 1:20 chance of a random event ie only draw conclusions regarding statistical significance if any of the tests resulted in a p value at or below 0.0125. Two of the 4 tests were below this figure, so we say that the result is unlikely to have been chance alone in this case. With hindsight it might have been good to include this explanation in the paper itself, but as editors we must constantly balance how much we push authors to adjust their papers, and in this case the editor focussed on reducing the conclusions to being speculative rather than definitive. A significant result in a secondary outcome leads to a speculative conclusion that acupuncture ‘may’ be an effective treatment option… but further research will be needed etc…

Now a final word on the 3000 plus acupuncture trials that DC loves to mention. His point is that there is no consistent evidence for acupuncture after over 3000 RCTs, so it clearly doesn’t work. He first quoted this figure in an editorial after discussing the largest, most statistically reliable meta-analysis to date – the Vickers et al IPDM.[2] DC admits that there is a small effect of acupuncture over sham, but follows the standard EBM mantra that it is too small to be clinically meaningful without ever considering the possibility that sham (gentle acupuncture plus context of acupuncture) can have clinically relevant effects when compared with conventional treatments. Perhaps now the best example of this is a network meta-analysis (NMA) using individual patient data (IPD), which clearly demonstrates benefits of sham acupuncture over usual care (a variety of best standard or usual care) in terms of health-related quality of life (HRQoL).[3]

30 January 2017

I got an email from the BMJ asking me to take part in a BMJ Head-to-Head debate about acupuncture. I did one of these before, in 2007, but it generated more heat than light (the only good thing to come out of it was the joke about leprechauns). So here is my polite refusal.


Thanks for the invitation, Perhaps you should read the piece that I wrote after the Today programme

Why don’t you do these Head to Heads about genuine controversies? To do them about homeopathy or acupuncture is to fall for the “manufactured doubt” stratagem that was used so effectively by the tobacco industry to promote smoking. It’s the favourite tool of snake oil salesman too, and th BMJ should see that and not fall for their tricks.

Such pieces night be good clickbait, but they are bad medicine and bad ethics.

All the best


The last email of Stephan Grimm has had more views than any other on this blog. “Publish and perish at Imperial College London: the death of Stefan Grimm“. Since then it’s been viewed more than 210,000 times. The day after it was posted, the server failed under the load.

Since than, I posted two follow-up pieces. On December 23, 2014 “Some experiences of life at Imperial College London. An external inquiry is needed after the death of Stefan Grimm“. Of course there was no external inquiry.

And on April 9, 2015, after the coroner’s report, and after Imperial’s internal inquiry, “The death of Stefan Grimm was “needless”. And Imperial has done nothing to prevent it happening again“.

On September 24th 2015, I posted a memorial on the first anniversary of his death. It included some of Grimm’s drawings that his mother and sister sent to me.

That tragedy led to two actions by Imperial, the metrics report (2015) and the bullying report (2016).

Let’s look at the outcomes.

The 2015 metrics report

In February 2015 and investigation was set up into the use of metrics to evaluate people, In December 2015 a report was produced: Application and Consistency of Approach in the Use of Performance Metrics. This was an internal enquiry so one didn’t expect very much from it. Out of 1338 academic staff surveyed at the College, 309 (23% of the total) responded
another 217 started the survey but did not submit anything). One can only speculate about the low return. It could be that 87% of staff were happy, or it could be that 87% of staff were frightened to give their opinions. It’s true that some departments use few if any metrics to assess people so one wouldn’t expect strong responses from them.

My position is clear: metrics don’t measure the quality of science, in fact they corrupt science.

This is not Imperial’s view though. The report says:

5.1 In seeking to form a view on performance metrics, we started from the premise that, whatever their benefits or deficiencies, performance metrics pervade UK universities. From REF to NSS via the THE and their attendant league tables, universities are measured and ranked in many dimensions and any view of performance metrics has to be formed in this context.

In other words, they simply acquiesce in the use of measures that demonstrably don’t do what’s claimed for them.

Furthermore the statement that “performance metrics pervade UK universities” is not entirely true. At UCL we were told in 2015.

“We will evaluate the quality of staff contributions appropriately, focusing on the quality of individual research outputs and their impact rather than quantity or journal-level metrics.” .

And one of the comments quoted in Imperial’s report says

“All my colleagues at MIT and Harvard etc tell me they reject metrics because they lead to mediocre candidates. If Imperial really wants to be a leader, it has to be bold enough to judge based on quality.”

It is rather shameful that only five UK universities (out of 114 or so) have signed the San Francisco Declaration on Research Assessment (DORA). I’m very happy that UCL is one of them, along with Sussex and Manchester, Birmingham and Liverpool. Imperial has not signed.

Imperial’s report concludes

“each department should develop profiles of its academic staff based on a series of published (ie open and transparent [perhaps on the College intranet]:”

There seems to be a word missing here. Presumably this means “open and transparent metrics“.

The gist of the report seems to be that departments can carry on doing what they want, as long as they say what it is. That’s not good enough, in my opinion.

A review of Imperial College’s institutional culture and its impact on gender equality

Unlike the metrics report, this one was external: that’s good. But, unlike the metrics report, it is secret: that’s bad.

The report was written by Alison Phipps (Director of Gender Studies and Reader in Sociology University of Sussex). But all that’s been released is an 11 page summary, written by Imperial, not by the authors of the report. When I asked Phipps for a copy of the whole report I was told

“Unfortunately we cannot share the full report – this is an internal document to Imperial, and we have to protect our research participants who told us their stories on this basis.”

It’s not surprising that the people who told their stories are afraid of repercussions. But it’s odd that their stories are concealed from everyone but the people who are in a position to punish them.

The report seems to have been commissioned because of this incident.

“The university apologised to the women’s rugby team after they were left playing to an empty stadium when the coaches ferrying spectators back to campus were allowed to leave early.”

“a member of staff was overheard saying that they did not care “how those fat girls” got home,”

But the report wasn’t restricted to sexism. It covered the whole culture at Imperial. One problem was that only 127 staff
and 85 students participated. There is no way to tell whether those who didn’t respond were happy or whether they were scared.

Here are some quotations from Imperial’s own summary of the secret report.

“For most, the meaning was restricted to excellence in research despite the fact that the College’s publicised mission statement gives equal prominence to research and education in the excellence context”

“Participants saw research excellence in metricised terms, positioning the College as a top-level player within the UK and in the world.”

Words used by those critical of Imperial’s culture included ” ‘cutthroat’, ‘intimidating’, ‘blaming’ and ‘arrogant’ “.

“Many participants in the survey and other methods felt that the external focus on excellence had emphasised internal competition rather than collaboration. This competition was noted as often being individualistic and adversarial. ”

“It was felt that there was an all-consuming focus on academic performance, and negative attitudes towards those who did not do well or who were not as driven as others. There was a reported lack of community spirit in the College’s culture including departments being ‘played off against each other’”

“The research findings noted comments that the lack of communal space on the campus had contributed to a lack of a community spirit. It was suggested that the College had ‘an impersonal culture’ and groups could therefore self-segregate in the absence of mechanisms for them to connect. ”

“There were many examples given to the researchers of bullying and discriminatory behaviour towards staff and students. These examples predominantly reflected hierarchies in work or study arrangements. ”

“The researchers reported that many of the participants linked it with the ‘elite’ white masculinity of the majority population, although a few examples of unacceptable behaviour by female staff and students were also cited. Examples of misogynistic and homophobic conduct were given and one interviewee expressed concern that the ‘ingrained misogyny’ at Imperial was so deep that it had become normal.”

“Although the College describes itself as a supportive environment, and many positive examples of that support were cited, a number of participants felt that senior management would turn a blind eye to poor behaviour if the individual involved was of value to the College.”

“Despite Imperial’s ‘no tolerance’ stance on harassment and bullying and initiatives such as ‘Have Your Say’, the researchers heard that people did not ‘speak up’ about many issues, ranging from discrimination and abuse to more subtle practices that leave people feeling vulnerable, unheard or undermined.”

“Relations between PIs and contract researchers were especially difficult, and often gendered as the PI was very often a man and the researcher a woman.”

“It was reported that there was also a clear sense of staff and students feeling afraid to speak up about issues and not receiving clear information or answers due to unclear institutional processes and one-way communication channels.”

“This representation of Imperial College as machine rather than organism resonated with observations on a culture of fear and silence, and the lack of empathy and community spirit at the College.”

“Some of the participants identified a surface commitment to diversity and representation but a lack of substantive system processes to support this. The obstacles to participation in the way of doing things at Imperial, and the associated issues of fear and insecurity, were reported as leading to feelings of hopelessness, demotivation, and low morale among some staff and students.”

“Some participants felt that Athena SWAN had merely scratched the surface of issues or had just provided a veneer which concealed continuing inequalities and that events such as the annual Athena SWAN lecture were little more than a ‘box ticking exercise.’”

The conclusions are pretty weak: e.g.

“They [the report’s authors] urged the College to implement changes that would ensure that its excellence in research is matched by excellence in other areas.”

Of course, Imperial College says that it will fix the problems. “Imperial’s provost, James Stirling, said that the institution must do better and was committed to gender equality”.

But that is exactly what they said in 2003

“The rector [then Richard Sykes] acknowledged the findings that came out of the staff audit – Imperial College – A Good Place to Work? – undertaken in August 2002.”

“He reinforced the message that harassment or bullying would not be tolerated in the College, and promised commitment from Council members and the Executive Committee for their continuing support to equal opportunities.”

This was eleven years before the pressure applied to Stefan Grimm caused him to take his own life. As always, it sounds good. But it seems that, thirteen years later, Imperial is going through exactly the same exercise.

It would be interesting to know whether Imperial’s Department of Medicine is still adopting the same cruel assessment methods as it was in 2007. Other departments at Imperial have never used such methods. It’s a continual source of bafflement to me that medicine, the caring profession, seems to care less for its employees that most other departments.

Other universities

Imperial is certainly not unique in having these problems. They are endemic. For example, Queen Mary, Kings College London and Warwick University have had similar problems, among many others.

Managers must learn that organisations function better when employees have good morale and are happy to work. Once again, I quote Scott Burkun (The myths of Innovation, 2007).

“Creation is sloppy; discovery is messy; exploration is dangerous. What’s a manager to do? The answer in general is to encourage curiosity and accept failure. Lots of failure.”

All big organisations are much the same -dissent is squashed and punished. Committees are set up. Fine-sounding statements are issued. But nothing much changes.

It should not be so.


Jump to follow-up

The "supplement" industry is a scam that dwarfs all other forms of alternative medicine. Sales are worth over $100 billion a year, a staggering sum. But the claims they make are largely untrue: plain fraudulent. Although the industry’s advertisements like to claim "naturalness". in fact most of the synthetic vitamins are manufactured by big pharma companies. The pharmaceutical industry has not been slow to cash in on an industry in which unverified claims can be made with impunity.

When I saw advertised Hotshot, "a proprietary formulation of organic ingredients" that is alleged to cure or prevent muscle cramps, I would have assumed that it was just another scam. Then I saw that the people behind it were very highly-regarded scientists, Rod MacKinnon and Bruce Bean, both of whom I have met.

The Hotshot’s website gives this background.

"For Dr. Rod MacKinnon, a Nobel Prize-winning neuroscientist/endurance athlete, the invention of HOTSHOT was personal.

After surviving life threatening muscle cramps while deep sea kayaking off the coast of Cape Cod, he discovered that existing cramp remedies – that target the muscle – didn’t work. Calling upon his Nobel Prize-winning expertise on ion channels, Rod reasoned that preventing and treating cramps began with focusing on the nerve, not the muscle.

Five years of scientific research later, Rod has perfected HOTSHOT, the kick-ass, proprietary formulation of organic ingredients, powerful enough to stop muscle cramps where they start. At the nerve.

Today, Rod’s genius solution has created a new category in sports nutrition: Neuro Muscular Performance (NMP). It’s how an athlete’s nerves and muscles work together in an optimal way. HOTSHOT boosts your NMP to stop muscle cramps. So you can push harder, train longer and finish stronger."  

For a start, it’s pretty obvious that MacKinnon has not spent the last five years developing a cure for cramp. His publications don’t even mention the topic. Neither do Bruce Bean’s.

I’d like to thank Bruce Bean for answering some questions I put to him. He said it’s "designed to be as strong as possible in activating TRPV1 and TRPA1 channels". After some hunting I found that it contains

Filtered Water, Organic Cane Sugar, Organic Gum Arabic, Organic Lime Juice Concentrate, Pectin, Sea Salt, Natural Flavor, Organic Stevia Extract, Organic Cinnamon, Organic Ginger, Organic Capsaicin

The first ingredient is sugar: "the 1.7oz shot contains enough sugar to make a can of Coke blush with 5.9 grams per ounce vs. 3.3 per ounce of Coke".[ref].

The TRP (transient receptor potential) receptors form a family of 28 related ion channels,Their physiology is far from being well understood, but they are thought to be important for mediating taste and pain, The TRPV1 channel is also known as the receptor for capsaicin (found in chilli peppers). The TRPA1 responds to the active principle in Wasabi.

I’m quite happy to believe that most cramp is caused by unsychronised activity of motor nerves causing muscle fibres to contract in an uncordinated way (though it isn’t really known that this is the usual mechanism, or what triggers it in the first place), The problem is that there is no good reason at all to think that stimulating TRP receptors in the gastro-intestinal tract will stop, within a minute or so, the activity of motor nerves in the spinal cord.

But, as always, there is no point in discussing mechanisms until we are sure that there is a phenomenon to be explained. What is the actual evidence that Hotshot either prevents of cures cramps, as claimed? The Hotshot’s web site has pages about Our Science, Its title is The Truth about Muscle Cramps. That’s not a good start because it’s well known that nobody understands cramp.

So follow the link to See our Scientific Studies. It has three references, two are to unpublished work. The third is not about Hotshot, but about pickle juice. This was also the only reference sent to me by Bruce Bean. Its title is ‘Reflex Inhibition of Electrically Induced
Muscle Cramps in Hypohydrated Humans’, Miller et al,, 2010 [Download pdf]. Since it’s the only published work, it’s worth looking at in detail.

Miller et al., is not about exercise-induced cramp, but about a cramp-like condition that can be induced by electrical stimulation of a muscle in the sole of the foot (flexor hallucis brevis). The intention of the paper was to investigate anecdotal reports that pickle juice and prevent or stop cramps. It was a small study (only 10 subjects). After getting the subjects dehydrated, they cramp was induced electrically, and two seconds after it started, they drank either pickle juice or distilled water. They weren’t asked about pain: the extent of cramp was judged by electromyograph records. At least a week later, the test was repeated with the other drink (the order in which they were given was randomised). So it was a crossover design.

There was no detectable difference between water and pickle juice on the intensity of the cramp. But the duration of the cramp was said to be shorter. The mean duration after water was 133.7 ± 15.9 s and the mean duration after pickle juice was 84.6 ± 18.5 s. A t test gives P = 0.075. However each subject had both treatments and the mean reduction in duration was 49.1 ± 14.6 s and a paired t test gives P = 0.008. This is close to the 3-standard-deviation difference which I recommended as a minimal criterion, so what could possibly go wrong?.

The result certainly suggests that pickle juice might reduce the duration of cramps, but it’s far from conclusive, for the following reasons. First, it must have been very obvious indeed to the subjects whether they were drinking water or pickle juice. Secondly, paired t tests are not the right way to analyse crossover experiments, as explained here, Unfortunately the 10 differences are not given so there is no way to judge the consistency of the responses. Thirdly, two outcomes were measured (intensity and duration), and no correction was made for multiple comparisons. Finally, P = 0.008 is convincing evidence only if you assume that there’s a roughly 50:50 chance of the pickle-juice folk-lore being right before the experiment was started. For most folk remedies, that would be a pretty implausible assumption. The vast majority of folk remedies turn out to be useless when tested properly.

Nevertheless, the results are sufficiently suggestive that it might be worth testing Hotshot properly. One might have expected that would have been done before marketing started, It wasn’t.

Bruce Bean tells me that they tried it on friends who said that it worked. Perhaps that’s not so surprising: there can be no condition more susceptible than muscle cramps to self-deception because of regression to the mean

They found a business partner, Flex Pharma, and Mackinnon set up a company. Let’s see how they are doing.

Flex Pharma

The hyperbole in the advertisements for Hotshots is entirely legal in the USA. The infamous 1994 “Dietary Supplement Health and Education Act (DSHEA)” allows almost any claim to be made for herbs etc as long as they are described as a "dietary supplement". All they have to do is add in the small print:

"These statements have not been evaluated by the Food and Drug Administration. This product is not intended to diagnose, treat, cure or prevent any disease".

Of course medical claims are made: it’s sold to prevent and treat muscle cramp (and I can’t even find the weasel words on the web site).

As well as Hotshot, Flex Pharma are also testing a drug, FLX-787, a TRP receptor agonist of undisclosed structure.  It is hoping get FDA approval for treatment of nocturnal leg cramps (NLCs) and treatment of spasticity in multiple sclerosis (MS) and amyotrophic lateral sclerosis (ALS) patients. It would be great if it works, but we still don’t know whether it does,

The financial press doesn’t seem to be very optimistic. When Flex Pharma was launched on the stock market at the beginning of 2015, its initial public offering, raised $$86.4 million, at $16 per share. The biotech boom of the previous few years was still strong. In 2016, the outlook seems less rosy. The investment advice site Seeking Alpha had a scathing evaluation in June 2016. Its title was "Flex Pharma: What A Load Of Cramp". It has some remarkably astute assessments of the pharmacology, as well as of financial risks. The summary reads thus:

  • We estimate FLKS will burn at least 40 million of its $84 million in cash this year on clinical trials for FLX-787 and marketing spend for its new cramp supplement called “HOTSHOT.”
  • Based on its high cash burn, we expect a large, dilutive equity raise is likely over the next 12 months.
  • We believe the company’s recent study on nocturnal leg cramps (NLCs) may be flawed. We also highlight risks to its lead drug candidate, FLX-787, that we believe investors are currently overlooking.
  • We highlight several competitive available alternatives to FLKS’s cramp products that we believe investors have not factored into current valuation.
  • Only 2.82% of drugs from companies co-founded by CEO Westphal have achieved FDA approval.

The last bullet point refers to Flex Pharma’s CEO, Christoph Westphal MD PhD (described bi Fierce Biotech as "serial biotech entrepreneur"). Only two out of his 71 requests for FDA approval were successful.

On October 13th 2016 it was reported that early trials of FLX-787 had been disappointing. The shares plunged.


On October 17th 2016, Seeking Alpha posted another evaluation: “Flex Pharma Has Another Cramp“. Also StreetInsider,com. They were not optimistic. The former made the point (see above) that crossover trials are not what should be done. In fact the FDA have required that regular parallel RCTs should be done before FLX-787 can be approved.


Drug discovery is hard and it’s expensive. The record for small molecule discovery has not been good in the last few decades. Many new introductions have, at best, marginal efficacy, and at worst may do more harm than good. In the conditions for which understanding of causes is poor or non-existent, it’s impossible to design new drugs rationally. There are only too many such conditions: from low back pain to almost anything that involves the brain, knowledge of causes is fragmentary to non-existent. This leads guidance bodies to clutch at straws. Disappointing as this is, it’s not for want of trying. And it’s not surprising. Serious medical research hasn’t been going for long and the systems are very complicated.

But this is no excuse for pretending that things work on tha basis of the flimsiest of evidence, Bruce Bean advised me to try Hotshot on friends, and says that it doesn’t work for everybody. This is precisely what one is told by homeopaths, and just about every other sort of quack. Time and time again, that sort of evidence has proved to be misleading,

I have the greatest respect for the science that’s published by both Bruce Bean and Rod MacKinnon. I guess that they aren’t familiar with the sort of evidence that’s required to show that a new treatment works. That isn’t solved by describing a treament as a "dietary supplement".

I’ll confess that I’m a bit disappointed by their involvement with Flex Pharma, a company that makes totally unjustified claims. Or should one just say caveat emptor?


Before posting this, I sent it to Bruce Bean to be checked. Here was his response, which I’m posting in full (hoping not to lose a friend).

"Want to be UK representative for Hotshot? Sample on the way!"

"I do not see anything wrong with the facts. I have a different opinion – that it is perfectly appropriate to have different standards of proof of efficacy for consumer products made from general-recognized-as-safe ingredients and for an FDA-approved drug. I’d be happy for the opportunity to post something like the following your blog entry (and suffer any consequent further abuse) if there is an opportunity".  

  " I think it would be unfair to lump Hotshot with “dietary supplements” targeted to exploit the hopes of people with serious diseases who are desperate for magic cures. Hotshot is designed and marketed to athletes who experience exercise-induced cramping that can inhibit their training or performance – hardly a population of desperate people susceptible of exploitation. It costs only a few dollars for someone to try it. Lots of people use it regularly and find it helpful. I see nothing wrong with this and am glad that something that I personally found helpful is available for others to try. "

     " Independently of Hotshot, Flex Pharma is hoping to develop treatments for cramping associated with diseases like ALS, MS, and idiopathic nocturnal leg cramps. These treatments are being tested in rigorous clinical trials that will be reviewed by the FDA. As with any drug development it is very expensive to do the clinical trials and there is no guarantee of success. I give credit to the investors who are underwriting the effort. The trials are openly publicly reported. I would note that Flex Pharma voluntarily reported results of a recent trial for night leg cramps that led to a nearly 50% drop in the stock price. I give the company credit for that openness and for spending a lot of money and a lot of effort to attempt to develop a treatment to help people – if it can pass the appropriately high hurdle of FDA approval."

     " On Friday, I sent along 8 bottles of Hotshot by FedEx, correctly labeled for customs as a commercial sample. Of course, I’d be delighted if you would agree to act as UK representative for the product but absent that, it should at least convince you that the TRP stimulators are present at greater than homeopathic doses. If you can find people who get exercise-induced cramping that can’t be stretched out, please share with them."

6 January 2017

It seems that more than one Nobel prizewinner is willing to sell their names to dodgy businesses. The MIT Tech Review tweeted a link to Imagine Albert Einstein getting paid to put his picture on tin of anti-wrinkle cream. No fewer than seven Nobel prizewinners have lent their names to a “supplement” pill that’s claimed to prolong your life. Needless to say, there isn’t the slightest reason to think it works. What posesses these people beats me. Here are their names.

Aaron Ciechanover (Cancer Biology, Technion – Israel Institute of Technology).

Eric Kandel (Neuroscience, Columbia University).

Jack Szostak (Origins of Life & Telomeres, Harvard University).

Martin Karplus (Complex Chemical Systems, Harvard University).

Sir Richard Roberts(Biochemistry, New England Biolabs).

Thomas Südhof (Neuroscience, Stanford University).

Paul Modrich (Biochemistry, Duke University School of Medicine).

Then there’s the Amyway problem. Watch this space.

‘We know little about the effect of diet on health. That’s why so much is written about it’. That is the title of a post in which I advocate the view put by John Ioannidis that remarkably little is known about the health effects if individual nutrients. That ignorance has given rise to a vast industry selling advice that has little evidence to support it.

The 2016 Conference of the so-called "College of Medicine" had the title "Food, the Forgotten Medicine". This post gives some background information about some of the speakers at this event. I’m sorry it appears to be too ad hominem, but the only way to judge the meeting is via the track record of the speakers.



Quite a lot has been written here about the "College of Medicine". It is the direct successor of the Prince of Wales’ late, unlamented, Foundation for Integrated Health. But unlike the latter, its name is disguises its promotion of quackery. Originally it was going to be called the “College of Integrated Health”, but that wasn’t sufficently deceptive so the name was dropped.

For the history of the organisation, see

The new “College of Medicine” arising from the ashes of the Prince’s Foundation for Integrated Health

Don’t be deceived. The new “College of Medicine” is a fraud and delusion

The College of Medicine is in the pocket of Crapita Capita. Is Graeme Catto selling out?

The conference programme (download pdf) is a masterpiece of bait and switch. It is a mixture of very respectable people, and outright quacks. The former are invited to give legitimacy to the latter. The names may not be familiar to those who don’t follow the antics of the magic medicine community, so here is a bit of information about some of them.

The introduction to the meeting was by Michael Dixon and Catherine Zollman, both veterans of the Prince of Wales Foundation, and both devoted enthusiasts for magic medicne. Zollman even believes in the battiest of all forms of magic medicine, homeopathy (download pdf), for which she totally misrepresents the evidence. Zollman works now at the Penny Brohn centre in Bristol. She’s also linked to the "Portland Centre for integrative medicine" which is run by Elizabeth Thompson, another advocate of homeopathy. It came into being after NHS Bristol shut down the Bristol Homeopathic Hospital, on the very good grounds that it doesn’t work.

Now, like most magic medicine it is privatised. The Penny Brohn shop will sell you a wide range of expensive and useless "supplements". For example, Biocare Antioxidant capsules at £37 for 90. Biocare make several unjustified claims for their benefits. Among other unnecessary ingredients, they contain a very small amount of green tea. That’s a favourite of "health food addicts", and it was the subject of a recent paper that contains one of the daftest statistical solecisms I’ve ever encountered

"To protect against type II errors, no corrections were applied for multiple comparisons".

If you don’t understand that, try this paper.
The results are almost certainly false positives, despite the fact that it appeared in Lancet Neurology. It’s yet another example of broken peer review.

It’s been know for decades now that “antioxidant” is no more than a marketing term, There is no evidence of benefit and large doses can be harmful. This obviously doesn’t worry the College of Medicine.

Margaret Rayman was the next speaker. She’s a real nutritionist. Mixing the real with the crackpots is a standard bait and switch tactic.

Eleni Tsiompanou, came next. She runs yet another private "wellness" clinic, which makes all the usual exaggerated claims. She seems to have an obsession with Hippocrates (hint: medicine has moved on since then). Dr Eleni’s Joy Biscuits may or may not taste good, but their health-giving properties are make-believe.

Andrew Weil, from the University of Arizona
gave the keynote address. He’s described as "one of the world’s leading authorities on Nutrition and Health". That description alone is sufficient to show the fantasy land in which the College of Medicine exists. He’s a typical supplement salesman, presumably very rich. There is no excuse for not knowing about him. It was 1988 when Arnold Relman (who was editor of the New England Journal of Medicine) wrote A Trip to Stonesville: Some Notes on Andrew Weil, M.D..

“Like so many of the other gurus of alternative medicine, Weil is not bothered by logical contradictions in his argument, or encumbered by a need to search for objective evidence.”

This blog has mentioned his more recent activities, many times.

Alex Richardson, of Oxford Food and Behaviour Research (a charity, not part of the university) is an enthusiast for omega-3, a favourite of the supplement industry, She has published several papers that show little evidence of effectiveness. That looks entirely honest. On the other hand, their News section contains many links to the notorious supplement industry lobby site, Nutraingredients, one of the least reliable sources of information on the web (I get their newsletter, a constant source of hilarity and raised eyebrows). I find this worrying for someone who claims to be evidence-based. I’m told that her charity is funded largely by the supplement industry (though I can’t find any mention of that on the web site).

Stephen Devries was a new name to me. You can infer what he’s like from the fact that he has been endorsed byt Andrew Weil, and that his address is "Institute for Integrative Cardiology" ("Integrative" is the latest euphemism for quackery). Never trust any talk with a title that contains "The truth about". His was called "The scientific truth about fats and sugars," In a video, he claims that diet has been shown to reduce heart disease by 70%. which gives you a good idea of his ability to assess evidence. But the claim doubtless helps to sell his books.

Prof Tim Spector, of Kings College London, was next. As far as I know he’s a perfectly respectable scientist, albeit one with books to sell, But his talk is now online, and it was a bit like a born-again microbiome enthusiast. He seemed to be too impressed by the PREDIMED study, despite it’s statistical unsoundness, which was pointed out by Ioannidis. Little evidence was presented, though at least he was more sensible than the audience about the uselessness of multivitamin tablets.

Simon Mills talked on “Herbs and spices. Using Mother Nature’s pharmacy to maintain health and cure illness”. He’s a herbalist who has featured here many times. I can recommend especially his video about Hot and Cold herbs as a superb example of fantasy science.

Annie Anderson, is Professor of Public Health Nutrition and
Founder of the Scottish Cancer Prevention Network. She’s a respectable nutritionist and public health person, albeit with their customary disregard of problems of causality.

Patrick Holden is chair of the Sustainable Food Trust. He promotes "organic farming". Much though I dislike the cruelty of factory farms, the "organic" industry is largely a way of making food more expensive with no health benefits.

The Michael Pittilo 2016 Student Essay Prize was awarded after lunch. Pittilo has featured frequently on this blog as a result of his execrable promotion of quackery -see, in particular, A very bad report: gamma minus for the vice-chancellor.

Nutritional advice for patients with cancer. This discussion involved three people.
Professor Robert Thomas, Consultant Oncologist, Addenbrookes and Bedford Hospitals, Dr Clare Shaw, Consultant Dietitian, Royal Marsden Hospital and Dr Catherine Zollman, GP and Clinical Lead, Penny Brohn UK.

Robert Thomas came to my attention when I noticed that he, as a regular cancer consultant had spoken at a meeting of the quack charity, “YestoLife”. When I saw he was scheduled tp speak at another quack conference. After I’d written to him to point out the track records of some of the people at the meeting, he withdrew from one of them. See The exploitation of cancer patients is wicked. Carrot juice for lunch, then die destitute. The influence seems to have been temporary though. He continues to lend respectability to many dodgy meetings. He edits the Cancernet web site. This site lends credence to bizarre treatments like homeopathy and crystal healing. It used to sell hair mineral analysis, a well-known phony diagnostic method the main purpose of which is to sell you expensive “supplements”. They still sell the “Cancer Risk Nutritional Profile”. for £295.00, despite the fact that it provides no proven benefits.

Robert Thomas designed a food "supplement", Pomi-T: capsules that contain Pomegranate, Green tea, Broccoli and Curcumin. Oddly, he seems still to subscribe to the antioxidant myth. Even the supplement industry admits that that’s a lost cause, but that doesn’t stop its use in marketing. The one randomised trial of these pills for prostate cancer was inconclusive. Prostate Cancer UK says "We would not encourage any man with prostate cancer to start taking Pomi-T food supplements on the basis of this research". Nevertheless it’s promoted on Cancernet.co.uk and widely sold. The Pomi-T site boasts about the (inconclusive) trial, but says "Pomi-T® is not a medicinal product".

There was a cookery demonstration by Dale Pinnock "The medicinal chef" The programme does not tell us whether he made is signature dish "the Famous Flu Fighting Soup". Needless to say, there isn’t the slightest reason to believe that his soup has the slightest effect on flu.

In summary, the whole meeting was devoted to exaggerating vastly the effect of particular foods. It also acted as advertising for people with something to sell. Much of it was outright quackery, with a leavening of more respectable people, a standard part of the bait-and-switch methods used by all quacks in their attempts to make themselves sound respectable. I find it impossible to tell how much the participants actually believe what they say, and how much it’s a simple commercial drive.

The thing that really worries me is why someone like Phil Hammond supports this sort of thing by chairing their meetings (as he did for the "College of Medicine’s" direct predecessor, the Prince’s Foundation for Integrated Health. His defence of the NHS has made him something of a hero to me. He assured me that he’d asked people to stick to evidence. In that he clearly failed. I guess they must pay well.


This is my version of a post which I was asked to write for the Independent. It’s been published, though so many changes were made by the editor that I’m posting the original here (below).

Superstition is rife in all sports. Mostly it does no harm, and it might even have a placebo effect that’s sufficient to make a difference of 0.01%. That might just get you a medal. But what does matter is that superstition has given rise to an army of charlatans who are only to willing to sell their magic medicine to athletes, most of whom are not nearly as rich as Phelps.

So much has been said about cupping during the last week
that it’s hard to say much that’s original. Yesterday I did six radio interviews and two for TV, and today Associated Press TV came to film a piece about it. Everyone else must have been on holiday. The only one I’ve checked was the piece on the BBC News channel. That one didn’t seem to go too badly, so it’s here

BBC news coverage

It starts with the usual lengthy, but uninformative, pictures of someone being cupped, The cupper in this case was actually a chiropractor, Rizwhan Suleman. Chiropractic is, of course a totally different form of alternative medicine and its value has been totally discredited in the wake of the Simon Singh case. It’s not unusual for people to sell different therapies with conflicting beliefs. Truth is irrelevant. Once you’ve believed one impossible thing, it seems that the next ones become quite easy.

The presenter, Victoria Derbyshire, gave me a fair chance to debunk it afterwards.

Nevertheless, the programme suffered from the usual pretence that there is a controversy about the medical value of cupping. There isn’t. But despite Steve Jones’ excellent report to the BBC Trust, the media insist on giving equal time to flat-earth advocates. The report, (Review of impartiality and accuracy of the BBC’s coverage of science) was no doubt commissioned with good intentions, but it’s been largely ignored.

Still worse, the BBC News Channel, when it repeated the item (its cycle time is quite short) showed only Rizwhan Suleman and cut out my comments altogether. This is not false balance. It’s no balance whatsoever. A formal complaint has been sent. It is not the job of the BBC to provide free advertising to quacks.

After this, a friend drew my attention to a much worse programme on the subject.

The Jeremy Vine show on BBC Radio 2, at 12.00 on August 10th, 2016. This was presented by Vanessa Feltz. It was beyond appalling. There was absolutely zero attempt at balance, false or otherwise. The guest was described as being am "expert" on cupping. He was Yusef Noden, of the London Hijama Clinic, who "trained and qualified with the Hijama & Prophetic Medicine Institute". No doubt he’s a nice bloke, but he really could use a first year course in physiology. His words were pure make-believe. His repeated statements about "withdrawing toxins" are well know to be absolutely untrue. It was embarrassing to listen to. If you really want to hear it, here is an audio recording.

The Jeremy Vine show

This programme is one of the worst cases I’ve heard of the BBC mis-educating the public by providing free advertising for quite outrageous quackery. Another complaint will be submitted. The only form of opposition was a few callers who pointed out the nonsense, mixed with callers who endorsed it. That is not, by any stretch of the imagination, fair and balanced.

It’s interesting that, although cupping is often associated with Traditional Chinese Medicine, neither of the proponents in these two shows was Chinese, but rather they were Muslim. This should not be surprising as neither cupping nor acupuncture are exclusively Chinese. Similar myths have arisen in many places. My first encounter with this particular branch of magic medicine was when I was asked to make a podcast for “Things Unseen”, in which I debated with a Muslim hijama practitioner and an Indian Ayurvedic practitioner. It’s even harder to talk sense to practitioners of magic medicine who believe that god is on their side, as well as believing that selling nonsense is a good way to make a living.

An excellent history of the complex emergence of similar myths in different parts of the world has been published by Ben Kavoussi, under the title "Acupuncture is astrology with needles".

Now the original version of my blog for the Independent.

Cupping: Michael Phelps and Gwyneth Paltrow may be believers, but the truth behind it is what really sucks

The sight of Olympic swimmer, Michael Phelps, with bruises on his body caused by cupping resulted in something of a media feeding-frenzy this week. He’s a great athlete so cupping must be responsible for his performance, right?  Just as cupping must be responsible for the complexion of an earlier enthusiast, Gwyneth Paltrow.

The main thing in common between Phelps and Paltrow is that they both have a great deal of money, and neither has much interest in how you distinguish truth from myth.  They can afford to indulge any whim, however silly.

And cupping is pretty silly. It’s a pre-scientific medical practice that started in a time when there was no understanding of physiology, much like bloodletting. Indeed one version does involve a bit of bloodletting.  Perhaps bloodletting is the best argument against the belief that it’s ancient wisdom, so it must work. It was a standard part of medical treatment for hundreds of years, and killed countless people.

It is desperately implausible that putting suction cups on your skin would benefit anything, so it’s not surprising that there is no worthwhile empirical evidence that it does.  The Chinese version of cupping is related to acupuncture and, unlike cupping, acupuncture has been very thoroughly tested. Over 3000 trials have failed to show any benefit that’s big enough to benefit patients. Acupuncture is no more than a theatrical placebo.  And even its placebo effects are too small to be useful.

At least it’s likely that cupping usually does no lasting damage.. We don’t know for sure because in the world of alternative medicine there is no system for recording bad effects (and there is a vested interest in not reporting them).  In extreme cases, it can leave holes in your skin that pose a serious danger of infection, but most people probably end up with just broken capillaries and bruises.  Why would anyone want that? 
The answer to that question seems to be a mixture of wishful thinking about the benefits and vastly exaggerated claims made by the people who sell the product.

It’s typical that the sales people can’t even agree on what the benefits are alleged to be.  If selling to athletes, the claim may be that it relieves pain, or that it aids recovery, or that it increases performance.  Exactly the same cupping methods are sold to celebs with the claim that their beauty will be improved because cupping will “boost your immune system”.  This claim is universal in the world of make-believe medicine, when the salespeople can think of nothing else. There is no surer sign of quackery.  It means nothing whatsoever.  No procedure is known to boost your immune system.  And even if anything did, it would be more likely to cause inflammation and blood clots than to help you run faster or improve your complexion.

It’s certainly most unlikely that sucking up bits of skin into evacuated jars would have any noticeable effect on blood flow in underlying muscles, and so increase your performance.  The salespeople would undoubtedly benefit from a first year physiology course.

Needless to say, they haven’t tried to actually measuring blood flow, or performance. To do that might reduce sales.  As Kate Carter said recently “Eating jam out of those jars would probably have a more significant physical impact”.

The problem with all sports medicine is that tiny effects could make a difference. When three hour endurance events end with a second or so separating the winner from the rest, that is an effect of less than 0.01%.   Such tiny effects will never be detectable experimentally.  That leaves the door open to every charlatan to sell miracle treatments that might just work.  If, like steroids, they do work, there is a good chance that they’ll harm your health in the long run.

You might be better off eating the jam.

Here is a very small selection of the many excellent accounts of cupping on the web.

There have been many good blogs. The mainstream media have, on the whole, been dire. Here are three that I like,

In July 2016, Orac posted in ScienceBlogs. "What’s the harm? Cupping edition". He used his expertise as a surgeon to explain the appalling wounds that can be produced by excessive cupping.


Photo from news,com.au

Timothy Caulfield, wrote "Olympic debunk!". He’s  Chair in Health Law and Policy at the University of Alberta, and the author of Is Gwyneth Paltrow Wrong about Everything.

“The Olympics are a wonderful celebration of athletic performance. But they have also become an international festival of sports pseudoscience. It will take an Olympic–sized effort to fight this bunk and bring a win to the side of evidence-based practice.”

Jennifer Raff wrote Pseudoscience is common among elite athletes outside of the Olympics too…and it makes me furious. She works on the genomes of modern and ancient people at the University of Kansas, and, as though that were not a full-time job for most people, she writes blogs, books and she’s also "training (and occasionally competing) in Muay Thai, boxing, BJJ, and MMA".

"I’m completely unsurprised to find that pseudoscience is common among the elite athletes competing in the Olympics. I’ve seen similar things rampant in the combat sports world as well."

What she writes makes perfect sense. Just don’t bother with the comments section which is littered with Trump-like post-factual comments from anonymous conspiracy theorists.


Of all types of alternative medicine, acupuncture is the one that has received the most approval from regular medicine. The benefit of that is that it’s been tested more thoroughly than most others. The result is now clear. It doesn’t work. See the evidence in Acupuncture is a theatrical placebo.

This blog has documented many cases of misreported tests of acupuncture, often from people have a financial interests in selling it. Perhaps the most egregious spin came from the University of Exeter. It was published in a normal journal, and endorsed by the journal’s editor, despite showing clearly that acupuncture didn’t even have much placebo effect.

Acupuncture got a boost in 2009 from, of all unlikely sources, the National Institute for Health and Care Excellence (NICE). The judgements of NICE and the benefit / cost ratio of treatments are usually very good. But the guidance group that they assembled to judge treatments for low back pain was atypically incompetent when it came to assessment of evidence. They recommended acupuncture as one option. At the time I posted “NICE falls for Bait and Switch by acupuncturists and chiropractors: it has let down the public and itself“. That was soon followed by two more posts:

NICE fiasco, part 2. Rawlins should withdraw guidance and start again“,


The NICE fiasco, Part 3. Too many vested interests, not enough honesty“.

At the time, NICE was being run by Michael Rawlins, an old friend. No doubt he was unaware of the bad guidance until it was too late and he felt obliged to defend it.

Although the 2008 guidance referred only to low back pain, it gave an opening for acupuncturists to penetrate the NHS. Like all quacks, they are experts at bait and switch. The penetration of quackery was exacerbated by the privatisation of physiotherapy services to organisations like Connect Physical Health which have little regard for evidence, but a good eye for sales. If you think that’s an exaggeration, read "Connect Physical Health sells quackery to NHS".

When David Haslam took over the reins at NICE, I was optimistic that the question would be revisited (it turned out that he was aware of this blog). I was not disappointed. This time the guidance group had much more critical members.

The new draft guidance on low back pain was released on 24 March 2016. The final guidance will not appear until September 2016, but last time the final version didn’t differ much from the draft.

Despite modern imaging methods, it still isn’t possible to pinpoint the precise cause of low back pain (LBP) so diagnoses are lumped together as non-specific low back pain (NSLBP).

The summary guidance is explicit.

“1.2.8 Do not offer acupuncture for managing non-specific low back 7 pain with or without sciatica.”

The evidence is summarised section 13.6 of the main report (page 493).There is a long list of other proposed treatments that are not recommended.

Because low back pain is so common, and so difficult to treat, many treatments have been proposed. Many of them, including acupuncture, have proved to be clutching at straws. It’s to the great credit of the new guidance group that they have resisted that temptation.

Among the other "do not offer" treatments are

  • imaging (except in specialist setting)
  • belts or corsets
  • foot orthotics
  • acupuncture
  • ultrasound
  • TENS or PENS
  • opioids (for acute or chronic LBP)
  • antidepressants (SSRI and others)
  • anticonvulsants
  • spinal injections
  • spinal fusion for NSLBP (except as part of a randomised controlled trial)
  • disc replacement

At first sight, the new guidance looks like an excellent clear-out of the myths that surround the treatment of low back pain.

The positive recommendations that are made are all for things that have modest effects (at best). For example “Consider a group exercise programme”, and “Consider manipulation, mobilisation”. The use of there word “consider”, rather than “offer” seems to be NICE-speak -an implicit suggestion that it doesn’t work very well. My only criticism of the report is that it doesn’t say sufficiently bluntly that non-specific low back pain is largely an unsolved problem. Most of what’s seen is probably a result of that most deceptive phenomenon, regression to the mean.

One pain specialist put it to me thus. “Think of the billions spent on back pain research over the years in order to reach the conclusion that nothing much works – shameful really.” Well perhaps not shameful: it isn’t for want of trying. It’s just a very difficult problem. But pretending that there are solutions doesn’t help anyone.


This post arose from a recent meeting at the Royal Society. It was organised by Julie Maxton to discuss the application of statistical methods to legal problems. I found myself sitting next to an Appeal Court Judge who wanted more explanation of the ideas. Here it is.

Some preliminaries

The papers that I wrote recently were about the problems associated with the interpretation of screening tests and tests of significance. They don’t allude to legal problems explicitly, though the problems are the same in principle.  They are all open access. The first appeared in 2014:

Since the first version of this post, March 2016, I’ve written two more papers and some popular pieces on the same topic. There’s a list of them at http://www.onemol.org.uk/?page_id=456.
I also made a video for YouTube of a recent talk.

In these papers I was interested in the false positive risk (also known as the false discovery rate) in tests of significance. It turned out to be alarmingly large. That has serious consequences for the credibility of the scientific literature. In legal terms, the false positive risk means the proportion of cases in which, on the basis of the evidence, a suspect is found guilty when in fact they are innocent. That has even more serious consequences.

Although most of what I want to say can be said without much algebra, it would perhaps be worth getting two things clear before we start.

The rules of probability.

(1) To get any understanding, it’s essential to understand the rules of probabilities, and, in particular, the idea of conditional probabilities. One source would be my old book, Lectures on Biostatistics (now free), The account on pages 19 to 24 give a pretty simple (I hope) description of what’s needed. Briefly, a vertical line is read as “given”, so Prob(evidence | not guilty) means the probability that the evidence would be observed given that the suspect was not guilty.

(2) Another potential confusion in this area is the relationship between odds and probability. The relationship between the probability of an event occurring, and the odds on the event can be illustrated by an example. If the probability of being right-handed is 0.9, then the probability of being not being right-handed is 0.1.  That means that 9 people out of 10 are right-handed, and one person in 10 is not. In other words for every person who is not right-handed there are 9 who are right-handed. Thus the odds that a randomly-selected person is right-handed are 9 to 1. In symbols this can be written

\[ \mathrm{probability=\frac{odds}{1 + odds}} \]

In the example, the odds on being right-handed are 9 to 1, so the probability of being right-handed is 9 / (1+9) = 0.9.


\[ \mathrm{odds =\frac{probability}{1 – probability}} \]

In the example, the probability of being right-handed is 0.9, so the odds of being right-handed are 0.9 / (1 – 0.9) = 0.9 / 0.1 = 9 (to 1).

With these preliminaries out of the way, we can proceed to the problem.

The legal problem

The first problem lies in the fact that the answer depends on Bayes’ theorem. Although that was published in 1763, statisticians are still arguing about how it should be used to this day.  In fact whenever it’s mentioned, statisticians tend to revert to internecine warfare, and forget about the user.

Bayes’ theorem can be stated in words as follows

\[ \mathrm{\text{posterior odds ratio} = \text{prior odds ratio} \times \text{likelihood ratio}} \]

“Posterior odds ratio” means the odds that the person is guilty, relative to the odds that they are innocent, in the light of the evidence, and that’s clearly what one wants to know.  The “prior odds” are the odds that the person was guilty before any evidence was produced, and that is the really contentious bit.

Sometimes the need to specify the prior odds has been circumvented by using the likelihood ratio alone, but, as shown below, that isn’t a good solution.

The analogy with the use of screening tests to detect disease is illuminating.

Screening tests

A particularly straightforward application of Bayes’ theorem is in screening people to see whether or not they have a disease.  It turns out, in many cases, that screening gives a lot more wrong results (false positives) than right ones.  That’s especially true when the condition is rare (the prior odds that an individual suffers from the condition is small).  The process of screening for disease has a lot in common with the screening of suspects for guilt. It matters because false positives in court are disastrous.

The screening problem is dealt with in sections 1 and 2 of my paper. or on this blog (and here). A bit of animation helps the slides, so you may prefer the Youtube version:  (It deals with screening tests up to 8’45”).

The rest of my paper applies similar ideas to tests of significance.  In that case the prior probability is the probability that there is in fact a real effect, or, in the legal case, the probability that the suspect is guilty before any evidence has been presented. This is the slippery bit of the problem both conceptually, and because it’s hard to put a number on it.

But the examples below show that to ignore it, and to use the likelihood ratio alone, could result in many miscarriages of justice.

In the discussion of tests of significance, I took the view that it is not legitimate (in the absence of good data to the contrary) to assume any prior probability greater than 0.5. To do so would presume you know the answer before any evidence was presented.  In the legal case a prior probability of 0.5 would mean assuming that there was a 50:50 chance that the suspect was guilty before any evidence was presented. A 50:50 probability of guilt before the evidence is known corresponds to a prior odds ratio of 1 (to 1)  If that were true, the likelihood ratio would be a good way to represent the evidence, because the posterior odds ratio would be equal to the likelihood ratio.

It could be argued that 50:50 represents some sort of equipoise, but in the example below it is clearly too high, and if it is less that 50:50, use of the likelihood ratio runs a real risk of convicting an innocent person.

The following example is modified slightly from section 3 of a book chapter by Mortera and Dawid (2008). Philip Dawid is an eminent statistician who has written a lot about probability and the law, and he’s a member of the legal group of the Royal Statistical Society.

My version of the example removes most of the algebra, and uses different numbers.

Example: The island problem

The “island problem” (Eggleston 1983, Appendix 3) is an imaginary example that provides a good illustration of the uses and misuses of statistical logic in forensic identification.

A murder has been committed on an island, cut off from the outside world, on which 1001 (= N + 1) inhabitants remain. The forensic evidence at the scene consists of a measurement, x, on a “crime trace” characteristic, which can be assumed to come from the criminal. It might, for example, be a bit of the DNA sequence from the crime scene.

Say, for the sake of example, that the probability of a random member of the population having characteristic x is P = 0.004 (i.e. 0.4% ), so the probability that a random member of the population does not have the characteristic is 1 – P = 0.996. The mainland police arrive and arrest a random islander, Jack. It is found that Jack matches the crime trace. There is no other relevant evidence.

How should this match evidence be used to assess the claim that Jack is the murderer? We shall consider three arguments that have been used to address this question. The first is wrong. The second and third are right. (For illustration, we have taken N = 1000, P = 0.004.)

(1) Prosecutor’s fallacy

Prosecuting counsel, arguing according to his favourite fallacy, asserts that the probability that Jack is guilty is 1 – P , or 0.996, and that this proves guilt “beyond a reasonable doubt”.

The probability that Jack would show characteristic x if he were not guilty would be 0.4% i.e. Prob(Jack has x | not guilty) = 0.004.  Therefore the probability of the evidence, given that Jack is guilty, Prob(Jack has x | Jack is guilty), is one 1 – 0.004 = 0.996.

But this is Prob(evidence | guilty) which is not what we want.  What we need is the probability that Jack is guilty, given the evidence, P(Jack is guilty | Jack has characteristic x).

To mistake the latter for the former is the prosecutor’s fallacy, or the error of the transposed conditional.

Dawid gives an example that makes the distinction clear.

“As an analogy to help clarify and escape this common and seductive confusion, consider the difference between “the probability of having spots, if you have measles” -which is close to 1  and “the probability of having measles, if you have spots” -which, in the light of the many alternative possible explanations for spots, is much smaller.”

(2) Defence counter-argument

Counsel for the defence points out that, while the guilty party must have characteristic x, he isn’t the only person on the island to have this characteristic. Among the remaining N = 1000 innocent islanders, 0.4% have characteristic x, so the number who have it will be NP = 1000 x 0.004 = 4 . Hence the total number of islanders that have this characteristic must be 1 + NP = 5 . The match evidence means that Jack must be one of these 5 people, but does not otherwise distinguish him from any of the other members of it.  Since just one of these is guilty, the probability that this is Jack is thus 1/5, or 0.2— very far from being “beyond all reasonable doubt”.

(3) Bayesian argument

The probability of the having characteristic x (the evidence) would be Prob(evidence | guilty) = 1 if Jack were guilty, but if Jack were not guilty it would be 0.4%, i.e. Prob(evidence | not guilty) = P. Hence the likelihood ratio in favour of guilt, on the basis of the evidence, is

\[ LR=\frac{\text{Prob(evidence } | \text{ guilty})}{\text{Prob(evidence }|\text{ not guilty})} = \frac{1}{P}=250 \]

In words, the evidence would be 250 times more probable if Jack were guilty than if he were innocent.  While this seems strong evidence in favour of guilt, it still does not tell us what we want to know, namely the probability that Jack is guilty in the light of the evidence: Prob(guilty | evidence), or, equivalently, the odds ratio -the odds of guilt relative to odds of innocence, given the evidence,

To get that we must multiply the likelihood ratio by the prior odds on guilt, i.e. the odds on guilt before any evidence is presented. It’s often hard to get a numerical value for this. But in our artificial example, it is possible. We can argue that, in the absence of any other evidence, Jack is no more nor less likely to be the culprit than any other islander, so that the prior probability of guilt is 1/(N + 1), corresponding to prior odds on guilt of 1/N.

We can now apply Bayes’s theorem to obtain the posterior odds on guilt:

\[ \text {posterior odds} = \text{prior odds} \times LR = \left ( \frac{1}{N}\right ) \times \left ( \frac{1}{P} \right )= 0.25 \]

Thus the odds of guilt in the light of the evidence are 4 to 1 against. The corresponding posterior probability of guilt is

\[ Prob( \text{guilty } | \text{ evidence})= \frac{1}{1+NP}= \frac{1}{1+4}=0.2 \]

This is quite small –certainly no basis for a conviction.

This result is exactly the same as that given by the Defence Counter-argument’, (see above). That argument was simpler than the Bayesian argument. It didn’t explicitly use Bayes’ theorem, though it was implicit in the argument. The advantage of using the former is that it looks simpler. The advantage of the explicitly Bayesian argument is that it makes the assumptions more clear.

In summary The prosecutor’s fallacy suggested, quite wrongly, that the probability that Jack was guilty was 0.996. The likelihood ratio was 250, which also seems to suggest guilt, but it doesn’t give us the probability that we need. In stark contrast, the defence counsel’s argument, and equivalently, the Bayesian argument, suggested that the probability of Jack’s guilt as 0.2. or odds of 4 to 1 against guilt. The potential for wrong conviction is obvious.


Although this argument uses an artificial example that is simpler than most real cases, it illustrates some important principles.

(1) The likelihood ratio is not a good way to evaluate evidence, unless there is good reason to believe that there is a 50:50 chance that the suspect is guilty before any evidence is presented.

(2) In order to calculate what we need, Prob(guilty | evidence), you need to give numerical values of how common the possession of characteristic x (the evidence) is the whole population of possible suspects (a reasonable value might be estimated in the case of DNA evidence),  We also need to know the size of the population.  In the case of the island example, this was 1000, but in general, that would be hard to answer and any answer might well be contested by an advocate who understood the problem.

These arguments lead to four conclusions.

(1) If a lawyer uses the prosecutor’s fallacy, (s)he should be told that it’s nonsense.

(2) If a lawyer advocates conviction on the basis of likelihood ratio alone, s(he) should be asked to justify the implicit assumption that there was a 50:50 chance that the suspect was guilty before any evidence was presented.

(3) If a lawyer uses Defence counter-argument, or, equivalently, the version of Bayesian argument given here, (s)he should be asked to justify the estimates of the numerical value given to the prevalence of x in the population (P) and the numerical value of the size of this population (N).  A range of values of P and N could be used, to provide a range of possible values of the final result, the probability that the suspect is guilty in the light of the evidence.

(4) The example that was used is the simplest possible case.  For more complex cases it would be advisable to ask a professional statistician. Some reliable people can be found at the Royal Statistical Society’s section on Statistics and the Law.

If you do ask a professional statistician, and they present you with a lot of mathematics, you should still ask these questions about precisely what assumptions were made, and ask for an estimate of the range of uncertainty in the value of Prob(guilty | evidence) which they produce.

Postscript: real cases

Another paper by Philip Dawid, Statistics and the Law, is interesting because it discusses some recent real cases: for example the wrongful conviction of Sally Clark because of the wrong calculation of the statistics for Sudden Infant Death Syndrome.

On Monday 21 March, 2016, Dr Waney Squier was struck off the medical register by the General Medical Council because they claimed that she misrepresented the evidence in cases of Shaken Baby Syndrome (SBS).

This verdict was questioned by many lawyers, including Michael Mansfield QC and Clive Stafford Smith, in a letter. “General Medical Council behaving like a modern inquisition

The latter has already written “This shaken baby syndrome case is a dark day for science – and for justice“..

The evidence for SBS is based on the existence of a triad of signs (retinal bleeding, subdural bleeding and encephalopathy). It seems likely that these signs will be present if a baby has been shake, i.e Prob(triad | shaken) is high. But this is irrelevant to the question of guilt. For that we need Prob(shaken | triad). As far as I know, the data to calculate what matters are just not available.

It seem that the GMC may have fallen for the prosecutor’s fallacy. Or perhaps the establishment won’t tolerate arguments. One is reminded, once again, of the definition of clinical experience: "Making the same mistakes with increasing confidence over an impressive number of years." (from A Sceptic’s Medical Dictionary by Michael O’Donnell. A Sceptic’s Medical Dictionary BMJ publishing, 1997).

Appendix (for nerds). Two forms of Bayes’ theorem

The form of Bayes’ theorem given at the start is expressed in terms of odds ratios. The same rule can be written in terms of probabilities. (This was the form used in the appendix of my paper.) For those interested in the details, it may help to define explicitly these two forms.

In terms of probabilities, the probability of guilt in the light of the evidence (what we want) is

\[ \text{Prob(guilty } | \text{ evidence}) = \text{Prob(evidence } | \text{ guilty}) \frac{\text{Prob(guilty })}{\text{Prob(evidence })} \]

In terms of odds ratios, the odds ratio on guilt, given the evidence (which is what we want) is

\[ \frac{ \text{Prob(guilty } | \text{ evidence})} {\text{Prob(not guilty } | \text{ evidence}} =
\left ( \frac{ \text{Prob(guilty)}} {\text {Prob((not guilty)}} \right )
\left ( \frac{ \text{Prob(evidence } | \text{ guilty})} {\text{Prob(evidence } | \text{ not guilty}} \right ) \]

or, in words,

\[ \text{posterior odds of guilt } =\text{prior odds of guilt} \times \text{likelihood ratio} \]

This is the precise form of the equation that was given in words at the beginning.

A derivation of the equivalence of these two forms is sketched in a document which you can download.


23 March 2016

It’s worth pointing out the following connection between the legal argument (above) and tests of significance.

(1) The likelihood ratio works only when there is a 50:50 chance that the suspect is guilty before any evidence is presented (so the prior probability of guilt is 0.5, or, equivalently, the prior odds ratio is 1).

(2) The false positive rate in signiifcance testing is close to the P value only when the prior probability of a real effect is 0.5, as shown in section 6 of the P value paper.

However there is another twist in the significance testing argument. The statement above is right if we take as a positive result any P < 0.05. If we want to interpret a value of P = 0.047 in a single test, then, as explained in section 10 of the P value paper, we should restrict attention to only those tests that give P close to 0.047. When that is done the false positive rate is 26% even when the prior is 0.5 (and much bigger than 30% if the prior is smaller –see extra Figure), That justifies the assertion that if you claim to have discovered something because you have observed P = 0.047 in a single test then there is a chance of at least 30% that you’ll be wrong. Is there, I wonder, any legal equivalent of this argument?

Jump to follow-up

“Statistical regression to the mean predicts that patients selected for abnormalcy will, on the average, tend to improve. We argue that most improvements attributed to the placebo effect are actually instances of statistical regression.”

“Thus, we urge caution in interpreting patient improvements as causal effects of our actions and should avoid the conceit of assuming that our personal presence has strong healing powers.”

McDonald et al., (1983)

In 1955, Henry Beecher published "The Powerful Placebo". I was in my second undergraduate year when it appeared. And for many decades after that I took it literally, They looked at 15 studies and found that an average 35% of them got "satisfactory relief" when given a placebo. This number got embedded in pharmacological folk-lore. He also mentioned that the relief provided by placebo was greatest in patients who were most ill.

Consider the common experiment in which a new treatment is compared with a placebo, in a double-blind randomised controlled trial (RCT). It’s common to call the responses measured in the placebo group the placebo response. But that is very misleading, and here’s why.

The responses seen in the group of patients that are treated with placebo arise from two quite different processes. One is the genuine psychosomatic placebo effect. This effect gives genuine (though small) benefit to the patient. The other contribution comes from the get-better-anyway effect. This is a statistical artefact and it provides no benefit whatsoever to patients. There is now increasing evidence that the latter effect is much bigger than the former.

How can you distinguish between real placebo effects and get-better-anyway effect?

The only way to measure the size of genuine placebo effects is to compare in an RCT the effect of a dummy treatment with the effect of no treatment at all. Most trials don’t have a no-treatment arm, but enough do that estimates can be made. For example, a Cochrane review by Hróbjartsson & Gøtzsche (2010) looked at a wide variety of clinical conditions. Their conclusion was:

“We did not find that placebo interventions have important clinical effects in general. However, in certain settings placebo interventions can influence patient-reported outcomes, especially pain and nausea, though it is difficult to distinguish patient-reported effects of placebo from biased reporting.”

In some cases, the placebo effect is barely there at all. In a non-blind comparison of acupuncture and no acupuncture, the responses were essentially indistinguishable (despite what the authors and the journal said). See "Acupuncturists show that acupuncture doesn’t work, but conclude the opposite"

So the placebo effect, though a real phenomenon, seems to be quite small. In most cases it is so small that it would be barely perceptible to most patients. Most of the reason why so many people think that medicines work when they don’t isn’t a result of the placebo response, but it’s the result of a statistical artefact.

Regression to the mean is a potent source of deception

The get-better-anyway effect has a technical name, regression to the mean. It has been understood since Francis Galton described it in 1886 (see Senn, 2011 for the history). It is a statistical phenomenon, and it can be treated mathematically (see references, below). But when you think about it, it’s simply common sense.

You tend to go for treatment when your condition is bad, and when you are at your worst, then a bit later you’re likely to be better, The great biologist, Peter Medawar comments thus.

"If a person is (a) poorly, (b) receives treatment intended to make him better, and (c) gets better, then no power of reasoning known to medical science can convince him that it may not have been the treatment that restored his health"
(Medawar, P.B. (1969:19). The Art of the Soluble: Creativity and originality in science. Penguin Books: Harmondsworth).

This is illustrated beautifully by measurements made by McGorry et al., (2001). Patients with low back pain recorded their pain (on a 10 point scale) every day for 5 months (they were allowed to take analgesics ad lib).

The results for four patients are shown in their Figure 2. On average they stay fairly constant over five months, but they fluctuate enormously, with different patterns for each patient. Painful episodes that last for 2 to 9 days are interspersed with periods of lower pain or none at all. It is very obvious that if these patients had gone for treatment at the peak of their pain, then a while later they would feel better, even if they were not actually treated. And if they had been treated, the treatment would have been declared a success, despite the fact that the patient derived no benefit whatsoever from it. This entirely artefactual benefit would be the biggest for the patients that fluctuate the most (e.g this in panels a and d of the Figure).

Figure 2 from McGorry et al, 2000. Examples of daily pain scores over a 6-month period for four participants. Note: Dashes of different lengths at the top of a figure designate an episode and its duration.

The effect is illustrated well by an analysis of 118 trials of treatments for non-specific low back pain (NSLBP), by Artus et al., (2010). The time course of pain (rated on a 100 point visual analogue pain scale) is shown in their Figure 2. There is a modest improvement in pain over a few weeks, but this happens regardless of what treatment is given, including no treatment whatsoever.


FIG. 2 Overall responses (VAS for pain) up to 52-week follow-up in each treatment arm of included trials. Each line represents a response line within each trial arm. Red: index treatment arm; Blue: active treatment arm; Green: usual care/waiting list/placebo arms. ____: pharmacological treatment; – – – -: non-pharmacological treatment; . . .. . .: mixed/other. 

The authors comment

"symptoms seem to improve in a similar pattern in clinical trials following a wide variety of active as well as inactive treatments.", and "The common pattern of responses could, for a large part, be explained by the natural history of NSLBP".

In other words, none of the treatments work.

This paper was brought to my attention through the blog run by the excellent physiotherapist, Neil O’Connell. He comments

"If this finding is supported by future studies it might suggest that we can’t even claim victory through the non-specific effects of our interventions such as care, attention and placebo. People enrolled in trials for back pain may improve whatever you do. This is probably explained by the fact that patients enrol in a trial when their pain is at its worst which raises the murky spectre of regression to the mean and the beautiful phenomenon of natural recovery."

O’Connell has discussed the matter in recent paper, O’Connell (2015), from the point of view of manipulative therapies. That’s an area where there has been resistance to doing proper RCTs, with many people saying that it’s better to look at “real world” outcomes. This usually means that you look at how a patient changes after treatment. The hazards of this procedure are obvious from Artus et al.,Fig 2, above. It maximises the risk of being deceived by regression to the mean. As O’Connell commented

"Within-patient change in outcome might tell us how much an individual’s condition improved, but it does not tell us how much of this improvement was due to treatment."

In order to eliminate this effect it’s essential to do a proper RCT with control and treatment groups tested in parallel. When that’s done the control group shows the same regression to the mean as the treatment group. and any additional response in the latter can confidently attributed to the treatment. Anything short of that is whistling in the wind.

Needless to say, the suboptimal methods are most popular in areas where real effectiveness is small or non-existent. This, sad to say, includes low back pain. It also includes just about every treatment that comes under the heading of alternative medicine. Although these problems have been understood for over a century, it remains true that

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it."
Upton Sinclair (1935)

Responders and non-responders?

One excuse that’s commonly used when a treatment shows only a small effect in proper RCTs is to assert that the treatment actually has a good effect, but only in a subgroup of patients ("responders") while others don’t respond at all ("non-responders"). For example, this argument is often used in studies of anti-depressants and of manipulative therapies. And it’s universal in alternative medicine.

There’s a striking similarity between the narrative used by homeopaths and those who are struggling to treat depression. The pill may not work for many weeks. If the first sort of pill doesn’t work try another sort. You may get worse before you get better. One is reminded, inexorably, of Voltaire’s aphorism "The art of medicine consists in amusing the patient while nature cures the disease".

There is only a handful of cases in which a clear distinction can be made between responders and non-responders. Most often what’s observed is a smear of different responses to the same treatment -and the greater the variability, the greater is the chance of being deceived by regression to the mean.

For example, Thase et al., (2011) looked at responses to escitalopram, an SSRI antidepressant. They attempted to divide patients into responders and non-responders. An example (Fig 1a in their paper) is shown.

Thase fig 1a

The evidence for such a bimodal distribution is certainly very far from obvious. The observations are just smeared out. Nonetheless, the authors conclude

"Our findings indicate that what appears to be a modest effect in the grouped data – on the boundary of clinical significance, as suggested above – is actually a very large effect for a subset of patients who benefited more from escitalopram than from placebo treatment. "

I guess that interpretation could be right, but it seems more likely to be a marketing tool. Before you read the paper, check the authors’ conflicts of interest.

The bottom line is that analyses that divide patients into responders and non-responders are reliable only if that can be done before the trial starts. Retrospective analyses are unreliable and unconvincing.

Some more reading

Senn, 2011 provides an excellent introduction (and some interesting history). The subtitle is

"Here Stephen Senn examines one of Galton’s most important statistical legacies – one that is at once so trivial that it is blindingly obvious, and so deep that many scientists spend their whole career being fooled by it."

The examples in this paper are extended in Senn (2009), “Three things that every medical writer should know about statistics”. The three things are regression to the mean, the error of the transposed conditional and individual response.

You can read slightly more technical accounts of regression to the mean in McDonald & Mazzuca (1983) "How much of the placebo effect is statistical regression" (two quotations from this paper opened this post), and in Stephen Senn (2015) "Mastering variation: variance components and personalised medicine". In 1988 Senn published some corrections to the maths in McDonald (1983).

The trials that were used by Hróbjartsson & Gøtzsche (2010) to investigate the comparison between placebo and no treatment were looked at again by Howick et al., (2013), who found that in many of them the difference between treatment and placebo was also small. Most of the treatments did not work very well.

Regression to the mean is not just a medical deceiver: it’s everywhere

Although this post has concentrated on deception in medicine, it’s worth noting that the phenomenon of regression to the mean can cause wrong inferences in almost any area where you look at change from baseline. A classical example concern concerns the effectiveness of speed cameras. They tend to be installed after a spate of accidents, and if the accident rate is particularly high in one year it is likely to be lower the next year, regardless of whether a camera had been installed or not. To find the true reduction in accidents caused by installation of speed cameras, you would need to choose several similar sites and allocate them at random to have a camera or no camera. As in clinical trials. looking at the change from baseline can be very deceptive.

Statistical postscript

Lastly, remember that it you avoid all of these hazards of interpretation, and your test of significance gives P = 0.047. that does not mean you have discovered something. There is still a risk of at least 30% that your ‘positive’ result is a false positive. This is explained in Colquhoun (2014),"An investigation of the false discovery rate and the misinterpretation of p-values". I’ve suggested that one way to solve this problem is to use different words to describe P values: something like this.

P > 0.05 very weak evidence
P = 0.05 weak evidence: worth another look
P = 0.01 moderate evidence for a real effect
P = 0.001 strong evidence for real effect

But notice that if your hypothesis is implausible, even these criteria are too weak. For example, if the treatment and placebo are identical (as would be the case if the treatment were a homeopathic pill) then it follows that 100% of positive tests are false positives.


12 December 2015

It’s worth mentioning that the question of responders versus non-responders is closely-related to the classical topic of bioassays that use quantal responses. In that field it was assumed that each participant had an individual effective dose (IED). That’s reasonable for the old-fashioned LD50 toxicity test: every animal will die after a sufficiently big dose. It’s less obviously right for ED50 (effective dose in 50% of individuals). The distribution of IEDs is critical, but it has very rarely been determined. The cumulative form of this distribution is what determines the shape of the dose-response curve for fraction of responders as a function of dose. Linearisation of this curve, by means of the probit transformation used to be a staple of biological assay. This topic is discussed in Chapter 10 of Lectures on Biostatistics. And you can read some of the history on my blog about Some pharmacological history: an exam from 1959.

Every day one sees politicians on TV assuring us that nuclear deterrence works because there no nuclear weapon has been exploded in anger since 1945. They clearly have no understanding of statistics.

With a few plausible assumptions, we can easily calculate that the time until the next bomb explodes could be as little as 20 years.

Be scared, very scared.

The first assumption is that bombs go off at random intervals. Since we have had only one so far (counting Hiroshima and Nagasaki as a single event), this can’t be verified. But given the large number of small influences that control when a bomb explodes (whether in war or by accident), it is the natural assumption to make. The assumption is given some credence by the observation that the intervals between wars are random [download pdf].

If the intervals between bombs are random, that implies that the distribution of the length of the intervals is exponential in shape, The nature of this distribution has already been explained in an earlier post about the random lengths of time for which a patient stays in an intensive care unit. If you haven’t come across an exponential distribution before, please look at that post before moving on.

All that we know is that 70 years have elapsed since the last bomb. so the interval until the next one must be greater than 70 years. The probability that a random interval is longer than 70 years can be found from the cumulative form of the exponential distribution.

If we denote the true mean interval between bombs as $\mu$ then the probability that an intervals is longer than 70 years is

\[ \text{Prob}\left( \text{interval > 70}\right)=\exp{\left(\frac{-70}{\mu_\mathrm{lo}}\right)} \]

We can get a lower 95% confidence limit (call it $\mu_\mathrm{lo}$) for the mean interval between bombs by the argument used in Lecture on Biostatistics, section 7.8 (page 108). If we imagine that $\mu_\mathrm{lo}$ were the true mean, we want it to be such that there is a 2.5% chance that we observe an interval that is greater than 70 years. That is, we want to solve

\[ \exp{\left(\frac{-70}{\mu_\mathrm{lo}}\right)} = 0.025\]

That’s easily solved by taking natural logs of both sides, giving

\[ \mu_\mathrm{lo} = \frac{-70}{\ln{\left(0.025\right)}}= 19.0\text{ years}\]

A similar argument leads to an upper confidence limit, $\mu_\mathrm{hi}$, for the mean interval between bombs, by solving

\[ \exp{\left(\frac{-70}{\mu_\mathrm{hi}}\right)} = 0.975\]
\[ \mu_\mathrm{hi} = \frac{-70}{\ln{\left(0.975\right)}}= 2765\text{ years}\]

If the worst case were true, and the mean interval between bombs was 19 years. then the distribution of the time to the next bomb would have an exponential probability density function, $f(t)$,

\[ f(t) = \frac{1}{19} \exp{\left(\frac{-70}{19}\right)} \]

There would be a 50% chance that the waiting time until the next bomb would be less than the median of this distribution, =19 ln(0.5) = 13.2 years.


In summary, the observation that there has been no explosion for 70 years implies that the mean time until the next explosion lies (with 95% confidence) between 19 years and 2765 years. If it were 19 years, there would be a 50% chance that the waiting time to the next bomb could be less than 13.2 years. Thus there is no reason at all to think that nuclear deterrence works well enough to protect the world from incineration.

Another approach

My statistical colleague, the ace probabilist Alan Hawkes, suggested a slightly different approach to the problem, via likelihood. The likelihood of a particular value of the interval between bombs is defined as the probability of making the observation(s), given a particular value of $\mu$. In this case, there is one observation, that the interval between bombs is more than 70 years. The likelihood, $L\left(\mu\right)$, of any specified value of $\mu$ is thus

\[L\left(\mu\right)=\text{Prob}\left( \text{interval > 70 | }\mu\right) = \exp{\left(\frac{-70}{\mu}\right)} \]

If we plot this function (graph on right) shows that it increases with $\mu$ continuously, so the maximum likelihood estimate of $\mu$ is infinity. An infinite wait until the next bomb is perfect deterrence.


But again we need confidence limits for this. Since the upper limit is infinite, the appropriate thing to calculate is a one-sided lower 95% confidence limit. This is found by solving

\[ \exp{\left(\frac{-70}{\mu_\mathrm{lo}}\right)} = 0.05\]

which gives

\[ \mu_\mathrm{lo} = \frac{-70}{\ln{\left(0.05\right)}}= 23.4\text{ years}\]


The first approach gives 95% confidence limits for the average time until we get incinerated as 19 years to 2765 years. The second approach gives the lower limit as 23.4 years. There is no important difference between the two methods of calculation. This shows that the bland assurances of politicians that “nuclear deterrence works” is not justified.

It is not the purpose of this post to predict when the next bomb will explode, but rather to point out that the available information tells us very little about that question. This seems important to me because it contradicts directly the frequent assurances that deterrence works.

The only consolation is that, since I’m now 79, it’s unlikely that I’ll live long enough to see the conflagration.

Anyone younger than me would be advised to get off their backsides and do something about it, before you are destroyed by innumerate politicians.


While talking about politicians and war it seems relevant to reproduce Peter Kennard’s powerful image of the Iraq war.


and with that, to quote the comment made by Tony Blair’s aide, Lance Price


It’s a bit like my feeling about priests doing the twelve stations of the cross. Politicians and priests masturbating at the expense of kids getting slaughtered (at a safe distance, of course).


Chalkdust is a magazine published by students of maths from UCL Mathematics department. Judging by its first issue, it’s an excellent vehicle for popularisation of maths. I have a piece in the second issue

You can view the whole second issue on line, or download a pdf of the whole issue. Or a pdf of my bit only: On the Perils of P values.

The piece started out as another exposition of the interpretation of P values, but the whole of the first part turned into an explanation of the principles of randomisation tests. It beats me why anybody still does a Student’s t test. The idea of randomisation tests is very old. They are as powerful as t tests when the assumptions of the latter are fulfilled but a lot better when the assumptions are wrong (in the jargon, they are uniformly-most-powerful tests).

Not only that, but you need no mathematics to do a randomisation test, whereas you need a good deal of mathematics to follow Student’s 1908 paper. And the randomisation test makes transparently clear that random allocation of treatments is a basic and essential assumption that’s necessary for the the validity of any test of statistical significance.

I made a short video that explains the principles behind the randomisation tests, to go with the printed article (a bit of animation always helps).

When I first came across the principals of randomisation tests, i was entranced by the simplicity of the idea. Chapters 6 – 9 of my old textbook were written to popularise them. You can find much more detail there.

In fact it’s only towards the end that I reiterate the idea that P values don’t answer the question that experimenters want to ask, namely:- if I claim I have made a discovery because P is small, what’s the chance that I’ll be wrong?

If you want the full story on that, read my paper. The story it tells is not very original, but it still isn’t known to most experimenters (because most statisticians still don’t teach it on elementary courses). The paper must have struck a chord because it’s had over 80,000 full text views and more than 10,000 pdf downloads. It reached an altmetric score of 975 (since when it has been mysteriously declining). That’s gratifying, but it is also a condemnation of the use of metrics. The paper is not original and it’s quite simple, yet it’s had far more "impact" than anything to do with my real work.

If you want simpler versions than the full paper, try this blog (part 1 and part 2), or the Youtube video about misinterpretation of P values.

The R code for doing 2-sample randomisation tests

You can download a pdf file that describes the two R scripts. There are two different R programs.

One re-samples randomly a specified number of times (the default is 100,000 times, but you can do any number). Download two_sample_rantest.R

The other uses every possible sample -in the case of the two samples of 10 observations,it gives the distribution for all 184,756 ways of selecting 10 observations from 20. Download 2-sample-rantest-exact.R

The launch party

Today the people who organise Chalkdust magazine held a party in the mathematics department at UCL. The editorial director is a graduate student in maths, Rafael Prieto Curiel. He was, at one time in the Mexican police force (he said he’d suffered more crime in London than in Mexico City). He, and the rest of the team, are deeply impressive. They’ve done a terrific job. Support them.

The party cakes

Rafael Prieto doing the introduction

Rafael Prieto doing the introduction

pic 3
Rafael Prieto and me

I got the T shirt

Decoding the T shirt

The top line is "I" because that’s the usual symbol for the square root of -1.

The second line is one of many equations that describe a heart shape. It can be plotted by calculating a matrix of values of the left hand side for a range of values of x and y. Then plot the contour for a values x and y for which the left hand side is equal to 1. Download R script for this. (Method suggested by Rafael Prieto Curiel.)



5 November 2015

The Mann-Whitney test

I was stimulated to write this follow-up because yesterday I was asked by a friend to comment on the fact that five different tests all gave identical P values, P = 0.0079. The paper in question was in Science magazine (see Fig. 1), so it wouldn’t surprise me if the statistics were done badly, but in this case there is an innocent explanation.

The Chalkdust article, and the video, are about randomisation tests done using the original observed numbers, so look at them before reading on. There is a more detailed explanation in Chapter 9 of Lectures on Biostatistics. Before it became feasible to do this sort of test, there was a simpler, and less efficient, version in which the observations were ranked in ascending order, and the observed values were replaced by their ranks. This was known as the Mann Whitney test. It had the virtue that because all the ‘observations’ were now integers, the number of possible results of resampling was limited so it was possible to construct tables to allow one to get a rough P value. Of course, replacing observations by their ranks throws away some information, and now that we have computers there is no need to use a Mann-Whitney test ever. But that’s what was used in this paper.

In the paper (Fig 1) comparisons are made between two groups (assumed to be independent) with 5 observations in each group. The 10 observations are just the ranks, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

To do the randomisation test we select 5 of these numbers at random for sample A, and the other 5 are sample B. (Of course this supposes that the treatments were applied randomly in the real experiment, which is unlikely to be true.) In fact there are only 10!/(5!.5!) = 252 possible ways to select a sample of 5 from 10, so it’s easy to list all of them. In the case where there is no overlap between the groups, one group will contain the smallest observations (ranks 1, 2, 3, 4, 5, and the other group will contain the highest observations, ranks 6, 7, 8, 9, 10.

In this case, the sum of the ‘observations’ in group A is 15, and the sum for group B is 40.These add to the sum of the first 10 integers, 10.(10+1)/2 = 55. The mean (which corresponds to a difference between means of zero) is 55/2 = 27.5.

There are two ways of getting an allocation as extreme as this (first group low, as above, or second group low, the other tail of the distribution). The two tailed P value is therefore 2/252 = 0.0079. This will be the result whenever the two groups don’t overlap, regardless of the numerical values of the observations. It’s the smallest P value the test can produce with 5 observations in each group.

The whole randomisation distribution looks like this


In this case, the abscissa is the sum of the ranks in sample A, rather than the difference between means for the two groups (the latter is easily calculated from the former). The red line shows the observed value, 15. There is only one way to get a total of 15 for group A: it must contain the lowest 5 ranks (group A = 1, 2, 3, 4, 5). There is also only one way to get a total of 16 (group A = 1, 2, 3, 4, 6),and there are two ways of getting a total of 17 (group A = 1, 2, 3, 4, 7, or 1, 2, 3, 5, 6), But there are 20 different ways of getting a sum of 27 or 28 (which straddle the mean, 27.5). The printout (.txt file) from the R program that was used to generate the distribution is as follows.

Randomisation test: exact calculation all possible samples

INPUTS: exact calculation: all possible samples
Total number of combinations = 252
number obs per sample = 5
sample A 1 2 3 4 5
sample B 6 7 8 9 10

sum for sample A= 15
sum for sample B = 40
mean for sample A= 3
mean for sample B = 8
Observed difference between sums (A-B) -25
Observed difference between means (A-B) -5
SD for sample A) = 1.581139
SD for sample B) = 1.581139
mean and SD for randomisation dist = 27.5 4.796662
quantiles for ran dist (0.025, 0.975) 18.275 36.725
Area equal to orless than observed diff 0.003968254
Area equal to or greater than minus observed diff 0.003968254
Two-tailed P value 0.007936508

Result of t test
P value (2 tail) 0.001052826
confidence interval 2.693996 7.306004

Some problems. Figure 1 alone shows 16 two-sample comparisons, but no correction for multiple comparisons seems to have been made. A crude Bonferroni correction would require replacement of a P = 0.05 threshold with P = 0.05/16 = 0.003. None of the 5 tests that gave P = 0.0079 reaches this level (of course the whole idea of a threshold level is absurd anyway).

Furthermore, even a single test that gave P = 0.0079 would be expected to have a false positive rate of around 10 percent

Jump to follow-up

Today, 25 September, is the first anniversary of the needless death of Stefan Grimm. This post is intended as a memorial.

He should be remembered, in the hope that some good can come from his death.


On 1 December 2014, I published the last email from Stefan Grimm, under the title “Publish and perish at Imperial College London: the death of Stefan Grimm“. Since then it’s been viewed 196,000 times. The day after it was posted, the server failed under the load.

Since than, I posted two follow-up pieces. On December 23, 2014 “Some experiences of life at Imperial College London. An external inquiry is needed after the death of Stefan Grimm“. Of course there was no external inquiry.

And on April 9, 2015, after the coroner’s report, and after Imperial’s internal inquiry, "The death of Stefan Grimm was “needless”. And Imperial has done nothing to prevent it happening again".

The tragedy featured in the introduction of the HEFCE report on the use of metrics.

“The tragic case of Stefan Grimm, whose suicide in September 2014 led Imperial College to launch a review of its use of performance metrics, is a jolting reminder that what’s at stake in these debates is more than just the design of effective management systems.”

“Metrics hold real power: they are constitutive of values, identities and livelihoods ”

I had made no attempt to contact Grimm’s family, because I had no wish to intrude on their grief. But in July 2015, I received, out of the blue, a hand-written letter from Stefan Grimm’s mother. She is now 80 and living in Munich. I was told that his father, Dieter Grimm, had died of cancer when he was only 59. I also learned that Stefan Grimm was distantly related to Wilhelm Grimm, one of the Gebrüder Grimm.

The letter was very moving indeed. It said "Most of the infos about what happened in London, we got from you, what you wrote in the internet".

I responded as sympathetically as I could, and got a reply which included several of Stefan’s drawings, and then more from his sister. The drawings were done while he was young. They show amazing talent, but by the age of 25 he was too busy with science to expoit his artistic talents.

With his mother’s permission, I reproduce ten of his drawings here, as a memorial to a man who whose needless death was attributable to the very worst of the UK university system. He was killed by mindless and cruel "performance management", imposed by Imperial College London. The initial reaction of Imperial gave little hint of an improvement. I hope that their review of the metrics used to assess people will be a bit more sensible,

His real memorial lies in his published work, which continues to be cited regularly after his death.

His drawings are a reminder that there is more to human beings than getting grants. And that there is more to human beings than science.

Click the picture for an album of ten of his drawings. In the album there are also pictures of two books that were written for children by Stefan’s father, Dieter Grimm.


Dated Christmas eve,1979 (age 16)



Well well. It seems that Imperial are having an "HR Showcase: Supporting our people" on 15 October. And the introduction is being given by none other than Professor Martin Wilkins, the very person whose letter to Grimm must bear some responsibility for his death. I’ll be interested to hear whether he shows any contrition. I doubt whether any employees will dare to ask pointed questions at this meeting, but let’s hope they do.

1 2 3 13