I'm currently reading Ben Goldacre's latest book, Bad Pharma. If you want to know what I think of it, you'll be able to read my review of it in EMWA's journal, Medical Writing, in due course, after I've finished reading it. But today, I want to share a few thoughts on interim analyses of clinical trials, prompted by one section of the book (pages 184-186, if you have the book and want to look up the section).
Interim analyses are an important feature of the design of clinical trials. It's unethical to continue a trial longer than necessary. If, for example, you are doing a placebo-controlled clinical trial in 1000 patients, and you already have excellent evidence after treating the first 200 patients that the new treatment is better than placebo, then it would be unethical to continue to randomise patients to placebo after that point. Similarly, if you have evidence that the new treatment is actually worse than placebo, then it would be unethical to continue to randomise patients to the new treatment. So in either case, you would want to stop the trial.
This sounds simple in theory, but in practice, it's a lot more difficult. What counts as "excellent evidence"? What you don't want to do is stop a study early when the treatment is a little bit better than placebo, because it might just have been a fluke. It turns out that there are statistical rules you can apply to determine when to stop the study, which have been discussed at length in the statistical literature. There's some information about this on the CONSORT website if you want to read up on it.
So in practice, what we need to do is to set pre-specified rules for any interim analysis. You then do an interim analysis at the specified time, and if the results are good enough (or bad enough) to meet the stopping criteria, you stop the study early, and otherwise you carry on. Applying pre-specified rules with appropriate statistical adjustments guards against stopping early on the basis of a fluke.
Anyway, that's the background, now back to Ben Goldacre's book. Ben tells us that interim analyses are a problem. Although he acknowledges that it is sometimes ethically necessary to stop trials early, and that statistical stopping rules can indeed guard against chance variation, he also claims that stopping trials early "will probably pollute the data".
So why does he say this?
He cites a 2010 study by Bassler et al, which compared trials that stopped early with trials that continued until their planned completion. Bassler et al concluded "Truncated RCTs were associated with greater effect sizes than RCTs not stopped early." In other words, truncated RCTs were probably biased and showed effects of treatments greater than those that were probably true.
That does make it sound like allowing trials to stop early may bias the results we see, but there's a subtle problem with Bassler et al's study, which means that I don't think it supports the assertion in Ben Goldacre's book that early stopping of trials "will probably pollute the data".
Here's the problem. Bassler et al searched specifically for trials that stopped early because they had already found convincing evidence of benefit. It's pretty clear that studies that find large beneficial effects at their interim analysis are more likely to stop early. But it's important to remember here that trials that stop early because of convincing evidence of benefit are only a subset of trials that stop early. Some trials also stop early for the opposite reason, namely that the active treatment turns out to be worse than the control (or at least so unlikely to prove better than the control that it's not worth continuing). So only looking at one set of trials that stopped early will have biased their results.
To gain some more insight into this, I have run some simulations. I simulated 10,000 clinical trials, of varying sizes (mean size 454, range 50 to 2058), in which patients were randomised in a 1:1 ratio to a treatment group or a control group. In all trials, patients had a 20% probability of a bad outcome if they were in the control group, and 15% probability of a bad outcome if they were in the treatment group. So an unbiased estimate of the treatment effect should show a relative risk of 0.75.
I then simulated an interim analysis half way through each trial. If the results were significantly in favour of active treatment with P < 0.001, then I assumed the trial would have been stopped for efficacy. If they were in favour of control with P < 0.1, then I assumed it would have been stopped for futility. There are, of course, many different stopping rules in use (and indeed I looked at a range of them, with similar results), but these ones will do for our purposes.
Technically, the proper way to calculate the combined relative risk from the studies would be by meta-analysis, but since I'm just doing this for a quick and dirty blogpost and not for an academic publication, I simply looked at the mean relative risk from the different studies. I doubt that it would make much difference to the results.
So, of our 10,000 trials, 109 stopped early for efficacy and 58 stopped early for futility, with the remaining 9833 carrying on to completion. If we look at the mean relative risk of all the trials if they'd continued to completion, it would have been 0.768, with a 95% CI of 0.764 to 0.771 (you'll note that's slightly larger than 0.75, which I suspect is a result of taking a simple mean rather than doing a proper meta-analysis, but you'll also note that it's close enough to make little difference). If, however, we assume that the trials that met stopping criteria didn't continue to completion and reported results at the interim analysis, then the relative risk was 0.771 (95% CI 0.767 to 0.776). This is almost identical to the "true" result we'd have obtained if all studies had continued to completion (and in fact shows a slightly smaller treatment effect).
Now, if we compare the studies that stopped early for efficacy with those that completed, we do indeed see greater treatment effects. Of the studies that completed, the mean relative risk was 0.767, and of the studies that terminated early for efficacy, the mean relative risk was 0.356. This is a dramatic difference, is concordant with Bassler et al's findings, and is also completely what we'd expect: just by chance, some trials will find greater treatment effects than others, and those that do are more likely to stop early for efficacy.
However, before we conclude that those studies will bias the literature, we have to remember that they are just one half of the coin. Studies can also stop early for futility. If we look at the studies that stopped early for futility, then the mean relative risk was 2.30. This is a pretty dramatic underestimate of the true treatment effect, which counteracts the overestimate of the treatment effect we see from the studies that stopped early for efficacy.
So overall, I don't believe that allowing studies to stop early creates any overall bias of treatment effects. The bias that Bassler et al found was because of their own study design, in which they selected a biased sample. Yes, any one trial that has stopped early (either for efficacy or futility) is likely to have biased results. But when you take them all together, the biases cancel out.
Update 15 October:
It has been pointed out to me offline that my analysis may be over-simplifying to some extent. In the analysis I described above, I simply calculated relative risks in the normal way, whereas in fact, since results from interim analyses are known to be biased, there are statistical methods for adjusting for those biases (see, for example, this paper). Had I adjusted for the bias in that way, then my results might have been different.
However, I'm not sure this makes much difference to my conclusions, for 2 reasons. First, I suspect such methods of adjustment may be rarely used in practice. I had a look at the 5 most recent truncated studies from Bassler's paper, and none of them reported using any such adjustment. Given that more recent trials probably have better methods than older trials, that suggests that methods of adjusting for bias from interim analyses are not commonly used.
Second, even if such methods were used, we would still be left with a situation in which we have a set of trials that terminated early for efficacy and a set that terminated early for futility. Of course, if the methods were used and worked as well as they should, then there should be no bias (which of course was not what Bassler et al found, reinforcing my suspicion that adjustment methods are not often used). But even if they didn't adjust for all the bias, we would still expect the two sets to cancel out.
The only situation that would invalidate my results is if we imagine that trials that stop early for futility commonly use bias adjustment methods, while trials that stop early for efficacy seldom do. While I can't rule that out, it does seem rather unlikely.