Randomized Controlled Trials: Evidence Biased Psychiatry
By David Healy, MD MRCPsych
Introduction
A new drug gets introduced to the market. It has been approved after stringent scrutiny by the FDA, which requires ever more convincing evidence that it works and that its safe. The new treatment will always cost more than the old treatments, but even on the cost front, many would argue that we have entered an era where placebo controlled clinical trials demonstrate that new in contrast to older treatments actually do work, and if we just stick to treatments that really work costs should fall. Besides it always seems to happen these days that when new and costly antidepressants or antipsychotics are put through an economic model based on the figures from clinical trials and a range of assumptions provided by experts, the model demonstrates that these new drugs costing thousand of dollars a year are in fact cheaper than treatments costing $100 per year or less. So where could the problems lie? Why do we seem to be so slow in reaching the new medical utopia towards which companies and others assure us we are heading?
Treatment Effects & Treatment Effectiveness
The first problem with this scenario lies in a set of misunderstandings about what RCTs show and what regulators are doing when they let a drug on the market.
When they were introduced, randomized controlled trials (RCTs) were a significant step forward. These trials work by first assuming there is no difference between a new and an old or a placebo treatment. This is called the null hypothesis. It means that standard RCTs are in fact set up to show that treatments do not work, rather than to demonstrate that treatments work. They were designed to stop therapeutic bandwagons in their tracks. Designed to stop quacks peddling worthless treatments to patients made vulnerable and desperate by their illness.
The first use of a controlled trial in psychiatry for instance was to demonstrate that cortisone did not work for schizophrenia. Recently controlled trials have shown that debriefing after trauma, which had all but become a social movement, does not work.
So what happens when a treatment like Thorazine back in the 1950s or Prozac now doesn’t fail in an RCT? If it does better than placebo, can it then be said to work?
The strict answer to this question is no. When Thorazine was shown to be different from placebo, what this meant was that it did something. Its critics might still argue that the something that it does might not be a good thing, but they cannot argue that the claims that it is doing something are just the wishful thinking of those who make a living out of using it.
Now, however, when RCTs are used, the results seem to fuel therapeutic bandwagons rather than stop them in their tracks. Clinical trial results are sold as evidence that Prozac for instance works (actually does good) rather than as evidence that a treatment like Prozac has an effect (which may be put to good use in judicious hands). There is no philosophical or methodological basis for this development.
Where do RCTs come from?
Randomized trials originated within epidemiology. When epidemiologists in the 19th century were attempted to establish what caused certain infectious disorders, they sampled huge sections of the population, controlling for the effects of age, sex, social class, ethnic group and other variables. Trying to establish whether a drug produces a change is the same problem in principle as trying to decide whether a bacterium produces a change. The problem here is that if it takes tens of thousands to patients to really control for all the things that might influence the outcome of treatment, the testing of drugs would become very difficult. Randomization is a means of greatly reducing the numbers of patients needed.
However, many epidemiologists had and continue to have misgivings about the capacity of randomization to substitute for sampling a whole population. When a drug or an infection is studied in tens or hundreds of thousands of patients, it is reasonable to suggest that the results may hold for the population of the United States as a whole. But is it as reasonable to suggest this when perhaps less than a hundred patients have been studied in a randomized protocol? Many experts think not.
The problems are made much worse in company sponsored RCTs. These studies recruit what the FDA terms "samples of convenience." These patient samples might bear little resemblance to the type of patient who will ultimately end up on the drug. The FDA regards such trials as internally valid but not necessarily externally valid. What this means is that the trial can provide evidence of a treatment signal, but no claims should be made on the basis of such studies about what will happen when the treatment is given widely.
The bottom line is that a majority of current trials in any area of medicine have the power to disconfirm the null hypothesis – that treatment does not differentiate from placebo – but this evidence does not support extrapolations to the likely effectiveness of treatment. Such extrapolations at present can only be based on clinical judgment. When treatments work, the condition being treated vanishes, and we don’t need randomized controlled trials to see this happening.
Rabbits from Hats?
To make matters worse, it is even harder in psychiatry to move from a demonstration that a treatment does something to knowing what will happen if this treatment is used widely. The reason for this has to do with the endpoints used in clinical trials of psychotropic agents. In other diseases a patient who was going to die lives, or something biological clearly changes – a high blood pressure or high lipid levels fall. But in psychiatric trials, the changes we depend on are changes in rating scale scores rather than demonstrations of a return to work, reduced mortality or eliminated germs.
There are four completely different sets of rating scales that could be used in trials. First there are observer based disease specific rating scales, such as the HAM-D, where a clinician rates a patient on items that the clinician wants to see change, which may not be the things that bother the patient. Second are patient based disease specific rating scales, such as the Beck Depression Inventory (BDI), which again contain features which clinicians may think important but others may not regard as critical. Third there are observer based non-disease specific scales; these are rating scales designed to measure a patient’s overall or global functioning. Fourth, there are patient based non-disease specific based scales of global functioning; these are ordinarily called Quality of Life (QoL) scales.
If a psychiatric treatment really worked, one might expect the effect of treatment to show up on rating scales from all four domains of measurement. As a matter of fact, however, there is not a single antipsychotic or antidepressant that has been demonstrated to have consistent beneficial effects across all these domains. In the case of the antidepressants, demonstrations of treatment effects have largely been on the basis of clinician ratings on instruments like the Hamilton Depression Scale. Even these have been inconsistent and can only be demonstrated in approximately half of the studies undertaken.
The work of Weissman and colleagues on social adaptation or global functioning shows that while antidepressants may lead to symptomatic improvements, the broader functioning of the patient may not normalize for a long time afterwards. In the case of trials with SSRI antidepressants, QoL scales have been used in possibly up to 100 trials with data from less than 10 reported. When patients are let rate the outcomes, these drugs have simply not been shown to work.
If a treatment routinely produced the right kind of changes on all 4 sets of rating scales, there would still remain the problem of factoring in recent evidence of withdrawal syndromes (Tranter and Healy 1998) before extrapolating from any demonstration that the treatment did something to claims that it in fact works. If a stabilized patient relapses on withdrawal, the final outcome may be worse than non-treatment. There is in fact a substantial amount of evidence that patients who recover on placebo treatments are least likely to relapse. The bottom line is that treating and stopping treatment is in general not the same as not treating in the first instance and we rarely know sufficient about the natural history of either the treated or untreated states to be sure we really know what we are doing.
After the introduction of Thorazine in the mid-1950s, the NIMH convened a major conference aimed at ensuring that a proper scientific basis was laid for the new field. One that would ensure the evaluation of new treatments and that would foster the right kind of basic science research that would lead to new and better Thorazines. The conference adopted recommendations to use RCTs and rating scales. One of the few dissenting voices was that of Nathan Kline, who argued that in the absence of real evidence that patients got better (that they left hospital or returned to work), rating scales and trials posed a risk. It would be very simple, he said, to produce a version of the rabbit out of the hat trick – which of course involves putting the rabbit into the hat in the first instance.
RCTs & Dumbing Down
There are further problems with the current evidence base. Like other epidemiological studies, RCTs essentially provide evidence of associations rather than evidence of cause and effect. Just as studies for instance of smoking and lung cancer or diet and cardiac disorders point to a link between events rather than an explanation of how or why these events may be linked, so too it is with drug studies. RCTs link drugs to a therapeutic outcome but in so doing they have probably for many clinicians obscured the mechanisms whereby these events are linked. They deflect attention away from what the drug actually does to bring about the association.
For example, in the case of the antidepressants, clinical trials suggest that a group of very different drugs, which almost certainly bring about their benefits by producing distinctive functional effects, at the end of day, produce a final common outcome. But the SSRIs were in fact synthesized in the first instance to do something functionally different to the older tricyclic agents. Arguing that the trial evidence shows that these agents all "work" diverts attention from the question of how they are working. Arguing that all these drugs "work" and therefore it doesn’t matter which drug I give this patient in front of me shows little or no understanding of either the drugs or the patients. Through what functional effects does a noradrenergic selective agent like desipramine bring about its benefits compared to an SSRI like Prozac?
Preclinical work suggests that desipramine for instance is energy enhancing, while Prozac and Zoloft and Paxil are serenic (anxiolytic). But our recent mesmerized focus on RCTs has obscured these distinctions in clinical practice. Clinicians today are increasingly likely to prescribe without knowing what potentially beneficial effects an agent produces and this is increasingly less likely to be either rational or good practice. If we don’t know what these diverse agents do to get depressed patients better, how can we know which of them to select to give the patient in front of us?
Now for the More Complex Cases
Depression is a relatively simple condition compared to manic-depressive disease or schizophrenia. In bipolar disorders, the problem gets even more complex. No one rating scale can be used in a condition, which cycles from one pole to its polar opposite. Using frequency of episodes as an endpoint, thousands of patients would have to be recruited across multiple centers and sustained within an experimental protocol for years in order to produce a convincing demonstration of prophylaxis. This cannot be simply done. Even the resources of the largest pharmaceutical companies have not been able to enable trials like this to happen. As a result, the use of anticonvulsants, sometimes called "mood-stabilizers", in mood disorders are underpinned by evidence of a treatment effect in depression or in mania but not evidence of effects on manic-depressive disease. In the same way, there is little evidence on the extent to which antipsychotics work for schizophrenia over and above their treatment effect in acute psychotic states and in some maintenance studies.
Or consider the case of the hypnotics, which in some respects are even simpler than the antidepressants. In this case, RCT evidence may show that a hypnotic has a clear effect without any need to employ a rating scale. Patients, however, may not wish to take such treatments. In this sense, despite evidence that the treatment can be said to work in one dimension of value, a hypnotic may not work for a sub-group of patients in other dimensions.
In this case, further trials are called for, to establish how much such a treatment is valued. Trials of this sort are never undertaken for hypnotics. In the case of sleep and hypnotics, however, people are probably confident enough in their own judgment to ignore their clinician or any expert if need be. In the case of anxiety, depression, manic-depression or schizophrenia, the situation is more ambiguous, patients are more vulnerable, and a good clinician acting on behalf of a patient should know something about the extent to which treatments actually are valued. But there is no evidence of this sort.
But its been licensed by the FDA!
A common misunderstanding is that the fact that drugs get on the market only after the FDA has reviewed them means that the FDA must be convinced that these drugs have been shown to work.
In fact, a regulator facing a new drug has a similar job to do, as a regulator facing a yellow material that has to decide is this butter or colored lard, mislabeled as butter. If minimal criteria for butter are met, the regulator must let the substance on the market. They make no judgment as to whether this is good or bad butter and no judgments as to whether eating butter is good for you. It is clinicians or other consumer’s organizations that should make judgments like that.
In the case of drugs, if a company can show, even if only in a minority of trials, that it is simply not correct to say their drug doesn’t have an effect in depression, the regulators are not in a position to keep this drug off the market. Faced with trials of Zoloft, showing effectively only one convincing result in 5 studies, Paul Leber of the FDA put it as follows: "how do we interpret.. two positive results in the context of several more studies that fail to demonstrate that effect? I am not sure I have an answer to that but I am not sure that the law requires me to have an answer to that — fortunately or unfortunately. That would mean, in a sense, that the sponsor could just do studies until the cows come home until he gets two of them that are statistically significant by chance alone, walks them out and says he has met the criteria.
Marketing The Evidence
The problems outlined above are in a very real sense academic. In the real world, the problems with the evidence facing clinicians are even graver. First, clinical trials that do not favor a company’s interest are frequently not reported. This leads to a situation where the greatest single determinant of outcome of a published study appears to be its sponsorship. Second, as mentioned above there is no obligation on companies to report all the data from within trials that are published. In the case of the SSRIs, for example, there has been almost universal non-reporting of Quality of Life data. Finally, there is an over reporting of favorable studies. At international meetings and in peer-reviewed journals, senior experts in the field who have had no participation in a study present data from company trials in a manner that leaves others who might want to meta-analyze the results confused as to how many trials there actually have been. A recent estimate has been that this process leads to a 25% over-estimate of the efficacy of new antipsychotics for instance.
Aside from the under reporting, selective reporting and over reporting, an ever increasing proportion of the literature on treatments is ghost written. This was once thought to apply primarily to material appearing in journal supplements as the proceedings of satellite symposia or consensus conferences. However in a recent analysis of articles appearing in mainstream journals such as JAMA, the New England Journal of Medicine, and the BMJ, we have shown that up to 50% of the articles in many therapeutic fields are ghostwritten.
It is common for philosophers and sociologists of science to investigate the emergence and dominance of what are called paradigms in science. None of these philosophers or sociologists appear to have hitherto considered the possibility that the convergence of views among experts constituting a paradigm might stem from the fact that a common set of articles get produced in communication agencies with the names of various experts almost randomly attached as appropriate for the occasion.
This has clear implications for the sociology of science but does any of this have any significance for clinical practice? Surely clinicians are trained to critically review papers and assess the literature. Indeed their duty under prescription only arrangements is to determine the true hazards of new agents and distinguish hype from genuine advances.
Unfortunately prescription only arrangements also mean that the full weight of the pharmaceutical industry is brought to bear on a very small number of purchasers. It would be a mistake to believe that this weight will be without influence. While the risk of dependence on benzodiazepines is clearly a therapeutic problem, the wholesale switch from the use of tranquilizers in the 1980s to antidepressants in the 1990s with the same patients being diagnosed as anxiety disorders in one decade and depressive disorders in another, stemmed to a considerable extent from the marketing power of pharmaceutical companies channeled through prescription only arrangements.
Now since 9/11 the same patients are being re-diagnosed as anxiety disorders, to be treated with SSRIs. All companies seem to think is needed to sell the pass to both clinicians and consumers is to re-brand these drugs as anxiolytics rather than tranquilizers. A hundred million dollars of advertising will do the rest.
In the case of the antipsychotics an earlier generation of weakly neuroleptic antipsychotics were replaced with a generation of neuroleptics. The past 5 years, however, has seen a wholesale switch from neuroleptics back to a group of compounds, which in terms of receptor profile and efficacy are indistinguishable from first generation of antipsychotics such as chlorpromazine, chlorprothixene and levomepromazine. Neither of these switches can be justified on the basis of clinical trial evidence. Newer agents such as olanzapine and risperidone have in fact a greater number of suicides, deaths and suicide attempts linked to them in their pre-licensing trials than any other psychotropic drugs in history, but who gets to hear about this?
Business or Science?
RCTs produce main effects and side effects. By convention, the main effect of antidepressants is taken to be on mood, and effects for example on sexual functioning are designated side effects. In fact, sexual functioning may be more reliably affected by an SSRI than mood. Where up to 200 patients may be needed to demonstrate a treatment effect for an SSRI in depression, as few as 12 patients may be needed to demonstrate efficacy for premature ejaculation. Evidence of the potentially beneficial effects of SSRIs on aspects of sexual functioning, such as premature ejaculation, was kept almost entirely out of the public domain by companies for two decades. This should make it clear that the designation of a main effect of the compound is an essentially arbitrary decision, related to company economics and far from value-free.
In 1860, faced with the then medical arsenal, Oliver Wendell Holmes stated that: ÎI firmly believe that if the whole materia medica as now used were to be sunk to the bottom of the sea, it would be all the better for mankind and all the worst for the fishes’. The perception now is that new evaluative methods have pushed out bad medicines from the arsenal. In fact, there is every reason to suspect that RCTs are pushing good therapies out of health care. Psychiatric units which once had active occupational therapy units and social programs are now reduced to boring sterile places where only things that have been "shown to work" happen. Patients are not exercised, nor taken out on social activities, nor involved in art, music or other therapies. If they leave hospital for psychosocial reasons, it is likely to be because of boredom.
One reason for this is that RCTs – as currently interpreted – allied to the patenting system, provide evidence that can be used for lobbying purposes. In contrast, other non-specific approaches will remain like placebo undeniably but unprovably effective and as a result unsponsored.
Much of the above could be countenanced if RCTs had done something to restrain therapeutic zeal (the furor therapeuticus). There is little evidence for this. In recent years there has been a mass medicalisation of a range of nervous conditions in primary care. Only time will tell how appropriate such medicalisation is. But what is clearly inappropriate is the current lack of monitoring of the therapeutic impact of intervening in these conditions. In practice, based on weak evidence of treatment effects, we have done a great deal to detect such conditions and advocate that subjects are given treatment but we have done little to monitor whether treatment has in fact delivered the desired result.
Because these agents have been shown by RCTs to "work", we have promoted a situation, virtually free of warnings, where primary care prescribers and others, besieged by the mass of community nervous problems and all but impotent to do much for these, have been trapped by the weight of supposed scientific evidence into indiscriminately handing out psychotropic agents on a massive scale, and increasingly to children.
There have been moves in recent years by leading medical journals to encourage companies to publish all their data. The implication appears to be that if only all the data is published the field will become scientific. In fact, publication of all the data will only produce acceptable business practice in contrast to the currently unacceptable business practice. The systematic concealment of data about a new car for instance would constitute bad business practice rather than bad science. It will take considerably more than more transparent publication practices to produce good science. Good science will only result from studies that are designed to answer scientific questions rather than from ones designed to support regulatory applications or market penetration.
Coda
We recently reported the first results of a study in North Wales which was undertaken against a background of a population that has been stable over a 100 year period in terms of population numbers, age, cohorts, ethnic mix and rurality. This demonstrated that there has been a three-fold increase in the rate of detentions into psychiatric services, and a 15-fold increase in the rate of admissions, since the introduction of the psychotropic drugs. The inter-illness intervals for bipolar disorders appear to have got shorter rather than longer, despite the availability of supposedly prophylactic treatments. Overall patients with all psychiatric conditions now appear to spend a greater amount of time in a service bed than they would have done 50 or 100 years ago. Such findings are compatible with our treatments having effects, which may be used judiciously, but in many instances are probably not being used to their best advantage. These findings are incompatible with our treatments being effective in practice for a majority of the patients to whom they are given.
At the NIMH conference following the introduction of Thorazine Nathan Kline and one other figure expressed doubts. The other was Ed Evarts. Evarts suggested that had fever therapy and later penicillin not been discovered as a treatment for GPI, Thorazine would have ended up being used for dementia paralytica – tertiary syphilis/GPI. And the research methods being proposed, which we have now come to rely on exclusively for dementia praecox (schizophrenia) and manic-depressive illness would have demonstrated how useful Thorazine was for GPI. The failure of cases of GPI to clear up in response to Thorazine would have justified the production of an ever-increasing number of essentially similar agents. A research and therapy establishment would have arisen on the back of these efforts and Evarts predicted this would have actively inhibited the discovery of a treatment that really worked for dementia paralytica, such as penicillin.
The example of GPI and penicillin demonstrates that everybody knows when a treatment really works without the need for RCTs – the problem vanishes. Notwithstanding this, we work in an era, which, for a range of reasons, puts great store on evidence-based medicine. RCTs and the evidence derived from them embodied in guidelines have become a solution for complexity and a substitute for wisdom and in some cases a substitute for common sense.
There is however one advantage in the new arrangements. The first antipsychotics and antidepressants led to the emergence of antipsychiatry and a questioning of the legitimacy of psychiatry. Such a scenario is unlikely to be repeated. The market development plans of drug companies for recent and future generations of psychotropic agents include the establishment of or penetration of patient support groups. Psychiatrists who might once have been vilified when they advocated new physical treatments to patient groups are more likely to find themselves vilified now if they fail to endorse enthusiastically the latest treatments.
A growing string of academic freedom cases, the most famous being the sacking of Nancy Olivieri from the University of Toronto which was linked to the publication of clinical trials results inconvenient to a sponsoring pharmaceutical company, demonstrate that fashionable treatments increasingly pose dilemmas that go beyond any problems in the evidence base or in the way that evidence is marketed.
References
Bisson JL, Jenkins PL, Alexander J, Bannister C (1997), Randomised controlled trial of psychological debriefing for victims of acute burn trauma. British Journal of Psychiatry,171, 78-81.
Evarts E (1959), A discussion of the relevance of effects of drugs on animal behavior to the possible effects of drugs on psychopathological processes in man. In Psychopharmacology: Problems in Evaluation. (eds J Cole & R Gerard) Publication 583, pp 284-306. Washington D.C.: National Academy of Sciences/National Research Council.
Freemantle N, Mason J, Phillips T, Anderson IM (2000). Predictive value of pharmacological activity for the relative efficacy of antidepressants drugs. Meta-regression analysis. British Journal of Psychiatry, 177, 292-302.
Gilbody SM, Song F (2000) Publication bias and the integrity of psychiatry research. Psychological Medicine, 30, 253-258
Healy D (1997) The Antidepressant Era. Cambridge Ma: Harvard University Press
Healy D (2000) The assessment of outcome in depression. Measures of social functioning. Reviews in Contemporary Pharmacotherapy 11, 295-301.
Healy D (2001). The Creation of Psychopharmacology. Cambridge, Mass, Harvard University Press.
Healy D (2001b). Treating More Patients Than Ever Before. Lecture delivered at Hannah 6th International Conference on the History of Psychiatry, Toronto April 17th (available on request).
Healy D, Nutt D (1997). British Association for Psychopharmacology Consensus on Childhood and Learning Disabilities Psychopharmacology. J Psychopharmacology (1998), 11, 291-294
Healy D, Savage M, Michael P. Harris M, Cattell D, Carter M, McMonagle T, Sohler N, Susser E (2001). Psychiatric service utilisation: 1896 & 1996 compared. Psychological Medicine 31, 779-790.
Holmes OW (1891). Medical Essays 1842-1882. (Cited in JH Young) Pure Food, Princeton, p 19. Princeton University Press.
Huston, D Locher M (1996) Redundancy, disaggregation and the integrity of medical research. Lancet 347, 1024-1026.
Jick S, Dean AD, Jick H (1995) Antidepressants and suicide. British Medical Journal 310, 215-218.
Pedersen V, Bogeso K (1998). Drug Hunting. In Healy D, The Psychopharmacologists, Arnold, London, pp 561-580.
Raphael B, Meldrum L, McFarlane AC (1995) Does debriefing after psychological trauma work? Time for randomised controlled trials". British Medical Journal, 310, 1479-1480.
Rees WL (1997) The place of controlled trials in the development of psychopharmacology. History of Psychiatry, 8, 1-20.
Rennie D (1999) Fair conduct and fair reporting of clinical trials". JAMA, 282, 1766-1768.
Tranter R, Healy D (1998) Neuroleptic discontinuation syndromes". Journal of Psychopharmacology, 12, 306-311.
Viguera AC, Baldessarini RJ, Hegarty JD, van Kammen DP, Tohen M (1997) Clinical risk following abrupt and gradual withdrawal of maintenance neuroleptic treatment. Archives of General Psychiatry, 54, 49-55.
Waldinger M, Hengeveld MH, Zwinderman AH (1994) Paroxetine treatment of premature ejaculation: a double-blind randomized placebo-controlled study. American Journal of Psychiatry, 151, 377-1379.
Weissman MM, Klerman GL, Paykel ES, Prusoff B & Hanson B (1974). Treatment effects on the social adjustment of depressed patients. Arch Gen Psychiatry, 30, 771-778
David Healy MD MRCPsych
Director
North Wales Department of Psychological Medicine
Hergest Unit
Bangor
Wales LL57 2PW
Tel: 01248-384452
Fax: 01249-371397