The Truth Wears Off

The excerpts from The New Yorker (below) may be read as a companion piece to the Atlantic article profiling Dr. John Ioannidis.    

However, the opening example that Lehrer uses to make the point that initial impressive positive research findings often weaken in time, perpetuates a myth about the new antipsychotics, a myth promulgated by psychiatry which more than any other medical discipline is tainted by close ties to drug manufacturers.

Based on the original FDA approval documents, we dispute the myth that the new antipsychotic drugs had ever demonstrated "a dramatic decrease in the subjects’ psychiatric symptoms."

In fact, as early as the time of FDA approval for marketing, Dr. Paul Leber, (then) chief of the FDA’s Division of Neuropharmacological Drug review, concluded in an internal memorandum (1996) that the antipsychotic, Zyprexa (for example) has not shown clinical efficacy only as "proof in principle." Furthermore, the premarketing trials were found to be deficient: 

"inappropriate design"

"inappropriate sample of patients”

“ill-suited titration;" “high incidence of dropouts”

“the evidence of efficacy submitted to the FDA provided only "proof in principle" of the drug’s acute antipsychotic action.”

See: www.fda.gov/downloads/Drugs/DrugSafety/UCM173472.pdf

 

 

THE NEW YORKER

The Truth Wears Off

Is there something wrong with the scientific method?

by Jonah Lehrer December 13, 2010

Many results that are rigorously proved and accepted start shrinking in later studies.

Many results that are rigorously proved and accepted start shrinking in later studies.

EXCERPTS:

On September 18, 2007, a few dozen neuroscientists, psychiatrists, and drug-company executives gathered in a hotel conference room in Brussels to hear some startling news. It had to do with a class of drugs known as atypical or second-generation antipsychotics, which came on the market in the early nineties. The drugs, sold under brand names such as Abilify, Seroquel, and Zyprexa, had been tested on schizophrenics in several large clinical trials, all of which had demonstrated a dramatic decrease in the subjects’ psychiatric symptoms. As a result, second-generation antipsychotics had become one of the fastest-growing and most profitable pharmaceutical classes. By 2001, Eli Lilly’s Zyprexa was generating more revenue than Prozac. It remains the company’s top-selling drug.

But the data presented at the Brussels meeting made it clear that something strange was happening: the therapeutic power of the drugs appeared to be steadily waning. A recent study showed an effect that was less than half of that documented in the first trials, in the early nineteen-nineties. Many researchers began to argue that the expensive pharmaceuticals weren’t any better than first-generation antipsychotics, which have been in use since the fifties. “In fact, sometimes they now look even worse,” John Davis, a professor of psychiatry at the University of Illinois at Chicago, told me.

Before the effectiveness of a drug can be confirmed, it must be tested and tested again. Different scientists in different labs need to repeat the protocols and publish their results. The test of replicability, as it’s known, is the foundation of modern research. Replicability is how the community enforces itself. It’s a safeguard for the creep of subjectivity. Most of the time, scientists know what results they want, and that can influence the results they get. The premise of replicability is that the scientific community can correct for these flaws.

But now all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable. This phenomenon doesn’t yet have an official name, but it’s occurring across a wide range of fields, from psychology to ecology. In the field of medicine, the phenomenon seems extremely widespread, affecting not only antipsychotics but also therapies ranging from cardiac stents to Vitamin E and antidepressants: Davis has a forthcoming analysis demonstrating that the efficacy of antidepressants has gone down as much as threefold in recent decades.

For many scientists, the effect is especially troubling because of what it exposes about the scientific process. If replication is what separates the rigor of science from the squishiness of pseudoscience, where do we put all these rigorously validated findings that can no longer be proved? Which results should we believe? Francis Bacon, the early-modern philosopher and pioneer of the scientific method, once declared that experiments were essential, because they allowed us to “put nature to the question.” But it appears that nature often gives us different answers.

Jonathan Schooler was a young graduate student at the University of Washington in the nineteen-eighties when he discovered a surprising new fact about language and memory. At the time, it was widely believed that the act of describing our memories improved them. But, in a series of clever experiments, Schooler demonstrated that subjects shown a face and asked to describe it were much less likely to recognize the face when shown it later than those who had simply looked at it. Schooler called the phenomenon “verbal overshadowing.”

The study turned him into an academic star. Since its initial publication, in 1990, it has been cited more than four hundred times. Before long, Schooler had extended the model to a variety of other tasks, such as remembering the taste of a wine, identifying the best strawberry jam, and solving difficult creative puzzles. In each instance, asking people to put their perceptions into words led to dramatic decreases in performance.

But while Schooler was publishing these results in highly reputable journals, a secret worry gnawed at him: it was proving difficult to replicate his earlier findings. “I’d often still see an effect, but the effect just wouldn’t be as strong,” he told me. “It was as if verbal overshadowing, my big new idea, was getting weaker.” At first, he assumed that he’d made an error in experimental design or a statistical miscalculation. But he couldn’t find anything wrong with his research. He then concluded that his initial batch of research subjects must have been unusually susceptible to verbal overshadowing. (John Davis, similarly, has speculated that part of the drop-off in the effectiveness of antipsychotics can be attributed to using subjects who suffer from milder forms of psychosis which are less likely to show dramatic improvement.) “It wasn’t a very satisfying explanation,” Schooler says. “One of my mentors told me that my real mistake was trying to replicate my work. He told me doing that was just setting myself up for disappointment.” XXXXXXX

…….In 2004, Schooler embarked on an ironic imitation of Rhine’s research: he tried to replicate this failure to replicate. In homage to Rhine’s interests, he decided to test for a parapsychological phenomenon known as precognition. The experiment itself was straightforward: he flashed a set of images to a subject and asked him or her to identify each one. Most of the time, the response was negative—the images were displayed too quickly to register. Then Schooler randomly selected half of the images to be shown again. What he wanted to know was whether the images that got a second showing were more likely to have been identified the first time around. Could subsequent exposure have somehow influenced the initial results? Could the effect become the cause?

The craziness of the hypothesis was the point: Schooler knows that precognition lacks a scientific explanation. But he wasn’t testing extrasensory powers; he was testing the decline effect. “At first, the data looked amazing, just as we’d expected,” Schooler says. “I couldn’t believe the amount of precognition we were finding. But then, as we kept on running subjects, the effect size”—a standard statistical measure—“kept on getting smaller and smaller.” The scientists eventually tested more than two thousand undergraduates. “In the end, our results looked just like Rhine’s,” Schooler said. “We found this strong paranormal effect, but it disappeared on us.”

The most likely explanation for the decline is an obvious one: regression to the mean. As the experiment is repeated, that is, an early statistical fluke gets cancelled out. The extrasensory powers of Schooler’s subjects didn’t decline—they were simply an illusion that vanished over time. And yet Schooler has noticed that many of the data sets that end up declining seem statistically solid—that is, they contain enough data that any regression to the mean shouldn’t be dramatic. “These are the results that pass all the tests,” he says. “The odds of them being random are typically quite remote, like one in a million. This means that the decline effect should almost never happen. But it happens all the time! Hell, it’s happened to me multiple times.” And this is why Schooler believes that the decline effect deserves more attention: its ubiquity seems to violate the laws of statistics. “Whenever I start talking about this, scientists get very nervous,” he says. “But I still want to know what happened to my results. Like most scientists, I assumed that it would get easier to document my effect over time. I’d get better at doing the experiments, at zeroing in on the conditions that produce verbal overshadowing. So why did the opposite happen? I’m convinced that we can use the tools of science to figure this out. First, though, we have to admit that we’ve got a problem.” XXXXXXXXXXXXXXXXX

……….Biologist Michael Jennions, argues that the decline effect is largely a product of publication bias, or the tendency of scientists and scientific journals to prefer positive data over null results, which is what happens when no effect is found. The bias was first identified by the statistician Theodore Sterling, in 1959, after he noticed that ninety-seven per cent of all published psychological studies with statistically significant data found the effect they were looking for. A “significant” result is defined as any data point that would be produced by chance less than five per cent of the time. This ubiquitous test was invented in 1922 by the English mathematician Ronald Fisher, who picked five per cent as the boundary line, somewhat arbitrarily, because it made pencil and slide-rule calculations easier. Sterling saw that if ninety-seven per cent of psychology studies were proving their hypotheses, either psychologists were extraordinarily lucky or they published only the outcomes of successful experiments. In recent years, publication bias has mostly been seen as a problem for clinical trials, since pharmaceutical companies are less interested in publishing results that aren’t favorable. But it’s becoming increasingly clear that publication bias also produces major distortions in fields without large corporate incentives, such as psychology and ecology.

While publication bias almost certainly plays a role in the decline effect, it remains an incomplete explanation. For one thing, it fails to account for the initial prevalence of positive results among studies that never even get submitted to journals. It also fails to explain the experience of people like Schooler, who have been unable to replicate their initial data despite their best efforts. Richard Palmer, a biologist at the University of Alberta, who has studied the problems surrounding fluctuating asymmetry, suspects that an equally significant issue is the selective reporting of results—the data that scientists choose to document in the first place. Palmer’s most convincing evidence relies on a statistical tool known as a funnel graph. When a large number of studies have been done on a single subject, the data should follow a pattern: studies with a large sample size should all cluster around a common value—the true result—whereas those with a smaller sample size should exhibit a random scattering, since they’re subject to greater sampling error. This pattern gives the graph its name, since the distribution resembles a funnel.

The funnel graph visually captures the distortions of selective reporting. For instance, after Palmer plotted every study of fluctuating asymmetry, he noticed that the distribution of results with smaller sample sizes wasn’t random at all but instead skewed heavily toward positive results. Palmer has since documented a similar problem in several other contested subject areas. “Once I realized that selective reporting is everywhere in science, I got quite depressed,” Palmer told me. “As a researcher, you’re always aware that there might be some nonrandom patterns, but I had no idea how widespread it is.” In a recent review article, Palmer summarized the impact of selective reporting on his field: “We cannot escape the troubling conclusion that some—perhaps many—cherished generalities are at best exaggerated in their biological significance and at worst a collective illusion nurtured by strong a-priori beliefs often repeated.”

Palmer emphasizes that selective reporting is not the same as scientific fraud. XXXXXXX

Our beliefs are a form of blindness.

John Ioannidis, an epidemiologist at Stanford University, argues that such distortions are a serious issue in biomedical research. “These exaggerations are why the decline has become so common,” he says. “It’d be really great if the initial studies gave us an accurate summary of things. But they don’t. And so what happens is we waste a lot of money treating millions of patients and doing lots of follow-up studies on other themes based on results that are misleading.” In 2005, Ioannidis published an article in the Journal of the American Medical Association that looked at the forty-nine most cited clinical-research studies in three major medical journals. Forty-five of these studies reported positive results, suggesting that the intervention being tested was effective. Because most of these studies were randomized controlled trials—the “gold standard” of medical evidence—they tended to have a significant impact on clinical practice, and led to the spread of treatments such as hormone replacement therapy for menopausal women and daily low-dose aspirin to prevent heart attacks and strokes. Nevertheless, the data Ioannidis found were disturbing: of the thirty-four claims that had been subject to replication, forty-one per cent had either been directly contradicted or had their effect sizes significantly downgraded.

The situation is even worse when a subject is fashionable. In recent years, for instance, there have been hundreds of studies on the various genes that control the differences in disease risk between men and women. These findings have included everything from the mutations responsible for the increased risk of schizophrenia to the genes underlying hypertension. Ioannidis and his colleagues looked at four hundred and thirty-two of these claims. They quickly discovered that the vast majority had serious flaws. But the most troubling fact emerged when he looked at the test of replication: out of four hundred and thirty-two claims, only a single one was consistently replicable. “This doesn’t mean that none of these claims will turn out to be true,” he says. “But, given that most of them were done badly, I wouldn’t hold my breath.”

According to Ioannidis, the main problem is that too many researchers engage in what he calls “significance chasing,” or finding ways to interpret the data so that it passes the statistical test of significance—the ninety-five-per-cent boundary invented by Ronald Fisher. “The scientists are so eager to pass this magical test that they start playing around with the numbers, trying to find anything that seems worthy,” Ioannidis says. In recent years, Ioannidis has become increasingly blunt about the pervasiveness of the problem. One of his most cited papers has a deliberately provocative title: “Why Most Published Research Findings Are False.”

The problem of selective reporting is rooted in a fundamental cognitive flaw, which is that we like proving ourselves right and hate being wrong. “It feels good to validate a hypothesis,” Ioannidis said. “It feels even better when you’ve got a financial interest in the idea or your career depends upon it. And that’s why, even after a claim has been systematically disproven”—he cites, for instance, the early work on hormone replacement therapy, or claims involving various vitamins—“you still see some stubborn researchers citing the first few studies that show a strong effect. They really want to believe that it’s true.”

That’s why Schooler argues that scientists need to become more rigorous about data collection before they publish. XXXXX

 

Read full New Yorker article at: http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer