A Deep Dive Into the Blockbuster Study That Called Into Doubt a Lot of Psych Research

today I lost creativity
Photo: Adrian Assalve

Yesterday, Science published a blockbuster article about the state of, well, science. Since 2011, a group called the Open Science Collaboration, headed by the psychologist Brian Nosek, has been working to replicate 100 studies previously published in leading psychology journals — that is, to conduct these experiments again to see whether the same results would pop up. (The researchers set out to replicate even more, but for various reasons the final number was culled to 100.)

What they found was rather alarming: In more than half of the studies the researchers attempted to redo, different results popped up, and among the ones that did replicate similar results, the findings were significantly less impressive than what was published the first time around. Some findings that had garnered widespread attention and sexy media headlines, including one on how women’s menstrual cycles affect their mate preferences and another on how a belief in free will influences people’s ethical decisions, failed to replicate (we’ll explain a bit more about these in a bit).

To understand why all this matters — and what it says about the exciting new science findings flashing into your news feeds seemingly every day — requires a bit of background. Here’s an explainer.

I’ve been hearing a lot about this “replication crisis” lately. Can you explain what the deal is in a non-nerdy way?

Basically, there’s an increasing awareness that many of the scientific findings we’ve accepted as “true” may be less sturdy than everyone thinks. It turns out that there are many ways that researchers, claiming to have found a given effect (“effect” just means that X does something to Y — ibuprofen reduces headache pain, for example), may have overstated or fabricated the effect they’re reporting.

Overstated effects? Fabricated ones? Sounds like there’s massive fraud afoot!

No one really thinks that. While instances of true nefariousness such as all-out data fabrication do occur, there’s a general consensus that it’s pretty rare (for one thing, most people aren’t that unethical; for another, if you’re caught, that’s it for your scientific career). What’s more common is that subtle forms of bias and pernicious incentives creep into the scientific process.

The short version is that the more cool results you publish as a scientist, the more likely you are to make a name for yourself, and this leads to problems. So, sometimes researchers might run a bunch of experiments and only report the ones that produce cool stuff. Other times, they might hypothesize an effect before they run their experiment, discover another, conflicting one when they run the experiment, and then rewrite what they were looking for so it looks like they had a coherent theory the whole time. Yet other times, they might use p-hacking to nudge a given finding from statistical insignificance to statistical significance.

P-hacking?

Yeah, this one’s important enough to warrant a quick (nontechnical) statistical segue. To oversimplify a bit, let’s say I want to figure out whether a coin is fair — that is, whether it’s equally likely it’ll come up heads (H) or tails (T). I toss it 100,000 times and record the results. Now, here and there, because of how probability works, I’ll get a sequence like HHHHHHHHH or TTTTTTTTTTTTT. Taken out of context, a sequence like this looks suspicious. Within the context of 100,000 tosses, it’s pretty meh.

What sometimes happens in science is that researchers, consciously or unconsciously (unconsciously sounds like a cop-out, but unconscious bias is as big a problem in science as it is in any other human endeavor), will select which of their data to use in a way that overblows the WHOA-ness of their findings. For example, if I didn’t publish the results of all 100,000 coin tosses, but rather picked and chose from the data in a way that highlighted those like the HHHHHHHHH and TTTTTTTTTTTTT sequences of tosses, while downplaying more normal-looking ones, I’d be (again, oversimplifying a bit) engaging in a form of p-hacking.

It sounds like there are a lot of ways researchers might publish shady findings. How do you fix this problem? Is that where replication comes in?

The only surefire way to gain confidence in a given finding is to test it over and over. Throughout science, the findings that have hardened into “facts” (nothing is really ever a “fact,” in science, since the scientific method leaves open the possibility that something newer and better will come along and improve upon a given idea that everyone had accepted) have been tested so frequently, from so many different angles, that there’s very little remaining doubt that the basics, at least, are true. Evolution and Einstein’s findings about relativity, for example, have been so thoroughly tested that it’s very hard to come up with a story in which they’re fundamentally wrong.

But many other sorts of claims haven’t actually endured this much rigor. They haven’t been tested over and over and over. Part of the problem is that in science, especially if you’re a young researcher, you get way more points for discovering a new thing than for checking to make sure an old thing holds up; as a result, the practice of replicating old findings has been slow to catch on. And when old findings don’t get replicated, they just sit there, treated like they’re “true,” even though they shouldn’t be. Plus, whatever initial burst of publicity they get (more on this in a bit) is seen as the last word; these studies don’t get the skeptical treatment any scientific claim should be met with.

So this Open Science Collaboration paper is the first big effort of its kind to reproduce old psychology findings in a systematic way?

Yes. That’s why it’s a big deal. Now, in the past there have been statistical estimates of how many studies are likely to fail to hold up. Back in 2005, for example, John P. A. Ioannidis of Stanford Medical School did some statistical noodling to argue that more than half of all research claims — across all fields —  are false. Actually reproducing a study, though, is difficult. To do it right, you need to gather a lot of information about how the original study was run — information that’s not always readily available, and which sometimes entails peppering the authors of the original with a bunch of questions. Then you need to actually conduct the study again, analyze the results, and so on. It’s time-consuming.

And Nosek’s team did all that?

Yes, and they did it in a particularly transparent way. The authors write:

The replication protocol articulated the process of selecting the study and key effect from the available articles, contacting the original authors for study materials, preparing a study protocol and analysis plan, obtaining review of the protocol by the original authors and other members within the present project, registering the protocol publicly, conducting the replication, writing the final report, and auditing the process and analysis for quality control. Project coordinators facilitated each step of the process and maintained the protocol and project resources. Replication materials and data were required to be archived publicly in order to maximize transparency, accountability, and reproducibility of the project.

There’s a reason this whole thing took years to complete, and why the paper has dozens of co-authors and volunteers listed.

Which papers did the researchers attempt to reproduce?

To keep things somewhat self-contained, they took a big handful of papers from three leading journals: Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition. Excluding the replications that had not concluded in time for the paper’s publication, 100 replications were conducted in total.

Let’s get to the good stuff: What did they find?

Of the original 100 studies, almost all of them (97) found a statistically significant result. Only 36 percent of the replications did, though. Moreover, the replications detected effect sizes only about half the strength of the effect sizes detected in the original papers, on average. Overall, just 39 percent of the originally published effects “were subjectively rated [by Nosek’s team] to have replicated the original result.”

That sounds bad.

In one sense, it clearly is: That is an alarming number of failed replications given that these are high-quality journals. What would the numbers look like if these papers had been pulled from the many subpar journals that litter the research landscape? Plus, remember that the replicators went out of their way to ape the original procedures. If they’d been able to make judgment calls about the original methods — that is, to say, “Well, the way they originally did this is likely to produce false-positive results, so we should tweak the method” — there’s a chance the results would be even more damning. So it’s hard to read this paper and not feel like psychological science is failing, on a fundamental level, to publish robust results that won’t fall apart with just a bit of diligent prodding.

On the other hand, this conversation has been growing in volume for a while. It’s part of the reason things like the Center for Open Science exist. Practices like p-hacking are open secrets in scientific communities, and just about everyone knows there’s a need for reform. There are some signs of progress: Just back in June, the COS announced big, comprehensive guidelines for conducting good, reproducible research that many leading journals seem to be taking seriously (though it’s too early to know whether they’ll put their money where their mouth is).

Crucially, this latest paper offers a step-by-step explanation of what a big replication effort looks like. Hopefully it’ll lend some momentum to make replication — and transparency and openness in general — an entrenched part of the scientific process. So while these results are bleak, there are many tweaks that can be made to the scientific method itself to ensure that a version of this study that runs 100 years from now comes up with a much higher rate of successful replication.

I’m a cranky person who likes to blame the media for everything. What’s my angle here?

Even just during the years it took to complete this replication effort, it seems like there’s been an explosion in the number of outlets covering science (including this one). Science garners clicks, to put it bluntly — and particularly so when there are big, exciting, sexy findings to report.

But the process by which science gets reported is a bit messed up, from the bottom up. We’ve already seen the ways individual researchers are incentivized to cut corners a little. But research institutions, mostly universities, also have troubling incentives — they want to get themselves and their researchers noticed. Hence, the awful press releases.

The awful press releases?

Yes, the awful press releases. Every day, countless universities send countless press releases hyping new findings to journalists around the country. It’s shocking how often these press releases overhype the findings — there have been times when I’ve been sent a press release stating X, only to open up the actual study itself and find no evidence for that claim. To take one recent example, multiple outlets recently reported that new research suggests Tetris can help curb addictions, but really, when you read the actual paper in question, there’s no reason to think that.

But many journalists lack training in evaluating scientific claims, and many of them face frantic pressure to produce a boatload of content. If you need to get something up on your website, and you have in front of you a press release from a legitimate university claiming an exciting finding, why wouldn’t you just run with it? So you have researchers pressured to finesse findings to make them look stronger than they are, press folks pressured to promote these shaky findings, and journalists pressured to publish stories about them. It’s not a good brew, and even if it wasn’t a direct part of what the OSC team studied, it’s part of the story here — as a researcher, it’s easier than ever to get your sexy-but-shoddy result noticed by media outlets.

And some of the papers the Open Science Collaboration failed to report had gotten a bit of media attention? Are they examples of this bad “brew”?

Yes and yes. The Times highlights some of the papers that failed to fully replicate:

Among them was one on free will. It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test … [Another] was on mate preference. Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so.

The second time around, researchers found “weaker effects,” as the Times put it, in both studies. That doesn’t mean they’re false, of course — but it also casts the confident way in which the original findings were reported in a new light. (Nature had the snazzy headline “Fertile wives find single men sexy.” The Association for Psychological Science headed its press release, in part: “Destined to Cheat? New research finds free will can keep us honest.”) It leads to natural questions about what differently structured experiments about these same phenomena would find. Now, it’s understandable why these findings got attention. Both of them just feel right and satisfying, and they cover the sort of territory — sex and bad behavior — that people love to read about. Plus, maybe, in the long run, further studies on these subjects will find that these claims do in fact hold up.

But the point is that a single finding should never be taken as evidence, full-stop, of a given phenomenon — especially findings that make us smile or nod. Everyone is falling into that trap too often. There are reasons other than scientific worthwhileness that some studies find their way into our Facebook feeds, while others languish in obscurity or are never published at all. 

(Update: Aaron Bornstein, a neurology postdoc at Princeton, pointed out to me via Twitter that there’s another way to interpret the fact that the researchers only looked at highly regarded journals. These journals, he said, are exactly the ones a researcher will submit their sexiest, most attention-getting findings to. So that could actually mean that if you examined less-prestigious journals, you’d get a higher rate of successful replication. But there’s no way to know without actually trying.)

Why So Many Psych Studies May Be False