Psychology’s Favorite Tool for Measuring Racism Isn’t Up to the Job

Perhaps no new concept from the world of academic psychology has taken hold of the public imagination more quickly and profoundly in the 21st century than implicit bias — that is, forms of bias which operate beyond the conscious awareness of individuals. That’s in large part due to the blockbuster success of the so-called implicit association test, which purports to offer a quick, easy way to measure how implicitly biased individual people are. When Hillary Clinton famously mentioned implicit bias during her first debate with Donald Trump, many people knew what she was talking about because the IAT has spread the concept so far and wide. It’s not a stretch to say that the IAT is one of the most famous psychological instruments created in recent history, and that it has been the subject of more recent fascination and acclaim than just about anything else to come out of the field of social psychology.

Since the IAT was first introduced almost 20 years ago, its architects, as well as the countless researchers and commentators who have enthusiastically embraced it, have offered it as a way to reveal to test-takers what amounts to a deep, dark secret about who they are: They may not feel racist, but in fact, the test shows that in a variety of intergroup settings, they will act racist. This notion, and the data surrounding it, have fed into a very neat narrative explaining bias and racial justice in modern America. Sure, explicit measures of racism have been in decline for a while in the United States. It’s less socially acceptable than ever to say that black people and white people shouldn’t get married, or that black people are less intelligent than white people (though, to be sure, a solid minority of Americans still endorses such views). And yet, more than a half-century after the end of Jim Crow, all sorts of racial discrepancies persist: On average, darker-skinned people have less access to solid education, housing, and health care than lighter-skinned ones, and face various other forms of discrimination. The IAT suggests that, having addressed many of the most outrageous and explicit forms of public discrimination, our progress toward genuine racial equality may be continually stalled or undone by implicit bias.

That is, many IAT proponents argue that if people who don’t feel like they discriminate do, in fact, discriminate, that could explain those disparate outcomes. Maybe some white cops who claim racial empathy are still, deep down, more likely to pull the trigger in an ambiguous situation involving a black suspect than a white one. Maybe white real-estate agents who are proud Obama voters conjure up thin excuses — excuses that feel legitimate to them — to avoid renting nice units to black families. And the data produced by the IAT suggests that a solid majority of Americans hold implicit biases against marginalized groups — which means they are likely to commit acts of implicit bias against these groups. Most white people in America, for example, hold what is frequently called anti-black implicit bias, and a significant minority of African-Americans do, too. All this implicit bias could be having a dire impact on society: As the co-creators of the test have written, “given the relatively small proportion of people who are overtly prejudiced and how clearly it is established that automatic race preference [as measured by the IAT] predicts discrimination, it is reasonable to conclude not only that implicit bias is a cause of Black disadvantage but also that it plausibly plays a greater role than does explicit bias in explaining the discrimination that contributes to Black disadvantage.”

Those co-creators are Mahzarin Banaji, currently the chair of Harvard University’s psychology department, and Anthony Greenwald, a highly regarded social psychology researcher at the University of Washington. The duo introduced the test to the world at a 1998 press conference in Seattle — the accompanying press release noted that they had collected data suggesting that 90–95 percent of Americans harbored the “roots of unconscious prejudice.” The public immediately took notice: Since then, the IAT has been mostly treated as a revolutionary, revelatory piece of technology, garnering overwhelmingly positive media coverage. In addition to countless writeups in newspapers and magazines, the test was covered with something like awe in Malcolm Gladwell’s best-selling Blink: The Power of Thinking Without Thinking — “The IAT is more than just an abstract measure of attitudes,” he wrote. “It’s a powerful predictor of how we act in certain kinds of spontaneous situations.” — and in NPR social-science correspondent Shankar Vedantam’s The Hidden Brain. New York Times columnist Nicholas Kristof is a big fan and mentions the test regularly. “It’s sobering to discover that whatever you believe intellectually, you’re biased about race, gender, age or disability,” he wrote in 2015. A 2001 article in the American Psychological Society’s Bulletin magazine described the IAT as “A Revolution in Social Psychology.” The IAT wasn’t the first tool researchers developed to measure implicit bias, which was a subject social psychologists had been interested in for a while, but it was the first to fully catch on and launch the concept into the mainstream.

That’s in part because Banaji, Greenwald, and the other architects and proponents have touted the test’s importance and potential for addressing racial discrimination ever since it was introduced. Along the way, they’ve made some very bold, attention-getting claims about what the IAT can do, such as in this paragraph from the introduction to Banaji and Greenwald’s 2013 book Blindspot: Hidden Biases of Good People:

[T]he automatic White preference expressed on the Race IAT is now established as signaling discriminatory behavior. It predicts discriminatory behavior even among research participants who earnestly (and, we believe, honestly) espouse egalitarian beliefs. That last statement may sound like a self-contradiction, but it’s an empirical truth. Among research participants who describe themselves as racially egalitarian, the Race IAT has been shown, reliably and repeatedly, to predict discriminatory behavior that was observed in the research.

Maybe the biggest driver of the IAT’s popularity and visibility, though, is the fact that anyone can take the test on the Project Implicit website, which launched shortly after the test was unveiled and which is hosted by Harvard University. The test’s architects reported that, by October 2015, more than 17 million individual test sessions had been completed on the website. As will become clear, learning one’s IAT results is, for many people, a very big deal that changes how they view themselves and their place in the world.

Given all this excitement, it might feel safe to assume that the IAT really does measure people’s propensity to commit real-world acts of implicit bias against marginalized groups, and that it does so in a dependable, clearly understood way. After all, the test is hosted by Harvard, endorsed and frequently written about by some of the top social psychologists and science journalists in the country, and is currently seen by many as the most sophisticated way to talk about the complicated, fraught subject of race in America.

Unfortunately, none of that is true. A pile of scholarly work, some of it published in top psychology journals and most of it ignored by the media, suggests that the IAT falls far short of the quality-control standards normally expected of psychological instruments. The IAT, this research suggests, is a noisy, unreliable measure that correlates far too weakly with any real-world outcomes to be used to predict individuals’ behavior — even the test’s creators have now admitted as such. The history of the test suggests it was released to the public and excitedly publicized long before it had been fully validated in the rigorous, careful way normally demanded by the field of psychology. In fact, there’s a case to be made that Harvard shouldn’t be administering the test in its current form, in light of its shortcomings and its potential to mislead people about their own biases. There’s also a case to be made that the IAT went viral not for solid scientific reasons, but simply because it tells us such a simple, pat story about how racism works and can be fixed: that deep down, we’re all a little — or a lot — racist, and that if we measure and study this individual-level racism enough, progress toward equality will ensue.

***

Before we get ahead of ourselves, it’s important to understand where implicit association tests dealing with race and ethnicity fit into the broader IAT universe. Ever since the IAT was first introduced, there have been many “flavors” of it. Some IATs claim to measure people’s implicit bias for one brand over another (such tests are used for corporate market research), others to measure implicit bias for skinny people over overweight people, yet others implicit bias for young people over older ones, and so on.

For the purposes of this article, though, “the IAT,” unless otherwise specified, refers specifically to the broad class of IATs dealing with race or ethnicity. The race IAT occupies a special place in the broader constellation of implicit-bias research: Many of the researchers who have focused on the shortcomings of the race IAT do so because it appears to be far and away the most famous, frequently administered, and societally relevant flavor of the test. This is especially true in the United States, given our racial history, and it’s not an accident that when Banaji and Greenwald introduced the IAT, they focused on black/white race relations. (Elsewhere, different types of race relations have attracted more attention — German researchers, for example, have used the IAT to test how ethnic Germans respond to German-sounding names versus Turkish-sounding ones, since Turks are the largest ethnic minority group in Germany and face various challenges to fair treatment and full integration into German society.) It’s important, though, to recognize that not every critique of the race IAT applies to other varieties of the test.

This is also probably as good a place as any to introduce some names that are going to pop up again and again and again. Since the IAT was first unveiled, a huge number of researchers have published studies about the test — it’s an area of research that has absolutely exploded, as a Google Scholar search reveals. But the academic controversy over the rigorousness of the IAT has been fought most fiercely, albeit not exclusively, between a core group of IAT proponents and a core group of critics. The proponents include Banaji and Greenwald, the creators of the test, Brian Nosek (who has also carved out his own niche as a leading open-science advocate at the University of Virginia), and John Jost of New York University. The critics include Wharton School professor Philip Tetlock, a big name in behavioral science best known for studying why some people are better at making predictions than others, Hart Blanton of the University of Connecticut (a methods expert who is currently a visiting scholar at the University of Texas at Austin), Gregory Mitchell of the UVA school of law, Fred Oswald of Rice University, Hal Arkes of Ohio State University, and James Jaccard of NYU.

Finally, it’s important to separate out two different questions: whether the IAT accurately measures implicit bias, and how large a role implicit bias, as compared to other factors, plays in generating discriminatory outcomes. There’s been a great deal of confusion on this front, and as we’ll see, this confusion has damaged not only the public’s understanding of racism, but also social psychology’s ability to study it productively.

Those preambles out of the way, here’s how a typical race IAT works: You sit down at a computer where you are shown a series of images and/or words. First, you’re instructed to hit ‘i’ when you see a “good” term like pleasant, or to hit ‘e’ when you see a “bad” one like tragedy. Then, hit ‘i’ when you see a black face, and hit ‘e’ when you see a white one. Easy enough, but soon things get slightly more complex: Hit ‘i’ when you see a good word or an image of a black person, and ‘e’ when you see a bad word or an image of a white person. Then the categories flip to black/bad and white/good. As you peck away at the keyboard, the computer measures your reaction times, which it plugs into an algorithm. That algorithm, in turn, generates your score.

If you were quicker to associate good words with white faces than good words with black faces, and/or slower to associate bad words with white faces than bad words with black ones, then the test will report that you have a slight, moderate, or strong “preference for white faces over black faces,” or some similar language. You might also find you have an anti-white bias, though that is significantly less common. By the normal scoring conventions of the test, positive scores indicate bias against the out-group, while negative ones indicate bias against the in-group.

The rough idea is that, as humans, we have an easier time connecting concepts that are already tightly linked in our brains, and a tougher time connecting concepts that aren’t. The longer it takes to connect “black” and “good” relative to “white” and “good,” the thinking goes, the more your unconscious biases favor white people over black people. If you take the IAT in a diversity-training session, all this will likely be explained to you after receive your score. You’ll probably also be told just how common implicit biases against minority groups are — perhaps the 75 percent figure will be invoked — and how they can worm their way into our everyday interactions with members of other races and all sorts of decision-making processes, helping to fuel discriminatory outcomes.

There’s a certain intuitive appeal to this story. But here’s where science taps us on the shoulder and reminds us that intuition isn’t enough. Here’s where science asks whether it’s fair to say that an average difference in reaction time on the order of a couple hundred milliseconds constitutes evidence of “implicit bias” against a certain group. This has been a source of major confusion — ever since the IAT was introduced, a great deal of media coverage has suggested that getting a high IAT score means one is implicitly biased against a minority group. That isn’t necessarily true, though, because all the IAT measures on its own is reaction times to different stimuli. Implicit bias, as the term is generally understood, is a psychological phenomenon that causes someone to act in a discriminatory manner in real-world settings. The IAT claims to be measuring implicit bias, but that’s just a claim — there’s no reason to automatically think it’s a true one. What if someone who scores high on the IAT never acts in a biased manner? Can a bias be a bias if it only exists in the context of a very specific test result, but never bubbles out into the real world? If “implicit bias” were defined as “the state of having received a high score on an IAT test,” no one would care about implicit bias, or about the IAT. But since the IAT was first introduced, it has been claimed, over and over, that it does accurately measure implicit bias in a way that is relevant to real-world behavior.

The only way to evaluate how accurately the IAT actually measures the sort of implicit bias everyone cares about, then, is to look to the research that has been published on the test. But before doing so it’s important to zoom out a bit to the broader question of how psychologists prove a given instrument, whether one developed to measure depression, narcissism, or anything else, is accurate enough to be useful for real-world purposes. There’s an entire field of psychology, psychometrics, dedicated to the creation and validation of psychological instruments, and instruments are judged based on whether they exceed certain broadly agreed-upon statistical benchmarks. The most important benchmarks pertain to a test’s reliability — that is, the extent to which the test has a reasonably low amount of measurement error (every test has some) — and to its validity, or the extent to which it is measuring what it claims to be measuring. A good psychological instrument needs both.

The IAT, it turns out, has serious issues on both the reliability and validity fronts, which is surprising given its popularity and the very exciting claims that have been made about its potential to address racism. That’s what the research says, at least, and it raises serious questions about how the IAT became such a social-science darling in the first place.

Take the concept of test-retest reliability, which measures the extent to which a given instrument will produce similar results if you take it, wait a bit, and then take it again. Different instruments have different test-retest reliabilities. A tape measure has high test-retest reliability because if you measure someone’s height, wait two weeks, and measure it again, you’ll get very similar results. The measurement procedure of grabbing an ice cube from your freezer and seeing how many ice cubes tall your friend is would have much lower test-retest reliability, because different ice cubes might be of different sizes; it’s easier to make errors when counting how many ice cubes tall your friend is; and so forth.

This is a bedrock psychometric feature of many psychological instruments; test-retest reliability is often one of the first things a psychologist will look for when deciding whether to use a given tool. That’s particularly true if it’s the sort of test that is designed to provide important information from someone based on a single test-taking session. If a depression test, for example, has the tendency to tell people they’re severely depressed and at risk of suicidal ideation on Monday, but essentially free of depression on Tuesday, that’s not a useful test. It’s safe to say, based on how the IAT is used and marketed, that most lay people who are familiar with the test imagine that it provides useful information based on a single session.

Test-retest reliability is expressed with a variable known as r, which ranges from 0 to 1. To gloss over some of the gory statistical details, r = 1 means that if a given test is administered multiple times to the same group of people, it will rank them in exactly the same order every time. Hypothetically, if the IAT had a test-retest reliability of r = 1, and you administered the test to ten people over and over and over, they’d be placed in the same order, least to most implicitly biased, every time. At the other end of the spectrum, when r = 0, that means the ranking shifts every time the test is administered, completely at random. The person ranked most biased after the first test would, after the second test, be equally likely to appear in any of the ten available slots. Overall, the closer you get to r = 0, the closer the instrument in question is to, in effect, a random-number generator rather than a remotely useful means of measuring whatever it is you’re trying to measure.

What constitutes an acceptable level of test-retest reliability? It depends a lot on context, but, generally speaking, researchers are comfortable if a given instrument hits r = .8 or so. The IAT’s architects have reported that overall, when you lump together the IAT’s many different varieties, from race to disability to gender, it has a test-retest reliability of about r = .55. By the normal standards of psychology, this puts these IATs well below the threshold of being useful in most practical, real-world settings.

But what about the race IAT in particular? After all, it is a very specific, very important category of IAT used in many educational and diversity-training contexts — not to mention the fact that millions of people have taken it on Project Implicit. Surprisingly, there’s a serious dearth of published information on test-retest reliability of the race IAT specifically. “It’s kind of odd that after almost two decades, the researchers promoting this measure have not so much as posted test-retest reliability of commonly used IATs via their many web resource pages, much less in publication,” said Hart Blanton in an email. “Look up common clinical, organizational and educational assessment tools and you will typically find such information readily available to the consumer.” Greenwald acknowledged in an email of his own that “no one has yet undertaken a study of” the race IAT’s test-retest reliability, though, as he pointed out, “no one” includes both the test’s proponents and its critics.

The individual results that have been published, though, suggest the race IAT’s test-retest reliability is far too low for it to be safe to use in real-world settings. In a 2007 chapter on the IAT, for example, Kristin Lane, Banaji, Nosek, and Greenwald included a table (Table 3.2) running down the test-retest reliabilities for the race IAT that had been published to that point: r = .32 in a study consisting of four race IAT sessions conducted with two weeks between each; r = .65 in a study in which two tests were conducted 24 hours apart; and r = .39 in a study in which the two tests were conducted during the same session (but in which one used names and the other used pictures). In 2014, using a large sample, Yoav Bar-Anan and Nosek reported a race IAT test-retest reliability of r = .4 (Table 2). Calvin Lai, a postdoctoral fellow at Harvard who is the director of research at Project Implicit, ran the numbers from some of his own data, and came up with similar results. “If I had to estimate for immediate test-retest now, it would be r ~= .35,” he wrote in an email. “If it was over longer time periods, I would revise my estimate downward although I’m uncertain about how much.” (In emails, Greenwald argued that Lai’s figures should be adjusted upward using the so-called Spearman-Brown formula to account for the fact that they stemmed from IATs that weren’t full-length, but Blanton strongly pushed back on that claim. I emailed a few statisticians asking them to arbitrate the dispute and basically got a hung jury.) (Update: Lai emailed me after this article went up and said that in light of research published since he provided me with the original estimate, he’d now estimate the true value to be in the neighborhood of r = .42.)

What all these numbers mean is that there doesn’t appear to be any published evidence that the race IAT has test-retest reliability that is close to acceptable for real-world evaluation. If you take the test today, and then take it again tomorrow — or even in just a few hours — there’s a solid chance you’ll get a very different result. That’s extremely problematic given that in the wild, whether on Project Implicit or in diversity-training sessions, test-takers are administered the test once, given their results, and then told what those results say about them and their propensity to commit biased acts. (It should be said that there are still certain consistent patterns: Most white people, for example, score positively on black-white IAT, supposedly signaling the presence of anti-black implicit bias.)

As for validity, over and over the IAT’s proponents have made confident statements about the test’s ability to predict behavior. In the quote from Blindside excerpted above, for example, Banaji and Greenwald explicitly claim that the test does a better job predicting behavior than explicit measures like feelings thermometers in which people numerically “rate” their feelings toward different groups — an idea echoed on the IAT’s FAQ page. This is an absolutely crucial claim, and much of the IAT’s cultural and academic gravitas flows directly out of it. If the IAT can’t predict discriminatory behavior, and can’t do so more accurately than explicit measures, then it’s a lot less useful and interesting than its proponents have made it out to be. A major conceit of the test, after all, is that it reveals hidden biases that can pop up in people who explicitly renounce discriminatory beliefs or intent.

Anthony Greenwald and Mahzari Banaji, co-creators of the implicit association test Photo: Harvard University News Office/Delacorte Press

In statistical terms, the architects of the IAT claimed, for a long time, that there is a meaningful correlation between two variables: Someone’s IAT score (call it x) and how implicitly biased they act in intergroup settings (call it y). Generally speaking, researchers measure the extent to which two variables are correlated by examining how much of the variation in one variable, y, is explained by changes in the other, x. The more two variables are correlated in this manner, the more meaningful a connection might exist between them.

Might is key here: As the Stats 101 saying goes, “Correlation does not imply causation.” But if you have a clear hypothesis about how two things connect to one another — in this case, The higher someone scores on the IAT (x), the higher the likelihood they will act in a discriminatory manner (y) — the first step toward proving that hypothesis is to establish that there is, in fact, a meaningful correlation between x and y. If you have a data set of IAT scores and observed behavior, and you find that as you move from lower to higher IAT scores those scores correspond with a steadily increasing prevalence of discriminatory behavior, that means that IAT scores explain a good amount of the variance in discriminatory behavior, in the parlance of statistics. The more variance in behavior the IAT explains, the more it can credibly claim to be measuring implicit bias — and the more validity it has.

It’s not easy to measure discriminatory behavior in a lab setting, but researchers have come up with a variety of methods for doing so. You can see whether a white participant interacts differently with a white as compared to a black experimenter, for example. Or you can have them complete a simulated hiring task in which they choose whom to give an “interview” by comparing résumés that are effectively identical, except that some have stereotypically white names and others have stereotypically black names. These methods aren’t perfect, and many of them have been critiqued on various grounds — particularly those dealing with “microbehaviors” like blinking or body language, where the connection to real-world outcomes is a lot more questionable than it is for experiments dealing with simulated hiring tasks and so forth — but that’s the toolbox social psychologists are working with.

IAT researchers have published a bunch of studies that the test’s main proponents claim show reasonably strong links between IAT scores and behavior on both race and other issues, and in my correspondence with Banaji and the test’s other core group of architects and proponents, they sent me some of them. If you look around, you can find IAT papers correlating IAT scores with all sorts of important behavioral outcomes. Some of those papers are pretty interesting, like a recently published National Bureau of Economic Research one that purports to show that “When minority cashiers [in a French grocery chain], but not majority cashiers, are scheduled to work with managers who are biased (as determined by an Implicit Association Test), they are absent more often, spend less time at work, scan items more slowly, and take more time between customers.” Tetlock, for his part, made a point of emphasizing that despite being a staunch critic of the race IAT, he finds some of the research linking other IATs to behavior to be interesting. For instance, he said he was impressed by work published by Matthew K. Nock of Harvard on the connection between scores on a self-injury IAT and real-world self-harm. “I think much more needs to be done to establish validity in this domain, but that work seems promising,” he said. (Here’s a “brief report” from the American Journal of Psychiatry summarizing one such study that Nock published with Banaji.)

But when it comes to social-scientific claims, exciting-seeming studies only get you so far. These days, especially in light of psychology’s replication crisis, a bunch of studies showing something to be true isn’t automatically viewed as sufficient evidence that that thing is true. Researchers have realized that all sorts of biases can creep into the process of conceptualizing, conducting, and publishing studies, and sometimes these biases lead to the overhyping of intriguing-seeming but ultimately spurious findings. This has happened over and over again, contributing to what has become a moment of grim soul-searching among many research psychologists — particularly social psychologists, whose field has experienced a particularly worrying string of failed replications.

That’s why one of the current gold standards for assessing how “real” a given effect is is meta-analysis, or the process of collecting all the studies you can find on a given question and, in effect, averaging their results. This, the thinking goes, can reduce experimenter error and bias. It isn’t perfect, but it’s a much better method than relying on any handpicked collection of individual studies. And when you use meta-analyses to examine the question of whether IAT scores predict discriminatory behavior accurately enough for the test to be useful in real-world settings, the answer is: No. Race IAT scores are weak predictors of discriminatory behavior.

We know this because of a protracted meta-analytical back-and-forth that has played out in the pages of the Journal of Personality and Social Psychology, a flagship publication in the field of psychology. Since 2009, a team of the IAT’s architects — Greenwald, Nosek, Banaji, and others, with different names on different papers — have duked it out with some of the test’s leading critics: Oswald, Mitchell, Blanton, Jaccard, and Tetlock.

The arguments and subarguments get pretty complicated and technical, but two important points stand out. One is that the most IAT-friendly numbers, published in a 2009 meta-analysis lead-authored by Greenwald, which found fairly unimpressive correlations (race IAT scores accounted for about 5.5 percent of the variation in discriminatory behavior in lab settings, and other intergroup IAT scores accounted for about 4 percent of the variance in discriminatory behavior in lab settings), were based on some fairly questionable methodological decisions on the part of the authors. The Oswald team, in a meta-analysis of their own published in 2013, argued convincingly that Greenwald and his colleagues had overestimated the correlations between IAT scores and discriminatory behavior by including studies that didn’t actually measure discriminatory behavior, such as those which found a link between high IAT scores and certain brain patterns (these studies, in fact, found some of the highest correlations). The Oswald group also claimed — again, convincingly — that the Greenwald team took a questionable approach to handling so-called ironic IAT effects, or published findings in which high IAT scores correlated with better behavior toward out-group than in-group members, the theory being the implicitly biased individuals were overcompensating. Greenwald and his team counted both ironic and standard effects as evidence of a meaningful IAT–behavior correlation, which, in effect, allowed the IAT to double-dip at the validity bowl: Unless the story being told is extremely pretzel-like, it can’t be true that high IAT scores predict both better and worse behavior toward members of minority groups. If one study finds a correlation between IAT scores and discriminatory behavior against out-group members, and another, similarly-sized study finds a similarly sized correlation between IAT scores and discriminatory behavior against the in-group members, for meta-analytic purposes those two studies should average out to a correlation of about zero. That isn’t what the Greenwald team did — instead, they in effect added the two correlations as though they were pointing in the same direction.

The second, more important point to emerge from this years-long meta-analytic melee is that both critics and proponents of the IAT now agree that the statistical evidence is simply too lacking for the test to be used to predict individual behavior. That’s not to say the two teams don’t still disagree on many issues — they do, and as we’ll see there’s some genuine bad blood — but on this point, the architects have effectively conceded. They did so in 2015: The psychometric issues with race and ethnicity IATs, Greenwald, Banaji, and Nosek wrote in one of their responses to the Oswald team’s work, “render them problematic to use to classify persons as likely to engage in discrimination.” In that same paper, they noted that “attempts to diagnostically use such measures for individuals risk undesirably high rates of erroneous classifications.” In other words: You can’t use the IAT to tell individuals how likely they are to commit acts of implicit bias. To Blanton, this is something of a smoking gun: “This concession undermines the entire premise of their webpage,” he said. “Their webpage delivers psychological diagnoses that even they now admit are too filled with error to be meaningful.”

Now, there’s still some debate over exactly how low the correlation between race IAT scores and behavior is. One important upcoming meta-analysis, which we’ll return to later in another context, found that such scores can explain less than one percent of the variance observed in discriminatory behavior. The researchers Rickard Carlsson and Jens Agerström, in a meta-analysis of their own published in the Scandinavian Journal of Psychology last year, pinned the figure at about 2 percent — but argued that the extant research is of such low statistical quality it’s impossible to draw any meaningful conclusions from it. “Attempting to meta-analytically test the correlation between IAT and discrimination thus appears futile,” they wrote. “We are, essentially, chasing noise, and simply cannot expect any strong, or even moderate, correlations, based on the current literature.”

Philip Tetlock, one of the IAT’s more outspoken critics

But at a certain level, it doesn’t matter whether the “real” value here is one percent or 2 percent or (less likely) 3 percent: The point is that the key experts involved in IAT research no longer claim that the IAT can be used to predict individual behavior. In this sense, the IAT has simply failed to deliver on a promise it has been making since its inception — that it can reveal otherwise hidden propensities to commit acts of racial bias. There’s no evidence it can.

***

In examining the history of the IAT, it’s clear that early on, the test’s architects and most enthusiastic proponents got ahead of themselves in their claims that the IAT accurately measured implicit bias, never fully grappling with the possibility that the test captures, or also captures, other stuff as well. But again: All the test itself measures is differences in reaction times, and if those reaction-times differences haven’t been proven to predict real-world behavior, it doesn’t make sense to tag someone with a high IAT score as “implicitly biased,” except in a very trivial sense of the term.

When the IAT was introduced in 1998, the reaction-time differences hadn’t yet been connected to any real-world outcomes, and they wouldn’t be until the first paper linking IAT scores to behavior was published in 2001. And yet between the test’s introduction and that first study, there was a spate of media coverage that failed to ask the same skeptical questions about the IAT that one should ask about any new psychological instrument. This helped boost the test’s reputation as a reliable, valid barometer of implicit bias when the verdict, by normal scientific standards, was very much still out.

Over and over, that early coverage — and early statements from Banaji and Greenwald — prematurely implied that there was a connection between IAT scores and real-world outcomes. The University of Washington press release announcing the introduction of the IAT, for example, described the IAT as a “psychological tool that measures unconscious components of prejudice.” Early coverage from the Associated Press and the Times and countless other outlets all echoed the idea that the IAT measured something that had implications for real-world manifestations of prejudice and discrimination.

To be fair, there were definitely hints of caution sprinkled here and there — “the researchers note that no evidence links a person’s performance on the test with attitudes or behavior in the outside world,” reported the Times. But overall, any lay reader would have come away from early coverage of the IAT convinced that the test had vital real-world implications for understanding racism. Banaji and Greenwald contributed to this idea. In a talk she gave at the American Psychological Society’s convention, Banaji described the invention of the IAT as similar to the invention of the telescope — it had ushered in a revolution for how we see the world. And in a March 19, 2000, episode of Dateline, Greenwald noted that “If a police officer is going to shoot two-tenths of a second faster at an African-American than a European-American, well, that could be a matter of life and death.” The implication was clear: The IAT can predict absolutely vital behavioral outcomes. (The transcript for that episode doesn’t appear to be Google-able, but it can be found on Nexis.)

It’s important to be fair here. Greenwald and Banaji weren’t bandying about homoeopathic remedies or astrological insights. Rather, from their point of view, the IAT fit neatly into preexisting theories of prejudice and intergroup relations they had been working on for an extended period. They thought they were onto something big, in other words. And researchers have every right to tout their exciting findings to the public, of course.

But they also have a responsibility to not get ahead of the available evidence. In reality, what Greenwald and Banaji had found around the turn of the millennium were certain predictable patterns in how quickly different sorts of people responded to different sorts of stimuli. Majority groups tended to score higher than minority groups on the IAT, for example. That’s interesting in its own right, but at the time Greenwald and Banaji certainly hadn’t established any solid, real-world connection between these scores and any observable marker of discriminatory behavior. And yet the test’s legend grew.

Part of the problem was that some of the early papers that did claim to find a link between IAT scores and discriminatory behavior had backbreaking problems that wouldn’t be discovered until much later on. That first paper from 2001, for example, had a major impact when it was published by Allen R. McConnell and Jill Leibold, and has since been cited heavily. But a group of six researchers that included Blanton, Mitchell, and Tetlock eventually uncovered serious methodological problems with it, which they highlighted in a 2009 article. Those problems effectively overturn the paper’s main finding of a correlation between IAT scores and discriminatory behavior (though the authors contest that). Another influential paper, published in 2007 by the researchers Jeremy Heider and John Skowronski, also reported impressive findings over the course of two studies. But as Blanton and Mitchell explained in 2011, both studies were riddled with crippling errors: The first one simply excluded data that would have shown the researchers had in fact discovered no link between IAT scores and discriminatory behavior, while the second was based on IAT data that the authors admitted had been partially fabricated (“Heider indicated it was an overzealous undergraduate,” said Blanton in an email). Other IAT studies, too, have been conducted in sloppy and misleading manners, Blanton and his colleagues have discovered over the years. But because of the lag time between publication and debunking, for a while there was the illusion that the IAT’s most impressive claims rested on sound empirical footing. Many people, particularly members of the public not up on the latest literature, seem to still believe this.

But there have always been alternate potential explanations for what the IAT really measures. From early on, skeptics of Greenwald and Banaji’s claims have highlighted the possibility that the test doesn’t really, or doesn’t only, capture implicit bias; in 2004, for example, Hal Arkes and Tetlock published a paper entitled “Would Jesse Jackson ‘Fail’ the Implicit Association Test?” in which they argued that it could be the case that people who are more familiar with certain stereotypes score higher on the IAT, whether or not they unconsciously endorse those stereotypes in any meaningful way. Along those same lines, some researchers have suggested that it could be the case that those who empathize with out-group members, and are therefore well aware of the negative treatment and stereotypes they are victimized by, have an easier time forming the quick negative associations with minority groups that the IAT interprets as implicit bias against those groups.

It would be wrong to say that the architects of the IAT and their colleagues ignored the possibility of these sorts of alternate explanations entirely — in one study from 2000, for example, Banaji, Greenwald, and some colleagues examined whether white people’s familiarity with white people as opposed to black people could be driving high IAT scores, and concluded that it wasn’t sufficient to account for them. But they also don’t appear to have fully explored alternate explanations for what the IAT measures, and they certainly didn’t linger much on this possibility when they were promoting the test.

Other researchers, though, have looked closely at the possibility that the IAT isn’t just measuring implicit bias. And they have indeed shown in multiple studies that high IAT scores may sometimes be artifacts of empathy for an out-group, and/or familiarity with negative stereotypes against that group, rather than indicating any sort of deep-seated unconscious endorsement of those associations. In 2006, for example, Eric Luis Uhlmann, Victoria Brescoll, and Betsy Levy Paluck published the results of a very clever study they conducted on undergraduates in the Journal of Experimental Social Psychology. In one experiment, the participants “were randomly assigned to either (1) associate the novel group Noffians with words related to oppression and the novel group Fasites with words related to privilege,” or the reverse — Noffians privileged, Fasites oppressed — and were then given a race IAT, but with Fasites and Noffians standing in for white and black. As it turned out, “participants were faster to associate Noffians with ‘Bad’ after being conditioned to associate Noffians with oppression, victimization, and discrimination.” In other words, the experimenters were able to easily induce what the IAT would interpret as “implicit bias” against Noffians simply by forming an association between them and downtroddenness in general.

That isn’t a perfect analogy to what’s going on with the IAT, of course, since terms like white and black are deeply embedded, culturally reinforced categories rather than novel ones like Noffian. But it’s still a striking result, as is another study published in that same journal by the psychologists Michael Andreychik and Michael Gill. That study dealt with these questions in a slightly more real-world way. Andreychik and Gill focused on the difference between so-called external or internal explanations for why certain out-groups are disadvantaged. In the case of African-Americans, for example, external explanations highlight “forces outside the group, such as the centuries of injustice they endured,” while internal explanations focus on the group’s own shortcomings — often, in this context, by referencing racist stereotypes about black people being lazy, less intelligent, and so forth. Past research has shown that those who endorse external explanations for disparate outcomes tend, unsurprisingly, to express more compassion and empathy for the groups in question. That is, if you believe a group hasn’t had a fair shake, you are more likely to feel bad about the fact that they occupy a lower level within society’s hierarchy.

Andreychik and Gill found that for those students who endorsed external explanations for the plight of African-Americans or a novel group, or who were induced to do so, high IAT scores correlated with greater degrees of explicitly reported more compassion and empathy for those groups. For those who rejected, or were induced to reject, external explanations, the correlation was exactly reversed: High IAT scores predicted lower empathy and compassion. In other words, the IAT appeared to indicate very different things for people who did or didn’t accept external explanations for black people’s lower standing in society. This suggests that sometimes high IAT scores indicate that someone feels high degrees of empathy and compassion toward African-Americans, and believes that the group hasn’t been treated fairly. Now, it could be that such people also have high amounts of implicit bias, but it’s striking how easily IAT scores can be manipulated with interventions that don’t really have anything to do with implicit bias.

And those are just two examples of the many published instances in which the IAT appears to be measuring something other than implicit bias. One study appeared to demonstrate the existence of a “stereotype threat”-style effect in the IAT — whites who were more concerned about appearing racist were scored as more “biased” by the test. In the initial version of the IAT’s scoring algorithm, there was even a correlation between cognitive-processing speed and IAT score, the researchers Sam G. McFarland and Zachary Crouch found: Those who were a bit cognitively slower got higher IAT scores, meaning they were told they were more biased than faster test-takers. “An older person is going to possibly be told they’re more racist than a younger person,” said Blanton, “or a person who does crossword puzzles or plays computer games will possibly be told they’re less racist compared to someone who doesn’t.”

To account for this and other errors, in 2007 the test’s architects adopted a new algorithm. But when Blanton and his colleagues popped that algorithm’s hood and poked around, they found that it contained a very weird quirk of its own — a potentially serious one. “The new problem is that anyone who concentrates and is really consistent [in their responses] is going to come across as extremely biased,” explained Blanton. In a 2014 article in the journal Assessment, Blanton, James Jaccard, and Christopher N. Burrows reported that since the algorithm changed, the less statistically “noisy” a given IAT test session is — think someone taking the test in a loud room, or with a bad hangover, or when their mind is elsewhere — the more the test will report the test-taker is biased against the out-group.

Blanton said that he has never seen a psychological instrument in which less statistical noise predictably biases the results upward or downward. “What should happen is that as you remove random noise, you just get a better estimate of [the thing being measured],” he explained. Blanton provided a surprising example of how this plays out in test sessions, according to his team’s math: If a race IAT test-taker is exactly 1 millisecond faster on each and every white/good as compared to black/bad trial, they “will get the most extreme label,” he said. That is, the test will tell them they are extremely implicitly biased despite their having exhibited almost zero bias in their actual performance. That’s an extreme example, of course, but Blanton says he’s confident this algorithmic quirk is “affecting real-world results,” and in the Assessment paper he and his colleagues published the results of a bunch of simulated IAT sessions which demonstrated as such.

To be sure, there’s no perfect psychological instrument. They all have their flaws and shortcomings — sometimes maddening ones. But there may not be any instrument as popular and frequently used as the race IAT that is as riddled with uncertainty about what, exactly, it’s measuring, and with the sorts of methodological issues that in any other situations would cause an epidemic of arched eyebrows. “What I’ve been convinced of is it’s very difficult to break down the origins of these associations,” said Elizabeth Paluck, a prejudice and intergroup relations researcher at Princeton and a co-author on the “Noffians” study. “They can’t be all attributed to personal preference, they certainly come from cultural associations and conditioning.” As for the authors of the internal/external explanations paper, they note in it that “our analysis is perfectly compatible with the possibility that, perhaps for the majority of people, implicit negativity is likely to be prejudice-based.” But even if you accept that, it means for a substantial minority of people, the implicit negativity revealed by the IAT isn’t connected to prejudice — which is one reasonable way to interpret those underwhelming meta-analyses.

So how much of a given IAT score is an actual marker of implicit bias? One researcher, Ulrich Schimmack, just published a blog post in which he re-analyzed an influential 2001 IAT study to try to better understand this issue. He found that just 25 percent in the variance of IAT scores generated by a single version of the test is explained by actual implicit bias, and that for technical reasons this may well be a significant overestimate. But overall, there’s simply a dearth of information on this issue. At the moment, there’s no good, empirically backed reason to assume that a given IAT score reflects your actual level of implicit bias, as opposed to a noisy mishmash of other stuff (a mishmash which probably includes some unknown quantity of “real” implicit bias). That’s because the IAT’s creators didn’t bother fully investigating this question before releasing the test to the public and telling all of us, prematurely, that it really measures implicit bias, and does so accurately.

***

Does the general public understand the IAT’s very serious limitations? It seems extremely unlikely. At the moment, after all, there’s something of a Schrödinger’s test situation going on in which the IAT both does and doesn’t predict discriminatory behavior, according to Banaji and Greenwald. If you read the mass-market explanation of the test published in book form — as of this writing, Blindspot is No. 11 in Amazon’s behavioral-science section — it does: The IAT “predicts discriminatory behavior even among research participants who earnestly (and, we believe, honestly) espouse egalitarian beliefs,” and “has been shown, reliably and repeatedly” to do so. In fact, this is a “clearly… established” “empirical truth.” If you wade into a complicated meta-analytic back-and-forth inaccessible to most readers, on the other hand, the IAT doesn’t predict behavior: The psychometric problems endemic to these tests “render them problematic to use to classify persons as likely to engage in discrimination,” and “attempts to diagnostically use such measures for individuals risk undesirably high rates of erroneous classifications.” This is a fairly remarkable about-face given that just two years separated the book and the article. (“The two statements in the paragraph are not contradictory,” Greenwald said when I asked him and Banaji to comment on this discrepancy. “You have to distinguish statements diagnostic of individuals, which we’ve resisted for multiple reasons, and statements of generalizations about aggregate data based on research of multiple (sometimes many) subjects.” But both snippets clearly reference individual-level predictions.)

Either way, while proponents of the test have acknowledged it can’t predict individual behavior with a useful degree of accuracy, both they and some critics of the test have maintained that the IAT can still be still useful for two purposes: as a means of estimating the level of implicit bias in society, and as an educational tool. Both these uses raise serious questions too, though.

In making the case for the race IAT’s use as a means of producing bird’s-eye-view estimates of implicit bias in society, Greenwald, Banaji, and Nosek argue in that same paper that its psychometric issues “diminish substantially as sample size increases. Therefore, limited reliability and small-to-moderate effect sizes are not problematic in diagnosing system-level discrimination, for which analyses often involve large samples.” And the race IAT is, in fact, commonly used to generate estimates of the level of implicit bias in society, or among various groups. There are examples everywhere. Slate, for example, reported in 2012 that an Associated Press IAT study “found that [Democrats and Republicans] are far closer in attitudes” than explicit measures of prejudice reveal, “with 55 percent of Democrats and 64 percent of Republicans having anti-black feelings.” Then there was a study of biracial adults conducted in 2015 by Pew Social Trends: “According to the [IAT], fully 42% of all white and black biracial adults had a pro-white bias, just short of the 48% of all whites that felt the same way and 7 percentage points higher than the share with a pro-black bias (35%).” And one figure from Project Implicit’s data seems to pop up everywhere: that, as one Washington Post review of Blindspot put it, “75 percent of [the IAT’s] takers, including some African Americans, have an implicit preference for white people over black people.”

Again, there’s that overheated language, with IAT scores described not as differences in reaction times, but as signifying “anti-black feelings” or “pro-white bias” or “implicit preference for white people over black people.” And that nicely sums up the problem with using the IAT in this manner. It’s broadly true that some of the imperfections in a given measurement tool get washed out if you use that tool over and over to generate more zoomed-out estimates. If my depression scale overestimates depressive symptoms in some patients and underestimates them in others, there still might be situations in which I can use a bunch of data generated from it to make certain broad statements about population-level rates of depression.

But that’s assuming I have already demonstrated that the depression scale is valid — that it measures, albeit imperfectly, actual depressive symptoms. It’s harder to make this sort of zooming-out argument once you’ve acknowledged that a given tool is too noisy to make valid individual-level measurements at at all. “If you’re not willing to say what the positive [IAT score] means at the individual level, you have no idea what it means at the aggregate level,” said Blanton. In other words, “If I’m willing to give 100 kids an IQ test, and not willing to say what an individual kid’s score means,” he said, “how can I then say 75 percent of them are geniuses, or are learning disabled?” So there is good reason to question that 75 percent figure, or any other figure of the prevalence of bias generated by the race IAT, given how foggy things get at the individual level.

What about the claim that the IAT is useful as an educational tool? As Paluck, the Princeton prejudice researcher, explained in an email, the IAT’s potency on this front could be a major reason the test has spread as far and wide as it has. “We’ve got the last couple generations growing up flooding into college courses thinking that they’re post-racial, or that they’re post-racist at the very least,” she said. “This is a powerful way to make the point that no, you’re not, and you still come out looking biased in this way. And you shouldn’t trust your instincts — you shouldn’t trust yourself, you shouldn’t trust your belief about yourself that you’re free of racism. And that’s valuable, and that’s why it’s also been wildly successful with instructors, because it does make a very powerful point in classrooms. Then you can extend that to things like diversity trainings.”

It is absolutely the case that in a country as segregated and racially troubled the United States, efforts to better inform citizens (particularly members of majority groups) about the sometimes subtle nature of bigotry and racial disparities are important. Plenty of research, for example, has shown that when it comes to your probability of being called in for a job interview after submitting a résumé, having a black-sounding name as compared to a white-sounding one effectively penalizes you in much the same way having less education or fewer qualifications would. Surely, at least some of that effect is attributable to implicit bias, and the same goes for many of the other areas in which racial cues have been correlated with discrepant outcomes.

So there is nothing wrong with implicit-bias training that covers this sort of research. Nor is there anything wrong with IAT-based trainings which merely explain to people that they may well be carrying around certain associations in their head they are unaware of, and that researchers have uncovered patterns about who is more likely to demonstrate which response-time differences. In situations where one group holds historic or current-day power over the other, for example, members of the in-group do tend to score higher on the IAT than the out-group. Some of these between-group differences appear to be pretty robust, and they deserve further study. These are all worthwhile subjects to discuss, as long as it is made clear to test-takers that their scores do not predict their behavior.

If, on the other hand, in the course of being “educated,” individual test-takers are being provided with confusing, misleading, or improperly hedged information about their own propensity to act in a racially biased manner, that is a big problem. Psychology has sturdy norms in place against exposing test-takers to misleading assessments, and it isn’t hard to see why: A depression or anxiety test that was wildly inaccurate yet popular would lead to a lot of false diagnoses, and could do serious harm.

Few would argue that a false IAT result would put someone at risk in the same way a missed diagnosis of major depression might. But the field of psychology still frowns upon the idea of people in positions of authority and respect misleading test-takers. That’s why the IAT’s proponents are so adamant that the IAT shouldn’t be seen as “diagnosing” anything. “We have always argued that the IAT should NOT be used as a diagnostic tool,” said Banaji in an email. “It is not, as I’ve said, a DNA test. The good news is, it has never been used that way, not with the developers of the test (us) speaking against any such use.” Nosek echoed that sentiment in an email as well. “Across the history of the website and our writing, we collectively have been opposed to interpretation of the IAT as a diagnosis of any kind, and of its application for selection purposes,” he wrote.

Here Banaji and Nosek seem to be focusing on one particular sense of the word diagnosis — the idea of the test being used to screen people for employment, educational, or other opportunities. But that’s not all “diagnosis” means. There’s a broader scientific sense of the word that matters here, and setting aside Orwellian scenarios in which people are denied jobs because of their IAT scores, critics of the test have expressed serious skepticism about the claim that the IAT shouldn’t be seen as diagnostic.

Take, for example, a paragraph from one of the best and most substantive critiques of the IAT that has yet been written, an article published by the researchers Klaus Fiedler, Claude Messner, and Matthias Bluemke in the European Review of Social Psychology in 2006:

Is the IAT actually used as a diagnostic inference tool, so that psychometric criteria must be applied? Or is it just an ordinary dependent measure for experimental research, like a ‘‘recognition test’’ or an informal ‘‘speed test’’, as proponents of the IAT have recently asserted in response to psychometric critique? To be sure, in science it matters little what proponents and authors declare; what is relevant is what function the IAT is playing in scientific reality. In this regard, it would hardly be justified to negate the diagnostic role attributed to the IAT. The IAT is abundantly used, and was originally meant … as a tool for measuring individual differences.

Fiedler and his colleagues go on to lay out many examples of the IAT’s proponents treating it like a diagnostic tool, specifically their tendency to tout it as a means of measuring some hidden, otherwise unobservable attribute within test subjects. By this logic, it’s hard to see how the test isn’t diagnostic.

Part of this debate, of course, hinges on the question of whether individual test-takers view their IAT scores as “diagnostic” in the sense of providing important information about themselves which they view as coming from a credible source. The more they do, the stronger the ethical case against certain usages of the test. Nosek downplayed the impact the IAT has on this front, drawing an analogy to another kind of test: “[W]hen my daughter takes a math test in her fourth-grade class, the score she receives is not a diagnosis of her math ability or even her math achievement. Even so, it can be useful.”

Is it fair to compare the IAT to a math test, as just one piece of feedback test-takers will integrate into a broader context and not get too hung up upon? If so, maybe there aren’t potentially important ethical issues here. But if you listen to Banaji and Greenwald, it is plainly the case that Nosek’s interpretation isn’t the norm — for years they have been expressing the opposite sentiment. In the University of Washington press release covering the IAT’s unveiling, for example, the author notes that “Banaji and Greenwald admitted being surprised and troubled by their own test results.” In 2005, Banaji would go further, telling the Washington Post she was “deeply embarrassed” at her test result. “I was humbled in a way that few experiences in my life have humbled me.” (The Post link appears to be dead but the quote shows up elsewhere.) In Blindspot, Greenwald described his first IAT session as a “moment of jarring self-insight … I can’t say if I was more personally distressed or scientifically elated to discover something inside my head that I had no previous knowledge of.”

They’re not the only ones to report having been deeply affected by their IAT results. Describing the emotional nature of one’s first IAT session became so common that in 2008, John Tierney mused on the Times website that “It’s something of a custom, when discussing the IAT, to disclose your own score on the test along with your unease.” In other instances, people from minority backgrounds are shocked to learn they are biased against their own people, and naturally respond to this news with discomfort. In Blink, for example, Gladwell describes his unease at finding out he has a moderate level of anti-black implicit bias, despite being biracial. And in 2015, KQED ran a story in which a Persian pharmacy resident said she was struck by her IAT results. “It was like, actually, you’re biased and you don’t like brown people and you don’t like Muslims,” she told reporter April Dembosky. “Which is interesting for me because that’s kind of the two things that I am.”

In light of all this, it’s hard to disagree with the conclusion of Fiedler and his colleagues that it is only “fair and appropriate to treat the IAT with the same scrutiny and scientific rigour as other diagnostic procedures.” If that’s true, then between Project Implicit and cutting-edge diversity trainings, the IAT has misled potentially millions of people. Over and over and over and over, the IAT, a test whose results don’t really mean anything for an individual test-taker, has induced strong emotional responses from people who are told that it is measuring something deep and important in them. This is exactly what the norms of psychology are supposed to protect test subjects against.

***

It would take thousands and thousands more words to fully lay out all the problems with the IAT and how it has been applied. For example, the test’s scoring convention assumes that a score of zero represents behavioral neutrality — that someone with a score at or near zero will treat members of the in-group and out-group the same. But Blanton and his colleagues found that in those studies in which the IAT does predict discriminatory behavior, there’s a “right bias” in which a score of zero actually corresponds to bias against the in-group. This offers even more evidence that there is something wrong with the entire basic scoring scheme.

Blanton, Tetlock, Mitchell, and others have also highlighted the fact that the IAT team adopted completely arbitrary guidelines regarding who is labeled by Project Implicit as having “slight,” “moderate,” or “strong” implicit preferences. These categories were never tethered to any real-world outcomes, and sometime around when the IAT’s architects changed the algorithm, they also changed the cutoffs, never fully publishing their reasoning. As a result of this switch, write Mitchell and Tetlock in a chapter on the IAT in the newly published book Psychological Science Under Scrutiny: Recent Challenges and Proposed Solutions, “the percentage of persons supposedly showing strong anti-black bias on the IAT dropped from 48% to 27%. This change in levels of implicit prejudice was not due to a sudden societal shift, nor due to the findings of any studies linking particular bands of IAT scores to particular behaviors. This change was due solely to the researchers’ change in definitions.”

Finally, there are hints which suggest that the race IAT could directly diminish the quality of certain intergroup interactions. In a 2012 study published in Psychological Science, for example, the psychologist Jacquie Vorauer had a bunch of white Canadians complete a work task with an aboriginal partner. Prior to doing so, some of the participants took an IAT pertaining to aboriginal people, some took a non-race IAT, and some were asked for their explicit feelings about the group. Aboriginals in the race-IAT group subsequently reported feeling less valued by their white partners as compared to aboriginals in all of the other groups. So while IAT proponents have suggested it could be used to improve intergroup relations, writes Vorauer, “if completing the IAT enhances caution and inhibition, reduces self-efficacy, or primes categorical thinking, the test may instead have negative effects.” This is just one finding, of course, and it comes from a contrived lab setting, but it suggests some troubling possibilities.

The IAT’s myriad problems pose profound challenges to its underlying theory, and to the practice of labeling people in certain ways based on their test results. So why has the test managed to mostly avoid critical scrutiny, except in academic papers members of the public generally don’t read?

For one thing, the test offers a lot to members of the public who are concerned about racism, whether they are white and concerned about their out-group biases, or nonwhite and concerned about the possibility that they have internalized bias against their own group. Taking the IAT is a way for them to feel like they are part of the solution. Now I get it — now I understand that my implicit bias is contributing to America’s race problem. This can explain the strange but common phenomenon of test-takers loudly broadcasting results which imply they are implicitly racist: It’s a way of signaling they’re serious about investigating their own complicity in a big, complicated system of oppression. There wouldn’t be anything wrong with that, of course, if the IAT were in fact providing test-takers useful information about their level of implicit bias.

The broader story told by the IAT is, at the moment, quite politically palatable and intuitively satisfying. Not only is implicit bias driving all sorts of racially unfair outcomes, that story tells us, but it’s something that we can detect and measure in ourselves, helping to raise our consciousness. “I think the reason behind adoption of implicit-bias training is simple: It is now the thing to do to demonstrate commitment to diversity and redressing inequality,” said Mitchell.

And the IAT story line is catnip not only to progressive members of the public but also to psychological researchers who want to contribute, in a rigorous-seeming, quantifiable way, to research about racism and how to dismantle it. Patrick Forscher, a postdoctoral researcher at the University of Wisconsin who has studied the IAT, explained that politics have bled into science a bit when it comes to the instrument: “The problem is that implicit measures, and the IAT in particular, became a critical part of a political narrative about why disparities between social groups exist in the United States,” he explained in an email. “Thus, claims about implicit measures became, to a certain extent, political claims, not just scientific claims.”

If it is politically palatable to embrace the IAT and the nationwide search for our inner bias, then to criticize the test is to be on the wrong side of the progressive conversation about race. That’s what Mahzarin Banaji seemed to imply when I emailed her some questions about common methodological critiques of the IAT, at least. In her responses, she repeatedly asserted that criticisms of the IAT come from a small group of reactionary researchers, and that questioning the IAT is not something normal, well-adjusted people do:

Of course it annoys people when a simple test that spits out two numbers produces this sort of attention, changes scientific practice, and appeals to ordinary people. Ordinary people who are not scared to know what may be in their minds. It scares people (fortunately a negligible minority) that learning about our minds may lead people to change their behavior so that their behavior may be more in line with their ideals and aspirations. The IAT scares people who say things like “look, the water fountains are desegregated, what’s your problem.”

[…]

By and large I operate on the view that I need to pay attention to REAL criticisms of the IAT. Criticisms that come from people who are experts – that is[,] people who understand the science’s assumptions about response latency measures. People who do original work using such methods. I’m sorry to say this but we are all so far along in our work that I at least only read criticisms from people who are experts. I don’t read commentaries from non-experts. There’s too much interesting stuff to do and too many amazing people doing it for me to justify worrying about a small group of aggrieved individuals who think that Black people have it easy in American society and that the IAT work might make their lives easier. The IAT as you know is not about any one group, but this small group of critics ignore everything other than race, and it may be a good idea to find out why that may be the case.

In a later email, Banaji again emphasized that there is something wrong, perhaps deeply so, with critics who focus so much on black/white implicit bias and the specifics of how it is measured. “[I]t is very important, if one wants to understand the role of implicit cognition in decisions about humans, to give up the pathological focus on race in the Black-White context that some people seem to cling to,” she wrote. “These are, of course, unique social categories which is why we ourselves pay attention to it. But the fetish with Black-White race relations in some folks is something that science won’t be able to answer because it seems not to be about the evidence. It will need to be dealt with by them in the presence of their psychotherapists or church leaders.” She concluded her email, “Someday when the history of this work is written, long after we are gone, the reaction from a half a dozen people to the mounting data will be even more interesting as a case study than the data about implicit cognition itself.”

Banaji is simply incorrect that it is only a small group of obsessed researchers concerned with the IAT’s methodological shortcomings, or that their critiques aren’t “real.” Yes, there is a core team of researchers who have spent the most time critiquing the test (just as there is a core team, of which Banaji is a member, of pro-IAT researchers who have spent the most time promoting it), but by now scores of papers have been published, some in top journals, which complicate the story Banaji and her colleagues have been telling for years about what the test measures and how well it performs at that task. Some of those papers have been published by researchers, like Betsy Levy Paluck of Princeton, who have dedicated their careers to understanding intergroup intolerance and how to remedy it. As for the fact that some researchers spend a lot of time critiquing and researching the properties of the black-white IAT, it isn’t a mystery why — again, that’s the most publicized and likely the most administered and influential flavor of the IAT.

But more important, it’s a rather big deal for the chair of Harvard’s psychology department to accuse researchers who are engaging in thoroughly commonplace methodological critiques of being animated by racism and possibly mental illness. Those are very serious charges. I brought this critique to Blanton, who is one of the most prolific and statistically sophisticated of the test’s critics (for what it’s worth, he is the co-author of an instructional book on methods in psychological research). “This topic is too important for bad science,” he said. “That is my philosophy on this. This isn’t the first time that someone has intimated that I am a political conservative, that I have a lax attitude toward bias, that I’m indifferent to it. They point out I’m a white male.” (The core group of IAT critics spans a sizable chunk of the political spectrum: Blanton described himself as liberal, while Tetlock is known to be somewhat conservative, or very conservative by the standards of social psychology.) “This isn’t the first time it’s happened,” continued Blanton. “My attitudes toward this is that I’m very comfortable with the fact that this is about the need for good science, and a construct as important as racism, we don’t apply the assessment methods you’d apply in the back of a beauty magazine. That’s what’s going on. That’s what offends me — it offends me as a scientist.”

***

It’s hard not to see Blanton’s point. Race is a really, really complicated subject. But there’s a risk that to accept the IAT story line uncritically is to ignore some of this complexity — or, in some cases, to adopt sterilized accounts of how racism works. On the homepage of the implicit-bias training outfit Fair & Impartial Policing, for example, the organization notes that its approach is “based on the science of bias, which tells us that biased policing is not, as some contend, due to widespread racism in policing.” One can quibble with the definition of widespread, but the implications are clear: Since the science of bias tells us that so many Americans harbor implicit bias, and that this form of bias is behaviorally meaningful, implicit bias must be a really important thing for police departments to focus on. Explicit bias, though? Not so much.

But this account leaves out some important details. When the federal government conducts a probe of an American police department scrutinized for brutal or unfair behavior, for example, investigators often uncover significant reserves of explicit bias. Take Ferguson, Missouri: The Justice Department reported that “Ferguson’s harmful court and police practices are due, at least in part, to intentional discrimination, as demonstrated by direct evidence of racial bias and stereotyping about African Americans by certain Ferguson police and municipal court officials.” It might be advantageous to various people to say implicit bias rather than explicit bias is the most important thing to focus on, but that doesn’t make it true — a point driven home, perhaps, by the fact that the United States just elected one of the more explicitly racist presidential candidates in recent history.

The fact that the IAT has come to so thoroughly dominate the social-psychological conversation about race may be tilting the scales in favor of certain explanations at the expense of others, not because they are better or more empirically defensible, but simply because they more neatly fit a hot and frequently hyped paradigm. “Focusing so narrowly on implicit bias risks ignoring the complexity of the problems, like racial disparities, that are argued to be caused to implicit bias,” said Forscher. “Any problem as tenacious and long-standing as racial disparities is unlikely to be caused by any one thing. Surely, then, it is worthwhile for psychologists interested in resolving problems like racial disparities to investigate other possible causes of disparities, such as intentional or structural processes.” It’s not that psychologists are ignoring these other causes, of course — it’s just that the IAT, by dint of its cultural and academic resonance, has generated a strong gravity well that sucks in money and researchers. If you study the IAT, you can easily produce heaps of quantitative data, you can help promote an interesting and provocative story line about race in America, and you can be a part of one of the most popular and widely publicized contemporary approaches to solving serious racial issues.

Forscher has had some firsthand exposure to the potential problems with psychology’s current IAT-sparked fixation on measuring implicit bias at the individual level. He’s the co–lead author, with Calvin Lai, the Harvard postdoc who is the head of research at Project Implicit, of an important work-in-progress meta-analysis (Brian Nosek is a co-author on it, too). The authors were interested in a bunch of studies that have evaluated newfangled interventions aimed at reducing biased behavior by reducing implicit bias. So Forscher and his colleagues examined hundreds of these studies — studies using both the IAT and other, less famous tools to measure implicit bias — for their meta-analysis. They didn’t find much. “Based on the evidence that is currently available,” Forscher said, “I’d say that we cannot claim that implicit bias is a useful target of intervention.” This is a valuable finding, but it isn’t a surprising one given the paucity of evidence correlating IAT scores (and other measures of implicit bias) with behavior in the first place.

The problem is that the hype over IAT research, and the eagerness to apply the test to real-world problems, has so outpaced the evidence that it has launched a lot of studies built on underwhelming foundations. “Implicit bias research has been driven by both the desire to understand truths about the human mind and the desire solve social problems,” said Forscher. “These goals have not always been in conflict. Unfortunately, one of the ways they have is that the desire to do something, anything, to solve problems related to race has led some people to jump to conclusions about the causal role of implicit bias that they might have been more cautious about had their only focus been on establishing truth.”

To Forscher, implicit bias’s role in propagating racial inequality should be given a “fair trial in the court of scientific evidence,” not simply assumed. But what’s going on now isn’t a fair trial; instead, the overhyping of IAT stacks the deck so much that sometimes it feels like implicit bias can explain everything. But plenty of researchers think that other factors play a bigger role in determining some of the most important societal outcomes. “I think unconscious racial prejudice is real and consequential,” said Robb Willer, a sociologist at Stanford University, “but my sense is that racial inequality in America is probably driven more by structural factors like concentrated poverty, the racial wealth gap, differential exposure to violence, the availability of early childhood education, and so on. Though it is also worth noting that past and present racial prejudice helped create these structural inequalities.” This is a fairly common sentiment among social scientists who study race and discrimination.

So it’s an open question, at least: The scientific truth is that we don’t know exactly how big a role implicit bias plays in reinforcing the racial hierarchy, relative to countless other factors. We do know that after almost 20 years and millions of dollars’ worth of IAT research, the test has a markedly unimpressive track record relative to the attention and acclaim it has garnered. Leading IAT researchers haven’t produced interventions that can reduce racism or blunt its impact. They haven’t told a clear, credible story of how implicit bias, as measured by the IAT, affects the real world. They have flip-flopped on important, baseline questions about what their test is or isn’t measuring. And because the IAT and the study of implicit bias have become so tightly coupled, the test’s weaknesses have caused collateral damage to public and academic understanding of the broader concept itself. As Mitchell and Tetlock argue in their book chapter, it is “difficult to find a psychological construct that is so popular yet so misunderstood and lacking in theoretical and practical payoff” as implicit bias. They make a strong case that this is in large part due to problems with the IAT.

Unless and until new research is published that can effectively address the countless issues with the implicit association test, it might be time for social psychologists interested in redressing racial inequality to reexamine their decision to devote so much time and energy to this one instrument. In the meantime, the field will continue to be hampered in its ability to provide meaningful answers to basic questions about how implicit bias impacts society, because answering those questions requires accurate tools. So, contra Banaji, scrutinizing the IAT and holding it to the same standards as any other psychological instrument isn’t a sign that someone doesn’t take racism seriously: It’s exactly the opposite.

Psychology’s Racism-Measuring Tool Isn’t Up to the Job