Recently, three economists—Oded Netzer and Alain Lemaire, both of Columbia, and Michal Herzenstein of the University of Delaware—looked for ways to predict the likelihood of whether a borrower would pay back a loan. The scholars used data from Prosper, a peer-to-peer lending site. Potential borrowers write a brief description of why they need a loan and why they are likely to make good on it, and potential lenders decide whether to provide them the money. Overall, about 13 percent of borrowers defaulted on their loan.
It turns out the language that potential borrowers use is a strong predictor of their probability of paying back. And it is an important indicator even if you control for other relevant information lenders were able to obtain about those potential borrowers, including credit ratings and income.
Listed below are ten phrases the researchers found that are commonly used when applying for a loan. Five of them positively correlate with paying back the loan. Five of them negatively correlate with paying back the loan. In other words, five tend to be used by people you can trust, five by people you cannot. See if you can guess which are which.
God, promise, debt-free, minimum payment, lower interest rate, will pay, graduate, thank you, after-tax, hospital.
You might think—or at least hope—that a polite, openly religious person who gives his word would be among the most likely to pay back a loan. But in fact this is not the case. This type of person, the data shows, is less likely than average to make good on their debt.
Here are the phrases used in loan applications by people most likely to pay them back: debt-free, lower interest rate, after-tax, minimum payment, graduate.
And here are the phrases used by those least likely to pay back their loans: God, promise, will pay, thank you, hospital.
What should we make of the words in the different categories? First, let’s consider the language that suggests someone is more likely to make their loan payments. Phrases such as “lower interest rate” or “after-tax” indicate a certain level of financial sophistication on the borrower’s part so it’s perhaps not surprising they correlate with someone more likely to pay their loan back. In addition, if he or she talks about positive achievements such as being a college “graduate” and being “debt-free,” he or she is also likely to pay their loans.
Now, let’s consider language that suggests someone is unlikely to pay their loans. Generally, if someone tells you he will pay you back, he will not pay you back. The more assertive the promise, the more likely he will break it. If someone writes “I promise I will pay back, so help me God,” he is among the least likely to pay you back. Appealing to your mercy—explaining that he needs the money because he has a relative in the “hospital”—also means he is unlikely to pay you back. In fact, mentioning any family member—a husband, wife, son, daughter, mother or father—is a sign someone will not be paying back. Another word that indicates default is “explain,” meaning if people are trying to explain why they are going to be able to pay back a loan, they likely won’t.
The authors did not have a theory for why thanking people is evidence of likely default.
In sum, according to these researchers, giving a detailed plan of how he can make his payments and mentioning commitments he has kept in the past are evidence someone will pay back a loan. Making promises and appealing to your mercy is a clear sign someone will go into default. Regardless of the reasons—or what it tells us about human nature that making promises is a sure sign someone will, in actuality, not do something—the scholars found the test was an extremely valuable piece of information in predicting default. Someone who mentions God was 2.2 times more likely to default. This was among the single highest indicators that someone would not pay back.
But the authors also believe their study raises ethical questions. While this was just an academic study, some companies do report that they use online data in approving loans. Is this acceptable? Do we want to live in a world in which companies use the words we write to predict whether we will pay back a loan? It is, at a minimum, creepy—and, quite possibly, scary.
A consumer looking for a loan in the near future might not merely have to worry about her financial history but also her online activity. And she may be judged on factors that seem absurd—whether she uses the phrase “Thank you” or invokes “God,” for example. Further, what about a woman who legitimately needs to help her sister in a hospital and will most certainly pay back her loan afterwards? It seems awful to punish her because, on average, people claiming to need help for medical bills have often been proven to be lying. A world functioning this way starts to look awfully dystopian.
Big Data is exploding. It has helped us find the websites we want to see, the people we want to talk to, the jobs we want to apply for.
But the power of Big Data raises a host of ethical questions. In particular: Do corporations have the right to judge our fitness for their services based on abstract but statistically predictive criteria not directly related to those services?
One place where our digital data is already increasingly used to make decisions is in hiring practices. Start-ups such as TalentBin help companies make sense of social media when considering job candidates. That may not raise ethical questions if they’re look for evidence of bad-mouthing previous employers or revealing previous employers’ secrets. But what if they find a seemingly harmless indicator that correlates with something they care about?
Researchers at Cambridge and Microsoft gave 58,000 U.S. Facebook users a variety of tests about their personality and intelligence. They found that Facebook likes are frequently correlated with IQ, extraversion, and conscientiousness. For example, people who like Mozart, thunderstorms, and curly fries on Facebook tend to have higher IQs. People who like Harley Davidson motorcycles, the country music group Lady Antebellum, or the page “I Love Being a Mom” tend to have lower IQs. Some of these correlations may be due to the curse of dimensionality. If you test enough things, some will randomly correlate. But some interests may legitimately correlate with IQ.
Nonetheless, it would seem unfair if a smart person who happens to like Harley Davidsons couldn’t get a job commensurate with his skills because he was, without realizing it, signaling low intelligence.
In fairness, this is not an entirely new problem. People have long been judged by factors not directly related to job performance—the firmness of their handshakes, the neatness of their dress. But a danger of the data revolution is that, as more of our life is quantified, these proxy judgments can get more esoteric yet more intrusive. Better prediction can lead to subtler and more nefarious discrimination.
Better data can also lead to another form of discrimination, what economists call price discrimination. Businesses are often trying to figure out what price they should charge for goods or services. Ideally they want to charge customers the maximum they are willing to pay.
Most businesses usually end up picking one price that everyone pays. But sometimes they are aware that the members of a certain group will, on average, pay more. This is why movie theaters charge more to middle-aged customers—at the height of their earning power—than to students or senior citizens and why airlines often charge more to last-minute purchasers. They price discriminate.
Big Data may allow businesses to get substantially better at learning what customers are willing to pay—and thus gouging certain groups of people. Optimal Decisions Group was a pioneer in using data science to predict how much consumers are willing to pay for insurance. How did they do it? They ran what’s called a doppelganger search, finding prior customers most similar to those currently looking to buy insurance—and saw how high a premium they were willing to take on. A doppelganger search is great if it helps us cure someone’s disease by finding a small group of patients most similar to him. But if a doppelganger search helps a corporation extract every last penny from you? That’s not so cool.
Big casinos are using something like a doppelganger search to better understand their consumers and make sure more of your money goes into their coffers. Here’s how it works. Every gambler, casinos believe, has a “pain point.” This is the amount of losses that will sufficiently frighten her so that she leaves your casino for an extended period of time. Suppose, for example, that Helen’s “pain point” is $3,000. This means if she loses $3,000, you’ve lost a customer, perhaps for weeks or months. If Helen loses $2,999, she won’t be happy. Who, after all, likes to lose money? But she won’t be so demoralized that she won’t come back tomorrow night.
Imagine for a moment that you are managing a casino. And imagine that Helen has shown up to play the slot machines. What is the optimal outcome? Clearly, you want Helen to get as close as possible to her “pain point” without crossing it. You want her to lose $2,999, enough that you make big profits but not so much that she won’t come back to play again soon.
How can you do this? Well, there are ways to get Helen to stop playing once she has lost a certain amount. You can offer her free meals, for example. Make the offer enticing enough, and she will leave the slots for the food.
But there’s one big challenge with this approach. How do you know Helen’s “pain point”? The problem is, people have different “pain points.” For Helen, it’s $3,000. For John, it might be $2,000. For Ben, it might be $26,000. If you convince Helen to stop gambling when she lost $2,000, you left profits on the table. If you wait too long—after she has lost $3,000— you have lost her for a while. Further, Helen might not want to tell you her pain point. She may not even know what it is herself.
You use data science. You learn everything you can about a number of your customers—their age, gender, zip code, and gambling behavior. And, from that gambling behavior—their winnings, losings, comings, and goings—you estimate their “pain point.” You gather all the information you know about Helen and find gamblers who are similar to her—her doppelgangers, more or less. However much pain they can withstand is probably the same amount as Helen. Indeed, this is what the casino Harrah’s does, utilizing a Big Data warehouse firm, Terabyte, to assist them.
Scott Gnau, general manager of Terabyte, explains, in the excellent book Super Crunchers, what casino managers do when they see a regular customer nearing their pain point: “They come out and say, ‘I see you’re having a rough day. I know you like our steakhouse. Here, I’d like you to take your wife to dinner on us right now.’ In other words, management is using sophisticated data analysis to try to extract as much money from customers, over the long term, as it can.
We have a right to fear that better and better use of online data will give casinos, insurance companies, lenders, and other corporate entities too much power over us. On the other hand, Big Data has also been enabling consumers to score some blows against businesses that overcharge them or deliver shoddy products.
One important weapon is sites such as Yelp that publish reviews of restaurants and other services. A recent study by economist Michael Luca, of Harvard, has shown the extent to which businesses are at the mercy of Yelp reviews. Comparing those reviews to sales data in the state of Washington, he found that one fewer star on Yelp will make a restaurant’s revenues drop five to nine percent. Consumers are also aided in their struggles with business by comparison shopping sites— like Kayak and Booking.com.
Data on the internet, in other words, can tell businesses which customers to avoid and which they can exploit. It can also tell customers the businesses they should avoid and who is trying to exploit them. Big Data to date has helped both sides in the struggle between consumers and corporations. We have to make sure it remains a fair fight.
From the book EVERYBODY LIES: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz. Copyright © 2017 by Seth Stevens-Davidowitz. Reprinted by permission of Dey Street Books, an imprint of HarperCollins Publishers.