Statistical Jargon to Use and to Avoid
Yay for Type I and Type II errors. Yay for understanding inverse probability; Boo to statistical significance
If you take a first-year statistics course, you will pick up some jargon in the course. I love some of it, and I hate some of it.
Trade-offs expressed in terms of Type I and Type II errors
Economist Thomas Sowell is known for saying “There are no solutions, only trade-offs.” That should be known as Sowell’s Law.
When we are faced with a set of binary decisions of a given sort, Sowell’s Law can be described as the trade-off between making two types of mistakes. In classical statistics, a Type I error means claiming that the evidence for a hypothesis is strong when it isn’t. And a Type II error means failing to recognize that the evidence for a hypothesis actually is strong.
But Type I and Type II errors can be used to describe many more situations. For instance, in a court case, the jury must find the defendant guilty or not. One mistake would be to convict an innocent defendant. Call that a Type I error. The opposite mistake would be to fail to convict a guilty defendant. That is a Type II error. By setting a standard of “innocent unless guilty beyond a reasonable doubt,” our legal tradition is saying that we should try to minimize Type I errors, at the risk of committing Type II errors.
You can apply this jargon to the decision to get married. Maybe young people today want to be really sure before they get married. They want to avoid Type I error of getting married and regretting it. But they will make more Type II errors, in which they postpone marriage and regret that.
In mortgage lending, a Type I error is approving a loan that subsequently defaults. A Type II error is failing to approve a loan that would have worked out.
In the early 2000s, politicians accused lenders of making too many Type II errors. They put pressure on lenders to approve more borrowers, including borrowers with poor credit histories. There was a bipartisan consensus, reinforced by industry lobbyists, that there was an “underserved” market for buying homes on credit. Under intense pressure, lenders loosened their standards.
Then the housing market tanked, defaults skyrocketed, and the politicians turned around and blamed the lenders for making Type I errors. Accusations of “predatory lending” flew. So after the crisis, pressured by the political scapegoating and the Dodd-Frank legislation that passed in response to the 2008 crisis, lenders adopted very strict standards.
In a situation with Type I and Type II errors, if you want to make fewer errors of one type, your decision criteria will cause you to make more errors of the other type. The closer you drive the instances of one type of error toward zero, the more you will suffer many errors of the other type.
If you want perfectly safe streets, then police will probably have to stop and frisk some innocent people. If you want the police to never harass an innocent person, then street crime will become more prevalent.
If you understand the trade-off between Type I and Type II errors, then you have a pretty decent grasp of Sowell’s Law.
Inverse Probability
Consider two statements:
(a) Men with serious kidney ailments often have microscopic levels of blood in their urine.
(b) Men with microscopic levels of blood in their urine often have serious kidney ailments.
The second statement is known in statistical jargon as an inverse probability statement. Someone who is well grounded in probability and statistics is very careful not to confuse (a) with (b), and especially not to presume that because (a) is true (b) must be true.
Unfortunately, I once had a reputable doctor from a leading medical school who made recommendations for me to undergo painful and expensive tests because he confused the two statements. In fact, microscopic blood in urine is a common symptom, and unless there are other indications, I do not believe that further testing is warranted.
In fact, doctors are notorious for getting inverse probability wrong. In their book SuperCrunchers, Ian Ayres and Barry Nalebuff observed this. So did the authors of this article.
As another example, consider two statements about young men jumping over turnstiles at the subway station closest to me.
(a) The turnstile jumpers are almost all black.
(b) Almost all black young men jump over the turnstile.
My observation is that (a) is true and (b) is false. But you need to be able to grasp the notion of inverse probability in order to avoid confusing the two statements.
When you tell your audience (a) is realistic, but people hear (b) and call you a racist, don’t say I didn’t warn you. If medical school graduates cannot sort out inverse probability, then your audience cannot, either.
Statistical Significance
If you read about research in economics, medicine, or sociology, you are bound to encounter the phrase “statistical significance.” I hate that phrase, because it never means what you want it to mean.
When someone says that a cancer drug produced a statistically significant result, what you want it to mean is that it had a big effect in shrinking a tumor and that the method for testing and measuring that effect was reliable. But “statistically significant” does not mean that.
With a large enough sample, miniscule effects can satisfy the formula for statistical significance. A study showing that a cancer drug causes a tumor to shrink by less than 1 percent for the average patient can nonetheless be statistically significant. A result that is statistically significant is not necessarily medically meaningful, or economically meaningful, or whatever the case may be . The material significance of a result is best judged by using common sense, not by applying formulas.
Technically, statistical significance means that the sample size is sufficiently large that you can believe the result, as long as everything else about the how the study was done is kosher. Spoiler alert: the study is never kosher. You should never trust a result just because it met the mathematical formula for statistical significance.
You can trust a result if it is found by different investigators and arrived at in a variety of ways. It is popular nowadays to do a “meta analysis,” which summarizes results of many studies. The concept makes sense, but I think that the process for doing meta analysis should rely less on formulas and more on judgment. When Scott Alexander or Emily Oster looks into a question, they look at all of the relevant studies they can find. They decide which studies employed the most credible methods. They look for the consistency, or lack thereof, in the results of different studies. Then they give you their judgment. I highly respect that sort of approach.
Beware of anyone who just rings the bell of statistical significance and says “Come look at this!” Instead, pay attention to everything about the way that a study was conducted. And be sure to compare it to other studies on the same topic.
Quibble: why say Type I and Type II errors instead of false positives and false negatives? With the former I always forget which is which; the latter is intuitively descriptive.
Arnold, you deservesome kind of award for writing this essay. Great stuff!