Quibble: why say Type I and Type II errors instead of false positives and false negatives? With the former I always forget which is which; the latter is intuitively descriptive.
The type nomenclature goes back nearly a century from Egon Pearson. The language problem is that you want to describe two different kinds or sources of some noun meaning error, and falsity is almost never used and "mistake" is not a true synonym. You want them named 'errors' so you can use the more abstract noun in formulating and discussing concepts like "crossover error rate" or "propagation of error" or "margin of error".
If one said "positive error" and "negative error", that would be potentially confusion as to the sign of the number involved. One could day "false positive error" and "false negative error" which I think is superior to and no more redundant than "type I error" and "type II error".
"False positive" and "False negative" — with or without the word "error" — also facilitate discussing "true positive" and "true negative." Even though you can easily compute the "true..." rate give the "false..." rate (and vice versa), it's often context-specific which of these four terms is most useful to discuss.
Mathematicians often introduce bland and unintuitive terminology like this. Corresponding jargon from areas where the concepts denoted by the terms find practical use tend to be better. In this case, the Soviet military jargon for Type I error is _false alarm_, and for Type II error - _miss_ (or _missed alarm_).
Software debugging has 3 error types. Type I; Aha!; You discover someone else's bug. Type II; Oops!: You discover your own bug. Type III: Ah shit!; Someone else discovers your bug.
"In classical statistics, a Type I error means claiming that the evidence for a hypothesis is strong when it isn’t. And a Type II error means failing to recognize that the evidence for a hypothesis actually is strong." I like the post / stack / essay in general, but the bit quoted here is incorrect as stated. Consider the case where some parameter equals zero in the population, and the null hypothesis is that it equals zero. But, as happens five percent of random samples, our estimate is statistically different from zero. In that case, the null is true, we reject the null and thereby make a Type I error, but the evidence against the null is nonetheless strong. That's why we make the mistake!
Sowell's Law only holds when the true 'vector' of the decision matrix is known; so that you are simply shifting the risk between the mistakes. In any system where that vector isn't known, there are decisions to be made which are potential 'solutions' or 'categorical errors' and can reduce both kinds of error and/or increase them; or increase one without reducing the other. Removing information content from the assessment of mortgage risk - or including useless information - can worsen all types of error, for example, without reducing any.
At first glance, the way we trade off Type 1 and Type 2 errors seems to be somewhat related to our moral intuitions around commission vs omission, loss aversion and most importantly, how saliently and clearly the failings can be attributed to our deliberate actions. . We don't seem to care about what's most relevant - our expected impact on the world - but we care a whole lot about what people can legibly hold us morally culpable for. For example:
- As a society, we'd much rather avoid committing the crime of prosecuting an innocent person than fail to keep innocent people safe from criminals, because then, it's on the criminals.
- We'd much rather too much regulation that avoids accidents that can be blamed on regulation but we're sanguine about the opportunity cost of those regulations and the countless lives it could improve on the margin.
- I'd rather avoid marrying the wrong person and have to accept that I chose wrong. But it's much less legible whether I did everything I can to find a partner.
"When you tell your audience (a) is realistic, but people hear (b) and call you a racist, don’t say I didn’t warn you."
I don't think it matters whether they understand inverse probability. Even if they did I think they'd still say the comment was racist. You can't say anything negative about the oppressed no matter what.
In another way, the statement probably is a bit racist. Just like saying women earn 70% what men do, the statement about turnstile jumpers does account for known differences in income, education, etc. I guess one could also see a bit of irony in assumptions that turnstile jumping is accusing the jumper and earning less doesn't accuse the earner. What type error is that?
Some of these errors can be avoided (the ones that do not require yes/no answers) by going straight to inputting the distribution that the "significance test" is applied to, into cost benefit analysis. A "statistically insignificant" result applied to a large effect might suggest the use of a new drug where a statistically significant effect of a small effect would not.
"Technically, statistical significance means that the sample size is sufficiently large that you can believe the result, as long as everything else about the how the study was done is kosher."
It also depends on the plausibility of the premise. And the reasonableness of reducing the premise and the data to numbers amenable to statistical significance.
The best explanation I've heard was in my econometrics class: A statistical significance of p<.01 means that, if nothing is really going on, I'd see results like this only 1% of the time, and I don't think I'm living in a 1% world, so I accept the proposition."
Most social science experiments take p<.05 as the required level of significance. This means that, if the hypothesis is wrong, 5% of studies will nevertheless conclude that the hypothesis is correct. Since "cannot reject the null hypothesis" results are rarely interesting enough to be published, lots of studies will be published, even with findings that are, in a "real" sense, wrong. Which is probably why most studies won't replicate.
Now, consider some alternative scenarios: Someone believes that ghosts are real and can speak through electronic boxes. He sets up an experiment to test this hypothesis, and meticulously records results. He records sounds in a place believed to be haunted, and records sounds in a place believed not to be haunted as a control. He finds word-like sounds in the former, with p<.05. Does this credibly demonstrate that the haunted location has ghosts, who are communicating through the electronic box? Only if you consider, a priori, that it is plausible that ghosts can communicate through an electronic box. If you don't think that's plausible, you'd need a much lower p value to even think about reconsidering - say p<.000000001.
Quibble: why say Type I and Type II errors instead of false positives and false negatives? With the former I always forget which is which; the latter is intuitively descriptive.
The type nomenclature goes back nearly a century from Egon Pearson. The language problem is that you want to describe two different kinds or sources of some noun meaning error, and falsity is almost never used and "mistake" is not a true synonym. You want them named 'errors' so you can use the more abstract noun in formulating and discussing concepts like "crossover error rate" or "propagation of error" or "margin of error".
If one said "positive error" and "negative error", that would be potentially confusion as to the sign of the number involved. One could day "false positive error" and "false negative error" which I think is superior to and no more redundant than "type I error" and "type II error".
"False positive" and "False negative" — with or without the word "error" — also facilitate discussing "true positive" and "true negative." Even though you can easily compute the "true..." rate give the "false..." rate (and vice versa), it's often context-specific which of these four terms is most useful to discuss.
Interesting and sounds right but then false positive is simply a short for false positive error.
Mathematicians often introduce bland and unintuitive terminology like this. Corresponding jargon from areas where the concepts denoted by the terms find practical use tend to be better. In this case, the Soviet military jargon for Type I error is _false alarm_, and for Type II error - _miss_ (or _missed alarm_).
Arnold, you deservesome kind of award for writing this essay. Great stuff!
Software debugging has 3 error types. Type I; Aha!; You discover someone else's bug. Type II; Oops!: You discover your own bug. Type III: Ah shit!; Someone else discovers your bug.
"In classical statistics, a Type I error means claiming that the evidence for a hypothesis is strong when it isn’t. And a Type II error means failing to recognize that the evidence for a hypothesis actually is strong." I like the post / stack / essay in general, but the bit quoted here is incorrect as stated. Consider the case where some parameter equals zero in the population, and the null hypothesis is that it equals zero. But, as happens five percent of random samples, our estimate is statistically different from zero. In that case, the null is true, we reject the null and thereby make a Type I error, but the evidence against the null is nonetheless strong. That's why we make the mistake!
Here’s an implementation of the trade-off between Type I and Type II errors for an application involving vehicle collision.
https://www.cs.cmu.edu/~astein/pub/TRR-K01.pdf
Sowell's Law only holds when the true 'vector' of the decision matrix is known; so that you are simply shifting the risk between the mistakes. In any system where that vector isn't known, there are decisions to be made which are potential 'solutions' or 'categorical errors' and can reduce both kinds of error and/or increase them; or increase one without reducing the other. Removing information content from the assessment of mortgage risk - or including useless information - can worsen all types of error, for example, without reducing any.
At first glance, the way we trade off Type 1 and Type 2 errors seems to be somewhat related to our moral intuitions around commission vs omission, loss aversion and most importantly, how saliently and clearly the failings can be attributed to our deliberate actions. . We don't seem to care about what's most relevant - our expected impact on the world - but we care a whole lot about what people can legibly hold us morally culpable for. For example:
- As a society, we'd much rather avoid committing the crime of prosecuting an innocent person than fail to keep innocent people safe from criminals, because then, it's on the criminals.
- We'd much rather too much regulation that avoids accidents that can be blamed on regulation but we're sanguine about the opportunity cost of those regulations and the countless lives it could improve on the margin.
- I'd rather avoid marrying the wrong person and have to accept that I chose wrong. But it's much less legible whether I did everything I can to find a partner.
"When you tell your audience (a) is realistic, but people hear (b) and call you a racist, don’t say I didn’t warn you."
I don't think it matters whether they understand inverse probability. Even if they did I think they'd still say the comment was racist. You can't say anything negative about the oppressed no matter what.
In another way, the statement probably is a bit racist. Just like saying women earn 70% what men do, the statement about turnstile jumpers does account for known differences in income, education, etc. I guess one could also see a bit of irony in assumptions that turnstile jumping is accusing the jumper and earning less doesn't accuse the earner. What type error is that?
I'm don't understand why (b) is clearly the statement with "inverse probability"
Replacing "serious kidney ailments" with A, and
replacing "microscopic levels of blood in their urine" with B, we have these two statements:
(c) Men with A often have B.
(d) Men with B often have A.
Presented with (c) and (d), how does one know which statement is the "inverse" correlation?
I think you are using some medical knowledge to make the distinction.
Type I and Type II Errors of the First Amendment
This post was so good that I decided to write a brief summary of it - painted somewhat heavily with my own signature.
https://open.substack.com/pub/scottgibb/p/type-i-and-type-ii-errors-of-the?r=nb3bl&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
Some of these errors can be avoided (the ones that do not require yes/no answers) by going straight to inputting the distribution that the "significance test" is applied to, into cost benefit analysis. A "statistically insignificant" result applied to a large effect might suggest the use of a new drug where a statistically significant effect of a small effect would not.
"Technically, statistical significance means that the sample size is sufficiently large that you can believe the result, as long as everything else about the how the study was done is kosher."
It also depends on the plausibility of the premise. And the reasonableness of reducing the premise and the data to numbers amenable to statistical significance.
The best explanation I've heard was in my econometrics class: A statistical significance of p<.01 means that, if nothing is really going on, I'd see results like this only 1% of the time, and I don't think I'm living in a 1% world, so I accept the proposition."
Most social science experiments take p<.05 as the required level of significance. This means that, if the hypothesis is wrong, 5% of studies will nevertheless conclude that the hypothesis is correct. Since "cannot reject the null hypothesis" results are rarely interesting enough to be published, lots of studies will be published, even with findings that are, in a "real" sense, wrong. Which is probably why most studies won't replicate.
Now, consider some alternative scenarios: Someone believes that ghosts are real and can speak through electronic boxes. He sets up an experiment to test this hypothesis, and meticulously records results. He records sounds in a place believed to be haunted, and records sounds in a place believed not to be haunted as a control. He finds word-like sounds in the former, with p<.05. Does this credibly demonstrate that the haunted location has ghosts, who are communicating through the electronic box? Only if you consider, a priori, that it is plausible that ghosts can communicate through an electronic box. If you don't think that's plausible, you'd need a much lower p value to even think about reconsidering - say p<.000000001.
It is also important which way the null hypothesis. It might be dissect to reject the hypothesis "is haunted."
Dear Mr. Kling;
Thank you for an excellent discussion.
Misunderstanding inverse probability in the way you explain is known in logic as “the fallacy of affirming the consequent”.
“miniscule”: minuscule (educated spelling).