Sunday, March 24, 2019

Legal Implications of the Statisticians' Manifestos on Statistical Significance

Manifestos in The American Statistician 1/ and Nature 2/ last week urge an end to "statistical significance." As these communications make clear, statisticians have long denounced presenting results that are barely significant (for example, p = 0.048) as radically different from ones that are almost significant (for example, p = 0.052). They have cautioned that "no significant difference" with a study that lacks power is not proof of no real difference; that a "statistically significant difference" can, for all practical purposes, be of no importance; that a p-value is not the probability that the null hypothesis is true; and so on (and on).

Indeed, after reading such books as Statistics: Concepts and Controversies (1979) by David S. Moore, and The Significance Test Controversy: A Reader, edited by Denton Morrison and Ramon Henkel (1970), I once proposed that courts exclude testimony using phrases like “significant difference” and “95% confidence interval” as unduly confusing and potentially misleading under Federal Rule of Evidence 403 and the common law from which the rule is derived. 3/ That idea fell on barren soil.

Now a similar rule has been proposed for scientific discourse generally. The American Statistician’s editorial advises all scientists as follows: “‘statistically significant’—don’t say it and don’t use it.” It observes that
The ASA Statement on P-Values and Statistical Significance [Feb. 5, 2016] stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.

Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. Made broadly known by Fisher’s use of the phrase (1925), Edgeworth’s (1885) original intention for statistical significance was simply as a tool to indicate when a result warrants further scrutiny. But that idea has been irretrievably lost. Statistical significance was never meant to imply scientific importance, and the confusion of the two was decried soon after its widespread use (Boring 1919). Yet a full century later the confusion persists.

And so the tool has become the tyrant. The problem is not simply use of the word “significant,” although the statistical and ordinary language meanings of the word are indeed now hopelessly confused (Ghose 2013); the term should be avoided for that reason alone. The problem is a larger one, however: using bright-line rules for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making (ASA statement, Principle 3). A label of statistical significance adds nothing to what is already conveyed by the value of p; in fact, this dichotomization of p-values makes matters worse.
Also eschewing what it calls "dichotomania," the comment in Nature, which garnered "more than 800 signatories" when a pre-publication draft was circulated, reads
[I]n line with many others over the decades, we are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis. .... The misuse of statistical significance has done much harm to the scientific community and those who rely on scientific advice. P values, intervals and other statistical measures all have their place, but it’s time for statistical significance to go.
This is powerful stuff, but I do not expect testifying experts to change their ways or courts to stop prizing declarations of “statistically significant” findings. Neither do I expect the scientific establishment to change overnight. I doubt that phrases like "trending toward significance" or "highly significant" will disappear (or, more radically, that p-values will be abandoned). There are lots of things to consider. Do we need words of some kind to demarcate studies whose estimates reasonably can be considered to be close to a true value as opposed to those that could not be consistently replicated because of randomness in the data? In what contexts should an adjective be used for this purpose? It is one thing to say that journal editors should not use an arbitrary line to reject articles under all circumstances or even that scientists should just state a number rather than using a less precise adjective, but what about more popular writing? If a term like "significant" is used for demarcation in any context, what is the tipping point? More fundamentally, why use p-values to grade the strength of evidence? Should something else play this role? The 43 articles in The American Statistician and the correspondence in Nature on the Amrhein et al. article indicate that no simple fix is imminent in the scientific community. 4/

Nonetheless, the recent manifestos should have some influence in the legal world. They should cause lawyers or testifying experts to hesitate to puff up evidence with modest p-values (in the vicinity of 0.05, for example) as "significant!", if only because the American Statistician's editorial, the earlier ASA Statement, and the Nature comment and the large number of statisticians who publicly endorsed it, all supply fodder for cross-examination (depending on the jurisdiction's hearsay rule for "learned treatises"). Relatedly, they could support motions to limit testimony with such verbiage as misleading or unfairly prejudicial, as I once suggested.

In addition, they should reinforce the tendency of most courts to reject a mechanical rule for admissibility based on a false alarm probability of 0.05. After seeing the manifestos, a law school colleague wrote me that he had "never understood why an empirical study that was 1% less significant than what a social science journal would accept is therefore ‘junk’ that should not even be considered by a jury." However, the argument for a categorical rule is more subtle. The relevant standard for admissibility is not, "Is this study junk science?" Under Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), and Federal Rule of Evidence 702, it is whether a finding presented as scientific qualifies as "scientific knowledge" and, as such, possess "evidentiary reliability." The argument that all statistical findings that fail the p < 0.05 test are not reliable enough to consider (along with other evidence) has always been dubious. At most, the mechanical rule would be that p ≥ 0.05 shows that the finding is insufficient for scientists -- and hence legal factfinders -- to conclude that the alternative hypothesis is true.

But even though Rule 702 does not lead to the categorical exclusion of findings for which p ≥ 0.05, such evidence is subject to the balancing test of Rule 403. One could argue for a bright-line rule to simplify that inquiry. Juries and judges often do not understand what a p-value is and might misconstrue a study with p exceeding 0.05 as good proof of a real association or effect (especially if the result is not characterized as "not significant"). For example, jurors might think that if p is "only" 0.052, then they should be almost 95% "confident" that there is a real association or effect. Such naive transposition is often incorrect. Of course, 0.05 is (or any other number) is somewhat arbitrary, but it may not be a ridiculous dividing point if the benefits of a simple rule exceed the costs of the misclassifications it produces relative to purely ad hoc case-by-case balancing. 5/

My own preference is for a more flexible approach -- with judicial recognition that the probative value of studies for which p is near (and, a fortiori, much larger than) 0.05 is quite limited. The findings on either side of this dividing line are not necessarily "junk science," but in the vicinity of p = 0.05, they are not all that surprising even if there is no real association or effect. As such, it may not be worth the time it takes to educate the jury about the weak evidence. In ruling on motions to exclude on this Rule 403 basis, courts should consider not only the issue of randomness in the data (which is all the p-value addresses), but also the design and quality of the statistical study as well as the availability (or not) of other, more probative evidence on the same point.

NOTES
  1. Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar, Moving to a World Beyond “p < 0.05”, 73:sup1 Am. Statistician 1–19 (2019) DOI: 10.1080/00031305.2019.1583913
  2. Valentin Amrhein, Sander Greenland, Blake McShane et al., Comment, Retire Statistical Significance, 567 Nature 305-307 (2019). This commentary is discussed further in Can P-values or Confidence Intervals Prove Non-association?, Forensic Sci., Stat. & L., Mar. 30, 2019.
  3. D.H. Kaye, Is Proof of Statistical Significance Relevant?, 61 Wash. L. Rev. 1333 (1986).
  4. John P. A. Ioannidis, Letter, Retiring Statistical Significance Would Give Bias a Free Pass, 567 Nature 461 (2019); Valen E. Johnson, Raise the Bar Rather than Retire Significance, 567 Nature 461 (2019); Julia M. Haaf, Alexander Ly & Eric-Jan Wagenmakers, Letter, Retire Significance, But Still Test Hypotheses, 567 Nature 461 (2019).
  5. Even with a categorical rule, courts would have discretion to exclude those studies with p-values that are small enough to pass through the filter. Consider a study that produces a p-value barely below 0.05. As with p = 0.052, the number may be transposed into a false sense of confidence (especially if it is praised as "statistically significant").
FURTHER READING
Last updated: Mar. 30, 2019, 8:55 AM

No comments:

Post a Comment