Wednesday, July 5, 2017

Multiple Hypothesis Testing in Karlo v. Pittsburgh Glass Works

The following posting is adapted from a draft of an annual update to the legal treatise The New Wigmore on Evidence: Expert Evidence. I am not sure of the implications of the calculations in note 23 and the fact that the age-based groups are overlapping. Advice is welcome.

The Age Discrimination in Employment Act of 1967 (ADEA) 1/ covers individuals who are at least forty years old. The federal circuit courts are split as to whether a disparate-impact claim is viable when it is limited to a subgroup of employees such as those aged fifty and older. In Karlo v. Pittsburgh Glass Works, 2/ the Third Circuit held that statistical proof of disparate impact on such a subgroup can support a claim for recovery. The court countered the employer’s argument that “plaintiffs will be able to ‘gerrymander’ arbitrary age groups in order to manufacture a statistically significant effect” 3/ by promising that “the Federal Rules of Evidence and Daubert jurisprudence [are] a sufficient safeguard against the menace of unscientific methods and manipulative statistics.” 4/ In Daubert v. Merrell Dow Pharmaceuticals, the Supreme Court famously reminded trial judges applying the Federal Rules of Evidence that they are gatekeepers responsible for ensuring that scientific evidence presented at trials is based on sound science. By the end of the Karlo opinion, however, the court appeals held that the Senior District Judge Terrence F. McVerry had been too vigorous a gatekeeper when he found inadmissible a statistical analysis of reductions in force offered by laid-off older workers.

The basic problem was that plaintiffs claimed to have observed statistically significant disparities in various overlapping age groups without correcting for the fact that by performing a series of hypothesis tests, they had more than one opportunity to discover something "significant." By way of analogy, if you flip a coin five times and observe five heads, you might begin to suspect that the coin is not fair. The probability of five heads in a row with a fair coin is p = (1/2)5 = 1/32 = 0.03. We can say that the five heads in the sample are "statistically significant" proof (at the conventional 0.05 level) that the coin is unfair.

But suppose you get to repeat the experiment five times. Now the probability of at least one sample of 5 flips with 5 heads is about five times larger. It is 1 - (1 - 1/32)5 = 0.146785, to be exact. This outcome is not so far out line with what is expected of a fair coin. It would be seen about 15% of the time for a fair coin. This is weak evidence that the coin is unfair; certainly, it is not as compelling as the 3% p-value. So the extra testing, with the opportunity to select any one or more of the five samples as proof of unfairness, has reduced the weight of the statistical evidence of unfairness. The effect of the opportunity to search for significance is sometimes known as "selection bias" or, of late, "p-hacking."

In Karlo, Dr. Michael Campion—a distinguished professor of management at Purdue University with degrees in industrial and organizational psychology—compared proportions of Pittsburgh Glass workers older than 40, 45, 50, 55, and 60 who were laid off to the proportion of younger workers who were laid off. He found that the disparities in three of the five categories were statistically significant at the 0.05 level. 5/ The disparity for the 40-and-older range, he said, fell “just short,” being “ significant at the 13% level.” Dr. Campion maintained that “[t]hese results suggest that there is evidence of disparate impact.” 6/ He also misconstrued the 0.05 level as “a 95% probability that the difference in termination rates of the subgroups is [] due to chance alone.” 7/ The district court expressed doubt as to whether Dr. Campion was a qualified statistical expert 8/ and excluded the testimony under Daubert as inadequate “data snooping.” 9/

Apparently, Judge McVerry was more impressed with the report of Defendant’s expert, James Rosenberger — a statistics professor at Pennsylvania State University and a fellow of the American Statistical Association and the American Association for the Advancement of Science. The report advocated adjusting the significance level to account for the five groupings of over-40 workers. The Chief Judge of the Third Circuit, D. Brooks Smith (also an adjunct professor at Penn State), described the recommended correction as follows:
The Bonferroni procedure adjusts for that risk [of a false positive] by dividing the “critical” significance level by the number of comparisons tested. In this case, PGW's rebuttal expert, Dr. James L. Rosenberger, argues that the critical significance level should be p < 0.01, rather than the typical p < 0.05, because Dr. Campion tested five age groups (0.05 / 5 = 0.01). Once the Bonferroni adjustment is applied, Dr. Campion's results are not statistically significant. Thus, Dr. Rosenberger argues that Dr. Campion cannot reject the null hypothesis and report evidence of disparate impact. 10/
Another way to apply the Bonferroni correction is to change the p-value. That is, when M independent comparisons have been conducted, the Bonferroni correction is either to set “the critical significance level . . . at 0.05/M” (as Professor Rosenberger recommended) or “to inflate all the calculated P values by a factor of M before considering against the conventional critical P value (for example, 0.05).” 11/

The Court of Appeals was not so sure that this conservative adjustment was essential to the admissibility of the p-values or assertions of statistical significance. It held that the district court erred in excluding the subgroup analysis and granting summary judgment. It remanded “for further Daubert proceedings regarding plaintiffs' statistical evidence.” 12/ Further proceedings were said to be necessary partly because the district court had applied “an incorrectly rigorous standard for reliability.” 13/ The lower court had set “a higher bar than what Rule 702 demands” 14/ because “it applied a bright-line exclusionary rule” for all studies with multiple comparisons that have no Bonferroni correction. 15/

But the district court did not clearly articulate such a rule. It wrote that “Dr. Campion does not apply any of the generally accepted statistical procedures (i.e., the Bonferroni procedure) to correct his results for the likelihood of a false indication of significance.” 16/ The sentence is grammatically defective (and hence confusing). On the one hand, it refers to "generally accepted statistical procedures." On the other hand, the parenthetical phrase suggests that only one "procedure" exists. Had the district court written “e.g.” instead of “i.e.,” it would have been clear that it was not promulgating a dubious rule that only the Bonferroni adjustment to p-values or significance levels would satisfy Daubert. To borrow from Mark Twain, "the difference between the almost right word and the right word is really a large matter—'tis the difference between the lightning-bug and the lightning." 17/

Understanding the district court to be demanding a Bonferroni correction in all cases of multiple testing, the court of appeals essentially directed it to reconsider its exclusionary ruling in light of the fact that other procedures could be superior. Indeed, there are many adjustment methods in common use, of which Bonferroni’s is merely the simplest. 18/ However, plaintiff’s expert apparently had no other method to offer, which makes it hard to see why the possibility of some alternative adjustment, suggested by neither expert in the case, made the district court's decision to exclude Dr. Campion's proposed testimony an abuse of discretion.

A rule insisting on a suitable response to the multiple-comparison problem does not seem “incorrectly rigorous.” To the contrary, statisticians usually agree that “the proper use of P values requires that they be ... appropriately adjusted for multiple testing when present.” 19/ It is widely understood that when multiple comparisons are made, reported p-values will exaggerate the significance of the test statistic. 20/ The court of appeal’s statement that “[i]n certain cases, failure to perform a statistical adjustment may simply diminish the weight of an expert's finding.” 21/ is therefore slightly misleading. In virtually all cases, multiple comparisons degrade the meaning of a p-value. Unless the statistical tests are all perfectly correlated, multiple comparisons always make the true probability of the disparity (or a larger one) under the model of pure chance greater than the nominal value. 22/

Even so, whether the fact that an unadjusted p-value exaggerates the weight of evidence invariably makes unadjusted p-values or reports of significance inadmissible under Daubert is a more delicate question. If no reasonable adjustment can be devised for the type of analysis used and no better analysis can be done, then the nominal p-values might be presented along with a cautionary statement about selection bias. In addition, in extreme cases, the adjustment will be small and the degree of exaggeration will not be so formidable as to render the unadjusted p-value inadmissible. For instance, if the nominal p-value were 0.001, the fact that the corrected figure is 0.005 would not be a fatal flaw. The disparity would be highly statistically significant even with the correction. But that was not the situation in Karlo. In this case, statistical significance was not apparent. It was undisputed that as soon as one considered the number of tests performed, not a single subgroup difference was significant at the 0.05 level. 23/

Consequently, the rejection of the district court’s conclusion that the particular statistical analysis in the expert’s report was unsound seems harsh. It should be within the trial court’s discretion to prevent an expert from testifying to the statistical significance of disparities (or their p-values) unless the expert avoids multiple comparisons that would seriously degrade the claims of significance or modifies those claims to reflect the negative impact of the repeated tests on the strength of the statistical evidence. 24/ The logic of Daubert does not allow an expert to dismiss the problem of selection bias on the theory -- advanced by plaintiffs in Karlo -- that “adjusting the required significance level [is only] required [when the analyst performs] ‘a huge number of analyses of all possibilities to try to find something significant.'’’ 25/ The threat to the correct interpretation of a significance probability does not necessarily disappear when the number of comparisons is moderate rather than “huge.” Given the lack of highly significant results here (even nominally), it is not statistically acceptable to ignore the threat. 26/ Although the Third Circuit was correct to observe that not all statistical imperfections render studies invalid within the meaning of Daubert, the reasoning offered in support of the claim of significant disparities in Karlo was not statistically acceptable. 27/

Notes
l. 29 U.S.C. §§ 621–634.
2. 849 F.3d 61 (3d Cir. 2017).
3. Id. at 76.
4. Id.
5. He testified that he did not compute a z-score (a way to analyze the difference between two proportions when the sample sizes are large) for the 60-and-over group “because ‘[t]here are only 14 terminations, which means the statistical power to detect a significant effect is very low.’” Karlo, 849 F.2d at 82 n.15.
6. Karlo v. Pittsburgh Glass Works, LLC, 2015 WL 4232600, at *11, No. 2:10–cv–1283 (W.D. Penn. July 13, 2015), vacated, 849 F.3d 61 (3d Cir. 2017).
7. Id. at *11 n.13. "A P value measures a sample's compatibility with a hypothesis, not the truth of the hypothesis." Naomi Altman & Martin Krzywinski, Points of Significance: Interpreting P values, 14 Nature Methods 213, 213 (2017).
8. Id. at *12.
9. Id. at *13.
10. 849 F.3d at 82 (notes omitted).
11. Pak C. Sham & Shaun M. Purcell, Statistical Power and Significance Testing in Large-scale Genetic Studies, 15 Nature Reviews Genetics 335 (2014) (Box 3).
12. Id. at 80 (note omitted).
13. Id. at 82.
14. Id at 83.
15. Id. (internal quotation marks and ellipsis deleted).
16. Karlo, 2015 WL 4232600, at *1.
17. George Bainton, The Art of Authorship 87–88 (1890.
18. Martin Krzywinski & Naomi Altman, Points of Significance: Comparing Samples — Part II, 11 Nature Methods 355, 355 (2014)
19. Naomi Altman & Martin Krzywinski, Points of Significance: Interpreting P values, 14 Nature Methods 213, 214 (2017)
20. Krzywinski & Altman, supra note 18
21. Id. at 83 (emphasis added).
22. Because each age group included some of the same older workers, the tests here were not completely independent. But neither were they completely dependent.
23. However, that three out of five groups exhibited significant associations between age and terminations is surprising under the null hypothesis that those variables are uncorrelated. If each test were independent, then the probability of a significant result in each group would be 0.05. The probability of one or more significant results in five tests would be 0.226; that of two or more would be 0.0226; of three or more, 0.00116.
24. Joseph Gastwirth, Case Comment: An Expert's Report Criticizing Plaintiff's Failure to Account for Multiple Comparisons Is Deemed Admissible in EEOC v. Autozone, 7 Law, Probability & Risk 61, 62 (2008).
25. Karlo, 849 F.3d at 82.
26. Dr. Campion also believed that “his method [was] analogous to ‘cross-validating the relationship between age and termination at different cut-offs,’ or ‘replication with different samples.’” Id. at 83. Although the court of appeals seemed to take these assertions at face value, cross-validation involves applying the same statistical model to different data sets (or distinct subsets of one larger data set). For instance, a equation that predicts law school grades as a function of such variables as undergraduate grades and LSAT test scores might be derived from one data set, then checked to ensure that it performs well in an independent data set. Findings in one large data set of statistically significant associations between particular genetic loci and a disease could be checked to see if the associations were present in an independent data set. No such validation or replication was performed in this case.
27. The Karlo opinion suggested that the state of statistical knowledge or practice might be different in social science than in the broader statistical community. The court pointed to a statement (in a footnote on regression coefficients) in a treatise on statistical evidence in discrimination cases that “the Bonferroni adjustment [is] ‘good statistical practice,’ but ‘not widely or consistently adopted’ in the behavioral and social sciences.” Id. (quoting Ramona L. Paetzold & Steve L. Willborn, The Statistics of Discrimination: Using Statistical Evidence in Discrimination Cases § 6:7, at 308 n.2 (2016 Update)). The treatise writers were referring to an unreported case in which the district court found itself unable to resolve the apparent conflict between the generally recognized problem of multiple comparisons and an EEOC expert’s insistence that labor economists do not make such corrections and courts do not require them. E.E.O.C. v. Autozone, Inc., No. 00-2923, 2006 WL 2524093, at *4 (W.D. Tenn. Aug. 29, 2006). In the face of these divergent perceptions, the district judge decided not to grant summary judgment just because of this problem. Id. (“[T]he Court does not have a sufficient basis to find that ... the non-utilization [of the Bonferroni adjustment] makes [the expert's] results unreliable.”). The notion that multiple comparisons generally can be ignored in labor economics or employment discrimination cases is false, Gastwirth, supra note 23, at 62 (“In fact, combination methods and other procedures that reduce the number of individual tests used to analyse data in equal employment cases are basic statistical procedures that have been used to analyse data in discrimination cases.”), and any tendency to overlook multiple comparisons in “behavioral and social science” more generally is statistically indefensible.
That said, the outcome on appeal in Karlo might be defended as a pragmatic response to the lower court's misunderstanding of the meaning of the ADEA. The court excluded the unadjusted findings of significance for several reasons. In addition to criticizing Professor Campion's refusal to make any adjustment for his series of hypothesis tests across age groups, Judge McVerry noted that "the subgrouping analysis would only be helpful to the factfinder if this Court held that Plaintiffs could maintain an over-fifty disparate impact claim." Karlo, 2015 WL 4232600, at *13 n.16. He sided with "the majority view amongst the circuits that have considered this issue ... that a disparate impact analysis must compare employees aged 40 and over with those 39 and younger ... ." Id. (Petruska v. Reckitt Benckiser, LLC, No. CIV.A. 14–03663 CCC, 2015 WL 1421908, at *6 (D.N.J. Mar.26, 2015)). The Third Circuit decisively rejected this construction of the ADEA, pulling this rug out from under the district court. Having held that the district court erred in interpreting the ADEA, requiring the district court to re-examine the statistical showing under the ADEA, correctly understood, might seem appropriate.
Of course, ordinarily an evidentiary ruling that can be supported on several independent grounds will be upheld on appeal as long as at least one of the independent grounds is valid. Here, the ADEA argument was literally a footnote to the independent ground that the failure to adjust for multiple comparisons invalidated the expert's claim of significant disparities. Nevertheless, the independent-grounds rule normally applies after a trial. It avoids retrials when the trial judge would or could rule the same way on retrial. Because Karlo is a summary judgment case, there is less reason to sustain the evidentiary ruling. But even so, the court of appeals did not have to vacate the judgment. Instead, it could have followed the usual independent-grounds rule to affirm the summary judgment while noting that district court could reconsider its Daubert ruling in light of the court of appeals' explanation of the proper reach of the ADEA and the range of statistically valid responses to the problem of multiple hypothesis tests. As a practical matter, however, there may be little difference between having counsel address the issue in the context of a motion to reconsider and a renewed motion for summary judgment.