Wednesday, March 29, 2017

After Moore v. Texas Is a Single IQ Score Really Determinative?

Bobby J. Moore has been on death row for the last 37 years. On Monday, the Supreme Court ruled that the Texas Court of Criminal Appeals (the state’s highest court for criminal cases) erred in finding that Moore is not intellectually disabled. Justice Ginsburg wrote for the five-member majority. The Chief Justice wrote a strong dissent for the other three justices. Neither opinion (on my quick reading at least) comes to grips with an obvious statistical principle—that combining information reduces uncertainty.

Moore v. Texas is the third case to try to clarify the rule in Atkins v. Virginia, 536 U.S. 304 (2002). There, the Supreme Court held that the Eighth Amendment’s Cruel and Unusual Punishment Clause prevents a state from executing an intellectually disabled offender, but it left the states with latitude in defining the disability. In Moore, the Court held that the Texas tribunal applied a medically outdated—and (hence?) constitutionally impermissible—standard in rejecting Moore’s claim of disability. Most of the majority opinion concerns “adaptive functioning,” which must be substantially impaired for a diagnosis of intellectual disability to be made.

However, the Court in Hall v. Florida, 572 U.S. __ (2014), allowed a state to refuse to inquire into adaptive functioning if an offender’s true IQ score is at least 70. Hall explicitly stated that the following statutory definition of intellectual disability was constitutionally acceptable:
“significantly subaverage general intellectual functioning existing concurrently with deficits in adaptive behavior and manifested during the period from conception to age 18,” where “significantly subaverage general intellectual functioning” is “performance that is two or more standard deviations from the mean score on a standardized intelligence test.”
Because IQ scores for the whole population are roughly normally distributed with a mean of approximately 100 and a standard deviation of about 15, Hall allows the state to execute offenders whose "true scores" are above 70.

In deciding whether a true score is above 70, Hall demanded that the state attend to the error of measurement. As the Moore Court, quoting from Hall, explained, "'[f]or purposes of most IQ tests,' [the] imprecision in the testing instrument 'means that an individual’s score is best understood as a range of scores on either side of the recorded score . . . within which one may say an individual’s true IQ score lies.'" For a single test with a standard error of 2.5 IQ points, it follows (for normally distributed errors) that the measured score must be greater than or equal to 75 (= 70 + two standard errors) to avoid "an unacceptable risk that persons with intellectual disability will be executed." 1/

But what about multiple scores? In that common situation, the Hall Court seemed conflicted. Justice Kennedy opaquely opined that “[e]ven when a person has taken multiple tests, each separate score must be assessed using the SEM, and the analysis of multiple IQ scores jointly is a complicated endeavor.” Does this mean that no matter how many IQ tests have been administered and no matter how many of them lie above 70, a single score of 75 or less makes a conclusive case for “significantly subaverage general intellectual functioning”?

From a statistical perspective, a lowest-single-score seems very strange indeed. If I want to know whether I have a fever and I take ten measurements of my temperature (with ten thermometers), I would not say that I have a fever just because one thermometer gives a high reading. I would use an average, and the mean temperature would have greater precision (smaller standard error) than the single highest reading of the ten.

Justice Ginsburg’s opinion in Moore seems to fly in the face of this common-sense statistical point. The Texas court focused on two test scores — "a 78 in 1973 and 74 in 1989." It pointed to factors that might have biased the latter score toward the low end, leaving the higher one as entitled to more weight. Specifically, it wrote that there was expert testimony that Moore might not have been putting much effort into answering the questions in the lower-scoring test, which was given to him in prison, and that he "also took the WAIS–R under adverse circumstances; he was on death row and facing the prospect of execution, and he had exhibited withdrawn and depressive behavior." Ex Parte Moore, 470 S.W.3d 481, 519 (Tex. Ct. Crim. App. 2015). Thus, the court concluded,
These considerations might tend to place his actual IQ in a somewhat higher portion of that 69 to 79 range. ... Considering these factors together, we find no reason to doubt that applicant's [higher] WAIS–R score accurately and fairly represented his intellectual functioning as being above the intellectually disabled range.
The Supreme Court assumed that it was necessary to consider each test in isolation and without making a clinical adjustment to the statistically determined plus-or-minus-five-point margin of error. Justice Ginsburg called the statistical range of error "clinically established." She described and condemned the Texas court's evaluation of the clinical testimony as follows:
Based on the two scores, but not on the lower portion of their ranges, the court concluded that Moore’s scores ranked “above the intellectually disabled range” (i.e., above 70). ... But the presence of other sources of imprecision in administering the test to a particular individual, cannot narrow the test-specific standard-error range. [W]e require that courts continue the inquiry and consider other evidence of intellectual disability where an individual’s IQ score, adjusted for the test’s standard error, falls within the clinically established range for intellectual-functioning deficits.
Thus, she insisted that just because "Moore’s score of 74, adjusted for the standard error of measurement, yields a range of 69 to 79" so that "the lower end ... falls at or below 70, the [Court of Criminal Appeals] had to move on to consider Moore’s adaptive functioning." (Emphasis added.)

In sum, Moore seems to say that a clinician cannot tinker with the statistical margin of error (two standard errors as constitutionalized in Hall). The dissent vigorously disagreed with this rule and maintained that the constitution permits states to make adjustments for individual circumstances that experts agree affect performance. A statistical argument for the dissent's position would be this: Computationally, the standard error reflects the variation in performance of a population of test-takers. This population-based figure is then applied to all individuals regardless of how strongly the sources of error apply to them. IQ tests administered in prison to inmates exhibiting signs of depression may not be part of that population. Those scores might have a larger or a smaller standard error, and they are generally lower than the true score for a person taking the test in normal circumstances. In other words, the clinician is not modifying the margin of error as much as adjusting the entire estimate upward.

This analysis does not necessarily render Moore's rule legally faulty. It might be undesirable to give clinicians this latitude to adjust scores. Under the majority's approach to the Eighth Amendment, the issue becomes whether the clinical guidelines for diagnosing disability allow individualized modifications of the statistical rule. The guidelines discussed in Hall are not completely clear. The dissent reads them as requiring an expert or a court to take the usual standard error seriously in interpreting an IQ score, but permitting reasoned and reasonable departures from them.

Even if Moore forbids individual adjustments to the statistical rule of plus-or-minus two standards errors (for the general population), why allow the confidence interval for a single test score to be dispositive when multiple tests scores are present? The clinical guidelines do not mandate this rule, and it is not so obvious that Moore does. Texas apparently made no effort to combine the two scores into a single point estimate with a margin of error applicable to the combined statistic. Hall claimed that combining scores from different IQ test forms was "complicated," although the literature it cited gave a simple procedure for doing so. So neither Hall nor Moore can be said to firmly establish that an appropriately averaged score is impermissible. After all, neither case presented the Court with an interval estimate for the true IQ score derived from multiple scores by an accepted statistical procedure, and the many-thermometer example given above illustrates the statistical deficiency in a rule that looks to every measured IQ score in isolation.

The single-score-too-low rule bends over backward to avoid misclassifying a disabled offender as normal. The rule might be defended on exactly that ground. But that is not the logic of Moore, which only asks what clinical guidelines for interpreting IQ scores allow. Moreover, if the real objective is make determinations of intellectual disability as fully informed as possible, it would seem more direct just to demand the inquiry into adaptive functioning along with IQ scores in all cases. On the other hand, if true IQ scores matter as a threshold to a richer inquiry into both intellectual and adaptive functioning, then statistically sound procedures for integrating all the IQ test results ought to be followed.

Further reading: David H. Kaye, Deadly Statistics: Quantifying an "Unacceptable Risk" in Capital Punishment, 15 Law, Probability & Risk __ (2017).

Note
  1. If the standard error were substantially less than 2.5, then the measured score would not have to be all five points above 70. The use of two standard errors also is on the high side; 1.96 standard errors provides 95% coverage. Justice Kennedy's opinion in Hall was not as clear as it should have been on these points, but this is the only interpretation consistent with the concept of confidence intervals and standard errors used in the opinion. In Bromfield v. Cain, 135 S.Ct. 2269 (2015), however, the Court wrote that after "[a]ccounting for this margin of error, Brumfield's reported IQ test result of 75 was squarely in the range of potential intellectual disability." Id. at 2278. The Court did not disclose the standard error of measurement for the test.