Friday, November 23, 2018

Muddling Through the Measurement of IQ

IQ scores are a critical component in the diagnosis of intellectual disability. That measurements of IQ are subject to various sources of measurement error is widely appreciated, but by and large, lawyers and psychologists have supplied rather imprecise -- and sometimes incorrect -- explanations of the statistics involved. A recent example is Intellectual Disability and the Death Penalty: Current Issues and Controversies, a book intended as "a valuable resource for mental health experts, attorneys, investigators, mitigation specialists, and other members of legal teams, as well as judges." 1/ The authors explain the "standard scores" that put the mean IQ score in the population at 100 as follows:
A person's standard score on a test is calculated by transforming the individual's obtained raw score on Test A (e.g., the sum of the number of correct responses on a test) using the population's known mean and standard deviation on Test A, which transforms the individual's test performance onto a common metric allowing us to compare his or her score to anyone else tested with Test A. Standard scores are possible only for tests where, if administered to the entire population, the distribution of all test scores on said test would be normally distributed ... . A percentile score is one form of a standard score that permits the interpretation of a person's performance in relation to a reference group. Although not a requirement, in the case of many psychological tests the scale for standard scores is set to have a mean or average score of 100 and a standard deviation of 15. Thus, a test performance that results in a standard score of 70 is said to be "significantly" below average or approximately two standard deviations below the population mean. A standard deviation is a unit of measure that indicates the distance from the average. During the standardization phase of the development of a standardized test, the test and its items are administered to a large anc representative sample of the reference group of interest or population. This is generally referred to as the standardization sample or norming group. From this norming group, the test developers compute the population's mean score and standard deviation on the test. The mean score and standard deviation are essential to transforming subsequently obtained raw scores (i.e., the sum of the number of correct items) on said test to a standard scale score (e.g., intelligence quotient, or IQ). 2/
Percentiles and Standard Scores

Standard scores have some value in "compar[ing one individual's] score to anyone else tested with Test A." Unlike raw scores, they incorporate the variance in the scores across different test-takers into the reported score. They are perhaps more useful for comparing scores from different tests (or different forms of the same test, or from tests administered to populations that are changing over time)

But whatever the motivation for a standardized reporting scale, it is strange to describe percentiles as standard scores. A standard score is just a particular linear transformation of a raw score that specifies "the number of standard deviations above (+) or below (-) the mean you are." 3/ As an example, suppose that the raw-score population mean for "Test A" is 60; that the population standard deviation is 12; and that a test taker has a raw score of 50. The standard score is 5/6s of a standard deviation below the mean: z = (50 - 60)/12 = -5/6 = -0.83.

To translate the raw score (or the corresponding z-score of -0.83) into a percentile, we need to know how the raw scores are distributed. For example, if raw scores were uniformly distributed from about 39 to 81, then some 26% of them would be 50 or less. If the raw scores were normally distributed (with the same mean and standard deviation), then 20% of the population would have a raw score of 50 (or less). Other distributions would produce other percentiles. Consequently, the percentile is not "one form of a standard score." At best, the percentile can be deduced from the standard score and other information.

A Standardized Scale Does Not Require Normality

Why are "[s]tandard scores ... possible only for tests where, if administered to the entire population, the distribution of all test scores on said test would be normally distributed"? Standard scores can be constructed for any distribution of test scores with a defined mean and standard deviation. Normality may be convenient or common, but it is not essential to a standardized score scale.

So What?

Not much turns on these corrections to the explanation in Intellectual Disability and the Death Penalty. IQ scores are more or less normally distributed, and the use of IQ scores of 70 and below (z ≤ -2) as the range in which an individual can be diagnosed as intellectually disabled limits the diagnosis to no more than roughly 2.3% of the general population.

But why should "a standard score of 70 [be] said to be 'significantly' below average"? Why is not an IQ score of 71 -- or even 80 -- significantly below the mean of 100? There is no statistical reason to focus on 70 as a cut off. In Hall v. Florida, 572 U.S. 5 (2014), a  majority of the Supreme Court was content with categorically excluding from the zone of intellectual disability (for the purpose of deciding potential eligibility for capital punishment) all defendants with true IQ scores above 70. Yet, no one could explain the basis for this fundamental choice. It is a convention currently in vogue among experts who want to have some such threshold. 4/

Quantifying Measurement Error

At the same time that the Court limited eligibility for the constitutional exemption from capital punishment because of intellectual disability to a small fraction of the population by approving of the z ≤ -2 range for true scores, it held that a slightly higher cutoff for observed scores was constitutionally necessary to ensure that random error in measuring IQ does not preclude too many defendants with true scores of 70 or less from consideration. Intellectual Disability and the Death Penalty explained this refinement as follows:
The Supreme Court of the United States in Hall v. Florida ruled that states must consider the test's standard error of measurement when interpreting obtained IQ scores in cases where the defendant is making an intellectual decision claim. ...
The standard error of measurement (SEM) is a direct measure of the test's reliability and is computed by administering the test to a large and representative sample of the population to be assessed on the test and computing the test's reliability coefficient, which can then be translated into an average error of measurement for the population ... . Generally, the SEM is computed and then used to create confidence intervals around the obtained standard scores (e.g., 95% certainty). A confidence interval of 95% represents a statistical certainty that, based on the knowledge of this test's reliahility coefficient, there is a 95% chance that the person's true score falls within a confidence interval that is +/-2 times the test's SEM. Thus, a professional reporting on an assessed individual's "obtained" full-scale IQ score of 70 on IQ Test A and knowing that Test A has a SEM of 2.5 around its full-scale IQ score, he would report that there is a 95% certainty that the assessed person's "true" full-scale IQ score falls within the range of 65-75 (i.e., 2x2.5= +/-5 points). 5/
This passage is garbled in two ways. To begin with, SEM is not "a direct measure of the test's reliability." It is a statistic derived from "the test's reliability coefficient." There are many ways to estimate reliability, and the logic behind the move from reliability to SEM is subtle. A better statistic for estimating the uncertainty in the observed score would be the standard error of estimate (SEE). The SEM is an average across all scores. The SEE takes into account the fact that uncertainty increases as one moves away from the population mean (IQ = 100). A description of the SEE can be found elsewhere. 6/

Second, the 95% in a 95% confidence interval is neither a "statistical certainty" nor "a 95% chance that the person's true score falls within [the computed] confidence interval." This interpretation of "confidence" is ubiquitous -- and widely known (to statisticians) to be wrong. The misinterpretation was apparent in the dissenting opinion written for four Justices by Justice Alito. It probably was implicit in the majority opinion penned by Justice Kennedy. Although we would expect (in the long run) 95% of all 95% confidence intervals  to contain the true value, the probability that a particular interval covers the true score cannot be computed with the machinery of confidence intervals. 7/ Interval estimates that can be said to provide such probabilities require Bayes theorem. Again, discussion and examples for IQ scores are available elsewhere. 8/

Clinical psychologists, lawyers, and judges are not statisticians. They do not have to compute means, standard deviations, standard errors, confidence intervals, or Bayesian credible regions. Nevertheless, to become more astute users of such statistics, they need a better understanding of the reasoning behind standard scores and expressions for measurement error.

NOTES
  1. Marc L. Tassé & John H. Blume, Intellectual Disability and the Death Penalty: Current Issues and Controversies vii (Prager 2018).
  2. Id. at 87.
  3. Penn State University Eberly College of Science, STAT 100: Statistical Concepts and Reasoning § 5.2 (2018), https://onlinecourses.science.psu.edu/stat100/node/13/
  4. David H. Kaye, Deadly Statistics: Quantifying an "Unacceptable Risk" in Capital Punishment, 16 Law, Probability & Risk 7-34 (2017), http://ssrn.com/abstract=2788377.
  5. Tassé & Blume, supra note 1, at 90.
  6. Kaye, supra note 4.
  7. For an elaboration in legal settings, see David H. Kaye, Apples and Oranges: Confidence Coefficients and the Burden of Persuasion, 73 Cornell L. Rev. 54 (1987).
  8. Kaye, supra note 4.
POSTINGS ON IQ SCORES AND CAPITAL PUNISHMENT

No comments:

Post a Comment