Thursday, June 12, 2014

More on the Mistakes in “Forensic Science Isn’t Science”

In "Flawed Journalism on Flawed Forensics in Slate Magazine" I referred to "other inaccuracies in the article" by Mark Joseph Stern entitled “Forensic Isn’t Science.” Here are some excerpts from the article with some thoughts.

Behind the myriad technical defects of modern forensics lie two extremely basic scientific problems. The first is a pretty clear case of cognitive bias: A startling number of forensics analysts are told by prosecutors what they think the result of any given test will be. This isn’t mere prosecutorial mischief; analysts often ask for as much information about the case as possible—including the identity of the suspect—claiming it helps them know what to look for. Even the most upright analyst is liable to be subconsciously swayed when she already has a conclusion in mind. Yet few forensics labs follow the typical blind experiment model to eliminate bias. Instead, they reenact a small-scale version of Inception, in which analysts are unconsciously convinced of their conclusion before their experiment even begins.

Few psychologists would propose that expectancy effects will cause errors in interpretation in every experiment. The role of cognitive bias in science and inference generally is quite complicated. There is no typical “blind experiment model.” The physicists who announced the discovery of the Higgs boson or those who believed they detected the marks of gravitational waves in the cosmic background radiation had no such model. In much of science, experimenters know what they are looking for; fortunately, some results are not ambiguous and not subject to subtle misinterpretation. There is the joke that if your experiment needs statistics, you should do a better experiment.

That said, when interpretations are more malleable—as is often the case in many forensic disciplines—various methods are available to minimize the chance of this source of error. One is “sequential unmasking” to protect forensic analysts from unconscious (or conscious) bias that could lead them to misconstrue their data when exposed to information that they do not need to know. There is rarely, if ever, an excuse for not using methods like these. But their absence does not make a solid latent fingerprint match or a matching pair of clean, single-source electropherograms, for example, into small-scale versions of Inception.

Without a government agency overseeing the field, forensic analysts had no incentive to subject their tests to stricter scrutiny. Groups such as the Innocence Project have continually put pressure on the Department of Justice—which almost certainly should have supervised crime labs from the start—to regulate forensics. But until recently, no agency has been willing to wade into the decentralized mess that hundreds of labs across the country had unintentionally created.

When and where was the start of forensic science? Ancient China? Renaissance Europe? Should the U.S. Department of Justice have been supervising the Los Angeles Police Department when it founded the first crime laboratory in the United States in 1923? The DOJ has had enough trouble with the FBI laboratory, whose blunders led to reports from the DOJ’s Office of the Inspector General and which has turned to the National Research Council for advice on more than one occasion. The 2009 report of a committee of the National Research Council had much better ideas for improving the practice of forensic science in the United States. Its recommendation of a new agency entirely outside of the Department of Justice for setting standards and funding research, however, gained little political traction. The current National Commission on Forensic Science is a distorted, toothless, and temporary version of the idea.

In 2009, a National Academy of Sciences committee embarked on a long-overdue quest to study typical forensics analyses with an appropriate level of scientific scrutiny—and the results were deeply chilling.

The committee did not undertake a “scientific” study. It engaged in a policy-oriented review of the state of forensic science without applying any particularly scientific methods. (This is not a criticism of the committee. NRC committees generally collect and review relevant literature and views rather than undertake scientific research of their own.) This committee's quest did not begin in 2009. That is when it ended. Congress voted to fund the study in 2005.

Aside from DNA analysis, not a single forensic practice held up to rigorous inspection. The committee condemned common methods of fingerprint and hair analysis, questioning their accuracy, consistent application, and general validity. Bite-mark analysis—frequently employed in rape and murder cases, including capital cases—was subject to special scorn; the committee questioned whether bite marks could ever be used to positively identify a perpetrator. Ballistics and handwriting analysis, the committee noted, are also based on tenuous and largely untested science.

The report is far more nuanced (some might say conflicted) than this. Here are some excerpts:
"The chemical foundations for the analysis of controlled substances are sound, and there exists an adequate understanding of the uncertainties and potential errors. SWGDRUG has established a fairly complete set of recommended practices." P. 135.

"Historically, friction ridge analysis has served as a valuable tool, both to identify the guilty and to exclude the innocent. Because of the amount of detail available in friction ridges, it seems plausible that a careful comparison of two impressions can accurately discern whether or not they had a common source. Although there is limited information about the accuracy and reliability of friction ridge analyses, claims that these analyses have zero error rates are not scientifically plausible." P. 142.

"Toolmark and firearms analysis suffers from the same limitations discussed above for impression evidence. Because not enough is known about the variabilities among individual tools and guns, we are not able to specify how many points of similarity are necessary for a given level of confidence in the result. Sufficient studies have not been done to understand the reliability and repeatability of the methods. The committee agrees that class characteristics are helpful in narrowing the pool of tools that may have left a distinctive mark. Individual patterns from manufacture or from wear might, in some cases, be distinctive enough to suggest one particular source, but additional studies should be performed to make the process of individualization more precise and repeatable." P. 154.

"Forensic hair examiners generally recognize that various physical characteristics of hairs can be identified and are sufficiently different among individuals that they can be useful in including, or excluding, certain persons from the pool of possible sources of the hair. The results of analyses from hair comparisons typically are accepted as class associations; that is, a conclusion of a 'match' means only that the hair could have come from any person whose hair exhibited—within some levels of measurement uncertainties—the same microscopic characteristics, but it cannot uniquely identify one person. However, this information might be sufficiently useful to 'narrow the pool' by excluding certain persons as sources of the hair." P. 160.

"The scientific basis for handwriting comparisons needs to be strengthened. Recent studies have increased our understanding of the individuality and consistency of handwriting and computer studies and suggest that there may be a scientific basis for handwriting comparison, at least in the absence of intentional obfuscation or forgery. Although there has been only limited research to quantify the reliability and replicability of the practices used by trained document examiners, the committee agrees that there may be some value in handwriting analysis."

"Analysis of inks and paper, being based on well-understood chemistry, presumably rests on a firmer scientific foundation. However, the committee did not receive input on these fairly specialized methods and cannot offer a definitive view regarding the soundness of these methods or of their execution in practice." Pp. 166-67

"As is the case with fiber evidence, analysis of paints and coatings is based on a solid foundation of chemistry to enable class identification." P. 170

"The scientific foundations exist to support the analysis of explosions, because such analysis is based primarily on well-established chemistry." P. 172

"Despite the inherent weaknesses involved in bite mark comparison, it is reasonable to assume that the process can sometimes reliably exclude suspects. Although the methods of collection of bite mark evidence are relatively noncontroversial, there is considerable dispute about the value and reliability of the collected data for interpretation." P. 176

"Scientific studies support some aspects of bloodstain pattern analysis. One can tell, for example, if the blood spattered quickly or slowly, but some experts extrapolate far beyond what can be supported." P. 178
Hardly a ringing endorsement of all police lab techniques, but neither is the report an outright rejection of all or even most techniques now in use.

The report amounted to a searing condemnation of the current practice of forensics and an ominous warning that death row may be filled with innocents.

According to an NRC press release issued in February, 2009, “[t]he report offers no judgment about past convictions or pending cases, and it offers no view as to whether the courts should reassess cases that already have been tried.” Such language may be a compromise among the disparate committee members. But to derive the conclusion that death row is “filled with innocents” even partly from the actual contents of the report, one would have to consider the deficiencies identified in the system, the extent to which these deficiencies generated the evidence used in capital cases, and the other evidence in those cases. Other research is far more helpful in evaluating the prevalence of false convictions.

Given the flimsy foundation upon which the field of forensics is based, you might wonder why judges still allow it into the courtroom.

As the 2009 NRC Committee explained, there is no single "field of forensics." Rather,
"Wide variability exists across forensic science disciplines with regard to techniques, methodologies, reliability, error rates, reporting, underlying research, general acceptability, and the educational background of its practitioners. Some of the forensic science disciplines are laboratory based (e.g., nuclear and mitochondrial DNA analysis, toxicology, and drug analysis); others are based on expert interpretation of observed patterns (e.g., fingerprints, writing samples, toolmarks, bite marks, and specimens such as fibers, hair, and fire debris). Some methods result in class evidence and some in the identification of a specific individual—with the associated uncertainties. The level of scientific development and evaluation varies substantially among the forensic science disciplines." P. 182.
The courts have been lax in responding to overblown testimony in some fields and to those techniques that lack proof of their fundamental precepts.

In 1993, the Supreme Court announced a new test, dubbed the "Daubert standard," to help federal judges determine what scientific evidence is reliable enough to be introduced at trial. The Daubert standard ... wound up frustrating judges and scientists alike. As one dissenter griped, the new test essentially turned judges into "amateur scientists," forced to sift through competing theories to determine what is truly scientific and what is not.

Blaming the persistence of the admissibility of the most dubious forensic disciplines on Daubert is strange. Daubert's standard did not spring into existence fully formed, like Athena from the brow of Zeus. A similar standard was in place in a number of jurisdictions. As the The New Wigmore: A Treatise on Evidence shows, the Court borrowed from these cases. A smaller point to note is that there were not one, but two partial dissenters (who concurred in the unanimous judgment). Chief Justice Rehnquist and Justice Stephens objected to the majority’s proffering "general observations" about scientific validity, and they did not complain about the ones the Mr. Stern points to as an explanation for the persistence of questionable forensic "science."

Even more puzzlingly, the new standards called for judges to ask "whether [the technique] has attracted widespread acceptance within a relevant scientific community"—which, as a frustrated federal judge pointed out, required judges to play referee between "vigorous and sincere disagreements" about "the very cutting edge of scientific research, where fact meets theory and certainty dissolves into probability."

That’s Chief Judge Alex Kozinski of the Ninth Circuit Court of Appeals writing on remand in Daubert itself. Judge Kozinski could not possibly be objecting to the Supreme Court's opinion on the ground that "widespread acceptance within a relevant scientific community" is an impenetrable standard. Quite the opposite. He applied that very standard in his previous opinion in the case and was bemoaning what he called the "brave new world" that the Court ushered in as it vacated his opinion. A recent survey of judges found that 96% of the (disappointingly small) fraction responding deemed the general scientific acceptance to be helpful -- more than any other commonly used factor in judging the validity of scientific evidence.

American jurors today expect a constant parade of forensic evidence during trials. They also refuse to believe that this evidence might ever be faulty. Lawyers call this the CSI effect, after the popular procedural that portrays forensics as the ultimate truth in crime investigation. [¶] “Once a jury hears something scientific, there’s a kind of mythical infallibility to it,” Peter Neufeld, a co-founder of the Innocence Project, told me. “That’s the association when a person in white lab coat takes the witness stand. By that point—once the jury’s heard it—it’s too late to convince them that maybe the science isn’t so infallible.”

Refusal to question scientific evidence is not what most lawyers call the CSI effect. In any event, jury research does not support the idea that jurors inevitably reject attacks on scientific testimony or that the testimony of the first witness in a figurative white coat is unshakeable.

If judges can’t be trusted to keep spurious forensic analysis out of the courtroom, and juries can’t be trusted to disregard it, then how are we going to keep the next Earl Washington off death row? One option would be to permit anybody convicted on the basis of biological evidence to subject that evidence to DNA analysis—which is, after all, the one form of forensics that scientists agree actually works. But in 2009, the Supreme Court ruled that convicts had no such constitutional right, even where they can show a reasonable probability that DNA analysis would prove their innocence. (The ruling was 5–4, with the usual suspects lining up against convicts’ rights.)

This option, which applies applies to a limited set of cases (and hence is no general solution) is not foreclosed by District Attorney for the Third Judicial District v. Osborne, 129 S.Ct. 2308 (2009). If there is a minimally plausible claim of actual innocence after conviction, let’s allow such testing by statute. Of course, it would be better to thoroughly test potential DNA evidence (when it is relevant) before trial—something that Osborne's trial counsel declined to request, fearing that it would only strengthen the prosecution’s case.

Until lab technicians follow some uniform guidelines and abandon the dubious techniques glamorized on shows like CSI, forensic science will barely qualify as a science at all. As a recent investigation by Chemical & Engineering News revealed, little progress has been made in the five years since the National Academy of Sciences condemned modern forensic techniques.

Again, as the NRC committee stressed, "forensic science" is not a single, uniform discipline. Since the report, funding has increased, some guidelines have been revised, and significant research has appeared in some fields. Still, the pace resembles that of global warming. It is coming, notwithstanding resistance described in earlier years on this blog.

As for the "investigation by Chemical & Engineering News," the latest I saw from that publication was an article in a May 12, 2014 issue with a map showing selected instances of examiner misconduct dating back to 1993 and indicating that only five states require laboratory accreditation. No effort was made to ascertain how many labs are still operating without accreditation. With no apparent literature review, The article simply asserted that
[I]n the years since [2009], little has been done to shore up the discipline’s scientific base or to make sure that its methods don’t result in wrongful convictions. Quality standards for forensic laboratories remain inconsistent. And funding to implement improvements is scarce. [¶] While politicians and government workers debate changes that could help, fraudsters like forensic chemist Annie Dookhan keep operating in the system. No reform could stop a criminal intent on doing wrong, but a better system might have shown warning signs sooner. And it likely would have prevented some of the larger, systemic problems at the Massachusetts forensics lab where Dookhan worked.
I must be missing the real investigation that the C&E News writers conducted.

Flawed Journalism on Flawed Forensics in Slate Magazine

Yesterday, Slate magazine published an article by Mark Joseph Stern announcing that “Forensic Science Isn’t Science.” 1/ The writer’s objective—to urge that forensic science be conducted rigorously and fairly—is laudable. But just as shabby science should not be tolerated in the courtroom or the police station, journalism that pays little heed to the facts should not be acceptable in serious publications.

I’ll give one example, chosen because it pervades the publication. The article begins with the claim that “[f]orensic analysis of semen introduced at trial had convinced the jury that [Earl] Washington [Jr.] ... had brutally raped and murdered a young woman in 1982.” It asks, “[h]ow could forensic evidence, widely seen as factual and unbiased, nearly send [this] innocent person to his death?” It ends with the plaintive thought that “[o]ur national experiment in untested forensics may soon be coming to a close. But it hasn’t ended in time to prevent a few more people like Earl Washington from being sacrificed on the altar of pseudoscience.”

The conviction and exoneration of Earl Washington have much to teach us about criminal justice. But it would be hard to find a worse example of “an innocent man being sacrificed on the altar of pseudoscience.”  There was no forensic evidence—scientific or pseudoscientific—introduced in the trial. Had there been, the outcome might have been different. This is the conclusion that follows from the description of the case in an important book, Convicting the Innocent, by Professor Brandon Garrett.

Garrett's research reveals that the police made every effort to keep science away when they built their case around a classic false confession from a “borderline mentally retarded farmhand” 2/ with convincing detail fed to him by police. One officer was found in a later civil rights action to have “fabricated the confession.” 3/ The alleged confession included the revelation (known to the police) of the killer’s blood-stained shirt with a torn-off patch left in the victim’s dresser drawer. Although forensic analysts had excluded five other suspects as possible sources of hairs found in the shirt pocket, “police instructed the state crime laboratory not to compare [Washington’s] hairs.” 4/

Even more telling—but untold—was the serological evidence in the case. According to Mr. Stern, it was “semen introduced at trial” that “convinced the jury.” But no semen was introduced at trial. No “semen analysis,” as Mr. Stern calls it, was offered into evidence. If only it had been!

“The semen-stained blanket from the victim’s bed was blood-typed, and that rudimentary technique had ruled out Washington.” 5/ The prosecutor would hardly want to introduce this evidence. (Indeed, it is hard to see how he ethically could go to trial without having proof that the blood-typing was incorrect.) As for the inexperienced defense counsel, 6/ “[t]he lawyer later said that while he saw the forensic reports, he ‘was not familiar with the significance of the analysis.’” 7/ Worse still, “the state concealed crucial evidence of innocence, including forensic evidence, from the defense.” 8/

In short, presenting the conviction and near-execution of Earl Washington, Jr., as the example of “a decades-long experiment in which undertrained lab workers jettison the scientific method in favor of speedy results that fit prosecutors’ hunches” disguises the real lessons of the case. The Washington case is an awful illustration of (1) evidence of a false confession that could have been prevented by proper interviewing techniques (including recording the confession); (2) willful blindness on the part of the police and the prosecution to the warnings signs in the confession; (3) suppression of and failure to pursue contradictory scientific evidence; and (4) ignorance of the scientific evidence that gave the lie to the alleged confession.

Are there real examples of “flawed forensics” contributing mightily to false convictions? Of course. Do we know how many? Not really, but whatever the precise number may be, there are too many such cases. An article making this now well known point easily could have started with a more a propos example.

Were this the only defect in the article, one might chalk it up to a combination of the expectancy effect and poor research. Perhaps the writer picked the Washington case without worrying too much about the actual facts because he already knew what to expect. (Dare I say that Mr. Stern was not writing on a blank Slate?) Unfortunately, however, there are other inaccuracies in the article. I comment on them in the next posting.

Notes
  1. Mark Joseph Stern, Forensic Science Isn’t Science: Why juries hear—and trust—so much biased, unreliable, inaccurate evidence, Slate, June 11, 2014.
  2. Brandon L. Garrett, Convicting the Innocent: Where Criminal Prosecutions Go Wrong 145 (2011).
  3. Id. at 30.
  4. Id. at 35.
  5. Id. at 147.
  6. Id. at 147-48.
  7. Rather than present a vigorous defense—“[t]he entire defense case lasted only 40 minutes,” id. at 146, Washington’s lawyer—who had never tried a capital case before— “simply asked for the mercy of the jury.” Id. at 154. He did not even point out “the glaring inconsistencies” between the compliant confession and some of the facts in the case—including the race of the white woman who was murdered in front of her two children. Id. at 147. When asked whether she was white or black, Washington chose “black.” Id.
  8. Id. at 148 (note omitted). The “forensic evidence” in question seems to be the following:
    An analyst working for the Virginia Bureau of Forensic Science had tested stains on a central piece of evidence, a blue blanket found on the murdered victim’s bed, and found Transferrin CD, a fairly uncommon plasma protein that is most found in African-Americans. The analyst even ran a second test to double-check the result. The next year, when Earl Washington, Jr., was arrested, they tested his blood and found he did not possess the unusual Transferrin CD. The state did not give the defense the report indicating Washington was excluded by that characteristic. Instead, the state gave the defense an “amended” report. Without having done any new tests, the altered report stated that the results of the Transferrin CD testing “were inconclusive.” The original lab report came to light decades later when Washington filed a civil rights lawsuit after his exoneration.
    Id. at 108 (notes omitted). Inasmuch as the “inconclusive” serum protein test would not have much significance for the defense, I assume that the “rudimentary” blood-typing results that excluded Washington, which the defense saw but overlooked, would have been even more damaging to the prosecution than this amended test for Transferrin CD.

Sunday, June 8, 2014

Kansas Court of Appeals Rejects Post-King Challenge to DNA Collection on Arrest

This year, the Kansas Court of Appeals upheld the state's DNA-on-arrest law against a Fourth Amendment challenge. The result is not surprising in light of the Supreme Court's opinion in Maryland v. King. Still, there are some differences between the Maryland statute and the Kansas one, making the state court's decision not even to publish its opinion a little questionable.

Excerpts from the opinion and two quick comments on them follow:

State v. Biery
No. 109,344, 318 P.3d 1020 (Table)
2014 WL 802100 (Kan. Ct. App. Feb. 28, 2014)

PER CURIAM.

In the early morning hours of May 12, 2012, Hutchinson police observed a white male out walking. The officers approached, without lights or sirens activated or weapons drawn, and asked for identification. Police learned the man was [Willie] Biery and there was an outstanding arrest warrant for his failure to appear for a probation violation hearing. Biery was arrested.
At the jail, Biery emptied his pockets and revealed a small plastic baggie containing methamphetamine. Biery was charged with possession of methamphetamine and booked into the jail for both violations. Because possession of methamphetamine is a felony and his DNA was not on file, Biery was asked to provide a DNA sample, via buccal mouth swab ... . Biery refused. ...

Biery was [found guilty of] refusing to give a DNA sample , in violation of K.S.A.2011 Supp. 21–2511(e)(2). ...

On appeal, Biery's sole issue is whether the statutory scheme for the collection, handling, and storage of DNA samples ... is a violation of the Fourth Amendment to the United States Constitution and § 15 of the Kansas Constitution Bill of Rights. ...
The recent United States Supreme Court decision in [Maryland v.] King, 133 S.Ct. 1958 [2013)], addressed this issue ... As part of their standard procedure for a person arrested and charged with felony offenses, Maryland police took a DNA sample by buccal swab ...

In determining whether the warrantless search was reasonable, the United States Supreme Court held the DNA collection statute served a legitimate government interest by providing a safe and accurate way to process and identify the persons taken into custody, reducing risk to police and those in police custody, ensuring criminals are available to be tried, assessing the danger an individual might pose to the public before setting bond, and reducing the possibility of innocent persons being wrongfully held. ... The Court also noted DNA collection is a search incident to a lawful arrest which, even lacking individual suspicion, is virtually unchallenged in American jurisprudence. ...
Hmm, the Supreme Court did not uphold the DNA collection in Maryland because it fell within the "search incident to arrest" exception to the warrant requirement. That exception only allows to police to search a person and his immediate surroundings to prevent the individual taken into custody from using a weapon or destroying evidence. King applied a balancing test to recognize what is effectually a new exception.

Because the search was minimally intrusive, served a legitimate government interest, was reasonable due to an arrestee's reduced expectation of privacy, and protected against unwarranted disclosures, the Court affirmed the constitutionality of the Maryland statute. ...

... Biery claims the Maryland statute at issue in King is significantly different than the one in place in Kansas ... . Thus, ... Biery argues the Maryland statutory scheme was deemed constitutional because it provided sufficient safeguards against the accidental disclosure or misuse of such samples. Biery claims the Kansas statute lacks such safeguards and, therefore, fails to pass constitutional muster. ...

... Because ... the Kansas Bureau of Investigation (KBI) [must] comply with national standards regarding the collection and maintenance of DNA records, the State argues K.S.A.2011 Supp. 21–2511 provides sufficient statutory safeguards to be considered constitutional under King. ...

The State is correct. ... In regards to the dissemination of DNA information, Kansas law only allows release of DNA records and samples to “authorized criminal justice agencies.” ... Finally, the overall process is governed by the KBI, which “shall promulgate rules and regulations” for the collection and maintenance of samples; expungement and destruction of samples; and procedures in compliance with national standards for DNA records. ...
The Kansas statute differs in a couple of ways that the court does not mention. It applies to a broader class of crimes than the Maryland law. It does not defer the DNA profiling until after an arraignment. It does not require the destruction of samples and profiles if there is no conviction. Apparently, the Kansas court did not consider these differences significant, and it proceeds to present the weaker Kansas provision for expungement as an argument for the reasonableness of the Kansas law.

We pause to note the charges leading to Biery's felony arrest have since been dismissed following the suppression of the evidence against him. ... With the dismissal of his case, K.S.A.2011 Supp. 21–2511(j)(1)(B) provides ... for ... a procedure which allows the defendant to petition to expunge and destroy the DNA samples and profile record in the event of a dismissal of charges, expungement or acquittal at trial.

Had Biery provided a sample, he could now proceed to ask the KBI to expunge and destroy his DNA sample. ...

Biery was lawfully under arrest for a felony at the time the buccal swab was requested. K.S.A.2011 Supp. 21–2511(e) requires that any person subject to a valid felony arrest to submit a buccal swab. The statute does not violate the Fourth Amendment to the United States Constitution or § 15 of the Kansas Constitution Bill of Rights and is constitutional. ...

Friday, June 6, 2014

Fingerprinting Errors and a Scandal in St. Paul

Reports of crime laboratory scandals and fraud have become legion. It is scandalous to convict and imprison people on the basis of contrived, fabricated, or even incompetently generated laboratory evidence. But not all scandals are equal. Consider, for example, the report last year in the ABA Journal that
the St. Paul, Minn., police department’s crime lab suspended its drug analysis and fingerprint examination operations after two assistant public defenders raised serious concerns about the reliability of its testing practices. A subsequent review by two independent consultants identified major flaws in nearly every aspect of the lab’s operation, including dirty equipment, a lack of standard operating procedures, faulty testing techniques, illegible reports, and a woeful ignorance of basic scientific principles. 1/
The article proceeds to describe deplorable conditions in the drug testing lab, but all it says about latent print work is that “[t]he city has since hired a certified fingerprint examiner to run the lab, who has announced plans to resume its fingerprint examination and crime scene processing operations, and begin the procedure for seeking accreditation.”

Curious as to what the latent-print examiners had been doing, I turned to a local newspaper article entititled "St. Paul Crime Lab Errors Rampant." It reported that “[t]he police department hired two consultants to work on improving the lab after a ... Court hearing last year disclosed flawed drug-testing practices” and that “the lab recently resumed fingerprint work by certified analysts.” 2/

One consultant, “Schwarz Forensic Enterprises of Ankeny, Iowa, ... studied the crime lab's latent fingerprint comparison, processing and crime scene units” and found that “[p]ersonnel appeared to have attended seminars and training, but there wasn't formal competency testing or a program to assess ongoing proficiency.”

These untested personnel offered an opportunity to see how poorly monitored analysts performed. Would they succumb to the widely advertised cognitive biases that might cause latent print examiners to declare matches that do not exist? Would they declare matches more frequently than certified examiners? Apparently not:
"'Despite these deficiencies, no evidence of erroneous identifications by latent print examiners was found; but we did find numerous examples of cases wherein examiners had failed to claim latent prints as suitable for identification and/or to identify prints to suspects,'
the Schwarz report said." In other words, the incidence of false negatives and missed opportunities to make identifications or exclusions was high, but no false-positive errors were found. “A review of 246 fingerprint cases found the unit successfully identified prints only ‘in cases where the print detail is of extraordinarily high quality.’”

This outcome is consistent with more rigorous studies showing that when latent print examiners make mistaken comparisons, the errors are usually false exclusions—not false matches. 3/ This tendency reflects a different sort of bias—an unwillingness to declare a match unless the match seems quite clear.

Of course, 246 instances without false positives from worrisome fingerprint analysts does not prove that they never make false matches. If this group were making false identifications 1% of the time, for instance, the probability that no false positives would be seen in a run of 246 independent cases (each with the same 1% false-match probability) would be (1 – .01)246 = 8%.

The absence of false positives also is consistent with an intriguing 2006 report by Itiel Dror and David Charlton. 4/ These investigators had six experienced, certified, and proficiency-tested analysts examine sets of prints from four cases in which, years ago, the examiners had found exclusions and another four cases in which they had made identifications. The subjects did not realize that they had seen these prints before. In some instances of previous exclusions, the examiners were told that a suspect had confessed. In none of these cases did the examiners depart from their earlier judgment of a match.

On the other hand, in cases of previous identifications, when examiners were told that the suspect was in police custody at the time of the crime, two examiners switched from an exclusion to an identification, and one switched to “cannot decide.” Although these sample sizes are too small to justify strong and widely generalizable conclusions, it looks like it is easier for information that is not needed for the analysis to prompt an exclusion than an individualization.

Dror and Charlton interpret their results as supporting (among other things) the claim “that the threshold to make a decision of exclusion is lower than that to make a decision of individualization.” This higher threshold would make it more difficult to bias an examiner to make a false identification than to make a false exclusion.

Did any of the 246 St. Paul cases involve contextual bias of one kind or another? If so, it would be interesting to find out if these examiners resisted contextual suggestions favoring identifications or exclusions in those cases. Audits like these could be helpful not only in getting laboratories with problems back on track, as in St. Paul, but also as a source of information on the risks of different types of errors in various settings and circumstances.

Notes
  1. Mark Hansen, Crime Labs Under the Microscope after a String of Shoddy, Suspect and Fraudulent Results, ABAJ, Sept. 2013
  2. Mara H. Gottfried & Emily Gurnon, St. Paul Crime Lab Errors Rampant, Reviews Find, Pioneer Press, Feb. 14, 2013
  3. See, e.g., Fingerprinting Under the Microscope: Error Rates and Predictive Value, Forensic Science, Statistics, and the Law, April 30, 2012; Fingerprinting Error Rates Down Under, June 24, 2012, Forensic Science, Statistics, and the Law.
  4. Itiel E. Dror & David Charlton, Why Experts Make Errors, 56 J. Forensic Identification 600-16 (2006)

Wednesday, June 4, 2014

Quarreling and Quibbling over Psychometrics in Hall v. Florida (part 3)

After explaining that Florida’s statutory cutoff of –2σx corresponds to an IQ score of 70 (because IQ tests are normed to have a mean of 100 and a standard deviation of 15), Justice Kennedy observes that:
Florida's rule disregards established medical practice in two interrelated ways. It takes an IQ score as final and conclusive evidence of a defendant's intellectual capacity, when experts in the field would consider other evidence. It also relies on a purportedly scientific measurement of the defendant's abilities, his IQ score, while refusing to recognize that the score is, on its own terms, imprecise.
Here, I show that these two limitations on IQ scores are less “interrelated” than Justice Kennedy suggests.

The first issue: validity

The first limitation involves what social scientists call “validity”—the extent to which something measures the real quantity of interest. For example, measuring the volume of a box by attending only to one dimension, such as height, is invalid because it ignores the two other determinative variables of width and depth.

The second limitation concerns the precision or “reliabilility” of the measurement—regardless of validity. Being able to measure the height of a box to the nearest millimeter time after time achieves precision and reliability, but it still lacks validity (with respect to the variable of volume). Moreover, measuring width or depth—even somewhat imprecisely—will add to validity but will do nothing to enhance the precision of the measurement of height.

Likewise, the “other evidence” to which the Court referred does not make IQ scores any more precise. Rather, it relates to what is known in the trade as “adaptive functioning.” The opinion defines adaptive functioning as “the inability to learn basic skills and adjust behavior to changing circumstances.” The Court disparages the “mandatory cutoff” of –2σx because this cut-score means that
sentencing courts cannot consider even substantial and weighty evidence of intellectual disability as measured and made manifest by the defendant's failure or inability to adapt to his social and cultural environment, including medical histories, behavioral records, school tests and reports, and testimony regarding past behavior and family circumstances. This is so even though the medical community accepts that all of this evidence can be probative of intellectual disability, including for individuals who have an IQ test score above 70.
If one were to follow this we-need-another-variable theory of “intellectual disability” to its logical limit, no IQ score could preclude the more comprehensive assay of all forms of “substantial and weighty evidence of intellectual disability.” An individual with above-average IQ scores also might “manifest [a] failure or inability to adapt to his social and cultural environment [as shown by] medical histories, behavioral records, school tests and reports, and testimony regarding past behavior and family circumstances.”

But surely the Court cannot claim that the execution of intellectually gifted but maladapted criminals is cruel and unusual while the execution of intellectually gifted and socially well adjusted criminals is not. To avoid such anomalies, the Court follows the contemporary (and prior) mental health practice of limiting “intellectual disability” to “concurrent deficits in intellectual and adaptive functioning” (emphasis added), which requires “significantly subaverage intellectual functioning” in addition to mere “deficits in adaptive functioning.” If it is clear that an individual is able to function intellectually within a broad but “normal” range, then a state need not entertain a claim of “intellectual disability” based solely on problems in adaptive functioning. Therefore, an IQ within the normal range should suffice to displace the offender from the potentially death-disqualified group.

So the possibility of weighty evidence of deficits in adaptive functioning, although relevant to clinicians, turns out to be no explanation for why Florida cannot draw the line at –2σx. If evidence of an adaptive-function deficit does not bar the state from executing criminals with IQs in the broad range of normalcy, why cannot the state define all IQs above 70 (–2σx) as lying within that range? That “experts in the field would consider other evidence” than IQ scores is not an answer. The answer has to be that (1) there is a range above which IQ, in of of itself, is a valid measure of the absence of “intellectual disability,” and (2) this range does not extend all the way down to 70. When these conditions hold, experts would not (or need not) consider “other evidence.”

Ironically, the Court’s opinion contradicts the second proposition. It clearly implies that the state could use a perfectly precise IQ measurement just above –2σx as conclusive evidence of intellectual disability. But if that is so, then the problem is not the failure to allow evidence of adaptive functioning. It is solely the existence of nonzero measurement error of IQ alone.

The second issue: precision (reliability)

Apparently (and dubiously) reserving the term “scientific” for precise measurements, Justice Kennedy stated that the “purportedly scientific measurement of the defendant's abilities, his IQ score, ... is, on its own terms, imprecise.” The problem is that although “there is evidence that Florida's Legislature intended to include the measurement error in the calculation ... the Florida Supreme Court ... has held that a person whose test score is above 70, including a score within the margin for measurement error, does not have an intellectual disability ... .”

In other words, a legislature that wants to preclude the more elaborate evaluations of all offenders with IQ scores below 70 could do so if only it had a way to measure IQs with perfect accuracy. Because of the “measurement error” of IQ tests, this legislature must adopt a higher cutoff. The Court, relying on the diagnostic literature, repeatedly refers to a cutoff of 75 as assuring an adequate safety margin.

The dissent had harsh words for the choice of 75, and I will get to those later, after examining where the figure of 75 comes from. At this point, no excursion into statistical theory is required to recognize that there is something weird about saying that IQ scores are problematic because they are an incomplete measure of “intellectual disability,” but then using them—and only them—within a band that accounts only for the error in measuring IQ. By definition, this band does not attend to the other factors that should be part of the full analysis. To put it another way, if the problem lies with using IQ alone, the solution lies in defining the range of IQ scores in which the other factors realistically could produce a different diagnosis. However, the error in IQ measurements has no clear connection to the range in which the failure to look beyond IQ makes a difference.

The majority’s response is essentially that if the mental health profession generally agrees that incompleteness is only a significant concern within the logically unrelated range of IQ-score error, then that is all that the Cruel and Unusual Punishment Clause demands. To which the dissent replies that abdicating the line drawing to the professionals makes no constitutional sense and “will also lead to serious practical problems.”

The dissent’s peculiar proof of “instability”

The first such problem is “instability.” According to Justice Alito:
This danger is dramatically illustrated by the most recent publication of the APA, on which the Court relies. This publication fundamentally alters the first prong of the longstanding, two-pronged definition of intellectual disability that was embraced by Atkins and has been adopted by most States. In this new publication, the APA discards “significantly subaverage intellectual functioning” as an element of the intellectual-disability test. Elevating the APA's current views to constitutional significance therefore throws into question the basic approach that Atkins approved and that most of the States have followed. 1/
The American Psychiatric Association’s latest version of its venerable Diagnostic and Statistical Manual of Mental Disorders—the DSM-5—“was published in May 2013 amid a storm of controversy and bitter criticism.” 2/ In general, critics maintain that “D.S.M.’s diagnostic categories lacked validity, that they were not ‘based on any objective measures,’ and that, ‘unlike our definitions of ischemic heart disease, lymphoma or AIDS,’ which are grounded in biology, they were nothing more than constructs put together by committees of experts.” 3/ Neither opinion even hints at such turmoil. The majority genuflects to clinical expertise and guidelines. The dissent raises no questions about validity and subjectivity, but objects to substituting “the standards of professional associations, which at best represent the views of a small professional elite” for “the standards of the American people.”

As for “instability,” the DSM-5 has brought a profusion of new or redefined disorders, but it does not radically change the definition of “intellectual disability” or dispense with the criterion of “significantly subaverage intellectual functioning.” It simply substitutes the word “intellectual ... deficit” for “significantly subaverage.” The diagnostic criteria have remained remarkably similar over the 19 years between the DSM-4 and the DSM-5.

The DSM-5 specifies that “[t]he first diagnostic criterion that “must be met” is “A. Deficits in intellectual functions ... confirmed by ... both clinical assessment and standardized intelligence testing.” If the intelligence testing does not demonstrate subaverage performance, it is hard to see how it could confirm the existence of a meaningful deficit. Moreover, the DSM-5 elaborates, making it plain that significantly subaverage IQ remains a sine qua non for the diagnosis:
The essential features ... are deficits in general mental abilities (Criterion A) and impairment in everyday adaptive functioning ... (Criterion B) [with o]nset is during the developmental period (Criterion C). The diagnosis of ... is based on both clinical assessment and standardized testing ... . Intellectual functioning is typically measured with ... tests of intelligence. Individuals with intellectual disability have scores of approximately two standard deviations or more below the population mean, including a margin for measurement error (generally +5 points). On tests with a standard deviation of 15 and a mean of 100, this involves a score of 65–75 (70 ± 5).
Compare this to the DSM-4 (or the DSM-4-TR cited by Justice Alito, which uses the same words):
The essential feature of Mental Retardation is significantly subaverage general intellectual functioning (Criterion A) that is accompanied by significant limitations in adaptive functioning ... (Criterion B) [with] onset ... before age 18 years (Criterion C). ... General intellectual functioning is defined by the intelligence quotient ... obtained by assessment with ... intelligence tests ... . Significantly subaverage intellectual functioning is defined as an IQ of about 70 or below (approximately 2 standard deviations below the mean). It should be noted that there is a measurement error of approximately 5 points in assessing IQ, although this may vary from instrument to instrument ... . Thus, it is possible to diagnose Mental Retardation in individuals with IQs between 70 and 75 who exhibit significant deficits in adaptive behavior. Conversely, Mental Retardation would not be diagnosed in an individual with an IQ lower than 70 if there are no significant deficits or impairments in adaptive functioning.
Thus, there are wording changes over the 19 years from 1994 to 2013, but Criterion A remains Criterion A, IQ tests remain critical to the diagnosis, and the range of test scores that lend themselves to the diagnosis is the same. The APA has changed the emphasis somewhat, and it has spelled out the constructs a little more (in words not quoted here). Nevertheless, to claim that the shift “dramatically illustrate[s a] fundamental[] alter[ation in] ... the longstanding ... definition of intellectual disability” seems, well, melodramatic.

State laws that rely on –2σx plus a margin of safety for measurement error, are compatible with Atkins, Hall, DSM-4, and DSM-5. Of course, whether this is a logically or functionally appropriate manner of defining “intellectual disability” for purposes of capital punishment is open to debate. Resolving this debate requires a more detailed and accurate understanding of the concept of measurement error than the Hall opinions provide.

Footnotes
  1. The second problem is that “changes adopted by professional associations are sometimes rescinded.” This problem is just a form of instability. The third problem is hypothetical (thus far) as it relates to intellectual disability determinations: “what if professional organizations disagree? The Court provides no guidance for deciding which organizations' views should govern.” The fourth and final “practical problem” is actually conceptual—and quite important. “[D]efinitions of intellectual disability ... are promulgated for use in making a variety of decisions that are quite different from the decision whether the imposition of a death sentence in a particular case would serve a valid penological end. ... [I]n determining eligibility for social services, adaptive functioning may be much more important.”
  2. Nat’l Health Service Choices, Controversy over DSM-5: New Mental Health Guide, Aug. 15, 2013.
  3. Gary Greenberg, The Rats of N.I.M.H., New Yorker, May 16, 2013 (quoting Thomas Insel, the director of the National Institute of Mental Health). 

Other postings in this series
  • Quarreling and Quibbling over Psychometrics in Hall v. Florida (part 1) (introduction)
  • Quarreling and Quibbling over Psychometrics in Hall v. Florida (part 2) (on standard deviation)

Monday, June 2, 2014

Quarreling and Quibbling over Psychometrics in Hall v. Florida (part 2)

Justice Kennedy described the Florida law that prompted the trial court to reject Hall’s claim of intellectual disability as follows:
Florida's statute defines intellectual disability for purposes of an Atkins proceeding as “significantly subaverage general intellectual functioning existing concurrently with deficits in adaptive behavior and manifested during the period from conception to age 18.” Fla. Stat. § 921.137(1) (2013). The statute further defines “significantly subaverage general intellectual functioning” as “performance that is two or more standard deviations from the mean score on a standardized intelligence test.” Ibid. The mean IQ test score is 100. The concept of standard deviation describes how scores are dispersed in a population. Standard deviation is distinct from standard error of measurement, a concept which describes the reliability of a test and is discussed further below. The standard deviation on an IQ test is approximately 15 points, and so two standard deviations is approximately 30 points. Thus a test taker who performs “two or more standard deviations from the mean” will score approximately 30 points below the mean on an IQ test, i.e., a score of approximately 70 points.
Because standard deviations are fundamental to the Florida law, to the Court’s conclusions about it, and to its dicta regarding the lowest mandatory IQ cut-off that a state can use, I am going to be persnickety in unpacking this paragraph.

Although the Court is only discussing the standard deviation of IQ scores, “standard deviation” (SD) has a much broader meaning. When one considers the general meaning of the term, it becomes clear that the SD of the scores does not quite describe “how scores are dispersed in a population.” It merely indicates how much they are dispersed in either a population or a sample.

For example, the trial court heard testimony about at least four IQ test scores for Hall—71, 72, 73, and 80. (It declined to consider a score of 69 on another test because the psychologist who administered and scored the test was dead and Hall’s counsel had violated an order “to provide the State with the [underlying] testing materials and raw data.” Indeed, Hall had taken as many as nine IQ tests over a 40-year period.)  The standard deviation of the scores considered by the trial court is the square root of the average squared deviation from the mean—namely,

SD = {[(71–74)2 + (72–74)2 + (73–74)2 + (80–74)2]/4}1/2 = 3.53.

Tossing in the excluded score of 69 increases the SD to 3.74. The SD increases because the additional score is below the range of the other four, thus creating more variability in the sample (and lowering the mean from 74 to 73).

Of course, the Court’s number of 15 for the SD of IQ scores does not come from Hall’s scores. At this point, I use his scores only to elaborate on the Court’s observation that a standard deviation is a statistic that indicates how much the numbers in some set of numbers fluctuate around their mean. The standard deviation of 15 IQ points is an estimate of how much the scores of everyone in the general population—a large batch of numbers indeed—would vary if everyone took the test. The average score would be approximately 100, and there would be a lot of scatter around this mean. (In fact, the raw scores on the test are transformed in light of their mean and SD to force them to have a desired mean and SD near 100 and 15, respectively.) And, yes, 100 – (2×15) = 70, so Florida’s choice of 2 SDs to demarcate “significantly subaverage general intellectual functioning” translates into a score of 70 on a test with this mean and SD.

But the fact that every batch of numbers has a SD does not tell us “how [these numbers] are dispersed.” The numbers could be highly concentrated around a single value, with outliers on the flanks. Their distribution could be flat, with an equal fraction of the numbers spread out everywhere. The distribution might show clustering at several locations, and so on.

IQ scores, however, are dispersed approximately according to a “normal” or “Gaussian” curve. This distribution is the bell-shaped one prominent in elementary statistics courses. There are other bell-shaped curves, and all kinds of other interesting and important families of curves, but IQ scores, like many physical variables (such as weight and height), tend to be normally distributed across the members of a population (and hence in representative samples of that population).

The exact shape of all such normal distributions can be determined from two numbers—the mean and the standard deviation. The mean states where the bell sits, and the standard deviation determines how steeply its sides flow down from the top.You can see for yourself by entering your favorite means and standard deviations into the demonstration program in the OnlineStatBook.

Using the variable X to denote IQ scores and the symbol σx to designate their standard deviation, the particular normal distribution used in the Court’s calculation is such that, 2.28% of the scores lie below 70 (which, as the Court calculated it, corresponds to –2σx), and 4.75% fall below 75 (which, for the mean of 100 and standard deviation σx of 15, corresponds to –1.67σx). The latter IQ score, x = 75, is significant because Hall conceded (and the Court seemed to agree) that Florida could have chosen this score as its cut-off. For example, the Court expressed dissatisfaction that, in light of its calculations, the effect of Florida’s cut-off of –2σx was to preclude legally effective “professional[] diagnose[s of] intellectual disability [in a case like Hall’s, for which] the individual's IQ score is 75 or below.”

The dissent insisted that states should have more discretion to set cut-off scores. Unless –1.67σx (or 75) corresponds to the level of impairment that justifies a categorical rule, the majority has no satisfying reason to select one cut-off over the other. Why is the Court’s choice of 1.67 standard deviations below the mean the highest that the Constitution permits? Why is Florida’s two-standard-deviation rule insufficient?

The Court’s answer leans heavily on the standard error of measurement — another technical term that appears in the paragraph quoted above: “Standard deviation is distinct from standard error of measurement, a concept which describes the reliability of a test and is discussed further below.” But the standard error of measurement is also a standard deviation, one that is estimated, almost magically, from test reliability statistics. Thus, a more precise sentence would have been: “The standard deviation of all test scores is distinct from another standard deviation known as the standard error of measurement, which depends on the reliability of the test. We discuss the standard error of measurement below.”

Other postings in this series

  • Quarreling and Quibbling over Psychometrics in Hall v. Florida (part 1), May 29, 2014 (introduction)
  •  Quarreling and Quibbling over Psychometrics in Hall v. Florida (part 2), June 2, 2014 (on standard deviation)
  • Quarreling and Quibbling over Psychometrics in Hall v. Florida (part 3), June 4, 2014 (on validity and the stability of the APA's diagnostic criteria)