Friday, April 29, 2016

False Justice and Prosecutors' Fallacies

False Justice, by Jim and Nancy Petro, is an engaging, first-person tale of a former Ohio Attorney General's involvement in correcting false convictions as well as a summary and refutation of, as the book's subtitle puts it, "Eight Myths that Convict the Innocent." 1/ The book reveals the frustrations that lawyers in the innocence movement know all too well, and it wisely warns prosecutors, police, and the public of pernicious fallacies about criminals and the criminal justice system.

But the book perpetuates a fallacy of a different sort -- a statistical fallacy often called, in legal circles, "the prosecutor's fallacy." 2/ With some types of trace evidence (particularly DNA evidence), it is feasible to estimate the probability of a match between a defendant and the trace evidence given that a suspect is not the source of the trace at the crime scene. We can write this coincidental match probability as Pr(Match | ~Source).

The problem is that the judge or jury wants to know the probability of a DNA match given that a suspect is not the source of the DNA: P(~Source | Match). These two conditional probabilities are conceptually distinct. Sometimes they can be numerically identical or very close to one another, but other times they are not even close. The statistical or logical fallacy consists of naively transforming P(Match | ~Source) into P(~Source | Match).

The first instance of this transposition occurs at page 47, when the Petros quote from a letter intended to persuade a county prosecutor of Clarence Elkins' innocence (and of the guilt of a different inmate in the same cellblock):
We had a very convincing match. In a letter [the Ohio Innocence Project] and Elkin's attorneys ... informed the Summit County prosecutor that newly conducted DNA testing "conclusively exonerates Elkins and implicates Earl Mann in the murder and rapes in which Elkins was convicted." The letter explained that the full profile of the DNA from the girl's panties and Mrs. Judy Johnson's vaginal swab were "consistent with Earl Mann's DNA for full 12-point match."
How convincing was this match? This "full 12-point match," did not involve a random match probability P(Match | ~Source) as small as those for normal STR matches. It came from Y-STR testing. Ordinary forensic STR testing uses loci scattered across different pairs of autosomal chromosomes. For those STRs, estimating the probability of a 12-locus match would involve 24 multiplications of smallish fractions and give rise to tiny match probabilities for any given profile. Not so for forensic Y-STR testing. Y-STRs all lie on a single Y chromosome and are inherited father to son, as one package. Multiplying the population frequencies for individual Y-STRs would not make sense. 3/ Instead of multiplying,
As in all Y-STR DNA analysis, the the odds of finding a match are calculated on how many times that specific configuration of markers has been seen in a particular database. In this case, Earl Mann's DNA, in a database of 4,000 samples, matched the crime scene DNA. The letter explained, "Thus far, it ... is a unique Y-STR profile, and there is less than a 1 in 4,000 chance that it is not Earl Mann who left his DNA at the crime scene in the most highly probative areas."
Presumably, "less than ... 1 in 4,000" refers to the fraction 1/4001 that expresses how often the profile has been seen -- only once -- compared to how many Y-STR profiles have been recorded -- 4000 previous profiles plus Mann's. 4/ A 1/4001 "chance that it is not Earl Mann" given that the trace DNA and Mann's have the Y-STR profile is P(~Source | Match). In contrast, 1/4001 is the probability of randomly picking Mann's profile from a population in which 1/4001 profiles are just like Mann's. It is P(Match | ~Source). Elkin's lawyers have transposed. To build their case against Mann, they have committed the prosecutor's fallacy.

Now, Earl Mann was almost certainly guilty -- but not just because he had a matching profile. According to the 2000 Census, Summit County was home to approximately 140,000 men between the ages of 20 and 59. At a rate of 1 man per 4001, we would expect to find 140000/4001 = 35 of them with matching DNA. Looking at just the Y-STR match, it no longer sounds as if the chance that Mann was not the source is only 1/4001.

In fact, one could argue the chance that Mann was not the source is P(~Source | Match) = 34/35! After all, there were some 35 men in the right age range and locale for whom one could say "the full profile of the DNA from the girl's panties and Mrs. Judy Johnson's vaginal swab were 'consistent with ... for full 12-point match.'" Mann is just one of them. As such, for him, P(Source | Match) = 1/35; hence, P(~Source | Match) = 34/35.  5/

The passage quoted above is not the only instance of transposition in False Justice. It occurs just about every time the Petros quote a random match probability. Most of these probabilities are so small that the resulting likelihood ratio would swamp any reasonable prior probability, making the fallacy for particular transpositions somewhat academic. Still, False Justice does not get its description of the meaning of small match probabilities quite right.

Notes
  1. Jim Petro & Nancy Petro, 2015. False Justice: Eight Myths that Convict the Innocent. Routledge: New York, NY (rev. ed.).
  2. William C. Thompson & Edward L. Shumann, (1987). Interpretation of Statistical Evidence in Criminal Trials: The Prosecutor's Fallacy and the Defense Attorney's Fallacy. Law and Human Behavior, 2(3): 167-187 (introducing the phrase).
  3. See, e.g., David H. Kaye, 2010. The Double Helix and the Law of Evidence. Harvard Univ. Press: Cambridge, MA.
  4. Another way to estimate the Y-STR profile frequency is more commonly used, but that is tangential to the issue of transposition.
  5. A better way to arrive at P(~Source | Match) is to apply Bayes' rule. That formula yields 34/35 if one assumes that Mann and every other man in Summit County in the age range mentioned has the same prior probability of being the source of the trace DNA and that everyone else in the world has a source probability of zero. 

Saturday, April 16, 2016

The Department of Justice's Plan for a Forensic Science Discipline Review

On March 21, the Department of Justice announced to the National Commission on Forensic Science that it will be
expanding its review of forensic testimony by the FBI Laboratory beyond hair matching to widely used techniques such as fingerprint examinations and bullet-tracing. Officials also said that if the initial review finds systemic problems in a forensic discipline, expert testimony could be reviewed from laboratories beyond the FBI that do analysis for DOJ. 1/
The head of the Department's Office of Legal Policy welcomed input from the Commission on the following topics:
  • How to prioritize disciplines
  • Scope of time period
  • Sampling particular types of cases
  • Consideration of inaccuracies
  • Levels of review
  • Legal and/or forensic reviewers
  • External review processes
  • Ensuring community feedback on methodology
  • Duty/process to inform parties 2/
On April 8, the Department quietly posted a Notice of Public Comment Period on the Presentation of the Forensic Science Discipline Review Framework in the Federal Register. The public comment period ends on May 9.

According to an earlier statement of Deputy Attorney General Sally Yates, the review is intended to "advance the practice of forensic science by ensuring DOJ forensic examiners have testified as appropriate in legal proceedings." Obviously, the criteria for identifying what is and is not "appropriate" will be critical. For example, which of the following examples of testimony about glass fragments (or paraphrases of the testimony) would be deemed inappropriate?
  • "In my opinion the refractory indices of the two glasses are consistent and they could have common origin." Varner v. State, 420 So.2d 841 (Ala. Ct. Crim. App. 1982). 
  • "Test comparisons of the glass removed from the bullet and that found in the pane on the back door, through which the unaccounted-for bullet had passed, revealed that all of their physical properties matched, with no measurable discrepancies. Based upon F.B.I. statistical information, it was determined that only 3.8 out of 100 samples could have the same physical properties, based upon the refractive index test alone, which was performed." Johnson v. State, 521 So.2d 1006 (Ala. Ct. Crim. App. 1986).
  • "Bradley was able to opine, to a reasonable degree of scientific certainty, that the glass standard and the third fragment had a 'good probability of common origin.'" People v. Smith, 968 N.E.2d 1271 (Ill. App. Ct. 2012).
  • "Blair Schultz, an Illinois State Police forensic chemist, compared a piece of standard laminated glass from defendant's windshield to a piece of glass from Pranaitis' clothing. He found them to have the same refractive index, which means that the two pieces of glass could have originated from the same source. The likelihood of this match was one in five, meaning that one out of every five pieces of laminated glass would have the same refractive index." People v. Digirolamo, 688 N.E.2d 116 (Ill. 1997).
  • "[O]ne of the glass fragments found in appellant's car was of common origin with glass from the victim's broken garage window. The prosecutor asked her if she were to break one hundred windows at random in Allen County, what would be the percentage of matching specimens she would expect. Over appellant's objection, she answered that if one hundred windows were broken, six of the windows would have the properties she mentioned." Hicks v. State, 544 N.E.2d 500, 504 (Ind. 1989).
Hopefully, the Department will learn from the FBI/DOJ Microscopic Hair Comparison Analysis Review 3/ and
(1) make public the detailed criteria is employs,
(2) use a system with measured reliability for applying these criteria, and
(3) make the transcripts or other materials under review readily available to the public.
Notes
  1. Spencer S. Hsu, Justice Department Frames Expanded Review of FBI Forensic Testimony, Wash. Post, Mar. 21, 2016. 
  2. Office of Legal Policy, U.S. Department of Justice, Presentation of the Forensic Science Discipline Framework to the National Commission on Forensic Science, Mar. 21, 2016
  3. See Ultracrepidarianism in Forensic Science: The Hair Evidence Debacle, Washington & Lee Law Review (online), Vol. 72, No. 2, pp. 227-254, September 2015
Related Posting
"Stress Tests" by the Department of Justice and the FBI's "Approved Scientific Standards for Testimony and Reports", Feb. 25, 2016

Saturday, April 2, 2016

Sample Evidence: What’s Wrong with ASTM E2548-11 Standard Guide for Sampling Seized Drugs?

Samuel Johnson once observed that “You don't have to eat the whole ox to know that it is tough.” Or maybe he didn't say this, 1/ but the idea applies to many endeavors. One of them is testing of seized drugs. The law needs to—and generally does—recognize the value of surveys and samples in drug and many other kinds of cases. 2/ If the quantity of seized drugs is large, it is impractical and typically unnecessary to test every bit of the materials. Clear guidance on how to select samples from the population of seized matter would be helpful to courts and laboratories alike.

To accomplish this goal, the Chemistry and Instrumental Analysis Subject Area Committee of the Organization of Scientific Area Committees for Forensic Science (OSAC) has recommended the addition of ASTM International’s Standard Guide for Sampling Seized Drugs for Qualitative and Quantitative Analysis (known as ASTM E2548-11) to the National Institute of Standards and Technology (NIST) Registry of Approved Standards. Unfortunately, this "Standard Guide" is vague in its guidance, incomplete and out of date in its references, and nonstandard in its nomenclature for sampling.

The Standard does not purport to prescribe "specific sampling strategies." 3/ Instead, it instructs “[t]he laboratory ... to develop its own strategies” and “recommend[s] that ... key points be addressed.” There are only two key points. 4/ One is that “[s]tatistically selected units shall be analyzed to meet Practice E2329 if statistical inferences are to be made about the whole population.” 5/ But ASTM E2329 merely describes the kinds of analytical tests that can or should be performed on samples. It reveals nothing about how to draw samples from a population. So far, ASTM E2548 offers no guidance about sampling.

The other “key point” is that “[s]ampling may be statistical or non-statistical.” Although tautological (A is either X or not-X), X is not defined, and an explanatory note intensifies the ambiguity. It states that “[f]or the purpose of this guide, the use of the term statistical is meant to include the notion of an approach that is probability-based.” 6/  Does “probability-based” mean probability sampling (the subject of ASTM E105-10)? At least the latter has a well-defined meaning in sampling theory. 7/ It means that every unit in the sampling frame has a known probability of being drawn.

But even if this is what ASTM E2548-11 Standard Guide means by “probability-based,” the phrase is not congruent with "statistical." The note indicates that even sampling that is not “probability-based” still can be considered "statistical sampling." Later parts of the the Standard allow inferences to populations to be made from "statistical" samples but not from "non-statistical" ones. Using an undefined notion of "statistical" and "non-statistical" as the fundamental organizing principle departs from conventional statistical terminology and reasoning. The usual understanding of sampling differentiates between probability samples -- for which sampling error readily can be quantified -- and other forms of sampling (whether systematic or ad hoc) -- for which statistical analysis depends on the assumption that the sample is the equivalent of a probability sample.

Thus, the statistical literature on sampling commonly explains that
If the probability of selection for each unit is unknown, or cannot be calculated, the sample is called a non-probability sample. Non-probability samples are often less expensive, easier to run and don't require a frame. [¶] However, it is not possible to accurately evaluate the precision (i.e., closeness of estimates under repeated sampling of the same size) of estimates from non-probability samples since there is no control over the representativeness of the sample. 8/
In contrast, because the ASTM Standard does not focus on probability sampling as opposed to other "statistical sampling," the laboratory personnel (or the lawyer) reading the standard never learns that "it is dangerous to make inferences about the target population on the basis of a non-probability sample." 9/

Indeed, Figure 1 of ASTM 2548 introduces further confusion about "statistical sampling." In this figure, a statistical “sampling plan” is either “Hypergeometric,” “Bayesian,” or “Other probability-based.” But the sampling distribution of a statistic is not a “sampling plan” (although it could inform one). A sampling plan should specify the sample size (or a procedure for stopping the sampling if results on the sampled items up to that point make further testing unnecessary). For sampling from a finite population without replacement, the hypergeometric probability distribution applies to sample-size computations and estimates of sampling error. But how does that make the sampling plan hypergeometric? One type of “sampling plan” would be to draw a simple random sample of a size computed to have a good chance of producing a representative sample. Describing a plan for simple random sampling, stratified random sampling, or any other design as “hypergeometric,” “Bayesian,” or “other” is not helpful.

Similarly confusing is the figure’s trichotomy of “non-statistical” into the following “plans”: “Square root N,” “Management Directive,” and “Judicial Requirements.” Using the old √N + 1 rule of thumb for determining sample size may be sub-optimal, 10/ but it is “statistical” -- it uses a statistical computation to establish a sample size. So do any judicial or administrative demands to sample a fixed percentage of the population (an approach that a Standard should deprecate). No matter how one determines the sample size, if probability sampling has been conducted, statistical inferences and estimates have the same meaning.

Also puzzling are the assertions that “[a] population can consist of a single unit,” 11/ and that “numerous sampling plans ... are applicable to single and multiple unit populations.” 12/ If a population consists of “a single unit” (as the term is normally used), 13/ then a laboratory that tests this unit has conducted a census. The study design does not involve sampling, so there can be no sampling error.

When it comes to the issue of reporting quantities such as sampling error, the ASTM Standard is woefully inadequate. The entirety of the discussion is this:
7.1 Inferences based on use of a sampling plan and concomitant analysis shall be documented.

8.1 Sampling information shall be included in reports.
8.1.1 Statistically Selected Sample(s)—Reporting statistical inferences for a population is acceptable when testing is performed on the statistically selected units as stated in 6.1 above [that is, according to a standard that is on the NIST Registry with a disclaimer by NIST]. The language in the report must make it clear to the reader that the results are based on a sampling plan.
8.1.2 Non-Statistically Selected Sample(s)—The language in the report must make it clear to the reader that the results apply to only the tested units. For example, 2 of 100 bags were analyzed and found to contain Cocaine.
These remarks are internally problematic. For example, why would an analyst report the population size, the sample size, and the sample data for “non-statistical” samples but not for “statistical” ones?

More fundamentally, to be helpful to the forensic-science and legal communities, a standard has to consider how the results of the analyses should be presented in a report and in court. Should not the full sampling plan be stated — the mechanism for drawing samples (e.g., blinded, which the ASTM Standard calls “black box” sampling, or selecting from numbered samples by a table of random numbers, which it portrays as not “practical in all cases”); the sample size; and the kind of sampling (simple random, stratified, etc.)? It is not enough merely to state that “the results are based on a sampling plan.”

When probability sampling has been employed, a sound foundation for inferences about population parameters will exist. But how should such inference be undertaken and presented? A Neyman-Pearson confidence interval? With what confidence coefficient? A frequentist test of a hypothesis? Explained how? A Bayesian conclusion such as “There is a probability of 90% that the weight of the cocaine in the shipment seized exceeds X”? The ASTM Standard seems to contemplate statements about “[t]he probability that a given percentage of the population contains the drug of interest or is positive for a given characteristic,” but it does not even mention what goes into computing a Bayesian credible interval or the like. 14/

The OSAC Newsletter proudly states that "[a] standard or guideline that is posted on either Registry demonstrates that the methods it contains have been assessed to be valid by forensic practitioners, academic researchers, measurement scientists, and statisticians through a consensus development process that allows participation and comment from all relevant stakeholders." The experience with ASTM Standards 2548 and 2329 suggests that even before a proposed standard can be approved by a Scientific Area Committee, the OSAC process should provide for a written review of statistical content by a group of statisticians. 15/

Disclosure and disclaimer: I am a member of the OSAC Legal Resource Committee. The information and views presented here do not represent those of, and are not necessarily shared by NIST, OSAC, any unit within these organizations, or any other organization or individuals.

Notes
  1. According to the anonymous webpage Apocrypha: The Samuel Johnson Sound Bite Page, the aphorism is "apocryphal because it's not found in his works, letters, or contemporary biographies about Samuel Johnson. But it is similar to something he once said about Mrs. Montague's book on Shakespeare: 'I have indeed, not read it all. But when I take up the end of a web, and find it packthread, I do not expect, by looking further, to find embroidery.'"
  2. See, e.g., David H. Kaye et al., David E. Bernstein & Jennifer L. Mnookin, The New Wigmore: A Treatise on Evidence: Expert Evidence (2d ed. 2011); Hans Zeisel & David H. Kaye, Prove It with Figures: Empirical Methods in Law and Litigation (1997).
  3. See ASTM E2548-11, § 4.1. 
  4. Id., § 4.2.
  5. § 4.2.2.
  6. § 4.2.1 (emphasis added).
  7. E.g., Statistics Canada, Probability Sampling, July 23, 2013:
    Probability sampling involves the selection of a sample from a population, based on the principle of randomization or chance. Probability sampling is more complex, more time-consuming and usually more costly than non-probability sampling. However, because units from the population are randomly selected and each unit's probability of inclusion can be calculated, reliable estimates can be produced along with estimates of the sampling error, and inferences can be made about the population.
  8. National Statistical Service (Australia), Basic Survey Design, http://www.nss.gov.au/nss/home.nsf/SurveyDesignDoc/B0D9A40C6B27487BCA2571AB002479FE?OpenDocument (emphasis in original).
  9. Id.
  10. See J. Muralimanohar & K. Jaianan, Determination of Effectiveness of the “Square Root of N Plus One” Rule in Lot Acceptance Sampling Using an Operating Characteristic Curve, Quality Assurance Journal, 14(1-2): 33.37, 2011.
  11. § 5.2.2.
  12. § 5.3.
  13. Laboratory and Scientific Section, United Nations Office on Drugs and Crime, Guidelines on Representative Drug Sampling 3 (2009).
  14. Cf. James M. Curran, An Introduction to Bayesian Credible Intervals for Sampling Error in DNA Profiles, Law, Probability and Risk, 4, 115−126, 2011, doi:10.1093/lpr/mgi009
  15. Of course, no process is perfect, but early statistical review can make technical problems more apparent. Cf. Sam Kean, Whistleblower Lawsuit Puts Spotlight On FDA Technical Reviews, Science, Feb. 2, 2012.