Monday, August 29, 2011

The Expected Value Fallacy in State v. Wright

In State v. Wright, 253 P.3d 838 (Mont. 2011), a woman (identified by the Montana Supreme court as Sierra) complained that her date raped her. No semen was recovered, but a penile swab from Timothy Wright, the man she accused, showed a mixture of DNA. Predictably, the major profile seemed to be the defendant's (presumably coming from his skin cells). The minor profile, however, matched Sierra's, confirming her accusation. The random-match probability was "1 in 467,700 Caucasians" and less for other groups. Id. at 841. The direct examination of the state's DNA analyst included the following colloquy:
Q. When you're determining whether or not [Sierra's] DNA is on that penis, tell me what the language "cannot be excluded" means?
A. So that means that the 16 locations we looked at for a DNA profile was at every of those 16 locations.
Q. So whose DNA is on that penis, that penile swab that you examined at the Lab?
A. Well, it--Timothy Wright and [Sierra] can't be excluded as contributing to that profile.
Q. If you--if you're finding her DNA, how come your conclusion isn't that she's included in the profile? That confuses me.
A. At the Forensic Science Division we don't use the word "included." Instead we use "cannot be excluded." It basically means the same thing. It's just our terminology we use.
Id. The prosecutor pressed on, eliciting the statement that the woman's DNA was present in the penile swab:
Q. Can you explain--let's focus on the Caucasian statistic. Can you explain that statistic to the jury? What's it really mean?
A. So that means that in a population of 467,000 you would expect that one person in that population could be included in this mixture.
Q. All right. How many--what's the population of the state of Montana, do you know?
A. It's approximately a million, just under.
Q. So in this particular scenario we've got a mixture of two DNA's, right?
A. Yes.
Q. Statistically speaking, then, I'm just--I want to make sure I understand you, is there only--are there only two people in the state of Montana that can contribute those particular profiles?
A. Yes. Statistically looking at the state of--or the population of Montana two people in Montana would contribute to this mixture.
Q. Those being whom [sic] according to your test results?
A. According to the test results Timothy Wright and [Sierra].
Id. at 841-42.

The prosecution argued in closing that the woman was included as the contributor and that this meant her DNA was on the defendant. On appeal, the defense contended that the source attribution was a knowingly false statement by the prosecutor.

The Montana Supreme Court concluded that this contention was unfounded, but it also described the analyst's source attribution as "internally inconsistent" with the statements that the defendant "cannot be excluded as a contributor." However, there is no logical inconsistency in this testimony. A test that does not exclude someone includes that person. It may include other people when they are tested--or it may not. If the random match probability is small enough, then the test would be expected to exclude all unrelated people, leaving the defendant (or a twin brother or perhaps a close relative) as the only possible source of the semen.

The actual problem in the case was not that the prosecutor knowingly misrepresented the testimony or deliberately elicited false testimony. It was the analyst erred when she stated that "two people in Montana would contribute to this mixture . . . Timothy Wright and [Sierra]" solely on the basis of the "the test results." The expert apparently reasoned that (1) Montana's population is about 1,000,000; (2) about 500,000 would be women; (3) the exact number of women in Montana with the minor profile is 1 (the random-match probability 1/500,000 times the female population of 500,000); hence, it is practically certain that no one but Sierra contributed the minor profile.

Technically, the quantity 1 is an expected value of a variable X that represents the number of women with the minor profile in a randomly generated population of 500,000 women. Expected values are all around us. If we flip a fair coin twice, the expected number of heads is (1/2) x 2 = 1. But we know quite well that the actual outcomes can vary about the expected value. Thus, the probability that two flips of the coin will produce 2 heads is 1/4. Over many coin flips, this "unexpected value" is expected to occur about 25% of the time. Betting on exactly one head in this situation would produce many losses.

How risky was the analyst's source attribution in Wright? The number of occurrences of the minor profile in the population is analogous to the number of heads that would occur when flipping a very heavily weighted coin 500,000 times. Imagine that we generate many populations of 500,000 profiles from a coin that has a probability of only 1/500,000 on each toss. Some populations will have 0 heads (minor profiles), some will have exactly 1, some will have 2, and so on. The number of heads, X, is approximately a Poisson random variable with mean λ = (1/500,000) x 500,000 = 1. Its probability distribution is f(x; λ) = λxe/x! = e-1/x! = .368/x! For example f(0; 1) = .368/0! =.368, and f(1; 1) = .368/1! = 0.368.

What is the probability that the analyst is wrong in thinking that there is only 1 woman with the minor profile? Of the many populations we generated with the coin, we can ignore some 36.8% of them--the ones with 0 minor profiles. We can ignore them because the real population has one woman (Sierra)--and possibly more with the minor profile. This leaves 63.2% of the populations to consider.

The analyst will be wrong in asserting X = 1 when we have a population in which X = 2 or more. This situation occurs in every population for which X is not 0 or 1. There are 100% x 36.8% x 36.8% = 26.4% such populations, and all of them are within the 63.2% that apply to this case. Consequently, looking at the DNA evidence in isolation, we conclude that there are 26.4/63.2 = 41.8% possible populations for which the analyst errs in asserting that the minor profile from the swab is Sierra's.

Yet, the jury was informed that the DNA test proved that Sierra's DNA was on the swab. The legal issue on appeal should have been whether this extravagant testimony to which no objection was raised was plain error or offended due process. Cf. McDaniel v. Brown, 130 S. Ct. 665 (2010) (not reaching the due process issue). But regardless of how those issues might be resolved, a well trained DNA analyst should not have testified in this fashion. It is hardly news to the forensic science community that the expected number of DNA profiles in a population must be much less than 1 to strongly support an inference that the profile is unique within that population. See David J. Balding, Weight-of-Evidence for Forensic DNA Profiles 148 (2005) (describing the kind of reasoning employed in Wright as a "uniqueness fallacy"); see also Ian W. Evett & Bruce S. Weir, Interpreting DNA Evidence (1998).

Cross-posted from the Double Helix Law blog. An expanded version is published in D.H. Kaye, The Expected Value Fallacy in State v. Wright, 2011. Jurimetrics 51: 1-8, https://ssrn.com/abstract=1921082.

Sunday, August 28, 2011

A Kiss is Just a Kiss: Lip Print Identification in the Criminal Law Bulletin

A periodical for lawyers, the Criminal Law Bulletin, recently published an article on “The Investigative and Evidential Uses of Cheiloscopy (Lip Prints).” [1] The author argues that “concerns about the reliability of lip prints evidence are unfounded” and “that lip prints evidence is admissible evidence.” The legal analysis rests on misconceptions about American and English law, but I won’t spell these out. Instead, I want to comment on the author’s approach to the validation of a technique for scientific identification.

According to the article, the following facts apparently demonstrate that “[c]heiloscopy has a scientific foundation”:

  • “Human lips are made up of wrinkles and grooves, just like fingerprints and footprints. Grooves are of several types and these groove types and wrinkles form the lip pattern which is believed to be unique to every individual.”
  • A 1970 article in the Journal of the Indian Dental Association reported that “none of the lip prints from the 280 Japanese individuals showed the same pattern.”
  • In a 2000 article in the same journal, “Vahanwala and Parekh studied the lip prints of 100 Indians (50 males and 50 females) and concluded that lip prints are unique to individuals” and that “It can therefore be conclusively said — Yes, lip prints are characteristic of an individual and behold a potential to be recognized as a mark of identification like the fingerprint!”
  • "Saraswathi et al. studied 100 individuals, made up of 50 males and 50 females, aged 18 - 30, and found that ‘no individual had single type of lip print in all the four compartments and no two or more individuals had similar type of lip print pattern.’"
  • “A study by Sharma et al [published, as was the previous study, in the Taiwanese Journal of Dental Sciences] also found that lip prints are unique to every individual.”
  • Tsuchisashi studied 1364 Japanese individuals, 757 males and 607 females, aged 3 - 60 years and found no identical lip prints. . . . [T]he lips of the twins frequently showed patterns extremely similar to those of their parents [but nonetheless distinguishable].”

These studies do not address the relevant questions. The hypothesis that lip prints are unique is neither a necessary nor a sufficient condition for forensic utility. Before a method of identification can be considered valid, research should establish that multiple impressions from the same person are typically closer to one another than two impressions from different individuals [2] and that they so much closer that an examiner can accurately classify pairs according to their source. Until these questions are asked and answered, the caution expressed in an article not mentioned in the Criminal Law Bulletin seems apt: “Although lip print identification has been utilized in the court of law in isolated cases, more research needs to be conducted in this field . . . .” [3]

References

1. Norbert Ebisike, The Investigative and Evidential Uses of Cheiloscopy (Lip Prints), 47 Crim. Law Bull. No. 4, Art. 4 (Summer 2011)

2. David H. Kaye, Questioning a Courtroom Proof of the Uniqueness of Fingerprints, 71 Int’l Stat. Rev. 521 (2003), available at http://ssrn.com/abstract=944365

2. Shilpa Patel, Ish Paul, Madhu Sudan Astekar, Gayathri Ramesh, Sowmya G V, A Study of Lip Prints in Relation to Gender, Family and Blood Group, 1 J. Oral & Maxillofacial Pathology No. 1 (2010), abstract available at http://journalgateway.com/index.php/ijomp/article/view/V.1.I.1.Art1

Friday, August 26, 2011

The Transposition Fallacy in Matrixx Initiatives, Inc. v. Siracusano: Part II

The previous posting promised a simple example that would demonstrate the fallacy in claims such this one:
For a p-value of .09, the odds of observing the AER [adverse event report] is 91 percent divided by 9 percent. Put differently, there are 10-to-1 odds that the adverse effect is “real” (or about a 1 in 10 chance that it is not).
Brief of Amici Curiae Statistics Experts Professors Deirdre N. McCloskey and Stephen T. Ziliak in Support of Respondents, Matrixx Initiatives, Inc. v. Siracusano, 131 S.Ct. 1309 (2011) (No. 09-1156).

Here is one such example. A bag contains 100 coins. One of them is a trick coin with tails on both sides; the other 99 are biased coins that have a 0.3 chance of coming up tails and a 0.7 chance of coming up heads. I pick one of these coins at random and flip it twice, obtaining two tails. On the basis of only this sample data (the two tails), you must decide which type of coin I picked. The p-value with respect to the “null hypothesis” (N) that the coin is a normal (albeit biased) heads-tails one is the probability of seeing two tails in the two tosses: p = 0.3 x 0.3 = 0.09. Should you reject the null hypothesis N and conclude that I flipped the unique tails-tails coin? Are the odds for this alternative hypothesis (A) 10:1, as the brief of the statistical experts asserts?

Of course not. Just consider repeating this game over and over. Ninety-nine percent of the time, you would expect me to pick a heads-tails coin. In 9% of those cases, you expect me to get tails-tails on the two tosses (9% x 99% = 8.91%). The other way to get tails-tails on the tosses is to pick the tails-tails coin. You expect this to happen about 1% of the time. Thus, the odds of the tails-tails coin given the data on the outcome of the tosses are 1% to 8.91% = 1:8.91, which is about 1:9. Despite the allegedly significant (in “practical, human, or economic” terms) p-value of 0.09, the alternative hypothesis remains improbable.

A more formal derivation uses Bayes' rule for computing posterior odds. Let tt be the event that the coin I picked produced the two tails when tossed (the data), and let "|" stand for "given that" or "conditioned on." Then Bayes' rule reveals that

Odds(A|tt) = L x Odds(A),

where L is the "likelihood ratio" given by P(tt|A) / P(tt|N) and Odds(A) are the odds prior to flipping the coin. The value of L is 1/.09 = 100/9. Hence,

Odds(A|tt) = (100/9) Odds(A).

Because there is only 1 trick coin and 99 normal coins in the bag, the prior odds of A are Odds(A) = 1:99. Hence, the posterior odds are Odds(A|tt) = (100/9)(1/99) = 100/891 = 1:8.91. In other words, the odds for the alternative hypothesis are only about 1:9 -- practically the opposite of the 10:1 odds quoted in the statistics experts' brief.

The lesson of this example is not that a statistic with a p-value of 0.9 always can be safely ignored. It is that the p-value, by itself, cannot be converted into a probability that the alternative hypothesis is true (“that the adverse effect is ‘real’”). Knowing that the two tails arise only 9% of the time when the head-tails coin is the cause does not imply that 9% is the probability that a heads-tails coin is the cause or that 91% is the probability that the tails-tails coin is the “real” cause. Statisticians have warned against this confusion of a p-value with a posterior probability time and again. The brief of "Amici Curiae Statistics Experts" thus brings to mind the old remark, "With friends like these, who needs enemies?" A more complete review of the brief is available at Nathan Schachtman's website (see Further Readings).

Further Reading

David H. Kaye et al., The New Wigmore, A Treatise on Evidence: Expert Evidence (2d ed. 2011).

Nathan A. Schachtman, The Matrixx Oversold, Apr. 4, 2011, http://schachtmanlaw.com/the-matrixx-oversold/

Friday, August 19, 2011

The Transposition Fallacy in Matrixx Initiatives, Inc. v. Siracusano: Part I

One might expect to hear phrases like “Not statistical significance there!” and “There is no way that anybody would tell you that these ten cases are statistically significant” hurled by a disgruntled professor at an underperforming statistics student. Yet, in January 2011, they came from the Supreme Court bench during the argument in Matrixx Initiatives, Inc. v. Siracusano.[1]

The issue before the Court was “[w]hether a plaintiff can state a claim under § 10(b) of the Securities Exchange Act and SEC Rule 10b-5 based on a pharmaceutical company's nondisclosure of adverse event reports even though the reports are not alleged to be statistically significant.” [2] In the case, the manufacturer of the Zicam nasal spray for colds issued reassuring press releases at a time when it was receiving case reports from physicians of loss of smell (anosmia) in Zicam users. The pharmaceutical company, Matrixx Initiatives, succeeded in getting a security fraud class action dismissed on the ground that the plaintiffs failed to plead “statistical significance.”

Because case reports are just a series of anecdotes, it is not immediately obvious how they could be statistically significant, but a determined statistician could compare the number of reports in the relevant time period to the number that would be expected under some model of the world in which Zicam is neither a cause nor a correlate of anosmia. If the observed number departed from the expected number by a large enough amount—one that would occur no more than about 5% of the time when the assumption of no association is true (along with all the other features of the model)—then the observed number would be statistically significant at the 0.05 level.

The Court rejected any rule that would require securities-fraud plaintiffs to engage in such statistical modeling or computation before filing a complaint. This result makes sense because a reasonable investor might want to know about case reports that do not cross the line for significance. Such anecdotal evidence could be an impetus for further research, FDA action, or product liability claims—any of which could affect the value of the stock. In rejecting a bright-line rule of p < 0.05, the Court made several peculiar statements about statistical significance and the design of studies, but these are not my subject for today. (An older posting, on March 25, has some comments on this issue.)

Instead, I want to look at a small part of an amicus brief from “statistics experts” filed on behalf of the plaintiffs. There is much in this brief, which really comes from two economists (or perhaps these eclectic scholars should be designated historians or philosophers of economics and statistics), with which I would agree (for whatever my agreement is worth). But I was shocked to find the following text in the “Brief of Amici Curiae Statistics Experts Professors Deirdre N. McCloskey and Stephen T. Ziliak in Support of Respondents”:
The 5 percent significance rule insists on 19 to 1 odds that the measured effect is real.26 There is, however, a practical need to keep wide latitude in the odds of uncovering a real effect, which would therefore eschew any bright-line standard of significance. Suppose that a p-value for a particular test comes in at 9 percent. Should this p-value be considered “insignificant” in practical, human, or economic terms? We respectfully answer, “No.” For a p-value of .09, the odds of observing the AER [adverse event report] is 91 percent divided by 9 percent. Put differently, there are 10-to-1 odds that the adverse effect is “real” (or about a 1 in 10 chance that it is not). Odds of 10-to-1 certainly deserve the attention of responsible parties if the effect in question is a terrible event. Sometimes odds as low as, say, 1.5-to-1 might be relevant.27 For example, in the case of the Space Shuttle Challenger disaster, the odds were thought to be extremely low that its O-rings would fail. Moreover, the Vioxx matter discussed above provides an additional example. There, the p-value in question was roughly 0.2,28 which equates to odds of 4 to 1 that the measured effect — that is, that Vioxx resulted in increased risk of heart-related adverse events — was real. The study in question rejected these odds as insignificant, a decision that was proven to be incorrect.

26. At a 5 percent p-value, the probability that the measured effect is “real” is 95 percent, whereas the probability that it is false is 5 percent. Therefore, 95 / 5 equals 19, meaning that the odds of finding a “real” effect are 19 to 1.

27. Odds of 1.5 to 1 correspond to a p-value of 0.4. That is, the odds of the measured effect being real would be 0.6 / 0.4, or 1.5 to 1.

28. Lisse et al., supra note 14, at 543-44.
Why is this explanation out of whack? The fundamental problem is that, within the framework of classical (Neyman-Pearson) hypothesis testing, hypotheses like “the adverse effect is real” or “a measured effect being real” do not have odds or probabilities attached to them. In Bayesian inference, statements like “the probability that the measured effect is ‘real’ is 95 percent, whereas the probability that it is false is 5 percent” are meaningful, but frequentist p-values play no role in that framework. Equating the p-value with the probability that a null hypothesis is true and regarding the complement of a p-value as the probability that the alternative hypothesis is true (that something is “real”) is known as the transposition fallacy. [2] That two “statistics experts” would rely on this crude reasoning to make an otherwise reasonable point is depressing.

The preceding paragraph is a little technical. Soon, I shall post a simple example that should make the point more concretely and with less jargon.

References

1. Transcript of Oral Argument, Matrixx Initiatives, Inc. v. Siracusano, 131 S.Ct. 1309 (2011) (No. 09-1156), 2011 WL 65028, at *12 & *16 (Kagan, J.).

2. Petition for Writ of Certiorari at i, Matrixx Initiatives, Inc. v. Siracusano, 131 S.Ct. 1309 (2011) (No. 09-1156), 2010 WL 1063936.

3. David H. Kaye, David E. Bernstein & Jennifer L. Mnookin, The New Wigmore: A Treatise on Evidence: Expert Evidence (2d ed. 2011).