Wednesday, May 31, 2017

A Few Statistical and Legal Ideas About the Weight of Evidence

The expression “weight of evidence” has become popular among theorists of forensic science, where it is used to indicate the extent to which findings support the claim that two similar traces originated from the same source as opposed to the claim that they originated from different sources. Speaking more broadly, the idea is that the degree of corroboration a body of evidence provides for a theory or hypothesis depends on the probability of the evidence given that hypothesis compared to the probability of the evidence given other hypotheses. This notion has a rich intellectual history in philosophy, law, and statistics.A recent book review* discusses ways to quantify this measure of corroboration and the motivations for them. Some excerpts follow:

The “likelihood ratio” is a concept that pervades statistics. 31/ As [Richard] Lempert argued, it can be used to define whether an item of evidence is [logically] relevant. For example, in the 1990s researchers developed a prostate cancer test based on the level of prostate-specific antigen (“PSA”). The test, they said, was far from definitive but still had diagnostic value. Should anyone have believed them? A straightforward method for validation is to run the test on subjects known to have the disease and on other subjects known to be disease-free. The PSA test was shown to give a positive result (to indicate that the cancer was present) about 70% of the time when the cancer was, in fact, present, and about 10% of the time when the cancer was not actually present. Thus, the test has diagnostic value. The doctor and patient can understand that positive results arise more often among patients with the disease than among those without it.

But why should we say that the greater probability of the evidence (a positive test result) among cancer patients than among cancer-free patients makes the test diagnostic of prostate cancer? There are three answers. One is that if we use it to sort patients into the two categories, we will (in the long run) do a better job than if we use some totally bogus procedure (such as flipping a coin). This is a “frequentist” interpretation of diagnostic value.

A second justification takes the notion of “support” for a hypothesis as fundamental. 37/ Results that are more probable under a hypothesis H1 about the true state of affairs are stronger evidence for H1 than for any alternative (H2) under which they are less probable. If the evidence were to occur with equal probability under both states, however, the evidence would lend equal support to both possibilities. In this example, such evidence would provide no basis for distinguishing between cancer-free and cancer-afflicted patients. It would have no diagnostic value, 38/ and the test should be kept off the market. The coin-flipping test is like this. A head is no more or less probable when the cancer is present than when it is absent.

A difference between the “frequentist,” long-run justification and the “likelihoodist,” support-based understanding is that the latter applies even when we do not perform or imagine a long series of tests. If it really is more probable to observe the data under one state of affairs than another, it would seem perverse to conclude that the data somehow support the latter over the former. The data are “more consistent” with the state of affairs that makes their appearance on a single occasion more probable (even without the possibility of replication).

The same thing is true of circumstantial evidence in law. Circumstantial evidence E that is just as probable when one party’s account is true as it is when that account is false has no value as proof that the account is true or false. It supports both states of nature equally and is logically irrelevant. To condense these observations into a formula, we can write:
E is irrelevant (to choosing between H1 and H2) if P(E|H1) = P(E|H2),
where P(E|H1) and P(E|H2) are the probabilities of the evidence conditional on (“given the truth of,” or just “given”) the hypotheses. The conditional probabilities (or quantities that are directly proportional to them) have a special name: likelihoods. So a mathematically equivalent statement is that
E is irrelevant if the likelihood ratio L = P(E|H1) / P(E|H2) = 1.
A fancier way to express it is that E is irrelevant if the logarithm of L is 0. Such evidence E has zero “weight” when placed on a metaphorical balance scale that aggregates the weight of the evidence in favor of one hypothesis or the other. 39/ In this [prostate cancer] case, the likelihood ratio for a positive test result is 70% ÷ 10% = 7, which is greater than 1. Thus, the test is relevant evidence in deciding whether the patient has cancer. ...

Nothing that I have said so far involves Bayes’s rule. “Likelihood” and “support” are the primitive concepts. Lempert argued for a likelihood ratio of 1 as the defining characteristic of relevance by relying on a third justification—the Bayesian model of learning. How does this work? Think of probability as a pile of poker chips. Being 100% certain that a particular hypothesis about the world is correct means that all of the chips sit on top of that hypothesis. Twenty-five percent certainty means that 25% of the chips sit on the same hypothesis, and the remaining 75% are allocated to the other hypotheses. 42/ To keep things as simple as possible, let’s assume there are only two hypotheses that could be true. To be concrete, let’s say that H1 asserts that the individual has cancer and that H2 asserts that he does not. Assume that doctors know that men with this patient’s symptoms have a 25% probability of having prostate cancer. We start with 25% of the chips on hypothesis 1 (H1: cancer) and 75% on the alternative (H2: some other cause of the symptoms). Learning that the PSA test is positive for cancer requires us to take some of the chips from H2 and put them on H1. Bayes’s rule dictates just how many chips we transfer. The exact amount generally depends on two things: the percentage of chips that were on H1 (the prior probability) and the likelihood ratio L in this simple situation. ... [T]he very simple structure of Bayes’s rule in this case [is]
Odds(H1) · L = Odds(H1|E).
The rule requires updating the “prior odds” (on the left-hand side) by multiplying by the Bayes factor (which also is the likelihood ratio L) to arrive at the “posterior odds” (on the right-hand side). ...

The crucial point is that multiplication by L = 1 never changes the prior odds. Evidence that is equally probable under each hypothesis produces no change in the allocation of the chips—no matter what the initial distribution. Prior odds of 1:3 become posterior odds of 1:3. Prior odds of 10,000:1 become posterior odds of 10,000:1. The evidence is never worth considering. Again, we can get fancy and place the odds and the likelihood ratio on a logarithmic scale. Then the posterior log odds are the prior log odds plus the weight of the evidence (WOE = log-L):
New LO = Prior LO + WOE. 44/
Evidence that has zero weight (L = 1, log-L = 0) leaves us where we started. Evidence E that does not change the odds (and, hence, the corresponding probability) is uninformative—it is irrelevant. Inversely, evidence that does change the probability is relevant—as [Federal] Rule [of Evidence] 401 states in near-identical terms. This, in a nutshell, is the Bayesian explanation of the rule as it applies to circumstantial evidence. It tracks the text of the rule better than the likelihoodist, support-based analysis, but both lead to the conclusion that relevance vel non turns on whether the likelihood ratio departs from 1. ...

... The simple likelihood ratio is the basic measure that dominates the forensic science literature on evaluative conclusions. However, most writers in this area construe the likelihood ratio as the ratio of posterior odds to prior odds and base its use on that purely Bayesian interpretation. Greater clarity would come from using the related term “Bayes factor” when this is the motivation for the ratio. 51/ [Note 51: The choice of words is not merely a labeling issue. In simple situations, the Bayes factor and the likelihood ratio are numerically equivalent, but more generally, there are conceptual and operational differences. For instance, simple likelihood ratios can be used to gauge relative support within any pair of hypotheses, even when the pair is not exhaustive. But when there are many hypotheses, the Bayes factor is not so simple. See [Peter M. Lee, Bayesian Statistics 140 (4th ed. 2012)], at 141–42. It becomes the usual numerator divided by a weighted sum of the likelihoods for each hypothesis. The weights are the probabilities (conditional on the falsity of the hypothesis in the numerator). For an example, see Tacha Hicks et al., A Framework for Interpreting Evidence, in Forensic DNA Evidence Interpretation 37, 63 (John S. Buckleton et al. eds., 2d ed. 2016). Furthermore, there is disagreement over the use of a likelihood ratio for highly multidimensional data (such as fingerprint patterns and bullet striations) and whether and how to express uncertainty with respect to the likelihood ratio itself. Compare Franco Taroni et al., Dismissal of the Illusion of Uncertainty in the Assessment of a Likelihood Ratio, 15 Law, Probability & Risk 1, 2 (2016), with M.J. Sjerps et al., Uncertainty and LR: To Integrate or Not to Integrate, That’s the Question, 15 Law, Probability & Risk 23, 23–26 (2016). ... ] ...

The obvious Bayesian measure of probative value is the Bayes factor (B). In the examples used here, B is equal to the likelihood ratio L, and therefore the statisticians’ “weight of evidence” is WOE = log-B = log-L. 58/ The value of L in these cases tells us just how much more the evidence supports one theory than another and hence—this is the Bayesian part—just how much we should adjust our belief (expressed as odds) for any starting point. For the PSA test for cancer, L = 7 is “the change in odds favoring disease.” A test with greater diagnostic value would have a larger likelihood ratio and induce a stronger shift toward that conclusion. ... [T]he likelihood-ratio measure (or variations on it), which keeps prior probabilities out of the picture, is more typically used to describe the value of test results as evidence of disease or other conditions in medicine and psychology. Using the same measure in law has significant advantages. ...

NOTES
* David H. Kaye, Digging into the Foundations of Evidence Law, 115 Mich. L. Rev. 915 (2017)  (reviewing The Michael J. Saks & Barbara A. Spellman, Psychological Foundations of Evidence Law (2016)).

31. Vic Barnett, Comparative Statistical Inference 306 (3d ed. 1999) (“The principles of maximum likelihood and of likelihood ratio tests occupy a central place in statistical methodology.”); see, e.g., id. at 178–80 (describing likelihood ratio tests in frequentist hypothesis testing); N. Reid, Likelihood, in Statistics in the 21st Century 419 (Adrian E. Raftery et al. eds., 2002).

37. A “support function” can be required to have several appealing, formal properties, such as transitivity and additivity. E.g., A.W.F. Edwards, Likelihood 28–32 (Johns Hopkins Univ. Press, expanded ed. 1992) (1972). It also can be derived, in simple cases, from other, arguably more fundamental, principles. E.g., Barnett, supra note 31, at 310–11.

39. See generally I. J. Good, Weight of Evidence and the Bayesian Likelihood Ratio, in the Use of Statistics in Forensic Science 85 (C.G.G. Aitken & D.A. Stoney eds., 1991); I. J. Good, Weight of Evidence: A Brief Survey, in 2 Bayesian Statistics 249 (J.M. Bernardo et al. eds., 1985) (providing background information regarding the use of Bayesian statistics in evaluating weight of evidence). ...


42. If the individual were to keep some of the chips in reserve, the analogy between the fraction of them on a hypothesis and the kind of probability that pertains to random events such as games of chance would break down.

44. A deeper motivation for using logarithms may lie in information theory, but, if so, it is not important here. See Solomon Kullback, Information Theory and Statistics (1959).

58. The logarithm of B has been called “weight of evidence” since 1878. I.J. Good, A. M. Turing’s Statistical Work in World War II, 66 Biometrika 393, 393 (1979) ... . While working in the town of Banbury to decipher German codes, Alan Turing famously (in cryptanalysis and statistics, at least) coined the term “ban” to designate a power of 10 for this metaphorical weight. Good, supra, at 394. Thus, a B of 10 is 1 ban, 100 is 2 ban, and so on.

No comments:

Post a Comment