Thursday, July 12, 2012

If the Shoe Fits, You Must Not Calculate It (Part III)

In R v T (see posts of July 9 and 10), the Court of Appeal for England and Wales was distressed that a footwear analyst used a database on the characteristics of shoes to verify his holistic impression that the forensic science evidence—the correspondence between the defendant’s shoes and the marks at a murder scene—constituted "a moderate degree of scientific evidence to support the view that the [shoes] had made the footwear marks." The court had this to say (in part) about the resort to the database:
Mr Ryder used the internal database of the FSS to examine the frequency of pattern. This recorded the number of shoes received by the FSS (in contradistinction to the number distributed within the United Kingdom ... ). The FSS database comprised approximately 0.00006 per cent of all shoes sold in a year. ...

It is evident from the way in which Mr Ryder identified the figures to be used in the formula for pattern and size that none has any degree of precision. The figure for pattern could never be accurately known. For example, there were only distribution figures for the United Kingdom of shoes distributed by Nike; these left out of account the Footlocker shoes and counterfeits. ...

More importantly, the purchase and use [of] footwear is also subject to numerous other factors such as fashion, counterfeiting, distribution, local availability and the length of time footwear is kept. A particular shoe might be very common in one area because a retailer has bought a large number or because the price is discounted or because of fashion or choice by a group of people in that area. There is no way in which the effect of these factors has presently been statistically measured; it would appear extremely difficult to do so, but it is an issue that can no doubt be explored for the future.

It is important to appreciate that the data on footwear distribution and use is quite unlike DNA. A person’s DNA does not change and a solid statistical base has been developed which enable accurate figures to be produced. Indeed as was accepted by Mr Ryder, the data for footwear sole patterns is a small proportion of what is in use and changes rapidly. [I]t would for these reasons be dangerous to use a straight statistical model.

Use of the FSS’s own database could not have produced reliable figures as it had only 8,122 shoes whereas some 42 million are sold every year. [T]he likelihood ratio calculated by using figures for the population as a whole is completely different from that calculated using the figures used by Mr Ryder based on the FSS database. There is also the further difficulty, even if it could be used for this purpose, that the data are the property of the FSS and are not routinely available to all examiners. It is only available in a particular case to an examiner appointed to consider the report of an FSS examiner.
The court’s uneasiness with the database involves two considerations—sample size and relevance. Each merits discussion.

Sample Size

The court’s dismissal of the sample as a mere "0.00006 per cent of all shoes sold in a year" stems from the intuition that accurate estimation of a population proportion requires a sample that constitutes a large fraction of the population of interest. This perception is common but statistically naïve. If the sample is random and the population is very large compared to the sample, the statistical uncertainty in the estimate depends on the absolute size of the sample, not the sample size relative to the population size.

To see this, imagine a huge container of randomly packed with marbles of two colors (20% blue, 80% red). I mean a huge container—almost half of an entire football stadium is filled with marbles! They go halfway up the height of the bleachers. If we pick a large sample (say, 10,000 marbles), it is likely to represent all the marbles in the stadium pretty well. It would be surprising if the proportion of blue marbles in the sample of 10,000 were very different from 20%.

Now we dump in truckloads more of well mixed marbles in the same 20-80 proportion of colors until the stadium is overflowing with blue and red marbles. The population of marbles is now three times what it was before. Must we triple the sample size to keep pace with the larger population?

Absolutely not. That the second sample is an even tinier fraction of the population than the first one is irrelevant. The same sample (size 10,000) will work as well for the packed stadium as it did for the partly full stadium. For large samples from very much larger populations, the precision of the sample estimate—of marble colors, shoe sizes, or what have you—essentially depends on the absolute size of the sample (actually, the square root of the sample size). The effect of population size is negligible. [1]

Relevance (Fit)

Even if sample size is not the issue here, does the sample relate to the population of interest? This is a matter of relevance, or what the US Supreme Court called “fit” in the famous American case of Daubert v. Merrell Dow Pharmaceuticals. In many early DNA cases, courts thought they needed to know the frequencies of DNA profiles in a defendant's ethnic or racial group although that plainly was not the relevant population. In one New York case, for instance, the court worried that the defendant's profile might be common in the defendant's home town of Shushtar, Iran—even though the alleged sexual assault took place in a wealthy, suburban community in New York [2].

A possible disconnect between population of shoes represented in the FSS database and the population of "innocent shoes" that, ideally, should be sampled means that the analyst should not overstate the value of the FSS data. Moreover, in some cases the available data might be too far afield to be particularly helpful, but usually having some systematically collected statistical information is better than having none, and a factfinder can appreciate the limitations in those background data.

Several articles on R v T make these points. Mike Redmayne, Paul Roberts, Colin Aitken, and Graham Jackson perspicaciously note that sample size is not the serious issue here. Rather,
The pertinent question is: which database is likely to provide the best comparators, relative to the task in hand? Shoe choice is influenced by social and cultural factors. Middle-aged university lecturers presumably buy different trainers to teenage schoolboys. And the shoes making the marks most often found at crime scenes are not the general public's most popular purchases. Consequently, the FSS database—comprising shoes owned by those who, like T, have been suspected of committing offences—may well be a more appropriate source of comparison data than national sales figures [3, pp. 354-55].
They add:
While the court was concerned that there “are, at present, insufficient data for a more certain and objective basis for expert opinion on footwear marks”, it is rarely helpful to talk about “objective” data in forensic contexts. Choice of data always involves a degree of judgement about whether a particular dataset is fit for purpose. At the same time, reliance on data is ubiquitous and inescapable. When the medical expert witness testifies that “I have never encountered such a case in forty years of clinical practice”, he is utilising data but he calls them “experience”, relying on memory rather than any formal database open to probabilistic calculations. It would be just as foolish to maintain that memory and experience are never superior to quantified probabilities in criminal litigation, as it would be to insist that memory and experience are always preferable to and should invariably displace empirical data and quantified probabilities in the courtroom.
There is a long running debate in psychology over the relative merits of clinical versus statistical prediction, but who can deny that a bad statistical model can give less accurate results than insightful clinicians? Nonetheless, “objective” inferences based on publicly accessible data have something going for them even when they are merely comparable to gestalt judgments. They can avoid cognitive bias in highly subjective and complex decision-making. Statistical models for deciphering complicated DNA mixtures or for gauging similarities in fingerprints have this appeal.

Objective and Subjective Probabilities

Another incisive article on R v T, by Charles Berger, John Buckleton, Christophe Champod, Ian W. Evett, and Graham Jackson, pursues the objective-subjective dichotomy. Some of their remarks could be read as suggesting that objective probabilities are ultimately subjective:
[W]henever we are making an inference from a sample the data are always an incomplete representation of the full picture; furthermore, their relevance is a matter of judgement and the uncertainty that concerned the Court is an unavoidable feature of such inference. The probability that is quoted then will inevitably be a personal probability and the extent to which the data influence that probability will depend on expert judgement. This is not a process that can be governed strictly by mathematical reasoning but this does not make it any less “scientific”: scientists are called on to exercise personal judgement in all aspects of their several pursuits [4, p. 45 (emphasis added)].
That judgment is ubiquitous in scientific reasoning is a welcome antidote to idealized visions of science, but whether “the probability that is quoted then will inevitably be a personal probability” depends on what probability is quoted. Unlike the Bayesian, the frequentist does not quote a probability for the truth of an inference from the sample to a population. The frequentist, upon finding that the sample proportion is 0.2 (the figure in the FSS database), reports that 0.2 is a reasonable estimate of the proportion in the population from which the sample was drawn (assuming, of course, that the sample was drawn at random and is reasonably large). The frequentist relies strictly on the sample data to go from the sample data to the population parameter. Whether the FSS shoes were drawn at random and the nature of the population from which they were drawn are crucial to frequentists, but they are not matters that the frequentist expert can discuss in terms of probabilities.

A Bayesian, on the hand, does not use just the sample proportion to estimate the population proportion. The Bayesian attaches probabilities to all possible prior beliefs about the population proportion and modifies these in view of the sample data. If the Bayesian’s prior beliefs were concentrated at 0.5 (that is, the Bayesian strongly believed before looking at the FSS database that half the population of shoes had the pattern in question), then the Bayesian might report an estimated population proportion at a point closer to 0.5 than 0.2—perhaps 0.4.

This does not make the frequentist’s estimate into a personal probability, and it does not address the problems of model selection that affect the Bayesian and the frequentist alike. Both the frequentist and the Bayesian estimates (of 0.2 and 0.4, respectively) pertain to the population from which the FSS database was drawn, presumably at random.

Whether that population is the best one to consider is a further question. Both the Bayesian and the frequentist might agree the locale in which the crime was committed is distinctive in ways that would make the population proportion smaller than that of the population from which the FSS acquired its shoes. In that event, both could present their estimates as conservative (likely to favor the defendant). But the uncertainty has not converted the "objective" 0.2 figure into a personal probability. Frequentist probabilities are conditioned on a specific model. Whether a model is reasonable, they would readily concede, is vital—but it is not a judgment to which they can or will assign a probability.

Berger et al. also observe that “[f]urthermore, the probabilities that the scientist is directed to address are always founded (even with DNA) on personal judgement. This is not a bad thing, it is an inescapable feature of science ... [3, p. 49].” That the scientist makes judgments is indeed inescapable. But does that make the probabilities or parameters that the scientist estimates personal and subjective? The objectivist would hold that it only means that (1) the scientist’s estimates of these quantities are based on assumptions; (2) the plausibility of the assumptions are matters of judgment; and (3) it is a good thing to make these assumptions explicit so that other scientists and legal factfinders can consider whether they are sufficiently reasonable to make the objective probabilities helpful.

In short, Berger et al. are right—a court should not get carried away with the distinction between objective and subjective probabilities. I would merely add that neither should courts act as if there are no differences between them.

References

1. Hans Zeisel & David H. Kaye, Prove It with Figures: Empirical Methods in Law and Litigation (1997).

2. People v. Mohit, 153 Misc.2d 22, 579 N.Y.S.2d 990 (Westchester Co. Ct. 1992)
3. Mike Redmayne, Paul Roberts, Colin Aitken, Graham Jackson, Forensic science evidence in questions, Crim. L.R. 2011(5) 347–356

4. Charles E.H. Berger, John Buckleton , Christophe Champod, Ian W. Evett, Graham Jackson, Evidence Evaluation: A Response to the Court of Appeal Judgment in R v T, Science and Justice 2011 51: 43–49

No comments:

Post a Comment