Tuesday, July 31, 2012

Supreme Court to Review DNA Swabbing on Arrest??

According to the SCOTUS blog,
Chief Justice John G. Roberts, Jr., calling tests of the DNA of individuals arrested by police 'a valuable tool for investigating unsolved crimes,' on Monday cleared the way for the state of Maryland to continue that practice until the Supreme Court can act on a challenge to its constitutionality. The Chief Justice’s four-page opinion is here. A Maryland state court ruling against the practice will remain on hold until the Justices take final action.
One should not read these words as stating that the stay is in effect until the Justices decide whether Maryland constitutionally can take DNA from mere arrestees. That would require two further actions by the Court—"granting cert" and extending the stay while the Court decides the case—both unusual events. The Court receives over 8,000 petitions per year asking it to issue writs of certiorari—orders for lower courts to send the record to the Supreme Court for its review. The court grants on the order of 100 of them. It takes only four votes to grant a petition. (It used to require five.) Justice Scalia once called wading through piles of petitions and supporting materials "the most ... onerous and ... uninteresting part of the job." [1]

Thus far, the Chief Justice has issued a order (on his authority as a Circuit Justice) temporarily blocking ("staying") the judgment of the Maryland Court of Appeals. The Court of Appeals judgment did not order the state to do anything (although its import hardly could be ignored). It reversed the decision of the state's intermediate appellate court (that had upheld the constitutionality of Maryland's DNA-on-arrest law) and remanded the case to that lower court for further proceedings. (I described some notable features of the original Maryland Court of Appeals opinion on April 26. [2])

The Chief Justice's order remains in effect only until the other Justices of the Supreme Court get around to voting on Maryland's petition for a writ of certiorari. At that point, one of three things will happen: either (1) the Justices will grant the petition and decide to continue the freeze on the Maryland judgment while the Court reviews the case; (2) the Justices will grant the petition and let the stay elapse while they hear the case; or (3) they will deny the petition and leave the judgment of Maryland's highest court undisturbed. [3]

Thus, the Court's "final action" might be merely to decide not to act on the merits of the challenge to the constitutionality of the Maryland law. Denying cert has no precedential value. But the Chief Justice's July 30 opinion predicts that the Court actually will review the case and issue an opinion that will uphold the constitutionality of the law. Because of the contentiousness of the constitutional question, the brief opinion is worth dissecting.

The Chief Justice begins with the observation that "there is a reasonable probability this Court will grant certiorari." He ought to know, but the reason he gives is not entirely convincing. He writes that:
Maryland’s decision conflicts with decisions of the U. S. Courts of Appeals for the Third and Ninth Circuits as well as the Virginia Supreme Court, which have upheld statutes similar to Maryland’s DNA Collection Act. ... The split implicates an important feature of day-to-day law enforcement practice in approximately half the States and the Federal Government. ... Indeed, the decision below has direct effects beyond Maryland: Because the DNA samples Maryland collects may otherwise be eligible for the FBI’s national DNA database, the decision renders the database less effective for other States and the Federal Government.
But this "split" is nothing like a split in the federal circuits on the constitutionality of the federal database law. That kind of split would throw a real monkey wrench into the operation of NDIS, the FBI's National DNA Index System. The split here only affects timing and a fraction of all DNA profiles. That is, for those individuals who are convicted anyway, not taking DNA on arrest in Maryland only delays the time at which their profiles go into the database. Once the offender profiles are entered, a weekly database trawl should link them to any profiles in the database of crime-scene samples. Of course, this delay is not without costs. For example, some arrestees will commit other crimes, up to and including murder, in the period between arrest and conviction.

With respect to arrestees who never are convicted of offenses that trigger inclusion in the database, the state loses the opportunity to trawl the crime-scene database for their DNA profiles. Some of these individuals might be connected to these unsolved crimes, but many will not be. Thus, the split does not shut down the database system. It does reduce its efficiency by an amount that is not clearly known. As the Chief Justice puts it, "the decision renders the database less effective."

Chief Justice Roberts also writes that "the decision below subjects Maryland to ongoing irreparable harm" because "[A]ny time a State is enjoined by a court from effectuating statutes enacted by representatives of its people, it suffers a form of irreparable injury." The latter quotation comes from the previous Chief Justice, who expressed this claim in New Motor Vehicle Bd. of Cal. v. Orrin W. Fox Co., 434 U. S. 1345, 1351 (1977) (REHNQUIST, J., in chambers). But the notion that every court order that blocks enforcement of a duly enacted law works an irreparable injury seems extravagant. Does the public suffer irreparable harm when someone on a Fort Lauderdale beach plays frisbee, flies a kite, attaches a hammock to a tree, or swims in long pants—all prohibited?

The more meaningful argument is that the Maryland ruling constitutes "an ongoing and concrete harm to Maryland’s law enforcement and public safety interests." The Chief Justice explains: "According to Maryland, from 2009—the year Maryland began collecting samples from arrestees—to 2011, 'matches from arrestee swabs [from Maryland] have resulted in 58 criminal prosecutions.'" But this statistic is wide of the mark. How many of these 58 prosecutions would the state have foregone had it been unable to enter the profiles at the point of the arrest rather than waiting until a conviction ensued?

In short, the Chief Justice is correct in stating that "in the absence of a stay, Maryland would be disabled from employing a valuable law enforcement tool for several months," but his opinion leaves unresolved the question of just how valuable it really is. This is a matter that surely will receive more attention if and when the full Court actually hears the case.

References

Wednesday, July 25, 2012

CODIS Loci Not Ready for Disease Prediction After All?

Last month, I noted the findings of a superior court in Vermont that "some of the CODIS loci have associations with identifiable serious medical conditions," making the scientific evidence "sufficient to overcome the previously held belief[s]" about the innocuous nature of the CODIS loci [1]. The judge based her conclusion in State v. Abernathy [2] that the CODIS loci now permit "probabilistic predictions of disease" on the unpublished views of biologist Greg Wray, who oversees the Center for Evolutionary Genomics and the DNA Sequencing Core Facility, within Duke University’s Institute for Genome Sciences and Policy.

A technical report accepted for publication in the Journal of Forensic Sciences seems to dispute these claims. Sara Katsanis, a staff researcher at the same Institute for Genome Sciences and Policy, and Jennifer Wagner, a research associate at the University of Pennsylvania’s Center for the Integration of Genetic Healthcare Technologies, searched the biomedical literature and genomic databases not only for associations with phenotypes in the current 13 loci used in offender databases, but also in ones that soon may be added to the system. They came up with “no evidence” that any particular CODIS single-locus genotypes “are indicative of phenotype.”

References

1. CODIS Loci Ready for Disease Prediction, Vermont Court Says, June 15, 2012.
2. State v. Abernathy, No. 3599-9-11 (Vt. Super. Ct. June 1, 2012).

Saturday, July 14, 2012

Going South with Shoeprint Testimony

On July 9, 10, and 12 ("If the Shoe Fits, You Must Not Calculate It"), I discussed the much maligned opinion in R. v. T., [2010] EWCA Crim. 2439. Professor Mike Redmayne kindly called to my attention another Court of Appeal opinion on shoe print evidence: R. v. South, [2011] EWCA Crim. 754. It indicates that an expert  who does not use the adjective "scientific" can present the "verbal equivalent" of a likelihood ratio when the precise value of the ratio is uncertain. Indeed, the witness can give a source probability as long as it is a personal judgment derived solely from individual experience rather than from systematically collected data on shoe prints. This situation is not the best of all possible worlds.

Students in Bournemouth found that their house had been burgled in the afternoon as two of them slept. “On the floor below the letterbox of the front door there were some envelopes which had footmarks on them. These envelopes were subsequently given to the police and they were forensically examined. The evidence concerning those footprints was adduced at the trial.” This was by no means the only evidence the police developed against Sergio South, who was known to them as a burglar, but it became fodder for his appeal.

The evidence from an FSS examiner (a Mr. Jones) resembled that of the FSS's Mr. Ryder in R. v. T. Both experts, quite reasonably, relied on size, pattern, and wear. Here, "this footprint was in agreement with the size, pattern, detailed alignment and degree of wear with the trainer of the appellant that had been seized from him upon arrest. The zigzag bar pattern and the curved tramline were similar, and the trainers, which were size 9, were consistent with the footprint which was of size 9 or 8 but not size 10."

As in R. v. T., the expert must have consulted the FSS likelihood ratio table of “verbal equivalents.” Defense counsel “submitted that ... Mr Jones had said that the evidence relating to the footprint was ‘moderately strong support’ for the proposition that the appellant's shoe had made the imprint on the envelopes.” The phrase “moderately strong” is reserved in the FSS table for likelihood ratios between 100 and 1000—the next rung up the ladder from the “moderate support” for the ratio of 100 in R. v. T.

So what distinguishes the cases? Surely not that South's feet were smaller or the likelihood ratio larger. Is it that Mr. Jones was asked on cross-examination where the verbal equivalent came from?  Defense counsel advised the Court of Appeal that "Mr Jones had said that this expression reflected a statistical probability of the footprint having been made by the shoes of the appellant which was considerably more than a 50 per cent probability, because the linguistic phrases used, such as 'weak or limited support' or 'extremely strong support', were based on probability which was itself based on a logarithmic scale."

If this description of the cross-examination is correct, the expert's testimony was less defensible that that in R. v. T.  It appears that Mr. Jones missed the point of the FSS's efforts to train analysts in estimating likelihood ratios. The raison d'etre for using likelihoods to arrive at a standardized expression for the strength of the evidence is to get away from testimony about source probabilities. Statements such as "a statistical probability of the footprint having been made by the shoes of the appellant ... was considerably more than a 50 per cent probability" are strictly verboten. The expert following the strength-of-evidence approach must confine himself to commenting on the degree to which the evidence supports the competing claims about the source of the impressions. It is the role of the jury, and not the business of the expert, to consider the probability of those claims. In addition the expert (or the defense counsel) did not appreciate the fundamental difference between probabilities (of hypotheses about the origin of the marks) and likelihood ratios (which measure the support the evidence gives to those hypotheses). Whether this foggy cross-examination satisfied R. v. T.'s call for more transparency about the origin of an expert's description of the strength of the trace evidence is questionable.

Nevertheless, and even though Mr. Jones testified as "a scientist" who "had worked as in this area since 1982," the court concluded that his presentation "did not transgress in any way the guidelines set down by this court in R v T." The crucial fact for the court was that "Mr Jones' evidence was based on his experience." The South court described R. v. T. as stating “that if a footwear examiner expressed a view that went beyond saying that the footwear could or could not make the mark concerned, the report should make it clear that the view is subjective and based on experience of the examiner, so that words such as ‘scientific’ used in making evaluations should not in fact be used because they would, before a jury, give an impression of a degree of precision and objectivity which is not present given the current state of expertise.”

That Mr. Jones referred to "a statistical probability" did not seem to worry the court. Neither did the court perceive any problem with testimony "that he encountered the type of footwear seized from the appellant in only 2 per cent of cases that he dealt with as a forensic examiner of footwear and footprints" and "that burglars frequently used sports trainers." If "2 per cent" is a summary of 27 years of unrecorded personal experiences, it is hardly a rigorously ascertained "statistical probability," although it is a statistic and it yields a probability. And if Mr. Jones's understanding of the sartorial preferences of burglars informed his perception of "moderately strong" trace evidence yielding a posterior probability of "considerably more than 50 per cent," then he was exceeding the bounds of his expertise as a careful observer of similarities and differences and a keen analyst of the significance of these similarities and difference in impressions.

In sum, South indicates that the strictures of R. v. T. are easily avoided. But the courts and the forensic science profession do better. They can implement a system of reporting and testifying that conveys opinions or information in terms of the strength of the evidence rather than the probability of source hypotheses. There is considerable support for this approach among forensic service providers in Europe. The English courts lag behind, as do both the forensic science profession and the courts in the United States.

Thursday, July 12, 2012

If the Shoe Fits, You Must Not Calculate It (Part III)

In R v T (see posts of July 9 and 10), the Court of Appeal for England and Wales was distressed that a footwear analyst used a database on the characteristics of shoes to verify his holistic impression that the forensic science evidence—the correspondence between the defendant’s shoes and the marks at a murder scene—constituted "a moderate degree of scientific evidence to support the view that the [shoes] had made the footwear marks." The court had this to say (in part) about the resort to the database:
Mr Ryder used the internal database of the FSS to examine the frequency of pattern. This recorded the number of shoes received by the FSS (in contradistinction to the number distributed within the United Kingdom ... ). The FSS database comprised approximately 0.00006 per cent of all shoes sold in a year. ...

It is evident from the way in which Mr Ryder identified the figures to be used in the formula for pattern and size that none has any degree of precision. The figure for pattern could never be accurately known. For example, there were only distribution figures for the United Kingdom of shoes distributed by Nike; these left out of account the Footlocker shoes and counterfeits. ...

More importantly, the purchase and use [of] footwear is also subject to numerous other factors such as fashion, counterfeiting, distribution, local availability and the length of time footwear is kept. A particular shoe might be very common in one area because a retailer has bought a large number or because the price is discounted or because of fashion or choice by a group of people in that area. There is no way in which the effect of these factors has presently been statistically measured; it would appear extremely difficult to do so, but it is an issue that can no doubt be explored for the future.

It is important to appreciate that the data on footwear distribution and use is quite unlike DNA. A person’s DNA does not change and a solid statistical base has been developed which enable accurate figures to be produced. Indeed as was accepted by Mr Ryder, the data for footwear sole patterns is a small proportion of what is in use and changes rapidly. [I]t would for these reasons be dangerous to use a straight statistical model.

Use of the FSS’s own database could not have produced reliable figures as it had only 8,122 shoes whereas some 42 million are sold every year. [T]he likelihood ratio calculated by using figures for the population as a whole is completely different from that calculated using the figures used by Mr Ryder based on the FSS database. There is also the further difficulty, even if it could be used for this purpose, that the data are the property of the FSS and are not routinely available to all examiners. It is only available in a particular case to an examiner appointed to consider the report of an FSS examiner.
The court’s uneasiness with the database involves two considerations—sample size and relevance. Each merits discussion.

Sample Size

The court’s dismissal of the sample as a mere "0.00006 per cent of all shoes sold in a year" stems from the intuition that accurate estimation of a population proportion requires a sample that constitutes a large fraction of the population of interest. This perception is common but statistically naïve. If the sample is random and the population is very large compared to the sample, the statistical uncertainty in the estimate depends on the absolute size of the sample, not the sample size relative to the population size.

To see this, imagine a huge container of randomly packed with marbles of two colors (20% blue, 80% red). I mean a huge container—almost half of an entire football stadium is filled with marbles! They go halfway up the height of the bleachers. If we pick a large sample (say, 10,000 marbles), it is likely to represent all the marbles in the stadium pretty well. It would be surprising if the proportion of blue marbles in the sample of 10,000 were very different from 20%.

Now we dump in truckloads more of well mixed marbles in the same 20-80 proportion of colors until the stadium is overflowing with blue and red marbles. The population of marbles is now three times what it was before. Must we triple the sample size to keep pace with the larger population?

Absolutely not. That the second sample is an even tinier fraction of the population than the first one is irrelevant. The same sample (size 10,000) will work as well for the packed stadium as it did for the partly full stadium. For large samples from very much larger populations, the precision of the sample estimate—of marble colors, shoe sizes, or what have you—essentially depends on the absolute size of the sample (actually, the square root of the sample size). The effect of population size is negligible. [1]

Relevance (Fit)

Even if sample size is not the issue here, does the sample relate to the population of interest? This is a matter of relevance, or what the US Supreme Court called “fit” in the famous American case of Daubert v. Merrell Dow Pharmaceuticals. In many early DNA cases, courts thought they needed to know the frequencies of DNA profiles in a defendant's ethnic or racial group although that plainly was not the relevant population. In one New York case, for instance, the court worried that the defendant's profile might be common in the defendant's home town of Shushtar, Iran—even though the alleged sexual assault took place in a wealthy, suburban community in New York [2].

A possible disconnect between population of shoes represented in the FSS database and the population of "innocent shoes" that, ideally, should be sampled means that the analyst should not overstate the value of the FSS data. Moreover, in some cases the available data might be too far afield to be particularly helpful, but usually having some systematically collected statistical information is better than having none, and a factfinder can appreciate the limitations in those background data.

Several articles on R v T make these points. Mike Redmayne, Paul Roberts, Colin Aitken, and Graham Jackson perspicaciously note that sample size is not the serious issue here. Rather,
The pertinent question is: which database is likely to provide the best comparators, relative to the task in hand? Shoe choice is influenced by social and cultural factors. Middle-aged university lecturers presumably buy different trainers to teenage schoolboys. And the shoes making the marks most often found at crime scenes are not the general public's most popular purchases. Consequently, the FSS database—comprising shoes owned by those who, like T, have been suspected of committing offences—may well be a more appropriate source of comparison data than national sales figures [3, pp. 354-55].
They add:
While the court was concerned that there “are, at present, insufficient data for a more certain and objective basis for expert opinion on footwear marks”, it is rarely helpful to talk about “objective” data in forensic contexts. Choice of data always involves a degree of judgement about whether a particular dataset is fit for purpose. At the same time, reliance on data is ubiquitous and inescapable. When the medical expert witness testifies that “I have never encountered such a case in forty years of clinical practice”, he is utilising data but he calls them “experience”, relying on memory rather than any formal database open to probabilistic calculations. It would be just as foolish to maintain that memory and experience are never superior to quantified probabilities in criminal litigation, as it would be to insist that memory and experience are always preferable to and should invariably displace empirical data and quantified probabilities in the courtroom.
There is a long running debate in psychology over the relative merits of clinical versus statistical prediction, but who can deny that a bad statistical model can give less accurate results than insightful clinicians? Nonetheless, “objective” inferences based on publicly accessible data have something going for them even when they are merely comparable to gestalt judgments. They can avoid cognitive bias in highly subjective and complex decision-making. Statistical models for deciphering complicated DNA mixtures or for gauging similarities in fingerprints have this appeal.

Objective and Subjective Probabilities

Another incisive article on R v T, by Charles Berger, John Buckleton, Christophe Champod, Ian W. Evett, and Graham Jackson, pursues the objective-subjective dichotomy. Some of their remarks could be read as suggesting that objective probabilities are ultimately subjective:
[W]henever we are making an inference from a sample the data are always an incomplete representation of the full picture; furthermore, their relevance is a matter of judgement and the uncertainty that concerned the Court is an unavoidable feature of such inference. The probability that is quoted then will inevitably be a personal probability and the extent to which the data influence that probability will depend on expert judgement. This is not a process that can be governed strictly by mathematical reasoning but this does not make it any less “scientific”: scientists are called on to exercise personal judgement in all aspects of their several pursuits [4, p. 45 (emphasis added)].
That judgment is ubiquitous in scientific reasoning is a welcome antidote to idealized visions of science, but whether “the probability that is quoted then will inevitably be a personal probability” depends on what probability is quoted. Unlike the Bayesian, the frequentist does not quote a probability for the truth of an inference from the sample to a population. The frequentist, upon finding that the sample proportion is 0.2 (the figure in the FSS database), reports that 0.2 is a reasonable estimate of the proportion in the population from which the sample was drawn (assuming, of course, that the sample was drawn at random and is reasonably large). The frequentist relies strictly on the sample data to go from the sample data to the population parameter. Whether the FSS shoes were drawn at random and the nature of the population from which they were drawn are crucial to frequentists, but they are not matters that the frequentist expert can discuss in terms of probabilities.

A Bayesian, on the hand, does not use just the sample proportion to estimate the population proportion. The Bayesian attaches probabilities to all possible prior beliefs about the population proportion and modifies these in view of the sample data. If the Bayesian’s prior beliefs were concentrated at 0.5 (that is, the Bayesian strongly believed before looking at the FSS database that half the population of shoes had the pattern in question), then the Bayesian might report an estimated population proportion at a point closer to 0.5 than 0.2—perhaps 0.4.

This does not make the frequentist’s estimate into a personal probability, and it does not address the problems of model selection that affect the Bayesian and the frequentist alike. Both the frequentist and the Bayesian estimates (of 0.2 and 0.4, respectively) pertain to the population from which the FSS database was drawn, presumably at random.

Whether that population is the best one to consider is a further question. Both the Bayesian and the frequentist might agree the locale in which the crime was committed is distinctive in ways that would make the population proportion smaller than that of the population from which the FSS acquired its shoes. In that event, both could present their estimates as conservative (likely to favor the defendant). But the uncertainty has not converted the "objective" 0.2 figure into a personal probability. Frequentist probabilities are conditioned on a specific model. Whether a model is reasonable, they would readily concede, is vital—but it is not a judgment to which they can or will assign a probability.

Berger et al. also observe that “[f]urthermore, the probabilities that the scientist is directed to address are always founded (even with DNA) on personal judgement. This is not a bad thing, it is an inescapable feature of science ... [3, p. 49].” That the scientist makes judgments is indeed inescapable. But does that make the probabilities or parameters that the scientist estimates personal and subjective? The objectivist would hold that it only means that (1) the scientist’s estimates of these quantities are based on assumptions; (2) the plausibility of the assumptions are matters of judgment; and (3) it is a good thing to make these assumptions explicit so that other scientists and legal factfinders can consider whether they are sufficiently reasonable to make the objective probabilities helpful.

In short, Berger et al. are right—a court should not get carried away with the distinction between objective and subjective probabilities. I would merely add that neither should courts act as if there are no differences between them.

References

1. Hans Zeisel & David H. Kaye, Prove It with Figures: Empirical Methods in Law and Litigation (1997).

2. People v. Mohit, 153 Misc.2d 22, 579 N.Y.S.2d 990 (Westchester Co. Ct. 1992)
3. Mike Redmayne, Paul Roberts, Colin Aitken, Graham Jackson, Forensic science evidence in questions, Crim. L.R. 2011(5) 347–356

4. Charles E.H. Berger, John Buckleton , Christophe Champod, Ian W. Evett, Graham Jackson, Evidence Evaluation: A Response to the Court of Appeal Judgment in R v T, Science and Justice 2011 51: 43–49

Wednesday, July 11, 2012

More on Statistical Reasoning and the Higgs Boson

A posting of July 6, "The Probability that the Higgs Boson Has Been Discovered," mentioned the transposition of a p-value in stories in the popular press about the discovery of what is likely to be the Higgs Boson. Professor Dennis Lindley, a major figure in the development of Bayesian methods (and known to some readers of this blog as the author of a classic paper on using them to identify glass fragments) posed a few questions on the experiment via the list server of the International Society for Bayesian Analysis. One highly informed set of answers came from Louis Lyons (organiser of PHYSTAT series of meetings, and a member of CMS Collaboration at CERN). The following is a slightly edited version of the comments of Lindley (DL) and Lyons (LL). The comments presuppose knowledge of the meaning of a p-value, a likelihood ratio, Bayes' rule, and the divide between frequentists and Bayesians. (The original text as well as many other interesting messages are at http://bayesian.org/forums/news/3648.)

DL:
Specifically, the news referred to a confidence interval with 5-sigma limits.

LL:
The test statistic we use for looking at p-values is basically the likelihood ratio for the two hypotheses (H_0 = Standard Model (S. M.) of Particle Physics, but no Higgs; H_1 = S.M with Higgs). A small p_0 (and a reasonable p_1) then implies that H_1 is a better description of the data than H_0. This of course does not prove that H_1 is correct, but maybe Nature corresponds to some H_2, which is more like H_1 than it is like H_0. Indeed in principle data will never prove a theory is true, but the more experimental tests it survives, the happier we are to use it -- e.g. Newtonian mechanics was fine for centuries till the arrival of Relativity.

In the case of the Higgs, it can decay to different sets of particles, and these rates are defined by the S.M.  We measure these ratios, but with large uncertainties with the present data. They are consistent with the S.M. predictions, but it could be much more convincing with more data. Hence the caution about saying we have discovered the Higgs of the S.M.

DL:
Five standard deviations, assuming normality, means a p-value of around 0.0000005. A number of questions spring to mind.

1.  Why such an extreme evidence requirement? We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. Neither seems to be the case, so why 5-sigma?

LL:
This is an unfortunate tradition, that is used more readily by journal editors than by Particle Physicists. Reasons are
a) Historically we have had 3 and 4 sigma effects that have gone away

b) The 'Look Elsewhere Effect' (LEE). We are worried about the chance of a statistical fluctuation mimicking our observation, not only at the given mass of 125 GeV but anywhere in the spectrum. The quoted p-values are 'local' i.e. the chance of a fluctuation at the observed mass. Unfortunately the LEE correction factor is not very precisely defined, because of ambiguities about what is meant by 'elsewhere'

c) The possibility of some systematic effect (characterised by a nuisance parameter) being more important than allowed for in the analysis, or even overlooked - see the recent experiment at CERN which claimed that neutrinos travelled faster than the speed of light.

d) A subconscious use of Bayes Theorem to turn p-values into probabilities about the hypotheses.
All the above vary from experiment to experiment, so we realise that it is a bit unfair to use the same standard for discovery for all analyses. We prefer just to quote the p-values (or whatever).

DL:
2. Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis.  Are the particle physics community completely wedded to frequentist analysis?

LL:
No we are not anti-Bayesian, and indeed our test statistics is a likelihood ratio. If you like, you can regard our p-values as an attempt to calibrate the meaning of a particular value of the likelihood ratio.

We actually recommend that for parameter determination at the LHC, it is useful to compare Bayesian and Frequentist methods. But for comparing hypotheses (e.g. an experimental distribution is fitted by H_0 = a smooth distribution; or by H_1 = a smooth distribution plus a localised peak), we are worried about what priors to use for the extra parameters that occur in the alternative hypothesis.We would welcome advice.

DL:
3. We know that given enough data it is nearly always possible for a significance test to reject the null hypothesis at arbitrarily low p-values, simply because the parameter will never be exactly equal to its null value. And apparently the LHC has accumulated a very large quantity of data. So could even this extreme p-value be illusory?

LL:
We are aware of this. But in fact, although the LHC has accumulated enormous amounts of data, the Higgs search is like looking for a needle in  a haystack. The final samples of events that are used to look for the Higgs contain only tens to thousands of events.

These and related issues are discussed to some extent in my article "Open statistical issues in Particle Physics", Ann. Appl. Stat. Volume 2, Number 3 (2008), 887-915. It is supposed to be statistician-friendly.

Tuesday, July 10, 2012

If the Shoe Fits, You Must Not Calculate It (Part II)

The Court of Appeal in R. v. T. did not like Mr. Ryder’s truncated and standardized testimony (outlined in Part I). It complained about the supposedly small size of the FSS database, the recourse to a formula for combining information on different features, the mixture of objective and subjective probabilities in arriving at the undisclosed conditional probability of 1/100 and the unstated likelihood ratio of 100, the lack of the references in his testimony and reports to these computations and the FSS likelihood table, and the use of the honorific "scientific” in front of “evidence” and “support." In summarizing its reasons for judging the conviction "unsafe," the court emphasized the issue of transparency and completeness, writing that "the practice of using a Bayesian approach and likelihood ratios to formulate opinions placed before a jury without that process being disclosed and debated in court is contrary to principles of open justice."

Although Mr. Ryder insisted that he had merely used the figures to confirm what his "very extensive experience of footwear marks" already told him, the court saw the effort to reason more explicitly about the match as ammunition for cross-examination. Concluding that this cross-examination could have changed the outcome of the trial, the Court of Appeal quashed the conviction and ordered a retrial.

It suggested that on retrial, testimony not billed as scientific and based strictly on personal experience about the fact that the defendant’s Nike trainers “could have” been the source of the impressions would be acceptable. Beyond this, the court seemed willing to countenance "a more definite evaluative opinion" — as long as the "size or pattern" is "unusual" based on "years studying this kind of comparison." That kind of opinion would be fine, the court wrote, because "[i]t is a judgment based on his experience"  and "without any figures or mathematical formula."

There is much to criticize in this court’s reasoning. As many authors have noted, surely an expert whose intuitive or experiential impressions give rise to a judgment about the source hypothesis should be encouraged to consult the available statistical data and to consider their limitations to produce a fully informed judgment.

Less prominent in the writing on the case is the fact that, despite a description to the court, from Visiting Professor and Scientist and Scholar Allan Jamieson, of Mr. Ryder's "approach as 'the Bayesian approach' of using likelihood ratios," likelihoodism and Bayesianism are hardly the same. As explained in the previous posting (Part I), Mr. Ryder never spoke of prior or posterior odds or of source probabilities. He merely attached some words—"modest support"—to his unstated estimate of the likelihood ratio.The court's condemnation of "the practice of using a Bayesian approach" therefore seems inapposite.

To be sure, "Bayesianism" can be used to motivate the likelihood ratio as a measure of probative value, but that does not make the simple presentation of a likelihood ratio “the Bayesian approach.” In fact, the likelihood school of statistical inference abjures the use of prior probability distributions, and it does not use Bayes' rule in coming to decisions about hypotheses. Likelihoodism maintains that the statistician should be concerned only with whether the evidence provides increased or decreased support for one hypothesis over another. A likelihoodist would find the presentation of the likelihood ratio itself, without any Bayesian baggage or interpretation, entirely appropriate.

Of course, this is not to say that either the likelihoodist or the Bayesian would agree that the particular likelihood ratio kept out of sight in R. v. T. should be admissible. Although the likelihoodist would be pleased that Mr. Ryder's LR of 100 and his description of it as "moderate ... support" were untainted by a subjective, prior probability, the objectivity or accuracy of the estimate and the adjective could be a source of legitimate concern.

But the concern is not with the use of likelihoods per se. Casting doubt on a particular estimate of an LR does not make it appropriate for the expert to speak of the source probability—quantitatively or qualitatively. Indeed, if the expert lacks the data and experience with which to estimate the likelihood ratio, as the court in R. v. T. suggested, how can the expert have anything useful to say about the source probability? The court's preference for expert opinions on source probabilities simply sweeps the problem under the proverbial rug.

References on R. v. T.

C.E.H. Berger et al., Evidence Evaluation: A Response to the Court of Appeal Judgment in R v T, 51 Sci. & Justice 43 (2011)

F. Hoar et al., Extending the Confusion about Bayes, 74 Modern L. Rev. 444 (2011)

David H. Kaye, Likelihoodism, Bayesianism, and a Pair of Shoes, 53 Jurimetrics J. (forthcoming Fall 2012)

G.S. Morrison, The Likelihood-ratio Framework and Forensic Evidence in Court: A Response to R v T, 16 Int'l J. Evid. & Proof 1 (2012)

Mike Redmayne et al., Forensic Science Evidence in Questions, 2011 Crim. L.R. 347

References on Likelihoodism

Jeffrey D. Blume, Likelihood Methods for Measuring Statistical Evidence, 21 Stat. Med. 2563 (2002)

Anthony W.F. Edwards, Likelihood (2d ed. 1992)

James Hawthorne, Inductive Logic, in Stanford Encyclopedia of Philosophy (Edward N. Zalta ed. 2012)

Richard M. Royall, Statistical Evidence: a Likelihood Paradigm (1997)

Monday, July 9, 2012

If the Shoe Fits, You Must Not Calculate It (Part I)

In R. v. T., [2010] EWCA Crim. 2439, the Court of Appeal of England and Wales wrote an opinion that dismayed, if not enraged, leading forensic scientists across the globe. The brouhaha began with testimony in a murder trial that there was "a moderate degree of scientific evidence to support the view that the [Nike trainers recovered from the appellant] had made the footwear marks."

This evidence came from "Mr Ryder of the Forensic Science Service (FSS)." Mr. Ryder compared four aspects of the footwear marks from a murder scene and a pair of Nike "trainers found in the appellant's house after his arrest," namely:
  • Pattern (p). The FSS maintained a database of the characteristics of the shoes it inspected. About 20% had the pattern of the soles of the Nikes—the same pattern seen in the shoeprints. The probability of the pattern in a pair of shoes not worn at the crime scene (-W) would be P(p | -W) = 1/5, where the vertical line stands for "given" or "conditional on."
  • Size (s). According to another database, 3% of shoes sold with that pattern were size 11 (UK). Given the uncertainty in the precise size of a shoe that might have left the marks and in the effects of wear, the examiner adjusted this last figure upward. He estimated that as many as 10% of shoes sold would be in the right size range. Hence, P(p & s | -W) = 1/10).
  • Wear (w). He estimated (somehow) that about 50% of relevant shoes would show as much wear as was indicated by the impression and the shoes themselves. P(p & s & w | -W) = 1/2.
  • Damage (d). Finally, he felt that he that marks indicative of damage to the shoes added almost nothing to the other information. P(p & s & w & d | -W) = 1.
It follows that if the marks did not come from the defendant's shoes, the probability that they would be comparable to the ones at the murder scene in these four respects would be P(p & s & w & d | -W) = (1/5)(1/10)(1/2)(1) = 1/100.

A frequentist statistician might say that the similarities between the impressions and the defendant’s shoes are good evidence that the shoes left the marks because a p-value of 0.01 is small.

A likelihoodist statistician would want to know more. It is not enough to believe that an outcome is improbable under the defense’s hypothesis that the defendant’s shoes did not leave the marks (-W). One also must consider the probability of the marks under the prosecution’s hypothesis that the defendant’s shoes left the marks (W). The "law of likelihood" postulates that when the probability of the evidence under one hypothesis exceeds that under the competing, simple hypothesis, it supports the former over the latter to a degree given by the ratio of the conditional probabilities. If the two probabilities in the "likelihood ratio" are equal, then the evidence is to be expected to the same extent under both hypotheses. It cannot help us discriminate between them. Thus, some law review article writers have called the likelihood ratio a "relevance ratio."

Here, the probability that the impressions would match the shoes if they had indeed come from the defendant's Nike trainers was almost 100%, so Mr. Ryder concluded that the evidence (E = p & s & w & d) was about 100 times more probable if the marks came from the defendant's shoes (W) than if they came from other shoes (-W). In symbols, the likelihood ratio (LR) for his conditional probabilities is


LR = P(E | W) / P(E | -W) = 1 / (1/100) = 100.

Mr. Ryder made this rough estimate "to confirm an opinion substantially based on his experience and so that it could be expressed in a standardised form." He wrote three reports and testified, but never once did he mention these numbers. Rather, he testified that "In my opinion there is a moderate degree of scientific support the view that the [Nike trainers] made those marks."

He chose the word "moderate" from a table that the Forensic Science Service had selected for ranges of the likelihood ratio. The table, which he did not mention at trial or in his written reports, classified LRs from 10-100 as providing “moderate support.” The use of a standard table of "verbal equivalents" finds approval in reports of the European Association of Forensic Service Providers and a committee of the US National Research Council.

A Bayesian statistician would agree that a likelihood ratio of 100 supports the prosecution's theory substantially more than the defense's. But this statistician would not stop here. He would argue that the LR is a "Bayes factor." It raises the prior odds on W by 100. A juror willing to post prior odds of only 1 to 10 for the prosecution's hypothesis before hearing Mr. Ryder's evidence (and harboring no doubts about the veracity and accuracy of that evidence) now should be willing to revise the odds upward. Specifically, Bayes' rule gives posterior odds of LR x prior odds = 100 x 1/10 = 10 to 1. Whatever the value V of the prior odds, the posterior odds for this evidence are 100V.

Mr. Ryder stopped with the verbiage derived from the likelihoods and the FSS table. He did not give a Bayesian interpretation to the evidence -- something that the Court of Appeal had strongly disapproved of in earlier cases. Even so, the court in R. v. T. held that his testimony of "a moderate degree of scientific evidence to support the [prosecution's] view" rendered the conviction unsafe and therefore required a new trial.The next posting on the topic will explain why.

Saturday, July 7, 2012

Have DNA Databases Produced False Convictions?

About a year ago, I asked whether any false convictions have resulted from DNA database searches [1]. Of course, if there are any, they might be hard to find, but there is a known recent case of a false initial accusation. It came about because a laboratory contaminated a crime-scene sample with DNA from an whose DNA was on file from other cases.

In March 2012, a private firm in England re-used a "plastic tray[] as part of the robotic DNA extraction process" [2]. The tray, which should have been disposed of, apparently contained some DNA from Adam Scott, a young man from Exeter, in Devon [3]. This DNA contaminated the sample from the clothing of a woman who had been raped in a park in Manchester. Police charged Scott, who vehemently protested that he had never been to Manchester, with the rape. After detectives realized that Scott "was in prison 300 miles away, awaiting trial on other unrelated offences" at the time of the rape, the charges were dropped [3]. An audit and investigation of 26,000 other samples analyzed after the robotic system had been introduced, uncovered no other instances of contamination. Steps intended to prevent a repetition of the error have been implemented [3].

Other errors in handling samples have been documented. In a 2001 Las Vegas case, police obtained DNA samples from two young suspects, Dwayne Jackson and his cousin, Howard Grissom. A technician put Jackson's sample in a vial marked as Grissom's, and vice versa. A falsely accused Jackson then pleaded guilty and was imprisoned for four years. The error came to light in 2010, after Grissom was convicted of robbing and stabbing a woman in Southern California. California officials took Grissom's DNA and entered the profile into the national database, leading to a match to the crime-scene DNA from the 2001 burglary for which Jackson had been falsely convicted [4].

Of course, this is not a case of a DNA database hit producing a conviction or even a false accusation. Quite the contrary, it is a case of a DNA database producing an exoneration that would not have occurred otherwise. But both cases vividly illustrate the need to implement quality control systems that reduce the chance of handling and other errors and to avoid over-reliance on cold hits.

Added July 9, 2012

Jeremy Gans's comment, posted this morning, is required reading.

References

1. David H. Kaye, Genetic Justice: Potential and Real, The Double Helix Law Blog, June 5, 2011.

2. BBC News, DNA Blunder: Man Accused of Rape After Human Error, Mar. 21, 2012.

3. Simon Israel, DNA Contamination Blamed on Human Error, Channel 4 News, May 9, 2012.

4. Lawrence Mower & Doug McMurdo, Las Vegas Police Reveal DNA Error Put Wrong Man in Prison, Las Vegas Rev.-J., July 8, 2011.

Cross-posted from The Double Helix Law Blog.

Friday, July 6, 2012

The Probability that the Higgs Boson Has Been Discovered

Surely everyone has heard of the probable discovery of the Higgs boson. But what does it have to do with  forensic science or law? It is a reminder that the "prosecutor's fallacy" is not limited to prosecutors or courtrooms. Reports in the popular press by skilled physicists and science writers trying to explain this impressive discovery are replete with a messy form of the transposition fallacy. Here is an example from an otherwise excellent report by physicist Lawrence Krauss in Slate magazine:
One can in fact quantify the likelihood that the observations are mistaken and that the events are actually background noise mimicking a real signal. Each experiment quotes a likelihood of very close to “5 sigma,” meaning the likelihood that the events were produced by chance is less than one in 3.5 million. Yet in spite of this, the only claim that has been made so far is that the new particle is real and “Higgs-like.”
Likewise, Nature announced "just a 0.00006% probability that the result is due to chance." The New York Times reported that "the likelihood that their signal was a result of a chance fluctuation was less than one chance in 3.5 million, 'five sigma,' which is the gold standard in physics for a discovery," attributing the statement to CERN's physicists.

How is this (mis)reporting related to the transposition fallacy? Well, sigma (σ) stands for standard deviation, and 5σ means 5 standard deviations from the value expected if the measurements were just noise. For a normal distribution, results this extreme or more extreme would be seen in pure noise a small fraction of the time. The tiny figures quoted above are estimates of that fraction. The fraction is the statistician's p-value, P(>5σ | noise), and it is on the order of 10-6. In plain English (and one bit of Greek), the probability of data of more than 5σ given that they are just noise is on the order of one in a million. So the observations would be very surprising if they were just noise.

But the probability that they actually are noise is an inverse probability, P(noise | data). That probability depends on the likelihoods P(5σ | noise) and P(5σ | signal) as well as on the prior probability, P(noise). The p-value itself does not generally "quantify the likelihood that the observations are mistaken and that the events are actually background noise mimicking a real signal." It does not specify the "probability that the result is due to chance." If one wants to quantify the probability that the data are a real signal rather than noise, then, for better or worse, one must turn to Bayes' rule.

References (for physicists)

- Giulio D’Agostini, Bayesian Reasoning in High Energy Physics, CERN Yellow Report 99-03, July 1999
- Giulio D'Agostini, Probability and Measurement Uncertainty in Physics: A Bayesian Primer (1995)

A couple of other blogs (and one newspaper) making the same point

- http://www.r-bloggers.com/the-higgs-boson-sigma-5-and-the-concept-of-p-values/
- http://understandinguncertainty.org/higgs-it-one-sided-or-two-sided
- http://randomastronomy.wordpress.com/2012/07/04/higgs-boson-discovery-and-how-to-not-interpret-p-values/
- http://blog.carlislerainey.com/2012/07/07/innumeracy-and-higgs-boson/
- http://understandinguncertainty.org/explaining-5-sigma-higgs-how-well-did-they-do#comment-1449

- http://online.wsj.com/article/SB10001424052702303962304577509213491189098.html

Postscript

Professor Dennis Lindley, a major figure in the development of Bayesian methods (and known to some readers of this blog as the author of a classic paper on using them to identify glass fragments) posed a few questions on the Higgs boson experiment via the list server of the International Society for Bayesian Analysis. One well informed set of answers came from Louis Lyons (organiser of PHYSTAT series of meetings, and a member of CMS Collaboration at CERN). I posted a slightly edited version on July 11 under the title "More on Statistical Reasoning and the Higgs Boson." The full text of these and various other interesting messages is at http://bayesian.org/forums/news/3648.