Saturday, March 30, 2019

Can P-values or Confidence Intervals Prove Non-association?

Last week, the scientific community heard two prominent calls to end to the labeling of results of statistical studies as statistically "significant" or "not significant." One of the manifestos (Amrhein et al.) is particularly irate about negative claims -- conclusions of "no association" based on sample data that are not extreme enough to reject the null hypothesis. The authors "are frankly sick of seeing such nonsensical 'proofs of the null' and claims of non-association in presentations, research articles, reviews and instructional materials." The "pervasive problem," as they articulate it, is in the box on the right. It is a recurring issue in toxic tort an other litigation.
PERVASIVE PROBLEM
Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero. Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not. These errors waste research efforts and misinform policy decisions. 1/
It is hard to disagree with the first observation in the box. Logically, we cannot conclude that the null hypothesis is true "just because a P value is larger than a threshold." For one thing, if the study lacks power, random error plausibly could be the source of the observed difference. However, this is not a sufficient reason to abjure hypothesis tests. It is a reason to attend to power. If a study has ample power and is well designed, then it is not so obvious that it wastes research efforts or misinforms policy decisions to conclude that the failure to achieve statistical significance at a rather undemanding level demonstrates the lack of a true association.

The second observation also is literally correct. Different outcomes of a null hypothesis significance test are not logically sufficient to establish a conflict between the studies. In 2013, a paper in the International Journal of Cardiology reported "that the use of selective COX-2 inhibitors [such as Vioxx] was not associated with atrial fibrillation risk" because the researchers did not find a statistically significant association. 2/ But a 2011 study in the British Medical Journal had found just such an association. 3/ It might seem that the second study is in conflict with the first, but as Amrhein et al. explain:
The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).

It is ludicrous to conclude that the statistically non-significant results showed “no association”, when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect. Yet these common practices show how reliance on thresholds of statistical significance can mislead us ... . 4/
So the results of the second study do not undermine those of the first. The studies provide consistent point estimates. The second one merely has a smaller standard error. As such, it does not merit as much weight, but it is fully consistent with the first result. 5/

What happens, however, when the point estimates of the relative risk are different? Is "no association" a reasonable statement when the confidence interval is tightly centered around RR = 1? Suppose, for example, that the second study produced a 95% confidence interval of RR = 1.0 ± 0.2. From a maximum likelihood standpoint, "no association" is the best estimate. Would it still be ludicrous to describe the observed RR of 1 (no difference in the risk) as proof of "no association"?

Amrhein et al. do not answer this specific question, but they grudgingly accept the idea of proof of "no association" in the form of no practical importance. As stated at the outset, they are "frankly sick of seeing such nonsensical 'proofs of the null' ... ." They argue that all values inside a confidence interval are "compatible" with the data (as are some outside of the interval), implying that you cannot single out any one value to the exclusion of the others. But they also propose that some values are more compatible than others. Is compatibility a likelihood? A probability? Something else? In the end, their answer to the question of whether the data can be said to prove a negative -- that there is no true association -- is this: "if you deem all of the values inside the interval to be practically unimportant, you might then be able to say something like ‘our results are most compatible with no important effect’." 6/

NOTES
  1. Valentin Amrhein, Sander Greenland, Blake McShane et al., Comment, Retire Statistical Significance, 567 Nature 305, 305-06 (2019).
  2. T.-F. Chao, C.-J. Liu, S.-J. Chen, K.-L. Wang, Y.-J. Lin, S.-L. Chang, et al., The Association Between the Use of Non-steroidal Anti-inflammatory Drugs and Atrial Fibrillation: a Nationwide Case–control Study, 168 Int’l J. Cardiology 312 (2013).
  3. M. Schmidt, C.F. Christiansen, F. Mehnert, K.J. Rothman, & H.T. Sørensen, Non-steroidal Anti-inflammatory Drug Use and Risk of Atrial Fibrillation or Flutter: Population Based Case-control Study, 343 Brit. Med. J. d3450 (2011).
  4. Amrhein, supra note 1, at 306..
  5. In fact, it lends strength to the conclusion that the true relative risk exceeds 1. Morten Schmidt & Kenneth J. Rothman, Mistaken Inference Caused by Reliance on and Misinterpretation of a Significance Test, 177(3) Int'l J. Cardiology 1089, 1090 (2014) (meta-analysis give a CI of 1.1 to 1.3).
  6. Amrhein, supra note 1, at 307.

Sunday, March 24, 2019

Legal Implications of the Statisticians' Manifestos on Statistical Significance

Manifestos in The American Statistician 1/ and Nature 2/ last week urge an end to "statistical significance." As these communications make clear, statisticians have long denounced presenting results that are barely significant (for example, p = 0.048) as radically different from ones that are almost significant (for example, p = 0.052). They have cautioned that "no significant difference" with a study that lacks power is not proof of no real difference; that a "statistically significant difference" can, for all practical purposes, be of no importance; that a p-value is not the probability that the null hypothesis is true; and so on (and on).

Indeed, after reading such books as Statistics: Concepts and Controversies (1979) by David S. Moore, and The Significance Test Controversy: A Reader, edited by Denton Morrison and Ramon Henkel (1970), I once proposed that courts exclude testimony using phrases like “significant difference” and “95% confidence interval” as unduly confusing and potentially misleading under Federal Rule of Evidence 403 and the common law from which the rule is derived. 3/ That idea fell on barren soil.

Now a similar rule has been proposed for scientific discourse generally. The American Statistician’s editorial advises all scientists as follows: “‘statistically significant’—don’t say it and don’t use it.” It observes that
The ASA Statement on P-Values and Statistical Significance [Feb. 5, 2016] stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.

Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. Made broadly known by Fisher’s use of the phrase (1925), Edgeworth’s (1885) original intention for statistical significance was simply as a tool to indicate when a result warrants further scrutiny. But that idea has been irretrievably lost. Statistical significance was never meant to imply scientific importance, and the confusion of the two was decried soon after its widespread use (Boring 1919). Yet a full century later the confusion persists.

And so the tool has become the tyrant. The problem is not simply use of the word “significant,” although the statistical and ordinary language meanings of the word are indeed now hopelessly confused (Ghose 2013); the term should be avoided for that reason alone. The problem is a larger one, however: using bright-line rules for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making (ASA statement, Principle 3). A label of statistical significance adds nothing to what is already conveyed by the value of p; in fact, this dichotomization of p-values makes matters worse.
Also eschewing what it calls "dichotomania," the comment in Nature, which garnered "more than 800 signatories" when a pre-publication draft was circulated, reads
[I]n line with many others over the decades, we are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis. .... The misuse of statistical significance has done much harm to the scientific community and those who rely on scientific advice. P values, intervals and other statistical measures all have their place, but it’s time for statistical significance to go.
This is powerful stuff, but I do not expect testifying experts to change their ways or courts to stop prizing declarations of “statistically significant” findings. Neither do I expect the scientific establishment to change overnight. I doubt that phrases like "trending toward significance" or "highly significant" will disappear (or, more radically, that p-values will be abandoned). There are lots of things to consider. Do we need words of some kind to demarcate studies whose estimates reasonably can be considered to be close to a true value as opposed to those that could not be consistently replicated because of randomness in the data? In what contexts should an adjective be used for this purpose? It is one thing to say that journal editors should not use an arbitrary line to reject articles under all circumstances or even that scientists should just state a number rather than using a less precise adjective, but what about more popular writing? If a term like "significant" is used for demarcation in any context, what is the tipping point? More fundamentally, why use p-values to grade the strength of evidence? Should something else play this role? The 43 articles in The American Statistician and the correspondence in Nature on the Amrhein et al. article indicate that no simple fix is imminent in the scientific community. 4/

Nonetheless, the recent manifestos should have some influence in the legal world. They should cause lawyers or testifying experts to hesitate to puff up evidence with modest p-values (in the vicinity of 0.05, for example) as "significant!", if only because the American Statistician's editorial, the earlier ASA Statement, and the Nature comment and the large number of statisticians who publicly endorsed it, all supply fodder for cross-examination (depending on the jurisdiction's hearsay rule for "learned treatises"). Relatedly, they could support motions to limit testimony with such verbiage as misleading or unfairly prejudicial, as I once suggested.

In addition, they should reinforce the tendency of most courts to reject a mechanical rule for admissibility based on a false alarm probability of 0.05. After seeing the manifestos, a law school colleague wrote me that he had "never understood why an empirical study that was 1% less significant than what a social science journal would accept is therefore ‘junk’ that should not even be considered by a jury." However, the argument for a categorical rule is more subtle. The relevant standard for admissibility is not, "Is this study junk science?" Under Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), and Federal Rule of Evidence 702, it is whether a finding presented as scientific qualifies as "scientific knowledge" and, as such, possess "evidentiary reliability." The argument that all statistical findings that fail the p < 0.05 test are not reliable enough to consider (along with other evidence) has always been dubious. At most, the mechanical rule would be that p ≥ 0.05 shows that the finding is insufficient for scientists -- and hence legal factfinders -- to conclude that the alternative hypothesis is true.

But even though Rule 702 does not lead to the categorical exclusion of findings for which p ≥ 0.05, such evidence is subject to the balancing test of Rule 403. One could argue for a bright-line rule to simplify that inquiry. Juries and judges often do not understand what a p-value is and might misconstrue a study with p exceeding 0.05 as good proof of a real association or effect (especially if the result is not characterized as "not significant"). For example, jurors might think that if p is "only" 0.052, then they should be almost 95% "confident" that there is a real association or effect. Such naive transposition is often incorrect. Of course, 0.05 is (or any other number) is somewhat arbitrary, but it may not be a ridiculous dividing point if the benefits of a simple rule exceed the costs of the misclassifications it produces relative to purely ad hoc case-by-case balancing. 5/

My own preference is for a more flexible approach -- with judicial recognition that the probative value of studies for which p is near (and, a fortiori, much larger than) 0.05 is quite limited. The findings on either side of this dividing line are not necessarily "junk science," but in the vicinity of p = 0.05, they are not all that surprising even if there is no real association or effect. As such, it may not be worth the time it takes to educate the jury about the weak evidence. In ruling on motions to exclude on this Rule 403 basis, courts should consider not only the issue of randomness in the data (which is all the p-value addresses), but also the design and quality of the statistical study as well as the availability (or not) of other, more probative evidence on the same point.

NOTES
  1. Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar, Moving to a World Beyond “p < 0.05”, 73:sup1 Am. Statistician 1–19 (2019) DOI: 10.1080/00031305.2019.1583913
  2. Valentin Amrhein, Sander Greenland, Blake McShane et al., Comment, Retire Statistical Significance, 567 Nature 305-307 (2019). This commentary is discussed further in Can P-values or Confidence Intervals Prove Non-association?, Forensic Sci., Stat. & L., Mar. 30, 2019.
  3. D.H. Kaye, Is Proof of Statistical Significance Relevant?, 61 Wash. L. Rev. 1333 (1986).
  4. John P. A. Ioannidis, Letter, Retiring Statistical Significance Would Give Bias a Free Pass, 567 Nature 461 (2019); Valen E. Johnson, Raise the Bar Rather than Retire Significance, 567 Nature 461 (2019); Julia M. Haaf, Alexander Ly & Eric-Jan Wagenmakers, Letter, Retire Significance, But Still Test Hypotheses, 567 Nature 461 (2019).
  5. Even with a categorical rule, courts would have discretion to exclude those studies with p-values that are small enough to pass through the filter. Consider a study that produces a p-value barely below 0.05. As with p = 0.052, the number may be transposed into a false sense of confidence (especially if it is praised as "statistically significant").
FURTHER READING
Last updated: Mar. 30, 2019, 8:55 AM

Thursday, March 21, 2019

Admitting Palm Print Evidence in United States v. Cantoni

A second U.S. District Court judge in the Eastern District of New York has ruled that latent fingerprint analysts may testify to categorical source attributions despite concerns voiced in prominent reviews of the field. 1/ Given the research demonstrating the ability of latent fingerprint examiners to associate prints with their sources, that is no surprise. More problematically, the case suggests that defense experts should not be permitted to testify about error rates in a forensic method unless they "conclude that [the analyst’s] conclusions were erroneous or that the process used was fundamentally unreliable."

In United States v. Cantoni, No. 18-cr-562 (ENV), 2019 WL 1259630 (E.D.N.Y. Mar. 19, 2019), the government was prepared to call three examiners from the New York City Police Department Latent Print Section to testify that a palm print found on a bank robber's note demanding money is Cantoni’s. Cantoni moved before trial for an order excluding this source conclusion or, alternatively, directing the examiners to include cautions about the risk of a false identification. In what the court called "throwing incense in Caesar’s face," Cantoni apparently relied on a 2017 report from a committee established by the AAAS, 2/ a 2016 report of the President’s Council of Scientific Advisors (PCAST), 3/ a 2012 report of a committee formed by the National Institute of Standards and Technology (NIST), 4/ and a 2006 report of the Inspector General of the Department of Justice. 5/ These did not get him very far.

Conclusions Admissible

In response to the demand for outright exclusion, district judge Eric N. Vitaliano pointed out that the PCAST report acknowledged the validity of latent print analysis. The studies of the ability of examiners to draw categorical conclusions that impressed PCAST involved fingerprints, but presumably those findings can be extrapolated to palm prints.

Cantoni tried to sidestep this evaluation by arguing that the NYPD examiners did not follow all the recommendations in the PCAST report. These were said to include testing examiners for minimal proficiency, disclosing the order in which the prints were examined and any extraneous facts that could have influenced the conclusions, documenting the comparison, and verifying that the latent print is comparable in quality to the prints used in validation studies.

The court noted that several of these matters, such as the fact that the examiners underwent proficiency testing, were not at issue in the case and that any remaining deficiencies did not make the process “so fundamentally unreliable as to preclude the testimony of the experts.” The court wrote that considering “Daubert’s liberal standard for the admission of expert testimony,” concerns about cognitive bias that underlay the PCAST procedures “are fodder for cross-examination rather than grounds to exclude the latent print evidence entirely.”

Limitations on the Testimony

Although the court did not exclude the examiners’ testimony in toto, it agreed to ensure that the testimony not include assertions “that their conclusion is certain, that latent print analysis has a zero error rate, or that their analysis could exclude all other persons who might have left the print.” The government had no problem with this constraint. It acknowledged that “[t]he language and claims that are of concern to defense counsel are disfavored in the latent print discipline.” One might think that analysts would not testify that way anymore, 6/ but some latent print examiners continue to offer conclusions in the traditional manner. 7/

Defendant’s remaining efforts to shape the testimony on direct examination were less successful. He wanted “the government [to] acknowledge, through the examiners or by stipulation, that studies have found the [false positive] error rate to be as high as 1 in 18 or 1 in 306.”

These numbers do not fairly or fully represent the findings of the validation studies from which they are drawn. These studies found smaller rates of false-positive errors. The quoted figures are inflated by taking the upper limit of a one-sided 95% confidence interval above the observed false-positive proportion. That a large error rate is not incompatible with a small study (one that lacks statistical power) cannot reasonably be interpreted as a finding that the error rate is large.

Perhaps the government pointed out this misuse of statistics, but the opinion describes a different rejoinder: “The government suggests that the studies Cantoni cites are inapposite because they involved the Federal Bureau of Investigation and the Miami-Dade Police Department.”

That response is disappointing. The most compelling study is the FBI-Noblis test of the accuracy of latent fingerprint examiners generally. It consisted of an experiment in which the researchers — but not the analysts — knew which pairs of prints were from the same finger and which were from fingers of different individuals. The test subjects were hardly limited to FBI examiners. To the contrary,
In order to get a broad cross-section of the latent print examiner community, participation was open to practicing latent print examiners from across the fingerprint community. A total of 169
latent print examiners participated; most were volunteers, while the others were encouraged or required to participate by their employers. Participants were diverse with respect to organization,
training history, and other factors. 8/
The government may have meant to argue that whereas the results of studies can be used to demonstrate that error rates are small among representative examiners, such an average figure is not a precise measure of the probability of error in a specific case. That much is true, but it does not mean that the error rates in the experiments are irrelevant or useless to a proper understanding of the risk of error in New York City.

The court did not reach such nuances. Judge Vitaliano wrote that “[c]ross-examination is the appropriate means to elicit weaknesses in direct testimony” and that “Cantoni may explore the error rates generated by the studies on cross-examination.” Yet, the court refused to let the defense inform the jury of error rates and other limitations on latent-print examinations through an expert of its choice — the social scientist, Simon Cole, who has researched the history of fingerprinting and written extensively about it.

Keeping Cole Away

The court provided two reasons for excluding Cole’s testimony. First, “this is a matter that may be explored on cross-examination and does not require an expert to offer an opinion.” However, the opportunity to cross-examine one expert normally does not preclude a party from calling its own expert. In a toxic tort case in which a manufacturer denies that its product is toxic, for example, the manufacturer ordinarily would be able not only to cross-examine plaintiff’s physicians, toxicologists or epidemiologists, but also to present its own comparable experts.

The difference, one might argue, is that “Dr. Cole’s opinions appear to be directed at NYPD’s methods for latent print analysis in general rather than specific issues in this case. ... [H]e does not thereby conclude that NYPD’s conclusions were erroneous or that the process used was fundamentally unreliable.” However, no rule of law requires an expert to give a final opinion. Simply providing background information — such as the accuracy of a medical diagnostic test — may assist the jury in a tort case in which plaintiff’s testifying physician relied on the diagnostic test in forming his or her opinion.

Second, the court suggests that the rule against hearsay makes the educational expert’s testimony inadmissible. Judge Vitaliano wrote that “to the extent that Dr. Cole merely plans to convey the contents of studies he did not conduct, he is poised to act as a conduit for hearsay, which is a prohibited role for an expert.” In the toxic tort case, however, the defendant’s epidemiologist could report the findings of other researchers. Federal Rule of Evidence 703 often permits experts to discuss inadmissible hearsay as the basis for their own opinion. In our toxic tort case, the defense need not call the authors of the studies as witnesses to lay a foundation for the epidemiologist to describe all the studies on point to explain why the substance has not been scientifically established to be toxic.

Again, one might argue that there is a difference. Epidemiologists study the types and quality of proof of toxicity; as a result of their specialized training in methodology, they can give an expert opinion on the state of the science. To inform that opinion, the epidemiologist can — indeed, must — review the published work of other researchers. But is Dr. Cole, as a sociologist of science, qualified to give the opinion that “there is now consensus in the scientific and governmental community that categorical conclusions of identification — such as the one made in this case — are scientifically indefensible”? The opinion in Cantoni does not squarely confront this Rule 702 question.

Whatever the answer to the question of qualifications, 9/ the hearsay rule should not preclude a statistician or other methodological expert from giving an opinion that well-designed research suggests that latent fingerprint examiners who have been the subject of experiments reach incorrect source conclusions under certain conditions or at certain rates. Analogous expert testimony about the accuracy of eyewitness identifications is now plainly admissible (in the court's discretion) in most jurisdictions. Cross-examination (this time by the government) can explore the extent to which these findings are applicable to the case at bar. If the known error rates are clearly inapposite, they should be excluded under Rule 403, but the hearsay objection seems misplaced.

NOTES
  1. The previous cases are noted in Ignoring PCAST’s Explication of Rule 702(d): The Opinions on Fingerprint Evidence in Pitts and Lundi, Forensic Sci., Stat. & L., July 16, 2018, and More on Pitts and Lundi: Why Bother with Opposing Experts?, Forensic Sci., Stat. & L., July 17, 2018.
  2. William Thompson, John Black, Anil Jain & Joseph Kadane, Forensic Science Assessments: A Quality and Gap Analysis, Latent Fingerprint Examination (2017).
  3. Executive Office of the President, President’s Council of Advisors on Sci. & Tech., Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods (2016).
  4. NIST Expert Working Group on Human Factors in Latent Print Analysis, Latent Print Examination and Human Factors: Improving the Practice Through a Systems Approach (David H. Kaye, ed. 2012).
  5. U.S. Dep't of Justice, Office of the Inspector General, A Review of the FBI’s Handling of the Brandon Mayfield Case (2006).
  6. Cf. Another US District Court Finds Firearms-mark Testimony Admissible in the Post-PCAST World, Forensic Sci., Stat. & L., Mar. 15, 2019.
  7. Nicole Westman, Bad Police Fingerprint Work Undermines Chicago Property Crime Cases, Chi. Reporter, Mar. 21, 2019.
  8. Bradford T. Ulery et al., Accuracy and Reliability of Forensic Latent Fingerprint Decisions, 108 Proc. Nat’l Acad. Sci.  7733, 7734 (2011). For more discussion of the characteristics of the volunteers, see Part II of Fingerprinting Under the Microscope: Examiners Studied and Methods, Forensic Sci., Stat. & L., Apr. 27, 2011.
  9. For discussion of the qualifications of non-forensic scientists to opine on the state of forensic science methods, see David H. Kaye et al., The New Wigmore on Evidence: Expert Evidence, Ch. 2 (2d ed. 2011) (2019 cumulative update).

Wednesday, March 20, 2019

Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 3)

Yesterday, I presented some of the testimony behind the one-in-650-biliion probability quoted in ProPublica's reporting on FBI testimony in United States v. McKreith. The figure came from the uniform probability model as applied to the placement of lines on plaid shirts. The selection of that probability model was motivated by observations of the manufacturing process.

Physical scientists are trained to make rough approximations of quantities — both large an small — and they should not be pilloried for trying to use a simple probability model to get a sense of how rare an event might be. But experts should not testify to rough calculations of extremely persuasive probabilities without any error bounds and without trying to verify the assumptions of the modeling effort by collecting and analyzing a reasonable amount of data.

The prosecution's enthusiasm for the largely theoretical probability model in McKreith extended not merely to the shirt mentioned in the ProPublica article, but also to stripes on a handbag. Dr. Vorder Bruegge's testimony on direct examination was relatively mild. He testified, without computing any probabilities, that a photograph and a handbag were "indistinguishable" with respect to "class" and "individual" features that the jurors could see for themselves:

Q. I believe we were at the Mary Kay bag, Government Exhibit 14. [W]ere you able to identify this exhibit as something that was presented to you for ... photographic comparison purposes?
A. Yes, this is the bag that was submitted to me at the FBI laboratory for comparison in this case.
Q. [W]hat observations were you able to make in terms of the class characteristics of the bag?
A. Basically, this is a handbag that has ... two dark straps. It's got a pocket on [one] side ... . It's made primarily of ... multiple panels — two side panels, two end panels, a bottom and a secondary panel that is overlapping this side panel. It's a striped bag that has these bright snaps on the end. ... [B]asically, it's ... dark silver and black stripes on the bag. ... The manufacturer of the bag is indicated by the name Mary Kay on the side.
Q. [C]an you identify ... what distinguishing features there are about the stripes?
A. Well, the stripes are evenly sized stripes. Each one is about a quarter of an inch wide, and they are alternating black and silver stripes.
Q. At the spacing, the same consistent —
A. The spacing is consistent throughout, across the bag, yes.
Q. And you indicated that it has hand straps?
A. Yes. ...
Q. [S]howing you what is marked as Government’s Exhibit 7-EE, regarding the bank robbery at SouthTrust, is this a chart that you prepared for comparison analysis? ...
A. Yes. Government exhibit V7B-EE is a chart that I prepared. ...
Q. ... [W]ere you able to identify, from looking at the bag itself, any individual characteristics, identifying characteristics, which would make this bag unique from all the other Mary Kay bags that may have come off the assembly line at sometime, using the same fabric, being the same sized bag?
A. [T]here are some small white markings on the bag ... . It's not clear to me whether it would be marker or some kind of staining on the bag that occurs at various places around the bag. [O]n the back here [are] little white marks that could be used to differentiate this bag from all of the bags.
      As far as the manufacturing characteristics, we've got another repeating pattern, much like the one in the shirt. in which we've got dark bright stripes. [I]f we look at this end panel, for example, you'll see that the very top stripe on this end of the bag is a ... totally black stripe. And if we look at this end ... where the top is, ... there's actually a little silver there. Likewise, if you look at the edges ... [at] the very top of this [side panel] is about half of one of those silver lines. On the other side, it's not quite a half of one of those silver lines. [Also, it’s] basically silver on the top of the sides — silver on the top of this side panel, end panel, but black [with] maybe a little bit of silver there, [but] when it's folded over, it's black on the top. Also, if you look at where the end panel meets the side panel, you've basically got ... the silver effectively lining up with the black, going from the end panel to the side panel. At the other end, it's a slight offset, slightly different, where it's kind of half and half. It doesn't match up exactly.
Q. ... Are the results of the randomly identifying features in terms of the way the bag was manufactured when these parts were sewn together?
A. Now, I have never been to a bag manufacturing plant, but assuming that the same sewing practices were used —
MR. HOWES: Judge, I'm going to object.
COURT: Sustained.
BY MR, STEFIN:
Q. [S]o you don't know the manufacturing process with respect to that particular bag?
A. That's correct.
Q. Okay. Were you able to do a comparison in any regard to determine whether or not there were points of identification which are similar to the government's Exhibit 14 with the bag depicted in the robbery photos from the SouthTrust bank robbery?
A. Yes I was. ...
Q. And what ... did your comparison yield?
A. Basically, I found a similarity in class characteristics between the bag, the Mary Kay bag — Government’s Exhibit 14 — and the bag carried by the Robert in the SouthTrust bank robbery as depicted on the left hand side of Government’s Exhibit VB7-EE. ...
Q. Are you able to offer an opinion as to whether Government Exhibit 14 is indistinguishable from the bag that's depicted in the bank surveillance photographs of the SouthTrust Bank robbery?
A. Yes.
Q. And what is your opinion?
A. This Government’s Exhibit 14 is indistinguishable from the bag in Government Exhibit 7-EE.

The court sustained the objection to testimony about "randomly identifying features in terms of the way the bag was manufactured when these parts were sewn together" because Dr. Vorde Bruegge forthrightly acknowledged that he was assuming certain facts not in evidence (and outside his expertise as an image analyst). Evidently, the court wanted more than a mere assumption "that the same sewing practices were used." But the next day, on re-direct examination, the prosecutor had Dr. Vorde Bruegge present the same probability model for the placement of the bag's stripes that he had used for the shirt:

Q. ... And with respect to the Mary Kay bag, Government Exhibit 14, didn't you identify individual characteristics of that bag which makes it different than other Mary Kay bags that may have come off the same assembly line?
A. Yes, I did.
Q. And in fact, how many different characteristics were you able to identify looking at that exhibit, in comparison with the bank surveillance photographs of a bag being carried by the robber?
A. There were four specific characteristics that I noted.
Q. And would you remind us of what those four individual characteristics were?
A. The first one was the alignment of the black and silver stripes from the back side of the bag with the end of the bag, the fact that the silver lines on the inside line up with the black lines on the back side. The second characteristic was the location of the snaps at the top on a silver line. The third characteristic was the very small silver line at the top of the back piece. And the last characteristic was the silver line at the very top of the back piece.
Q. And did you come up with any ... odds or probabilities that these items would appear exactly as they are on that bag in a random fashion?
A. Yes, I did. ...
Q. [H]ow were you able to arrive at a probability as far as the individual characteristic that would exist?
A. Basically I'm dealing with a black or white situation. In this case, black or silver. Either you're going to get the black line in one place or you're going to get the silver line in that place. I'm not breaking down by 50% of the black line or 50% of the silver line. I'm just saying it's either a black line or a silver line, which is a 50/50. You got like one chance in two of a specific feature being black or silver. In particular, these silver snaps on the end can either be on a silver line or a black line. They're on a silver line. That eliminates all of the other bags that would have the snaps on a black line.
      Likewise at the top, there is either a silver line at the top or there's a black line at the top. One chance in two, 50/50. So with this, the snaps and the top of the side of the back, it’s one in four. With the addition of the back of the bag silver at the top, it's 1 in 8 — 2 times 2 times 2. And then with the sides here having silver aligning with black, the silver’s either going to align with black, or the silver’s going to align with silver. That's another one in two chance. So 2 times 2 times 2 is 1 in 16.
Q. 2 times 2 times 2 times 2?
A. Yes, correct. 2 to the 4th power.
Q. 2 to the 4th power. So it is possible then to eliminate 15 out of 16 silver bags that would be coming the manufacturing process from whatever company made those bags?
A. That would be the hypothesis, correct.
Q. And did you, fact, find those four same characteristics in the photographs depicting the robber carrying the same bag?
A. Yes. Yes, I did.

This computation of 1/16 for the probability of a four-feature match sounds suspiciously like an application of the principle of insufficient reason discussed yesterday. Every feature is "black or white," present or absent. Not knowing which is more probable, we can presume that each state (present or absent) is equally probable. Four independent features then create 16 equally likely states of nature.

Over a century ago, the New York Court of Appeals soundly rejected this Laplacean reasoning. In People v. Risley, 108 N.E. 200 (N.Y. 1915), a mathematician "was permitted to testify that, by the application of the law of mathematical probabilities, the chance of such defects [in letters typed on an allegedly altered affidavit] being produced by another typewriting machine was so small as to be practically a negative quantity." The mathematics professor "defined the law of probabilities as 'a proper fraction expressing the ratio of the number of ways an event may happen, divided by the total number of ways in which it can happen.'" The court wrote that the extended multiplication of one-half for each peculiarity "was not based upon actual observed data, but was simply speculative ... ."

If the choice of 1/2 for the probability of each binary feature on the handbag was based on nothing more than the fact that that there are two possibilities -- present or absent -- then it too is "simply speculative." If it was based on the more plausible assumption that the bag is sewn together in a way that would be expected to produce a uniform distribution, then the objection -- that the expert had not even visited a Mary Kay plant to learn how the bags were made -- applies. But McKreith's lawyer did not renew the objection. Neither did he argue that the applicability of the model was not verified by data on a sample of Mary Kay bags. If anything, the 1/16 figure for the four-feature handbag match is more speculative than the one in 650 billion probability of the eight-seam shirt match.

Risley and McKreith are not the only cases in which experts have multiplied a lot of small fractions together to get a smaller number. In the 1968 California case of People v. Collins, a prosecutor had a local mathematics professor testify to the product (one in 12 million) of a series of probabilities for characteristics supplied by eyewitnesses who desctibed an interracial couple that drove a yellow automobile away from the scene of a robbery. The probabilities were data-free estimates that the prosecutor fed to the mathematician. The California Supreme Court reversed the conviction, famously remarking that "[m]athematics, a veritable sorcerer in our computerized society, while assisting the trier of fact in the search for truth, must not cast a spell over him," and sparking much distrust in the legal community of probability calculations.

However, the probability model in McKreith, for handbags as well as shirts, is more plausible than the one in Collins. For one thing, the assumption of uncorrelated features is more plausible, and there is at least a modicum of knowledge of the process generating the features. Nevertheless, as I observed with regard to another case in which appellate lawyers for a defendant put an explicitly probabilistic assessment of evidence into their brief, "[t]he attempt to use probability theory ... was heroic. Like many acts of heroism, it also was hasty. Although there were some measurements and estimates of quantities bearing on guilt or innocence, the empirical data were so sketchy that the computations inevitably were more creative than convincing." 1/

NOTES
  1. D. H. Kaye, Book Review, Statistics for Lawyers and Law for Statistics, 89 Mich. L. Rev. 1520, 1543 (1991).
POSTINGS IN THIS SERIES
  • Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 1), Mar. 3, 2019
  • Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 2), Mar. 19, 2019.
  • Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 3), Mar. 20, 2019

Tuesday, March 19, 2019

Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 2)

As noted earlier this month, one of the more striking parts of the ProPublica articles on the FBI laboratory's image image analysis unit 1/ is its discovery of testimony, given more than sixteen-years ago, about the probability of finding certain features in photos of a shirt worn by a bank robber caught on film in a string of robberies. The government accused Wilbert McKreith of eight robberies and of and possession of firearms having been previously convicted of a felony. 2/

According to ProPublica, a ruling that Dr. Richard Vorder Bruegge’s “testimony met the Daubert standard” “enshrined the FBI unit’s techniques and testimony as reliable scientific evidence.” This characterization seems to overstate the importance of a single, unpublished pretrial ruling that followed a short, spur-of-the-moment hearing. 2/ Still, the testimony was central to McKreith’s conviction on the bank robbery counts. Dr. Vorder Bruegge’s conclusions fortified the testimony from ordinary witnesses that, among other things, McKreith and the bank robber wore two wrist watches at the same time; wore or had ski masks in south Florida; wore plaid shirts; drove a maroon, burgundy or red-colored car; and had the same general appearance.

Dr. Vorder Bruegge’s signal contribution came from comparing video images from bank cameras to items seized from the defendant’s home and to McKreith himself. Some of this testimony was not definitive. For example, Vorder Bruegge testified that enlargements of the photographs from several bank robberies lacked sufficient resolution for him to say “with a scientific certainty” what that they showed wristwatches. Instead, he stated that they were “consistent with these features being two watches on the left wrist of the bank robber.” Likewise, he testified that a bank photograph that showed a profile of the robber’s face lacked the resolution to reveal allegedly “individual identifying characteristics [such as moles, stars, chipped teeth, ear patterns or other facial minutiae],” but “the overall characteristics of the profile, which include the shape of the nose, mouse, and chin” displayed “similarities” “consistent with” McKreith’s being the bank robber.

Dr. Vorde Bruegge’s analysis of the pattern of a plaid Van Heusen shirt in photos from seven of the robberies produced more conclusive evidence. A condensed version of the testimony on direct examination on December 16, 2002, is in the shaded boxes that follow. Discussion is interspersed between the boxes. ProPublica posted the full transcript.of the testimony from Dec. 16 and 17.

"Individual Identifying Characteristics"

MR. [Roger] STEFIN [Assistant US Attorney]: Our next witness would be Richard Vorder Bruegge.
THE COURT: Okay, members of the jury, [o]ur next witness will be an expert witness. ....
[Defense counsel objected that the government first needed “to lay a sufficient predicate” and “request[ed] a Daubert ... hearing. The court held an impromptu hearing, with no written briefing, at which Dr. Vorder Bruegge affirmed that “the techniques [are] well recognized ... in forensics [a]nd in the scientific community at large.” The court overruled the defendant’s objection.]
Q. [by MR. STEPHIN] [H]ave you had the opportunity to view the eight by ten photographs that have been introduced in evidence in this particular case with respect to the bank robbery images in each of the eight bank robberies?
A. Yes, I have. ...
Q. ... Let’s talk about the shirt now. What is it about the manufacturing process of shirts that might enable you to identify features of that shirt that could be then compared with image analyses?
A. Well, any, any comparison analysis involves a comparison of first of all, the class characteristics. ... In a shirt like this, the class characteristics ... include such things as, does it have a pattern, which this shirt does — it has a plaid pattern or a checked pattern, if you will. ... It also is a button-down shirt — it's not a pull-over, and it's a long-sleeved shirt. ...
Once one has ... found ... class characteristics that match, one can move on to look at the individual identifying characteristics. ... [I]f I were trying to differentiate one person from another and reach a positive identification, then I would need individual identifying characteristics, such as moles, scars, freckle patterns, chipped teeth. These, these features enable you to really differentiate people down to saying this person is unique from all other people. [B]ecause this shirt is a patterned shirt, there are individual identifying characteristics ... based upon ... the way ... the pieces on the shirt are cut out and then sewn together.

The class-individual distinction is entrenched in forensic science, and there is a grain of truth in it, But it assumes what is to be proved -- that every "individual characteristic" (or some unspecified combination of them) makes an individual distinguishable from every other individual. "Class characteristics" are known to be generic, whereas "individual" ones might not be. Logically, both work the same way -- the presence or absence of a characteristic changes the probability of a common source for the specimens or images being compared. A less tendentious pair of terms would be "generic" and "randomly acquired or incorporated," keeping in mind that random characteristics are not necessarily specific to an individual.

"One in-35 Through Random Processes"

Q. So what can you conclude with respect to the pattern itself on this shirt?
A. [B]ecause we don't see the patterns matching across the seams, we can use that as a way to individualize this shirt relative to the other shirts that would be manufactured at the same time. ...
Q. [W]hat do you call this when you find an individual feature?
A. Individual identify characteristic. Basically, each seam ... can be considered on its own as an entire set of individual identifying characteristics. It's almost one individual identifying characteristic. ... [I]f we look at where the right yoke meets the right front panel, there's this ... thick dark that I've been talking about this comes down from the neck, and it terminates here about 2/3 of the way across this panel ... . Likewise, if we look at the left sleeve, I point out that ... very thin dark line here, which is coming up the right sleeve. It almost exactly meets the point where the yoke joins the front panel. ... Likewise, ... look at the collar itself. You'll see that the collar has a very dark line [that] just cuts the corner of the bottom hole on this side. That is paralleled on the other side because it's one piece. So that's another individuating characteristic for this shirt. ...
Q. All right. And did you take precise measurements of these types of lining up of the different — the thick lines or the thin lines in order to further identify, you know, the individual characteristics of this particular shirt?
A. I measured the width of these features on this shirt so that I could figure out [the] relationship ... between places meeting on one side with those meeting on the other side.
Q. And I think you said that the pattern repeats basically every three-and-one-half inches?
A. Yes. Basically, ... if we just repeat going from this dark line to the next dark line, it's three-and-a-half inches. ...
Q. The thick line or the thin line?
A. The thin line. Each ... of these features repeats every three-and-a-half inches. It's just that the thin line is the smallest feature that we can see very easily.
Q. So you were using the thin line as a point of reference in measuring out the three-and-one-half inches? A. Exactly. Recall that I said there's a curved surfaces here where the sleeve meets the other seams. That would complicate any type of repeat analysis because since it's not a straight line, that's going perpendicular to that feature, so it's actually going to be longer.
So basically, if we did have a single seam that we were trying to match this pattern to run across it, the chances of that very thin dark line matching up, matching up or at least touching the black line on the other side, would be one in 35 through random processes.
Q. Could you explain that, explain that a little bit, how you came up to the one in 35? ...
A. ... Let's use, let’s use the yoke and the sleeve. ...
Q. OK ... And the yoke, again, is this back panel ... ?
A. Right. I'm using the yoke in the back because they're almost aligned. ... You can see here that on the yoke the dark line is ... slightly below the seam here. And you have to go about half an inch down before you hit the black line on the other side.
Q. Okay. In other words, you can measure the distance between the black line on the yoke to the black line on the sleeve, and you measured, say ... a half an inch?
A. Yeah.
Q. All right. And so just taking that one point of reference, what would be the odds that two shirts coming out the same manufacturing plant being manufactured in this, in this method would end up having the same alignment of yoke and the sleeve, whereby you could have this ... half an inch offset between them just for this one point of identification?
A. Just for that one point, it would be a 1 in 35 chance.
Q. And how did you come up with the number that it's a 1 in 35 of probability that these two items would line up the same in more than one shirt?
A. Because the feature itself that I’m looking to align is 1/35th of the overall repeat length. And to see ... that one feature come up randomly happens to be one in 35 times.
Q. So one in every 35 shirts manufactured by the company using this patterned cloth, the exact same cloth, you could say that one in 35 probability-wise would come up with the exact same alignment between the yoke and the left sleeve?
A. That is correct.

Several things are going on here to give rise to a uniform probability distribution. First, the plaid design of the shirt comes from a repeated 3.5"-wide distinctive pattern that contains lines of at least two different thicknesses. Second, these lines can be offset from one another across the seams of the shirt. Third, Dr. Vorder Bruegge measures how large the offset is in 1/10th" strips on the 8x10" photographs. Finally, the exact offset -- and hence, which strip a corresponding line on the other side of a seam falls, is determined at random.Thus, the number (call it X) of 1/10th" strips that separate the starting point of every plaid block on the two different cuts of fabric that are sewn together along a seam can take on the value x = 0, 1, 2, ..., 34, with probability f(x) = 1/35 for every x. The following sketch of a repeating block showing only one line in the pattern may clarify what x stands for:
1/10" strip ---------| |
1/10" strip          | |
1/10" strip          | |---------} x=2 (2/10" offset)
1/10" strip          | |
...
1/10" strip ---------| |
1/10" strip          | |
1/10" strip          | |---------
1/10" strip          | |
...                   s
1/10" strip ---------|e|
1/10" strip          |a|
1/10" strip          |m|--------
The idea of assigning an equal probability to elementary events dates back to Bernouilli (1713) and Laplace (1814). The underlying principle of insufficient reason, indifference, or symmetry holds that if there is no reason to believe that any event is more likely to occur than any other event in a set of possible, mutually exhaustive events, then one should assume that all the events are equally probable. This principle offers one way to motivate or understand the axioms of probability theory. Although it has fallen out of favor, it is not without defenders. 4/

But Vorder Bruegge did not base his probability model on a metaphysical or philosophical principle. He gave an empirical justification. During the short Daubert inquiry, he testified that he visited "manufacturing plants, factories where articles of clothing [are] made" and that "in this particular instance, I visited ... cutting plants in Alabama where the patterned material is cut out, as well as manufacturing plants where the cut out pieces are sewn together so that I can see for myself how the process takes place." Furthermore, "I've also been to another plant, Guess plant in Southern California, where shirts were also manufactured, and found that they use the same manufacturing process at the Guess Factory as they do at the Arrow shirt factories in Alabama and ... Georgia." He learned that "[m]anufacturers do not make an effort, in general, to make these features align, because to do so would be prohibitively expensive" and that "there are also places where it is not possible ... to make them align because of the curvature such as along the arms and the sleeves."

In essence, he proposed that busy workers stitched the pre-cut pieces of fabric together without regard to how well their patterns aligned with one another. He elaborated:

Q. All right. Could you ... explain the manufacturing process ... as to how shirts of this nature would be made in a factory?
A. [Y]ou start off with a huge bolt of cloth that can be hundreds and hundreds of yards long. The cloth is then laid down at a cutting plant on a table that [is] maybe about a hundred yards long. Now, these huge bolts will be rolled out on the table and then rolled back onto itself until the entire roll is done. The another role is attached, ... and it continues until you have on the order of 500 plies of material. Now, because of the way that the material is rolled back in, this pattern is not going to line up from one ply to the next. ...
Once the fabric is laid out and you've got 500 plies on top, the manufacturers lay out, basically, tracing paper that has cutting patterns. Just like if you were sewing at home and making your own clothes, you would have a pattern to cut out. Only this is a hundred yards long [and] has been designed by engineers whose only job is to figure out how best to place each piece of this shirt in its closest proximity so they are wasting as little of that fabric as possible. ...
[T]hey slap it down and they will actually have people get on there with jigsaws and cut them out by hand. Or in some of the plants they will have computers that can manage the cutting out. Once all of those pieces are cut out, they [are] transferred over to the sewers ... [Y]ou've got people who do nothing all day but sew on sleeves onto shoulders or yolks on to the back.
Q. ... So there will be, like, thousands of pieces for the collar and thousands ... for the sleeves and so forth?
A. ... If you have 500 plies thick, you have 500 pieces in one particular area. ... [O]n this shirt we've got two pieces on the collar. If you look closely at this shirt, you'll see there's actually a piece here it appears on the outside and a piece on the inside. There's another piece that goes around the collar; so that's three pieces. ... [All together, there are] 16 pieces of this same pattern cloth on this shirt [that must be sewn together].
Q. What significance does that have in terms of determining whether or not, you know, identifying the uniqueness of a particular shirt as opposed to any of the other shirts that are manufactured by the plant using the same cloth coloring?
A. Well, as I mentioned before, every one of these pieces is cut out. The plies line up on every row. If, in the manufacturing process, they happened to take a piece from the top layer and try to stitch it to a piece from the next layer down, you're going to get a lot of randomness in this process. Because there is this offset, you're not going to get this pattern lining up across the seam in any consistent way from one shirt to the next. ... [I]n fact, you can't get a curved seam like this to have an alignment across it, because it isn't geometrically possible.
You look at the pockets on this shirt and you'll see that they line up. [T]he fact that one of these pockets lines up makes this shirt twice as expensive to manufacture as it would be if this pocket were on the bias. [T]hey have to make sure that this pocket is cut out and exactly the same orientation as the front panel. Furthermore, they have to actually have a human being take the time to physically line this pocket up when they're sewing ... .
Q. Is there any effort to line up any of the other pieces of cloth? In other words, lining up the lines from the back to the sleeves, or the yolk to the sleeves, or the yolk to the front panels —
A. No. No. And you can see that for yourself just by comparing the way the left sleeve doesn't align with the yoke in the same way that the right sleeve aligns with the yoke. ...

The uniform-probability model of the placement of the lines is not as "preposterous" and "outrageous" as the ProPublica article suggests. But no one can see for themselves that there is no "effort to line up any of the other pieces of cloth" from the mere fact that the plaid patterns are displaced where a "sleeve aligns with the yoke." That is like saying that a marksman is shooting entirely at random because he missed the bullseye. There is a random component, but if the the marksman is skilled, bullets are more likely to arrive near the center of the target than to be dispersed equally across it.

Our theory about the marksman firing at random would be more credible if we saw that he was wearing a blindfold and firing rapidly. Likewise, the theory in McKreith is that the workers won't "take the time to physically line [the plaid patterns] up when they are sewing." It could be a good theory, but how has it been validated? Might not some workers try to get the unit patterns to line up just a little bit as they join the pre-cut pieces of fabric? If that happened, smaller cross-seam offsets would be more probable than larger ones. We could test the theory empirically by inspecting shirts. If the uniform probability model is correct, we would expect to find pretty much the same number of 1/10" offsets (x = 0, 1, 2, and so on) at a given seam in a large sample of Van Heusen shirts with the same design. The FBI apparently had no such data to support the probability model.

To be sure, the probability model was empirically motivated — Dr. Vorde Bruegge informed himself about the manufacturing process by visiting several manufacturing plants. His sense of the process might be entirely correct. When interviewed, I was not told what model he had adopted and what the basis for it was. `But even if I had been asked to study the testimony before reacting, I still might have said that even a plausible model could be "terribly flawed" (or at least not well validated).

"650 Billion, Give or Take a Few Billion"

A characteristic that results from the manufacturing process that occurs with a probability of 1/35 does not "individualize" an item in the only-one-in-the-universe sense that criminalists use the term. It is a generic feature, and Dr. Vorder Bruegge did not claim otherwise. To arrive at the ultimate opinion took a few more steps. The first was to assume that the offset at each seam is statistically independent of the offset at every other seam (and combination of them).

Q. Now, would the same randomness apply to all the other features in pieces of cloth that go into the shirt?
A. Yes, they would.
Q. So if you were able to, for example, reach a measurement as far as the left sleeve is concerned, with the yoke, would the line up — would the line up be the same or would it be, again, random with respect to the right sleeve?
A. There's going to be a 1 in 35 chance that it's going to be the same on the right sleeve as it is on the left sleeve.
Q. All right. So if you were able to find, for example, from the photographs, two points of identification, whereby the photograph matches the shirt in one area, maybe the yoke to the sleeve on the left side, and then you're able to find a second point of comparison or identification, say on the right sleeve and the yoke, what would be the odds of two shirts randomly being manufactured coming from the factory that would match this particular shirt?
A. It would be ... 1 in 35. But to simplify things and to be conservative, I prefer to use one in 30. By saying one in 30, that's — each giving it a better chance of being the same, but it makes the math easier. Thirty times 30 is 900. So one in 900 chance that you're going to find another shirt that has the left sleeve aligned to the yoke the same way and the right sleeve aligned to the yoke in the same way.
Q. All right. Now let's say you ... have a good enough picture that you can make three points identification. ...
A. Well, ... if we had sleeve to yoke, yoke to back, and yoke to sleeve, then that’s 30 times 30 times 30, which is one in 27,000.
Q. So if you were able to make three items of — points of identification the odds would be one in 27,000 that there would be two shirts randomly made from the factory that would have those three points of identification exactly the same?
A. Correct. ...
Q. And were you able to actually make those types of identification with respect to the photographs depicted in the bank robbery surveillance photos with this shirt ... .?
A. Yes. I was.
Q. And what was the highest number of points of identification that you were able to match up with respect to any particular bank robbery photo — in the surveillance photographs with respect to this shirt?
A. Eight. ...
Q. So 30 to the eighth power [would] be the odds in which two shirts would be randomly manufactured by the company [with] all those eight points of identification lining up exactly the same?
A. That's correct.
Q. And does your pocket calculator actually print out all the numbers that would come out if you were to insert 30 to the 8th power?
A. No. It came out to be 6.5 x 10 to the 11th, which is basically 650 billion, give or take a few billion.

Some of ProPublica's criticism of this part of the testimony misses the mark. The one example of the "[m]any problems in the examiner’s testimony [that] went unnoticed, or were simply unknown, during trial" is that "Vorder Bruegge undercut the precision of his calculations when he admitted having rounded down the shirt measurements used in his calculations because 'it makes the math easier.'" But how could anyone not notice that he used the figure of 1/30 instead of 1/35? And why is that reduction in "the precision of the calculations" a problem for the defendant? It means that the joint probability is even smaller than the pocket calculator's output.

When ProPublica contacted me and quoted the 1/650,000,000,000 figure, my reaction was "How could you get that number"? Not knowing anything about the case, I assumed it was the product of frequency estimates for different kinds of characteristics, such as shirt size, color, style, pattern, imperfections, discolorations, and so on. I doubted that such frequencies were known with sufficient accuracy to justify giving a single  astronomically large denominator to a jury. 5/

Apparently, "Karen Kafadar, chair of the statistics department at the University of Virginia," had a similar reaction and was among the "seven statisticians and independent forensic scientists [who] told ProPublica that "[t]he statistics were also preposterous" because "[t]he features Vorder Bruegge matched might be common in plaid shirts, making them of little value for identifying the garments." Indeed, Dr. Kafadar inveighed "that the 1-in-650-billion claim 'makes about as much sense as the statement two plus two equals five.'"

However, while the proposition that "two plus two equals five" seems to violate mathematical logic, it is not illogical to argue for a uniform probability distribution of offsets and to multiply probabilities as Dr. Vorder Bruegge did. If the 1/10" measurements are all correct, if the offsets are uniformly distributed over such strips, and if the independence assumption for different seams holds, then the probability of an equal offset at every one of n corresponding seams is (1/35)n, just as he testified. The assumptions are part of a perfectly logical argument, and they are not inherently "preposterous." But they have not been the subject of any systematic study (that I know of).

The Probability of Uniqueness

With all that said, the testimony had an unusual virtue over the typical "individualization" thinking of its day. Dr. Vorde Bruegge followed Lord Kelvin's dictum that "when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be." Dr. Vorde Bruegge was explicit and numerical about the grounds for concluding that only one shirt was involved in the pictures:

Q. [D]id you, during your research, contact the ... Van Heusen Company or the factory to determine ... approximately how many shirts using this cloth and this design were made by them?
A. They do not keep records of every single shirt that they ever make, but they have people who recognize patterns and they also know what their typical runs are. [A]t most, they said they would have made about no more than 18,000 of these shirts. ...
Q. So seven points of identification with respect to the bank robbery surveillance photos and the shirt with the Commerce Union bank robbery?
A. Yes.
Q. Let's go to the next robbery. ...
A. That would be seven, I believe.
Q. So the odds of this being random with two shirts, 30 to the seventh power?
A. If we use 30.
Q. Being conservative?
A. Yes. ...
A. That's seven points of identification with respect to the Bank of America shirt?
A. Yes.
Q. [F]rom the ... Bank United robbery? ... Five positive points of comparison?
A. Right.
Q. Okay. South Trust ... Well it's either 30 to the 7th power or 30 to the 8th power or 30 to the 6th power? A. That's it for South Trust. ...
Q. So that would be five points of identification for the Union Bank photographs?
A. Yes.
Q. [I]s there a point that you were able to conclude that all the photographs in all the different bank robberies are depicting the same shirt?
A. Well, for me in this case, it only took three points. Because with one to the 30th chance each time ... having three points ... I've got 30 times 30 times 30, which is 27,000, which is half again as many as 18,000. [S]o for my opinion it's enough to go over to say that three of those are enough. The fact that I've got four, five, six, seven and eight, just makes me all the more certain.
Q. And what is your opinion with respect to this comparison analysis?
A. They're all the same shirt.
Q. All right. And all the shirts. In the bank robbery surveillance photographs, do you have an opinion as to whether or not they are the same shirt as Exhibit 11, which is the questioned shirt?
A. Government Exhibit 11, in my opinion, is the shirt worn by the bank robber in each of these seven bank robberies. ...

Right or wrong, the testimony is transparent about the threshold for concluding that a picture has the defendant's shirt in it. For 18,000 shirts, Dr. Vorder Bruegge asserted, it takes only three seams to conclude that there is but one shirt in existence. But is this a reasonable criterion for an inference of uniqueness?

It might seem that way. One might be tempted to reason as follows: The random-match probability for two seams is 1/900. For 18,000 shirts manufactured as described in the uniform probability model, we would expect to find about 18,000 shirts x 1/900 matches per shirt = 200 shirts with two matching seams. That is way too many to claim individuality. But for 3 matching seams, the expected number is 18,000 x 1/27,000 = 2/3. In other words, we would expect to find lots of two-seam matches, but not even one three-seam match. So it does not look like a second shirt is very likely for three or more matching seams.
But there is a problem here that did go unrecognized in the case and the article about it. I have called it the expected value fallacy. 6/ It consists of thinking that as long as the expected number of items in a population is less than 1, the probability that there really is less than one is very high. Life, or at least mathematics, is not this simple. Even when the expected number is a little less than 1, as it is here, the probability of at least one additional matching item is appreciable. In this case, it is about 3/10, as shown in Box 1.
BOX 1. THE PROBABILITY OF ANOTHER MATCHING SHIRT
The number y of items with a given characteristic that has a constant, small probability p of appearing in each item in a large population of size n is a Poisson random variable with the parameter λ = np. The probability of any y is f(y; λ) = λye-λ/y! For the three-seam match, λ = 2/3.The conditional probability that at least one more item in the population has the characteristic, given the known fact that one such item is present is [1 – f(0, λ) – f(1; λ)] / [1 – f(0; λ)] = 0.30.

In short, even ignoring any modeling and measurement uncertainty in McKreith, there is a 30% probability that another Van Heusen shirt that matches at three seams has been manufactured. This large a risk of an erroneous conclusion of individuality would not be acceptable under a 2000 FBI policy that uses similar reasoning in explaining when an analyst may testify that a DNA profile comes from a named individual.

Thus, it appears that Dr. Vorde Bruegge choose a relatively undemanding quantitative threshold for source attribution. We can say this only because he was unusually clear as to the probabilistic basis for his conclusion that only one shirt —  the defendant's —  appeared in all the pictures. It also should be noted that the final opinion would have been the same had he selected a threshold as high as five, for which the conditional probability of matching Van Heusen shirts of the same type would be only 0.0004, or 0.04%. Still, the prosecution introduced the three-seam testimony to make the matches at five and more seams appear more impressive than they were. Even if this abuse of statistical reasoning did not affect the outcome, it was unfortunate.

NOTES
  1. Ryan Gabrielson, The FBI Says Its Photo Analysis Is Scientific Evidence. Scientists Disagree, Propublica, Jan. 17, 2019; Ryan Gabrielson, FBI Scientist’s Statements Linked Defendants to Crimes, Even When His Lab Results Didn’t, Propublica, Feb. 22, 2019.
  2. United States v. McKreith, 140 Fed.Appx. 112, 2005 WL 1600471 (11th Cir. 2005) (per curiam).
  3. Hans-Werner Sinn, A Rehabilitation of the Principle of Insufficient Reason, 94 Q. J. Econ. 493 (1980); Jon Williamson, Justifying the Principle of Indifference, 8 European J. Phil. Sci. 559 (2018).
  4. An oral ruling in the midst of a trial rarely creates or enshrines a rule of law. The U.S. Court of Appeals for the Eleventh Circuit affirmed the conviction, but it did not consider its opinion important enough to release for publication, and McKreith did not raise the Daubert claim on appeal. Although the January article states that McKreith “exhausted his appeals, most of which attempted to dispute the FBI Lab findings,” the Westlaw database’s history of the case displays only one direct appeal and one petition for postconviction relief for ineffective assistance of counsel. In neither of these attacks did McKreith challenge the scientific basis of the FBI lab’s work, and no court seems to have cited the case to support admitting similar testimony.
  5. The article states that "The statisticians who reviewed Vorder Bruegge’s materials for ProPublica said the examiner’s calculations cannot be correct. Vorder Bruegge’s statistic — 1 in 650 billion — is simply too astronomical to be true, said Kaye, the Penn State professor. There isn’t a database documenting features on plaid-shirt seams like there is for human DNA, making it impossible to determine the likelihood a different shirt would appear to match the robber’s shirt." I was not one of "[t]he statisticians who reviewed Vorder Bruegge’s materials for ProPublica," but I was (and am) of the view that estimating such small numbers on the basis of strong modeling assumptions alone is fraught with danger. The same thing can be said about DNA evidence. Decades ago, England's leading forensic statistician, Ian Evett, rhetorically asked me why American experts give such small probabilities in DNA matches instead of stopping at a number like one in a million.
  6. David H. Kaye, The Expected Value Fallacy in State v. Wright, 51 Jurimetrics J. 1 (2011).
POSTINGS IN THIS SERIES
  • Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 1), Mar. 3, 2019
  • Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 2), Mar. 19, 2019.
  • Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 3), Mar. 20, 2019

Friday, March 15, 2019

Another US District Court Finds Firearms-mark Testimony Admissible in the Post-PCAST World

A new opinion on firearms-toolmark identification from the U.S. District Court for the Southern District of New York continues the clear trend of admitting such identifications notwithstanding the misgivings of three scientific study groups. The opinion of Paul G. Gardephe is more careful than most, but it has some loose ends.

In United States v. Johnson, (S5) 16 Cr. 281 (PGG), 2019 WL 1130258 (S.D.N.Y. Mar. 11, 2019), the Government brought firearms-related charges against Latique Johnson, the alleged leader of the New York and Pennsylvania street gang “Blood Hound Brims.” One indictment alleged that he shot at rival gang members in a restaurant in the Bronx. New York City Police Department Detective Jonathan Fox concluded that “toolmark identification” analysis showed that bullets collected from the restaurant were fired from an AK 47 semi-automatic assault rifle that an undercover officer purchased from a former gang member a year-and-a-half later (and that was rumored to be the one Johnson used in the restaurant).

Johnson moved before trial to prevent Detective Fox from testifying, or at least to bar an opinion that the ammunition was fired from the AK 47. During the trial, the court conducted a hearing. Detective Fox testified that the premise for “what we [microscopists] do” is “that the tools that are used to manufacture ... firearms leave marks on the inside of the firearms that are unique to that particular tool.”

The court did not question the premise that characteristics are “individual characteristics” that must be “unique” when viewed under a microscope. Why insist on empirical proof of that theoretical proposition? Surely, the marks made by guns on ammunition are unique at the molecular level. But that intuition proves too much. It means that every shell or cartridge case fired from the same gun will be unique on that scale.

So the claim of uniqueness has to pertain to features as they actually are measured or perceived, and that claim requires empirical verification. Nevertheless, the belief in universal uniqueness is not critical to the work of firearms examiners. They could be incredibly accurate in associating expended ammunition with specific guns even if the high level of similarity between ammunition components from test fires and components recovered from a crime scene is not quite unique with respect to every gun that has ever existed.

In Johnson, the court found that Detective Fox followed the Association of Firearm and Tool Mark Examiners’ “Theory of Identification as it Relates to Tool Marks.” This theory of how to interpret marks requires “sufficient agreement” for a positive association. “Sufficient agreement” comes from the examiner’s sense that the level of agreement is greater than that for different guns and is within the range for ammunition from the same gun. It “means that the agreement of individual characteristics is of a quantity and quality that the likelihood another tool could have made the mark is so remote as to be considered a practical impossibility.” 1/

In challenging the AFTE theory and its application, Johnson relied “primarily on the 2008 National Research Council Report, Ballistic Imaging ... ;  the 2009 National Research Council Report, Strengthening Forensic Science in the United States: A Path Forward ... ; and the 2016 report of the President’s Council of Advisors on Science and Technology, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods (2016) ... .” The court recognized that these reports “challenge, to varying degrees, the scientific basis for toolmark identification analysis.”

After summarizing the statements in the three reports on the limited scientific knowledge of the accuracy of firearms-mark identification 2/ and reviewing the federal case law, 3/ the court “conclude[d] that toolmark identification analysis — at least as performed by Detective Fox — is sufficiently reliable to be presented to the jury.” In reaching this result, the court considered the canonical “Daubert factors.”

"Testability"

First, it found sufficient testing of the ability of examiners to reach the correct results in experiments and proficiency tests. According to the PCAST report, the published experiments were not properly designed and hence could not establish validity. As for proficiency tests, the court later recognized that they may not be representative of casework and they were not blind. So although the AFTE theory surely is testable, the extent to which it has been adequately tested remains open to debate despite the court's conclusion.

Oddly, the court also proposed that “photographic documentation and independent review” in case work tests the theory or practice. Taking pictures is useful, but it does not validate the process, and even blind verification only tests consistency between two examiners. Such reproducibility is not the same as accuracy, and validation requires comparing the examiners’ judgments to the true state of affairs — what forensic scientists sometimes call “ground truth.”

Peer Review and Publication

Second, the court determined that there was adequate “peer review and publication” — “mostly” in a journal of the Association of Firearm and Tool Mark Examiners. This part of the opinion is perfunctory. The court accepted at face value the assurance in a 2003 AFTE Journal article that the journal uses “experts within the scientific community” for “technical review.”

Academic and other critics have complained that the publication is basically a trade journal without peer review by qualified scientists. For example, one forensic scientist wrote that
The AFTE Journal ... is not freely available and requires written testimony [sic] from existing AFTE members. It has extremely limited dissemination beyond the members of AFTE — it is found only in 18 libraries of the 72,000 libraries listed in World CAT, the largest catalog of library materials — and completely lacks integration with any of the voluminous networks for the production and exchange of scientific research information. The journal engages in peer review that is neither blind nor draws on an extensive network of researchers. This is not an attitude in keeping with the openness that is a part of any true scientific research culture. 4/
That was 2014. Today, the journal is in only nine more libraries. Only sixteen university libraries make it available.

Courts evaluating the extent of publication and peer review may wish to consider the views of the National Commission on Forensic Science. This Commission, formed by the Department of Justice and the National Institute of Standards and Technology, issued guidance on what constitutes “scientific literature” for supporting forensic science and practice. The Commision’s criteria include publication “in a journal that utilizes rigorous peer review with independent external reviewers to validate the accuracy in its publications and their overall consistency with scientific norms of practice.”

AFTE emphasizes that it has long used a formal peer review system. A position paper published in its journal in 2015 commented as follows:
While it is unclear what the NCFS considers “rigorous” and “independent”, all submissions to the AFTE Journal undergo a thorough two-stage review process conducted by subject matter experts who are members of the AFTE Editorial Committee. Occasionally, a reviewer may need to enlist the knowledge of an expert outside of the field to complete a review; however, it would be extremely difficult to use only external reviewers because most qualified potential reviewers also tend to be AFTE members. Research manuscripts submitted to the AFTE Journal are independently reviewed by at least two reviewers from the Editorial Committee ... .
Controlling Standards

Third, the court expressed more hesitation in finding “controlling standards.” It wrote that
[B]oth courts and the scientific community have voiced serious concerns about the “sufficient agreement” standard, characterizing it as “tautological,” “wholly subjective,” “circular,” “leav[ing] much to be desired,” and “not scientific.” The Court shares some of these concerns.
Nonetheless, Judge Gardephe deemed “photographic documentation and verification requirements” and “extensive AFTE training and proficiency testing” to be controlling standards.

This logic is hard to follow. Record-keeping, verification, training, and proficiency testing all are critical components of a quality assurance system. But they are not standards that control how a subjective judgment is made. Detective Fox was clear about the degree of subjectivity. He "stated that ... he employs a holistic approach incorporating his 'training as a whole' and his experience 'based on all the cartridge casings and ballistics that [he] ha[s] identified and compared.'" Still, the court managed to extract "certain principles that ground his conclusions":
For example, the CMS standard — six consecutive matching striations or two groups of three matching striations — represents a “bottom standard” or a floor for declaring a match. Detective Fox will not declare that “sufficient agreement” exists unless microscopic examination reveals a toolmark impression with one area containing six consecutive matching individual characteristics, or two areas with three consecutive matching individual characteristics. ... Detective Fox’s analysis does not end at that point, however. Instead, Detective Fox goes on to examine every impression on the ballistics evidence. “All these lines should match,” as well, and if they do not, Detective Fox will not find “sufficient agreement.”
“These criteria.” the court concluded, “provide standards for Detective Fox’s findings as to “sufficient agreement.”

"Rate of Error"

Fourth, the court regarded proficiency test data as suggestive of a modest error rate and added that “even accepting the PCAST Report’s assertion that the error rate could be as high as 1 in 46, or close to 2.2%, such an error rate is not impermissibly high.” The court did not maintain that the one unpublished and unreplicated study of cartridge shell casings that PCAST used to generate this figure (for a Ruger pistol) provided a precise estimate of a false-positive error rate. Somewhat limply, it concluded “that the absence of a definite error rate for toolmark identification does not require that such evidence be precluded.” The judge did not discuss PCAST's recommendation that the upper-bound error rate be presented to the jury.

General Acceptance

Finally, the court spent little energy on the issue of general acceptance. Assuming that the relevant scientific community was limited to “forensic scientists, and firearms examiners in particular,” it perceived “no dispute.”

* * *

In denying the motion to exclude or limit the testimony, the court ironically saw the fact that the problems with firearms identifications are apparent as favoring — or at least not inhibiting — admission. It described “the weaknesses in the methodology of toolmark identification analysis” as “readily apparent,” “as discussed at length in the scientific literature,” and as “not particularly complicated or difficult to grasp.” Thus, they “are likely to be understood by jurors if addressed on cross-examination.”

At the same time, Judge Gardephe did not approve of all types of firearms-marks testimony. He recognized that overclaiming can occur; however, he saw no risk of it happening in this case:
Having heard Detective Fox’s testimony at the Daubert hearing, it is clear that he does not intend to assert — and the Government does not intend to elicit — any particular degree of certainty as to his opinions regarding the ballistics match. ... Indeed, Detective Fox’s repeated concession at the Daubert hearing that his conclusions are “based on [his] subjective opinion” stands in stark contrast to the “tendency of [other] ballistics experts ... to make assertions that their matches are certain beyond all doubt.” ... Detective Fox also testified that he “would never” state his conclusion that ballistics evidence matches to a particular firearm “to the exclusion of all other firearms ... in a court proceeding[,] ... because I haven’t looked at all other firearms.”
Although the last concession is refreshing, it creates a puzzle -- Are "individual characteristics" truly individual? If the “individual characteristics” obviously matched — as apparently they did 5/ — why shouldn't the examiner testify that they exclude every other gun? Isn’t it “practically impossible,” to use the AFTE phrase, that another AK 47 fired the bullets?

NOTES
  1. The current version of the official theory is at https://afte.org/about-us/what-is-afte/afte-theory-of-identification.
  2. The opinion also quoted the trivially true statements in the reports that in describing the scientific status of the pattern-identification methods, the groups were not themselves taking a stand on the legal question of their admissibility.
  3. The district court wrote that
    In assessing reliability, ‘the district court must focus on the principles and methodology employed by the expert, without regard to the conclusions the expert has reached or the district court’s belief as to the correctness of those conclusions.’ Amorgianos v. Nat’l R.R. Passenger Corp., 303 F.3d 256, 266 (2d Cir. 2002).
    In General Electric Co. v. Joiner, 522 U.S. 136 (1997), however, the Supreme Court wrote that “conclusions and methodology are not entirely distinct from one another,” and Rule 702(d) specifies that “the expert has reliably applied the principles and methods to the facts of the case.”
  4. David Klatzow, Justice Denied: Role of Forensic Science in the Miscarriage of Justice (2014).
  5. The court was impressed that “[t]he ‘matching’ ... is stark, even to an untrained observer.”
FURTHER READING