Sunday, July 21, 2019

Confidence Intervals -- If Only It Were That Simple

Confidence Interval: Statistics such as means (or averages) and medians are often calculated from data from a portion—or sample—of a population rather than from data for an entire population. Statistics based on sample data are called “sample statistics,” whereas those based on an entire population are called “population parameters.” A confidence interval is the range of values of a sample statistic that is likely to contain a population parameter, and that likeliness is expressed with a specific probability. For example, if a study of a sample of 1,500 Americans finds their average weight to be 150 pounds with a 95 percent confidence interval of plus/minus 25 pounds, this means that there is a 95 percent probability that the average weight of the entire American population is between 125 and 175 pounds. --Wm. Nöel & Judy Wang, Is Cannabis a Gateway Drug? Key Findings and Literature Review: A Report Prepared by the Federal Research Division, Library of Congress, Under an Interagency Agreement with the Office of the Director, National Institute of Justice, Office of Justice Programs, U.S. Department of Justice, Nov. 2018, at 3.

{T]here is a 5 percent chance the true value [of a 95% one-sided confidence interval] exceeds the bound. --President’s Council of Advisors on Science and Technology, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods, Sept. 2016, at 153.
[T]he confidence level does not give the probability that the unknown parameter lies within the confidence interval. ... According to the frequentist theory of statistics, probability statements cannot be made about population characteristics: Probability statements apply to the behavior of samples. That is why the different term ‘confidence’ is used. --David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manual on Scientific Evidence 211, 247 (Federal Judicial Center & National Research Council Committee on the Development of the Third Edition of the Reference Manual on Scientific Evidence eds., 3d ed. 2011).

Warning! ... [T]he fact that a confidence interval is not a probability statement about [an unknown value] is confusing. --Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference 92-93 (2004) (emphasis in original).

Wednesday, July 17, 2019

No Tension Between Rule 704 and Best Principles for Interpreting Forensic-science Test Results

At a webinar on probabilistic genotyping organized by the FBI, the Department of Justice’s Senior Advisor on Forensic Science, Ted Hunt, summarized the rules of evidence that are most pertinent to scientific and expert testimony. In the course of a masterful survey, he suggested that Federal Rule of Evidence 704 somehow conflicts with the evidence-centric approach to evaluating laboratory results recommended by a subcommittee of the National Commission on Forensic Science, by the American Statistical Association, and by European forensic-science service providers. 1/ In this approach, the expert stops short of opining on whether the defendant is the source of the trace. Instead, the expert merely reports that the data are L times more probable when the hypothesis is true than when some alternative source hypothesis is true. (Or, the expert gives some qualitative expression such as "strong support" when this likelihood ratio is large.)

Whatever the merits of these proposals, Rule 704 does not stand in the way of implementing the recommended approach to reporting and testifying. First, the identity of the source of a trace is not necessarily an ultimate issue. To use the example of latent-print identification given in the webinar, the traditional opinion that a named individual is the source of a print is not an opinion on an ultimate issue. Courts have long allowed examiners to testify that the print lifted from a gun comes from a specific finger. But this conclusion is not an opinion on whether the murder defendant is the one who pulled the trigger. The examiner’s source attribution bears on the ultimate issue of causing the death of a human being, but the examiner who reports that the prints were defendant's is not opining that the defendant not only touched the gun (or had prints planted on it) but also pulled the trigger. Indeed, the latent print examiner would have no scientific basis for such an opinion on an element of the crime of murder.

Furthermore, even when an expert does want to express an opinion on an ultimate issue, Rule 704 does not counsel in favor of admitting it into evidence. Rule 704(a) consists of a single sentence: “An opinion is not objectionable just because it embraces an ultimate issue.” The sole function of these words is to repeal an outmoded, common-law rule categorically excluding these opinions. The advisory committee that drafted this repealing rule explained that “to allay any doubt on the subject, the so-called ‘ultimate issue’ rule is specifically abolished by the instant rule.” The committee expressed no positive preference for such opinions over evidence-centric expert testimony. It emphasized that Rules 701, 702, and 403 protect against unsuitable opinions on ultimate issues. Modern courts continue to exclude ultimate-opinion testimony when it is not sufficiently helpful to jurors. For example, conclusions of law remain highly objectionable.

Consequently, any suggestion that Rule 704 is an affirmative reason to admit one kind of testimony over another is misguided. “The effect of Rule 704 is merely to remove the proscription against opinions on ‘ultimate issues' and to shift the focus to whether the testimony is ‘otherwise admissible.’” 2/ If conclusion-centric testimony is admissible, then so is the evidence-centric evaluation that lies behind it--with or without the conclusion.

In sum, there is no tension between Rule 704(a) and the recommendation to follow the evidence-centric approach. Repealing a speed limit on a road does not imply that drivers should put the pedal to the floor.

NOTES
  1. This is the impression I received. The recording of the webinar should be available at the website of the Forensic Technology Center of Excellence in a week or two.
  2. Torres v. County of Oakland, 758 F.2d 147, 150 (6th Cir.1985).
UPDATED: 18 July 2019 6:22 AM

LATER POSTING ON THE SUBJECT:
https://for-sci-law.blogspot.com/2019/11/more-on-rule-704-and-source-attribution.html

Saturday, July 6, 2019

Distorting Daubert and Parting Ways with PCAST in Romero-Lobato

United States v. Romero-Lobato 1/ is another opinion applying the criteria for admissibility of scientific evidence articulated in Daubert v. Merrell Dow Pharmaceuticals 2/ to uphold the admissibility of a firearms examiner's conclusion that the microscopic marks on recovered bullets prove that they came from a particular gun. To do so, the U.S. District Court for the District of Nevada rejects the conclusions of the President's Council of Advisors on Science and Technology (PCAST) on validating a scientific procedure.

This is not to say that the result in the case is wrong. There is a principled argument for admitting suitably confined testimony about matching bullet or ammunition marks. But the opinion from U.S. District Court Judge Larry R. Hicks does not contain such an argument. The court does not reach the difficult question of how far a toolmark expert may go in forging a link between ammunition and a particular gun. It did not have to. In what seems to be a poorly developed challenge to firearms-toolmark expertise, the defense sought to exclude all testimony about such an association.

This posting describes the facts of the case, the court's description of the law on the admissibility of source attributions by firearms-toolmark examiners, and its review of the practice under the criteria for admitting scientific evidence set forth by the Supreme Court in Daubert.

I. THE FIREARMS EVIDENCE IN THE CASE

A grand jury indicted Eric Romero-Lobato for seven felonies. On March 4, 2018, he allegedly tried to rob the Aguitas Bar and Grill and discharged a firearm (a Taurus PT111 G2) into the ceiling. On May 14, he allegedly stole a woman's car at gunpoint while she was cleaning it at a carwash. Later that night, he crashed the car in a high-speed chase On the front passenger's seat, was a Taurus PT111 G2 handgun.

Steven Johnson, a supervising criminalist in the Forensic Science Division of the Washoe County Sheriff's Office," 3/ was prepared to testify that the handgun had fired a round into the ceiling of the bar. Romero-Lobato moved "to preclude the testimony." The district court held a pretrial hearing at which Johnson testified to his background, training, and experience. He explained that he matched the bullet to the gun using the "AFTE method" advocated by the Association of Firearm and Tool Mark Examiners.

Defendant's challenge rested "on the critical NAS and PCAST Reports as evidence that 'firearms analysis' is not scientifically valid and fails to meet the requisite threshold for admission under Daubert and Federal Rule of Evidence 702." Apparently, the only expert at the hearing was the Sheriff Department's criminalist. Judge Hicks denied the motion to exclude Johnson's expert opinion testimony and issued a relatively detailed opinion.

II. SETTING THE STAGE FOR A DAUBERT ANALYSIS

Skipping over the early judicial resistance to "this is the gun" testimony, 4/ the court noted that despite misgivings about such testimony on the part of several federal district courts, only one reported case has barred all source opinion testimony 5/ and the trend among the more critical courts is to search for ways to admit the conclusion with qualifications on its certainty.

Judge Hicks did not pursue the possibility of admitting but constraining the testimony, apparently because the defendant did ask for that. Instead, the court reasoned that to overcome the inertia of the current caselaw, a defendant must have extraordinarily strong evidence (although it also recognized that the burden is on the government to prove scientific validity under Daubert) .The judge wrote:
[T]he defense has not cited to a single case where a federal court has completely prohibited firearms identification testimony on the basis that it fails the Daubert reliability analysis. The lack of such authority indicates to the Court that defendant's request to exclude Johnson's testimony wholesale is unprecedented, and when such a request is made, a defendant must make a remarkable argument supported by remarkable evidence. Defendant has not done so here.
Defendant's less-than-remarkable evidence was primarily two consensus reports of scientific and other experts who reviewed the literature on firearms-mark comparisons. 6/  Both are remarkable. The first document was the highly publicized National Academy of Sciences committee report on improving forensic science. The committee expressed concerns about the largely subjective comparison process and the absence of studies to adequately measure the uncertainty in the evaluations. The court deemed these to be satisfied by a single research report submitted to Department of Justice, which funded the study:
The NAS Report, released in 2009, concluded that “[s]ufficient studies have not been done to understand the reliability and repeatability” of firearm and toolmark examination methods. ... The Report's main issue with the AFTE method was that it did not provide a specific protocol for determining a match between a shell casing or bullet and a specific firearm. ... Instead, examiners were to rely on their training and experience to determine if there was a “sufficient agreement” (i.e. match) between the mark patterns on the casing or bullet and the firearm's barrel. ... During the Daubert hearing, Johnson testified about his field's response to the NAS Report, pointing to a 2013 study from Miami-Dade County (“Miami-Dade Study”). The Miami-Dade Study was conducted in direct response to the NAS Report and was designed as a blind study to test the potential error rate for matching fired bullets to specific guns. It examined ten consecutively manufactured barrels from the same manufacturer (Glock) and bullets fired from them to determine if firearm examiners (165 in total) could accurately match the bullets to the barrel. 150 blind test examination kits were sent to forensics laboratories across the United States. The Miami-Dade Study found a potential error rate of less than 1.2% and an error rate by the participants of approximately 0.007%. The Study concluded that “a trained firearm and tool mark examiner with two years of training, regardless of experience, will correctly identify same gun evidence.”
A more complete (and accurate) reading of the Miami Dade Police Department's study, shows that it was not designed to measure error rates as they are defined in the NAS report and that the "error rate" was much closer to 1%. That's still small, and, with truly independent verification of an examiners' conclusions, the error rate should be smaller than that for examiners whose findings are not duplicated. Nonetheless, as an earlier posting shows, the data are not as easily interpreted and applied to case work as the report from the crime laboratory suggests.The research study, which has yet to appear in any scientific journal. has severe limitations.

The second report, released late in 2016 by the President's Council of Advisors on Science and Technology (PCAST) flatly maintained that microscopic firearms-marks comparisons had not been scientifically validated. Essentially dismissing the Miami Dade Police and earlier research as not properly designed to measure the ability of examiners to infer whether the same gun fired test bullets and ones recovered from a crime scene, PCAST reasoned that (1) AFTE-type identification had yet to be shown to be "reliable" within the meaning of Rule 702 (as PCAST interpreted the rule); (2) if courts disagreed with PCAST's legal analysis of the rule's requirements, they should at least require examiners associating ammunition with a particular firearm to give an upper bound, as ascertained from controlled experiments, on false-positive associations. (These matters are discussed in previous postings.)

The court did not address the second conclusion and gave little or no weight to the first one. It wrote that the 2016 report
concluded that there was only one study done that “was appropriately designed to test foundational validity and estimate reliability,” the Ames Laboratory Study (“Ames Study”). The Ames Study ... reported a false-positive rate of 1.52%. ... The PCAST Report did not reach a conclusion as to whether the AFTE method was reliable or not because there was only one study available that met its criteria.
All true. PCAST certainly did not write that there is a large body of high quality research that proves toolmark examiners cannot associate expended ammunition with specific guns. PCAST's position is that a single study is not a body of evidence that establishes a scientific theory--replication is crucial. If the court believed that there is such a body of literature, it should have explained the basis for its disagreement with the Council's assessment of the literature. If it agreed with PCAST that the research base is thin, then it should have explained why forensic scientists should be able to testify--as scientists--that they know which gun fired which bullet. This opinion does neither. (I'll come to court's discussion of Daubert below.)

Instead, the court repeats the old news that
the PCAST Report was criticized by a number of entities, including the DOJ, FBI, ATF, and AFTE. Some of their issues with the Report were its lack of transparency and consistency in determining which studies met its strict criteria and which did not and its failure to consult with any experts in the firearm and tool mark examination field.
Again, all true. And all so superficial. That prosecutors and criminal investigators did not like the presidential science advisors' criticism of their evidence is no surprise. But exactly what was unclear about PCAST's criteria for replicated, controlled, experimental proof? In fact, the DOJ later criticized PCAST for being too clear--for having a "nine-part" "litmus test" rather than more obscure "trade-offs" with which to judge what research is acceptable. 7/

And what was the inconsistency in PCAST's assessment of firearms-marks comparisons? Judge Hicks maintained that
The PCAST Report refused to consider any study that did not meet its strict criteria; to be considered, a study must be a “black box” study, meaning that it must be completely blind for the participants. The committee behind the report rejected studies that it did not consider to be blind, such as where the examiners knew that a bullet or spent casing matched one of the barrels included with the test kit. This is in contrast to studies where it is not possible for an examiner to correctly match a bullet to a barrel through process of elimination.
This explanation enucleates no inconsistency. The complaint seems to be that PCAST's criteria for a validating a predominantly subjective feature-comparison procedure are too demanding or restrictive, not that these criteria were applied inconsistently. Indeed, no inconsistency in applying the "litmus test" for an acceptable research design to firearms-mark examinations is apparent.

Moreover, the court's definition of "a 'black box' study" is wrong. All that PCAST meant by "black box" is that the researchers are not trying to unpack the process that examiners use and inspect its components. Instead, they say to the examiner, "Go ahead, do your thing. Just tell us your answer, and we'll see if you are right." The term is used by software engineers who test complex programs to verify that the outputs are what they should be for the inputs. The Turing test for the proposition that "machines can think" is a kind of black box test.

Nonetheless, this correction is academic. The court is right about the fact that PCAST gave no credence to "closed tests" like those in which an examiner sorts bullets into pairs knowing in advance that every bullet has a mate. Such black-box experiments are not worthless. They show a nonzero level of skill, but they are easier than "open tests" in which an examiner is presented with a single pair of bullets to decide whether they have a common source, then another pair, and another, and so on. In Romero-Lobato, the examiner had one bullet from the ceiling to compare to a test bullet he fired from one suspect gun. There is no "trade-off" that would make the closed-test design appropriate for establishing the examiner's skill at the task he performed.

All that remains of the court's initial efforts to avoid the PCAST report is the tired complaint about a "failure to consult with any experts in the firearm and tool mark examination field." But what consultation does the judge think was missing? The scientists and technologists who constitute the Council asked the forensic science community for statements and literature to support their practices. It shared a draft of its report with the Department of Justice before finalizing it. After releasing the report, it asked for more responses and issued an addendum. Forensic-services providers may complain that the Council did not use the correct criteria, that its members were closed-minded or biased, or that the repeated opportunities to affect the outcome were insufficient or even a sham. But a court needs more than a throw-away sentence about "failure to consult" to justify treating the PCAST report as suspect.

 III. DISTORTING DAUBERT

Having cited a single, partly probative police laboratory study as if it were a satisfactory response to the National Academy's concerns and having colored the President's Council report as controversial without addressing the limited substance of the prosecutors' and investigators' complaints, the court offered a "Daubert analysis." It marched through the five indicia that the Supreme Court enumerated as factors that courts might consider in assessing scientific validity and reliability.

A. It Has Been Tested

The Romero-Lobato opinion made much of the fact that "[t]he AFTE methodology has been repeatedly tested" 8/ through "numerous journals [sic] articles and studies exploring the AFTE method" 9/ and via Johnson's perfect record on proficiency tests as proved by his (hearsay and character evidence) testimony. Einstein once expressed impatience "with scientists who take a board of wood, look for its thinnest part and drill a great number of holes where drilling is easy." 10/ Going through the drill of proficiency testing does not prove much if the tests are simple and unrealistic. A score of trivial or poorly designed experiments should not engender great confidence. The relevant question under Daubert is not simply "how many tests so far?" It is how many challenging tests have been passed. The opinion makes no effort to answer that question. It evinces no awareness of the "10 percent error rate in ballistic evidence" noted in the NAS Report, that prompted corrective action in the Detroit Police crime laboratory.

Instead of responding to PCAST's criticisms of the design of the AFTE Journal studies, the court wrote that "[a]lthough both the NAS and PCAST Reports were critical of the AFTE method because of its inherent subjectivity, their criticisms do not affect whether the technique they criticize has been repeatedly tested. The fact that numerous studies have been conducted testing the validity and accuracy of the AFTE method weighs in favor of admitting Johnson's testimony."

But surely the question under Daubert is not whether there have been "numerous studies." It is what these studies have shown about the accuracy of trained examiners to match a single unknown bullet with control bullets from a single gun. The court may have been correct in concluding that the testing prong of Daubert favors admissibility here, but its opinion fails to demonstrate that "[t]here is little doubt that the AFTE method of identifying firearms satisfies this Daubert element."

B. Publication and Peer Review

Daubert recognizes that, to facilitate the dissemination, criticism, and modification of theories, modern science relies on publication in refereed journals that members of the scientific community read. Romero-Lobato deems this factor to favor admission for two reasons. First, the AFTE Journal in which virtually all the studies dismissed by PCAST appear, uses referees. That it is not generally regarded as a significant scientific journal -- it is not available through most academic libraries, for example -- went unnoticed.

Second, the court contended that "of course, the NAS and PCAST Reports themselves constitute peer review despite the unfavorable view the two reports have of the AFTE method. The peer review and publication factor therefore weighs in favor of admissibility." The idea that the rejection in consensus reports of a series of studies as truly validating a theory "weighs in favor of admissibility" is difficult to fathom. Some readers might find it preposterous.

C. Error Rates

Just as the court was content to rely on the absolute number of studies as establishing that the AFTE method has been adequately tested, it takes the error rates reported in the questioned studies at face value. Finding the numbers to be "very low," and implying (without explanation) that PCAST's criteria are too "strict," it concludes that Daubert's "error rate" factor too "weighs in favor of admissibility."

A more plausible conclusion is that a large body of studies that fail to measure the error rates (false positive and negative associations) appropriately but do not indicate very high error rates is no more than weakly favorable to admission. (For further discussion, see the previous postings on the court's discussion of the Miami Dade and Ames Laboratory technical reports.)

D. Controlling Standards

The court cited no controlling standards for the judgment of  "'sufficient agreement' between the 'unique surface contours' of two toolmarks." After reciting the AFTE's definition of "sufficient agreement," Judge Hicks decided that "matching two tool marks essentially comes down to the examiner's subjective judgment based on his training, experience, and knowledge of firearms. This factor weighs against admissibility."

However, the opinion adds that "the consecutive matching striae ('CMS') method," which Johnson used after finding "sufficient agreement," is "an objective standard under Daubert." It is "objective" because an examiner cannot conclude that there is a match unless he "observes two or more sets of three or more consecutive matching markings on a bullet or shell casing." The opinion did not consider the possibility that this numerical rule does little to confine discretion if no standard guides the decision of  whether a marking matches. Instead, the opinion debated whether the CMS method should be considered objective and confused that question with how widely the method is used.

The relevant inquiry is not whether a method is subjective or objective. For a predominantly subjective method, the question is whether standards for making subjective judgments will produce more accurate and more reliable (repeatable and reproducible) decisions and how much more accurate and reliable they will be.

E. General Acceptance

Finally, the court found "widespread acceptance in the scientific community." But the basis for this conclusion was flimsy. It consisted of statements from other courts like "the AFTE method ... is 'widely accepted among examiners as reliable'" and "[t]his Daubert factor is designed to prohibit techniques that have 'only minimal support' within the relevant community." Apparently, the court regarded the relevant community as confined to examiners. Judge Hicks wrote that
it is unclear if the PCAST Report would even constitute criticism from the “relevant community” because the committee behind the report did not include any members of the forensic ballistics community ... . The acceptance factor therefore weighs in favor of admitting Johnson's testimony.
If courts insulate forensic-science service providers from the critical scrutiny of outside scientists, how can they legitimately use the general-acceptance criterion to help ascertain whether examiners are presenting "scientific knowledge" à la Daubert?

NOTES
  1. No. 3:18-cr-00049-LRH-CBC, 2019 WL 2150938 (D. Nev. May 16, 2019).
  2. 509 U.S. 579 (1993).
  3. For a discussion of a case involving inaccurate testimony from the same laboratory that caught the attention of the Supreme Court, see David H. Kaye, The Interpretation of DNA Evidence: A Case Study in Probabilities, National Academies of Science, Engineering and Medicine, Science Policy Decision-making Educational Modules, 2016, available at https://ssrn.com/abstract=2810744; McDaniel v. Brown: Prosecutorial and Expert Misstatements of Probabilities Do Not Justify Postconviction Relief — At Least Not Here and Not Now, Forensic Sci., Stat. & L., July 7, 2014, http://for-sci-law.blogspot.com/2014/07/mcdaniel-v-brown-prosecutorial-and.html.
  4. See David H. Kaye, Firearm-Mark Evidence: Looking Back and Looking Ahead, 68 Case W. Res. L. Rev. 723, 724-25 (2018), available at https://papers.ssrn.com/abstract_id=3117674. The court relied on the article's explication of more modern case law.
  5. The U.S. District Court for the District of Colorado  excluded toolmark conclusions in the prosecutions for the bombing of the federal office building in Oklahoma City. The toolmarks there came from a screwdriver. David H. Kaye et al., The New Wigmore, A Treatise on Evidence: Expert Evidence 686-87 (2d ed. 2011).
  6. The court was aware of an earlier report from a third national panel of experts raising doubts about the AFTE method, but it did not cite or discuss that report's remarks. Although the 2008 National Academies report on the feasibility of establishing a ballistic imaging database only considered the forensic toolmark analysis of firearms in passing, it gave the practice no compliments. Kaye, supra note 2, a 729-32.
  7. Ted Robert Hunt, Scientific Validity and Error Rates: A Short Response to the PCAST Report, 86 Fordham L. Rev. Online Art. 14 (2017), https://ir.lawnet.fordham.edu/cgi/viewcontent.cgi?article=1013&context=flro.
  8. Quoting United States v. Ashburn, 88 F.Supp.3d 239, 245 (E.D.N.Y. 2015).
  9. Citing United States v. Otero, 849 F.Supp.2d 425, 432–33 (D.N.J. 2012), for "numerous journals [sic] articles and studies exploring the AFTE method."
  10. Philipp Frank, Einstein's Philosophy of Science, Reviews of Modern Physics (1949).
FURTHER READING
MODIFIED: 7 July 2019 9:10 EST