Friday, February 3, 2017

Connecticut Trial Court Deems PCAST Report on Footwear Mark Evidence Inapplicable and Unpersuasive

In an unpublished (but rather elaborate) opinion, a trial court in Connecticut found no merit in a motion “to preclude admission of footwear comparison evidence relative to footwear found on Wolfe Road in Warren, Connecticut and footprints found at the residence where the victim was killed.” State v. Patel, No. LLICR130143598S (Conn. Super. Ct., Dec. 28, 2016). The court did not describe the case or the footwear evidence, but its opinion responded to the claim of defendant Hiral Patel that “the scientific community has rejected the validity of the footwear comparison proposed by the state.” Judge John A. Danaher III was unimpressed by Patel's reliance on
a September 2016 report by the President's Council of Advisors on Science and Technology [stating] that ‘there are no appropriate empirical studies to support the foundational validity of footwear analysis to associate shoeprints with particular shoes based on specific identifying marks (sometimes called randomly 'randomly [sic] acquired characteristics'). Such conclusions are unsupported by any meaningful evidence or estimates of their accuracy and thus are not scientifically valid.’
The court reasoned that the state had no need to prove that the “expert testimony ... albeit scientific in nature” was based on a scientifically validated procedure because the physical comparison was “neither scientifically obscure nor instilled with 'aura of mystic infallibility' ... which merely places a jury ... in in [sic] a position to weigh the probative value of the testimony without abandoning common sense and sacrificing independent judgment to the expert's assertions.” Patel (quoting Maher v. Quest Diagnostics, Inc., 269 Conn. 154, 170-71 n.22, 847 A.2d 978 (2004)).

But the Superior Court did not stop here. Judge Danaher wrote that the President’s Council (PCAST) lacked relevant scientific expertise, and their skepticism did not alter the fact that courts previously had approved of “the ACE-V method under Daubert for footwear and fingerprint impressions.” He declared that "[t]here is no basis on which this court can conclude, as the defendant would have it, that the PCAST report constitutes 'the scientific community.'" These words might mean that the relevant scientific community disagrees with the Council that footwear-mark comparisons purporting to associate a particular shoe with a questioned impression lack adequate scientific validation. Other scientists might disagree either because they do not demand the same type or level of validation, or because they find the existing research satisfies PCAST's more demanding standards. The former is more plausible than the latter, but it is not clear which possibility the court accepted as true.

To reject the PCAST Report's negative finding, Judge Danaher relied exclusively on the testimony of “Lisa Ragaza, MSFS, CFWE, a ‘forensic science examiner 1’ ... who holds a B.S. degree from Tufts University and an M.S. degree from the University of New Haven.” What did the forensic-science examiner say to support the conclusion that PCAST erred in its determination that no adequate body of scientific research supports the accuracy of examiner judgments? To begin with,
Ms. Ragaza testified that, in her opinion, footwear comparison analysis is generally accepted in the relevant scientific community. She testified that such evidence has been admitted in 48 or 49 of the 50 states in the United States, in many European countries, and also in India and China. In fact, she testified, such analyses have been admitted in United States courts since the 1930s, although she is also aware that one such analysis was carried out in Scotland as early as 1786.
It seems odd to have forensic examiners instruct the court in the law. That the courts in these jurisdictions (not all of which even require a showing of scientific validity) admit the testimony of footwear analysts that a given shoe is the source of a mark says little about the extent to which these judgments have been subjected to scientific testing. As a committee of the National Academy of Sciences reported in 2009, “Daubert has done little to improve the use of forensic science evidence in criminal cases.” NRC Committee on Strengthening Forensic Science in the United States, Strengthening Forensic Science in the United States: A Path Forward 106 (2009). Instead, “courts often ‘affirm admissibility citing earlier decisions rather than facts established at a hearing.’” Id. at 107.

Second, 
Ms. Ragazza testified that there are numerous treatises and journals, published in different parts of the world, on the topic of footwear comparison analysis. She testified that there have been studies relative to the statistical likelihood of randomly acquired characteristics appearing in various footwear.
But the existence of “treatises and journals” — including what the NAS Committee called “trade journals,” id. at 150 — does not begin to contradict PCAST’s conclusion about the dearth of studies of the accuracy of examiner judgments. PCAST commented (pp. 116-17) on one of the “studies relative to the statistical likelihood”:
a mathematical model by Stone that claims that the chance is 1 in 16,000 that two shoes would share one identifying characteristics and 1 in 683 billion that they would share three characteristics. Such claims for “identification” based on footwear analysis are breathtaking—but lack scientific foundation. ... The model by Stone is entirely theoretical: it makes many unsupported assumptions (about the frequency and statistical independence of marks) that it does not test in any way.
Third,
Ms. Ragazza testified that her work is subject to peer review, including having a second trained examiner carry out a blind review of each analysis that she does. In response to the defendant's question as to whether such reviews have ever resulted in the second reviewer concluding that Ms. Ragazza had carried out an erroneous analysis, she responded that there were no such instances. Most of her work is not done in preparation for litigation. It is frequently done for investigative purposes and may be used to inculpate, but also exculpate, an individual. She indicated that the forensic laboratory carries out its analyses for both prosecutors and defense counsel.
Verification of an examiner’s conclusion by another examiner is a good thing, but it does almost nothing to establish the validity of the examination process. Making sure that two readers of tea leaves agree in their predictions does not validate tea reading (although it could offer data on measurement reliability, which is necessary for validity).

Fourth,
Ms. Ragazza explained how footwear comparison analysis is carried out, using a protocol known as ACE-V, and employing magnifiers and/or microscopes.
Plainly, this misses the point. If tea reading were expanded to include magnifiers and microscopes, that would not make it more valid. (Actually, I believe that footwear-mark comparisons based on “randomly acquired characteristics” are a lot better than tea reading, but I still am searching for the scientific studies that let us know how much better.)

Sixth,
Ms. Ragazza does not agree with the PCAST report because, in her view, that report did not take into account all of the available research on the issue of footwear comparison evidence.
Maybe there is something to this complaint, but what validity studies does the PCAST report overlook? The Supporting Documentation for Department of Justice Proposed Uniform Language for Testimony and Reports for the Forensic Footwear and Tire Impression Discipline (2016) begins “The origin of the principles used in the forensic analysis of footwear and tire impression evidence dates back to when man began hunting animals.” But the issue the PCAST Report addresses is not whether a primitive hunter can distinguish between the tracks of an elephant and a tiger. It is the accuracy with which modern forensic fact hunters can identify the specific origin of a shoeprint or a tire tread impression. If Ms. Ragazza provided the court with studies of this particular issue that would produce a different conclusion about the extent of the validation research reported on in both the NRC and PCAST reports, the court did not see fit to list them in the opinion.

A footnote to the claim that "an examiner can identify a specific item of footwear/tire as the source of the footwear/tire impression" can be found in the Justice Department document mentioned above. This note (#12) lists the following publications:
  1. Cassidy, M.J. Footwear Identification. Canadian Government Publishing Centre: Ottawa, Canada, 1980, pp. 98-108; 
  2. Adair, T., et al. (2007). The Mount Bierstadt Study: An Experiment in Unique Damage Formation in Footwear. Journal of Forensic Identification 57(2): 199-205; 
  3. Banks, R., et al. Evaluation of the Random Nature of Acquired Marks on Footwear Outsoles. Research presented at Impression & Pattern Evidence Symposium, August 4, 2010, Clearwater, FL;
  4. Stone, R. (2006). Footwear Examinations: Mathematical Probabilities of Theoretical Individual Characteristics. Journal of Forensic Identification 56(4): 577-599;
  5. Wilson, H. (2012). Comparison of the Individual Characteristics in the Outsoles of Thirty-Nine Pairs of Adidas Supernova Classic Shoes. Journal of Forensic Identification 62(3): 194-203.
I wish I could say that I have read these books and papers. At the moment, I can only surmise their contents from the titles and places of publication, but I would be surprised if any of them contains an empirical study of the accuracy of footwear-mark examiners’ source attributions. (If my guess is wrong, I hope to hear about it.)

Finally,
She testified that, to her knowledge, the PCAST members did not include among their membership any forensic footwear examiners.
It's true. The President's Council of Advisors on Science and Technology does not include footwear examiners. But would we say that only tea-leaf readers are able to judge whether there have been scientific studies of the validity of tea-leaf reading? That only polygraphers are capable of determining whether the polygraph is a valid lie detector? That only pathologists can ascertain whether an established histological test for cancer is accurate?

PCAST's conclusion was that no direct experiments currently establish the sensitivity and specificity of footwear-mark identification. In the absence of a single counter-example from the opinion, that conclusion seems sound. But the legal problem is whether to accept the PCAST report's premise that this information is essential to admissibility of footwear evidence under the standard for scientific expert testimony codified in Federal Rule of Evidence 702. Is it true, as a matter of law (or science), that only a large number of so-called black box studies with large samples can demonstrate the scientific validity of subjective identification methods or that the absence of precisely known error probabilities as derived from these experiments dictates exclusion? I fear that the PCAST report is too limited in its criteria for establishing the requisite scientific validity for forensic identification techniques, for there are other ways to test examiner performance and to estimate error rates.  But however one comes out on such details, the need for courts to demand substantial empirical as well as theoretical studies that demonstrate the validity and quantify the risks of errors in using these methods remains paramount.

Although Patel is merely one unpublished pretrial ruling with no precedential value, the case indicates that defense counsel cannot just cite the conclusions of the PCAST report and expect judges to exclude familiar types of evidence. They need to convince courts that "the reliability requirements" for scientific evidence include empirical proof that a technique actually works as advertised. Then the parties can focus on whether PCAST's assessments of the literature omit or give too little weight to studies that would warrant different conclusions. Broadbrush references to "treatises and journals" and a history of judicial acceptance should not be enough to counter PCAST's findings of important gaps in the research base of a forensic identification method.

Wednesday, January 25, 2017

Statistics for Making Sense of Forensic Genetics

The European Forensic Genetics Network of Excellence (EUROFORGEN-NoE) is a group of “16 partners from 9 countries including leading groups in European forensic genetic research.” In 2016, it approached Sense About Science — “an independent charity that challenges misrepresentation of science and evidence in public life” — to prepare and disseminate a guide to DNA evidence. Within the year, the guide, entitled Making Sense of Forensic Genetics, emerged. The 40-page document has a long list of “contributors,” who, presumably, are its authors. According to EUROFORGEN-NoE, it is “designed to introduce professional and public audiences to the use of DNA in criminal investigations; to understand what DNA can and can’t tell us about a crime, and what the current and future uses of DNA analysis in the criminal justice system might be.”

By and large, it accomplishes this goal, offering well informed comments and cautions for the general public. Some of the remarks about probabilities and statistics, however, are not as well developed as they could be. The points worth noting have more to do with clarity of expression than with any outright errors.

Statistics do not arise in a vacuum. Proper interpretation requires some understanding of how they came to be produced. Thus, Making Sense correctly observes that:
DNA evidence has a number of limitations: it might be undetectable, overlooked, or found in such minute traces as to make interpretation difficult. Its analysis is subject to error and bias. Additionally, DNA profiles can be misinterpreted, and their importance exaggerated, as illustrated by the wrongful arrest of a British man, ... . Even if DNA is detected at a crime scene, this doesn’t establish guilt. Accordingly, DNA needs to be viewed within a framework of other evidence, rather than as a standalone answer to solving crimes.
With respect to the narrow question of whether two DNA samples originate from the same individual, Making Sense asks, “So what is the chance that your DNA will match that of someone else?” An ambiguity lurks in this question. Does it refer to probability of a matching profile somewhere in the population, or to the probability of a matching profile in  a single, randomly selected individual? Apparently, the authors have the latter question in mind, for Making Sense explains that
It depends on how many locations in the DNA (loci) you look at. If a forensic scientist looked at just one locus, the probability of this matching the same marker in another individual would be relatively high (between 1 in 20 and 1 in 100). ... Since European police forces today typically analyse STRs at 16 or more loci, the probability that two full DNA profiles match by chance is miniscule — in the region of 1 in 10 with 16 zeros after it (or 1 in 100 million billion). ... Although in the UK court, the statistics are always capped at 1 in a billion.
The 1-in-a-billion cap is not seen in the United States, where laboratories toss about estimates in the quintillionths, septillionths, and so on (and on). (Could this be an instance of “America First”?) The naive reader might be forgiven for thinking that when the probability of the same match to a randomly selected individual is far less than 1 in a billion, an analyst could conclude that the recovered DNA is either from the defendant or a close relative. But Making Sense rejects this thought, insisting that “DNA doesn’t give a simple ‘yes’ or ‘no’ answer.”

The explanation for its position is muddled. First, the report repeats that “with information available for all 16 markers, ... the risk of DNA retrieved from a crime scene matching someone unrelated to the true source is extremely low (less than 1 in a billion, and often many orders of magnitude lower than this).” So why is not this good enough for a “yes or no answer”? The hesitation, as expressed, is that
However, many of the DNA profiles retrieved from crime scenes aren’t full DNA profiles because they’re missing some genetic markers or there is a mixture of DNA from two or more people. So was it the suspect who left their DNA at the crime scene? The DNA evidence won’t give a ‘yes’ or ‘no’ answer: it can only ever be expressed in terms of probability.
But the conclusion that “it can only ever be expressed in terms of probability” is a non sequitur. The only thing that follows from the fact that not all crime-scene DNA samples lead to 16-locus profiles is that matches to the samples with less complete profiles are less convincing than matches to the samples with more complete profiles.

Of course, there is a sense in which all DNA evidence only gives rise to probabilities, and never to categorical conclusions. All empirical evidence only gives probable conclusions rather than certainties. Furthermore, it has been argued that forensic scientists should eschew source attributions because their expertise is limited to evaluating likelihoods — the probability of the match given that the sample came from a named individual and the probability given that it came from a different individual (or individuals). But that is not what Making Sense seems to be saying when declares yes-and-no answers impossible. The limits on all empirical knowledge and the role of an expert witness do not produce any line between 16-locus matches and less-than-16-locus matches.

Making Sense also points out that
[T]he match probability ... must not be confused (but often is) with how likely the person is to be innocent of the crime. For example, if a DNA profile from the crime scene matches the suspect’s DNA and the probability of such a match is 1 in 100 million if the DNA came from someone else, this does not mean that the chance of the suspect being innocent is 1 in 100 million. This serious misinterpretation is known as the prosecutor’s fallacy.
Conceptually, this transposition is a “serious misinterpretation,” but whether the correct inverse probability (one that is based on a prior probability and a Bayes factor on the order 100 million) gives a markedly different value is far from obvious. See David H. Kaye, The Interpretation of DNA Evidence: A Case Study in Probabilities, in Making Science-based Policy Decisions: Resources for the Education of Professional School Students, Nat'l Academies of Science, Engineering, and Medicine Committee on Preparing the Next Generation of Policy Makers for Science-Based Decisions ed., Washington, DC, 2016.

A reasonable approach is to have analysts present the two pertinent conditional probabilities mentioned above (the “likelihoods”) to explain how strongly the profiles support one hypothesis over the other. Making Sense refers to this approach in some detail, but it suggests that it is needed only “in more complex cases, such as mixtures of two or more individuals, or when there might be contamination by DNA in the environment.” Compared to the alternative ways to explain the implications of DNA and other trace evidence, however, the approach is more widely applicable.

Monday, January 9, 2017

If You Are Going To Do a “DNA Dragnet,” Cast the Net Widely

Police in Rockingham County, North Carolina, took a circuitous path to identify the killer of a couple who were shot to death in their home in Reidsville, NC. They utilized a “DNA dragnet,” kinship analysis, ancestry analysis, and DNA phenotyping to conclude that the killer was the brother-in-law of the daughter of slain couple. Had the initial DNA collection been slightly more complete, that effort alone would have sufficed.

The evidence that led to the man ultimately convicted of the double homicide were drops of the killer's blood:
Parabon Nanolabs, The French Homicides, Jan. 4, 2017 [hereinafter Parabon]

In the early hours of 4 Feb 2012, Troy and LaDonna French were gunned down in their home in Reidsville, NC. The couple awoke to screams from their 19-year old daughter, Whitley, who had detected the presence of a male intruder in her second floor room. As they rushed from their downstairs bedroom to aid their daughter, the intruder attempted to quiet the girl with threats at knifepoint. Failing this, he released Whitley and raced down the stairs. After swapping his knife for the handgun in his pocket, he opened fire on the couple as they approached the stairwell. During his escape, the perpetrator left a few drops of his blood on the handrail, apparently the result of mishandling his knife. ...
Seth Augenstein, Parabon’s DNA Phenotyping Had Crucial Role in North Carolina Double-Murder Arrest, Conviction, Forensic Mag., Jan. 5, 2017 [hereinafter Augenstein]

A couple were gunned down by an intruder in their North Carolina home in the early hours of Feb. 4, 2012. The teenaged daughter had seen the hooded gunman, when he had briefly held a knife to her throat, but she could apparently not describe him to cops. The attacker left several drops of blood on a handrail as he fled, apparently self-inflicted from his blade.
At a press conference, Sheriff Sam Page announced that "You can run, but you can’t hide from your DNA." Danielle Battaglia, Blood on the Stairs, News & Record Greensboro.com, Apr. 14, 2016 [hereinafter Battaglia]. But efforts to follow the DNA seemed to lead nowhere.
Parabon
Running short of leads, investigators began collecting DNA samples from anyone thought to have been in or around the French home. "We swabbed a lot of people," says Captain Tammi Howell of the RCSO. "Early on, if there was a remote chance someone could have been connected to the crime, we asked for a swab." In the first 12 months following the crime, over 50 subjects consented to provide a DNA sample. None of the samples matched the perpetrator.
Augenstein
"We swabbed a lot of people," said Capt. Tammi Howell, of the Rockingham County Sheriff’s Office, who led the investigation. "Early on, if there was a remote chance someone could have been connected to the crime, we asked for a swab." Those swabs produced no hits.
In particular, this screening of possible sources in the county eliminated "Whitley, her brother, and her boyfriend at the time, John Alvarez." Parabon. But police did not include Alvarez's father or his three brothers in their dragnet search, and when "[a]nalysts uploaded profiles of the blood drops and the skin fragments along with a sample from Whitley French into a database of known samples maintained by the FBI, [t]hey found no match." Battaglia. (According to Forensic Magazine, "the killer was not in any of the public databases," but law enforcement DNA databases are not public.)

There is some confusion in the accounts of what happened next.
Parabon
The first break in the case came when familial DNA testing, performed at the University of North Texas, revealed the possibility that the perpetrator might be related to John Alvarez, Whitley's boyfriend. Because traditional DNA testing is limited in its ability to detect all but the closest relationships (e.g., parent-child), this report alone did not provide actionable information. Subsequently, scientists at the University of North Texas performed Y-chromosome STR analysis, which tests whether two male DNA samples share a common paternal lineage. This analysis, however, showed that the perpetrator did not share a Y-STR lineage with John Alvarez, seemingly eliminating John's father and brother as possible suspects.
Augenstein
Further analysis then indicated that the daughter’s boyfriend, John Alvarez (who had given a swab), could be related to the killer. But it was only a possible relationship, since the STR did not definitively say whether the killer and the boyfriend shared ancestry.
The partial DNA matching led to a Y-STR analysis. The short-tandem repeat on the Y chromosome shows paternal links between fathers, sons and brothers, and has produced huge breakthroughs in cases like the Los Angeles serial killer Lonnie Franklin, Jr., infamously dubbed the “Grim Sleeper.” But in the Sleeper and other cases used “familial searching,” or “FS,” a painstaking and somewhat controversial process of combing large state and national databases like CODIS to find partial DNA matches eventually leading to a suspect. FS was not used in the Rockingham County case, where they had a limited pool of suspects.
Battaglia
Investigators then decided to send the DNA samples out of state for what the warrant called “familial DNA testing,” a type of analysis that allows scientists to match DNA samples to a parent, child or sibling. According to warrants, the samples were sent to the Center for Human Identification at the University of North Texas in Denton. But they do not appear to have gone to that lab. And Rockingham County District Attorney Craig Blitzer said that although a lab did the familial DNA test, it was not North Texas. He declined to say where it was done.
The term "familial searching" has no well-established scientific meaning. As explained in David H. Kaye, The Genealogy Detectives: A Constitutional Analysis of “Familial Searching”, 51 Am. Crim. L. Rev. 109 (2013), kinship testing of possible parents, children, and siblings can be done with the usual autosomal STR loci used for criminal forensic investigation. When this technique is applied to a database (local, state, or national), it sometimes reveals that the crime-scene DNA matches no one in the database but is a near miss to someone -- a near miss in such a way as to suggest the possibility that the source of the crime-scene sample is a brother, son, or parent of the nearly-matching individual represented in the database. In other words, "familial searching" is the process of trawling a database for possible matches to people outside of the database -- "outer-directed trawling," for short.

The Rockingham case evidently involved a conventional but fruitless database search ("inner-directed trawling") followed by testing -- in Texas or somewhere else -- to ascertain whether it was plausible that a close relative of the boyfriend was the source of the blood. Based on the autosomal STRs, this seemed to be the case. However, the laboratory threw a monkey wrench into the investigation when it reported that Y-STRs in the boyfriend's DNA did not match the blood DNA. Because Y-STRs are inherited (usually unchanged) from father to son, this additional finding seemed to exclude the untested father and brothers of the boyfriend.

But the social and familial understanding of a family tree does not always correspond to a biological family tree. It is not unheard of for genetic tests for parentage to reveal unexpected cases of illegitimate children. A man and child who believe that they are father and son may be mistaken. Genetic genealogists like to call the phenomenon of misattributed paternity a Non-Paternity Event, or NPE.

Thinking that the male members of the immediate Alvarez family had to be innocent, police were stymied. They turned to Parabon Nanolabs in Reston, Virginia.
Parabon
[For $3,500, the lab,] starting with 30 ng of DNA, ... genotype[d] over 850,000 SNPs from the sample, with an overall call rate of 98.9% [and advised the police that the blood probably came from a man with] fair or very fair skin, brown or hazel eyes, dark hair, and little evidence of freckling, ... a wide facial structure and non-protruding nose and chin, and ... admixed ancestry, a roughly 50-50 combination of European and Latino ancestry consistent with that observed in individuals with one European and one Latino parent. ... "The Snapshot ancestry analysis and phenotype predictions suggested we should not eliminate José as a suspect, despite the Y-STR results," said Detective Marshall. "The likeness of the Snapshot composite with his driver's license photograph is quite striking."

Augenstein
From approximately 30 nanograms of DNA, the software genotyped approximately 850,000 single-nucleotide polymorphisms, or SNPs, at a call rate of 98.9 percent. In this case, the blood showed the killer to be someone with mixed ancestry – apparently someone with one European and one Latino parent. ... "The Snapshot ancestry analysis and phenotype predictions suggested we should not eliminate Jose (Jr.) as a suspect, despite the Y-STR results," said Det. Marcus Marshall, the lead investigator on the case. "The likeness of the Snapshot composite with his driver’s license photograph is quite striking."
At this time, Parabon proudly juxtaposes the "Snapshot Composite Profile and a photo of José Alvarez, Jr., taken at the time of his arrest" on its website.(and shown below). One of the more intriguing (genetically associated?) similarities is the five o'clock shadow.
Snapshot™ Composite Profile for Case #3999837068, Rockingham County, NC Sheriff's Office

It also would be interesting to know how "confidence" in skin color and other phenotypes is computed. In any event, with this report, police finally obtained DNA samples by consent from the father, José Alvarez Sr., José Alvarez Jr., and Elaine Alvarez, the mother. Analysis indicated misattributed paternity -- and a conventional STR match of the DNA in the bloodstains. As a result,
Parabon
José Alvarez Jr. was arrested on 25 Aug 2015 on two counts of capital murder. He later pled guilty to both murders and on 8 Jul 2016 was sentenced to two consecutive life sentences without the possibility of parole.
Augenstein
Jose Alvarez, Jr., was ... arrested in August 2015 and charged with two counts of capital murder. He later pleaded guilty to killing the Frenches, and was sentenced to two consecutive life sentences without the possibility of parole in July 2016.
A final note on the twists and turns in the case is that John Alvarez's wedding to Whitley French "had been planned for months. Jose Alvarez Jr. served as a groomsman for his brother even as detectives were planning to arrest him on charges that he murdered his new sister-in-law’s parents." Battaglia.

Related posting

"We Can Predict Your Face" and Put It on a Billboard, Forensic Sci., Stat. & L., Nov. 28, 2016

Sunday, January 8, 2017

Reflections on Glass Standards: Statistical Tests and Legal Hypotheses

Statistical Applicata (Italian Journal of Applied Statistics) recently published several issues (volume 27, nos. 2 & 3) devoted to statistics in forensic science and law. They include an invited article I prepared in 2016 on the statistical logic of declaring pieces of glass "indistinguishable" in their physical properties. 1/ The article contains some of the views expressed in postings on this blog (e.g., Broken Glass: What Do the Data Show?). However, the issue is much broader than glass evidence. The article notes the potential for confusion in reporting that any kind of trace-evidence samples match (or cannot be distinguished) without also describing data on the frequency of such matches in a relevant population. I am informed that NIST's Organization of Scientific Area Committees on Forensic Science (OSAC) is preparing guidelines or standards for explaining the probative value of results obtained from ASTM-approved test methods.
Abstract

The past 50 years have seen an abundance of statistical thinking on interpreting measurements of chemical and physical properties of glass fragments that might be associated with crime scenes. Yet, the most prominent standards for evaluating the degree of association between specimens of glass recovered from suspects and crime scenes have not benefitted from much of this work. Being confined to a binary match/no-match framework, they do not acknowledge the possibility of expressing the degree to which the data support competing hypotheses. And even within the limited match/no-match framework, they focus on the single step of deciding whether samples can be distinguished from one another and say little about the second stage of the matching paradigm–characterizing the probative value of a match. This article urges the extension of forensic-science standards to at least offer guidance for criminalists on the second stage of frequentist thinking. Toward that end, it clarifies some possible sources of confusion over statistical terminology such as “Type I” and “Type II” error in this area, and it argues that the legal requirement of proof beyond a reasonable doubt does not inform the significance level for tests of whether pairs of glass fragments have identical chemical or physical properties.
Note
  1. The article is David H. Kaye, Reflections on Glass Standards: Statistical Tests and Legal Hypotheses, 27 Statistica Applicata -- Italian J. Applied Stat. 173 (2015). Despite the publication date assigned to the issue, the article, as stated above, was not written until 2016.

Tuesday, December 27, 2016

NCFS Draft Views on "Statistical Statements in Forensic Testimony"


The period for comments on the second public draft of a proposed National Commission on Forensic Science (NCFS) views document on Statistical Statements in Forensic Testimony opened today and will close on January 25, 2017. Comments can be submitted at regulations.gov, Docket No. DOJ-LA-2016-0025.

The full document can be downloaded from https://www.regulations.gov/document?D=DOJ-LA-2016-0025-0001. It defines "statistical statements" broadly, to encompass quantitative or qualitative statements [that] indicate the accuracy of measurements or observations and the significance of these findings." "These statistical statements," it explains, "may describe measurement accuracy (or conversely, measurement uncertainty), weight of evidence (the extent to which measurements or observations support particular conclusions), or the probability or certainty of the conclusions themselves."

The draft summarizes the views as follows (footnote omitted):
1. Forensic experts, both in their reports and in testimony, should present and describe the features of the questioned and known samples (the data), and similarities and differences in those features as well as the process used to arrive at determining them. The presentation should include statements of the limitations and uncertainties in the measurements or observations.

2. No one form of statistical calculation or statement is most appropriate to all forensic evidence comparisons or other inference tasks. Thus, the expert needs to be able to support, as part of a report and in testimony, the choice used in the specific analysis carried out and the assumptions on which it was based. When the statistical calculation relies on a specific database, the report should make clear which one and its relevance for the case at hand.

3. The expert should report the limitations and uncertainty associated with measurements and the inferences that could be drawn from them. This report might take the form of an interval for an estimated value, or of separate statements regarding errors and uncertainties associated with the analysis of the evidence. If the expert has no information on sources of error in measurements and inferences, the expert must state this fact.

4. Forensic science experts should not state that a specific individual or object is the source of the forensic science evidence and should make it clear that, even in circumstances involving extremely strong statistical evidence, it is possible that other individuals or objects could possess or have left a similar set of observed features. Forensic science experts should confine their evaluative statements to the support that the findings provide for the claim linked to the forensic evidence.

5. To explain the value of the data in addressing claims as to the source of a questioned sample, forensic examiners may:
A. Refer to relative frequencies of individual features in a sample of individuals or objects in a relevant population (as sampled and then represented in a reference database). The examiner should note the uncertainties in these frequencies as estimates of the frequencies of particular features in the population.

B. Present estimates of the relative frequency of an observed combination of features in a relevant population based on a probabilistic model that is well grounded in theory and data. The model may relate the probability of the combination to the probabilities of individual features.

C. Present probabilities (or ratios of probabilities) of the observed features under different claims as to the origin of the questioned sample. The examiner should note the uncertainties in any such values.

D. When the statistical statement is derived from an automated computer-based system for making classifications, present not only the classification but also the operating characteristics of the system (the sensitivity and specificity of the system as established in relevant experiments using data from a relevant population). If the expert has no information or limited information about such operating characteristics, the expert must state this fact.
6. Not all forensic subdisciplines currently can support a probabilistic or statistical statement. There may still be value to the factfinder in learning whatever comparisons the expert in those subdisciplines has carried out. But the absence of models and empirical evidence needs to be expressed both in testimony and written reports.
The document will be discussed at the January 2017 NCFS meeting. A final version should be up for a vote at the (final?) Commission meeting, on April 10-11, 2017.

Thursday, December 22, 2016

Realistically Testing Forensic Laboratory Performance in Houston

The Houston Forensic Science Center, announced on November 17, 2016, that
HFSC Begins Blind Testing in DNA, Latent Prints, National First
This innovation -- said to be unique among forensic laboratories and to exceed the demands of accreditation -- does not refer to blind testing of samples from crime scenes. It is generally recognized that analysts should be blinded to information that they do not need to reach conclusions about the similarities and differences in crime-scene samples and samples from suspects or other persons of interest. One would hope that many laboratories already employ this strategy for managing unwanted sources of possible cognitive bias.

Perhaps confusingly, the Houston lab's announcement refers to "'blindly' test[ing] its analysts and systems, assisting with the elimination of bias while also helping to catch issues that might exist in the processes." More clearly stated, "[u]nder HFSC’s blind testing program analysts in five sections do not know whether they are performing real casework or simply taking a test. The test materials are introduced into the workflow and arrive at the laboratory in the same manner as all other evidence and casework."

A month earlier, the National  Commission on Forensic Science unanimously recommended, as a research strategy, "introducing known-source samples into the routine flow of casework in a blinded manner, so that examiners do not know their performance is being studied." Of course, whether the purpose is research or instead what the Houston lab calls a "blind quality control program," the Commission noted that "highly challenging samples will be particularly valuable for helping examiners improve their skills." It is often said that existing proficiency testing programs not only fail to blind examiners to the fact that they are being tested, but also are only designed to test minimum levels of performance.

The Commission bent over backward to imply that the outcomes of the studies it proposed would not necessarily be admissible in litigation. It wrote that
To avoid unfairly impugning examiners and laboratories who participate in research on laboratory performance, judges should consider carefully whether to admit evidence regarding the occurrence or rate of error in research studies. If such evidence is admitted, it should only be under narrow circumstances and with careful explanation of the limitations of such data for establishing the probability of error in a given case.
The Commission's concern was that applying statistics from work with unusually difficult cases to more typical casework might overstate the probability of error in the less difficult cases. At the same time, its statement of views included a footnote implying that the defense should have access to the outcomes of performance tests:
[T]he results of performance testing may fall within the government’s disclosure obligations under Brady v Maryland, 373 U.S. 83 (1963). But the right of defendants to examine such evidence does not entail a right to present it in the courtroom in a misleading manner. The Commission is urging that courts give careful consideration to when and how the results of performance testing are admitted in evidence, not that courts deny defendants access to evidence that they have a constitutional right to review.
Using traditional proficiency test results and the newer performance tests in which examiners are blinded to the fact that they are being tested in a given case (which is a better way to test proficiency) to impeach a laboratory's reported results raises interesting questions of relevance under Federal Rules of Evidence 403 and 404. See, e.g., Edward J. Imwinkelried & David H. Kaye, DNA Typing: Emerging or Neglected Issues, 76 Wash. L. Rev. 413 (2001).

Sunday, December 11, 2016

PCAST’s Sampling Errors (Part II: Getting More Technical)

The report to the President on Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods from the President’s Council of Advisors on Science and Technology emphasizes the need for research to assess the accuracy of the conclusions of criminalists who compare the features of identification evidence — things like fingerprints, toolmarks, hairs, DNA, and bitemarks. To this extent, it treads on firm (and well-worn) ground.

It also stresses the need to inform legal factfinders — the judges and jurors who try cases with the assistance of such evidence — of the false-positive error rates discovered in comprehensive and well designed studies with large sample sizes. This too is a laudable objective (although the specificity, which is the complement of the false-negative probability, also affects the probative value of the evidence and therefore should be incorporated into a presentation that indicates probative value).

When sample sizes are small, different studies could well generate very different false-positive rates if only because accuracy varies, both within and across examiners. Testing different samples of examiners at different times therefore will show different levels of performance. A common response to this sampling variability is a confidence interval (CI). A CI is intended to demarcate a range of possible values that might plausibly include the value that an ideal study of all examiners at all times would find.

The report advocates the use of an upper limit of a one-sided 95% CI for the false-positive rates instead of, or perhaps in addition to, the observed rates themselves. Some of the report’s statements about CIs and what to report are collected in an appendix to this posting. They have led to claims of a very large false-positive error rate for latent fingerprint identification. (See On a “Ridiculous” Estimate of an “Error Rate for Fingerprint Comparisons,” Dec. 10, 2016.)

A previous posting on "PCAST’s Sampling Errors" identified problems with the specifics of PCAST’s understanding of the meaning of a CI, with the technique it used to compute CIs for the performance studies it reviewed, and with the idea of substituting a worst-case scenario (the upper part of a CI) for the full range of the interval. Informing the factfinder of plausible variation both above and below the point estimate — is fairer to all concerned, and that interval should be computed with techniques that will not create distracting arguments about the statistical acumen of the expert witness.

This posting elaborates on these concerns. Forensic scientists and criminalists who are asked to testify to error probabilities (as they should) need to be aware of the nuances lest they be accused of omitting important information or using inferior statistical methods. I provide a few tables that display PCAST’s calculations of the upper limits of one-sided 95% CIs; two-sided CIs computed with the same method for ascertaining the width of the intervals; and CIs with methods that PCAST listed but did not use.

The discussion explains how to use the statistical tool that PCAST recommended. It also shows that there is potential for confusion and manipulation in the reporting of confidence intervals. This does not mean that such intervals should not be used — quite the contrary, they can give a sense of the fuzziness that sampling error creates for point estimates. Contrary to the recommendation of the PCAST report, however, they should not be presented as if they give the probability that an error probability is as high as a particular value.

At best, being faithful to the logic behind confidence intervals, one can report that if the error probability were that large, then the probability of the observed error rate or a smaller one would have some designated value. To present the upper limit of a 95% CI as if it states that the probability is “at least 5%” that the false-positive error probability could be “as high as” the upper end of that CI — the phrasing used in the report (p. 153) — would be to succumb to the dreaded “transposition fallacy” that textbooks on statistics abjure. (See also David H. Kaye, The Interpretation of DNA Evidence: A Case Study in Probabilities, National Academies of Science, Engineering and Medicine, Science Policy Decision-making Educational Modules, June 16, 2016.)

I. PCAST’s Hypothetical Cases

The PCAST Report gives examples of CIs for two hypothetical validation studies. In one, there are 25 tests of examiner’s ability to use features to classify pairs according to their source.  This hypothetical experiment establishes that the examiners made x = 0 false positive classifications in n = 25 trials. The report states that “if an empirical study found no false positives in 25 individual tests, there is still a reasonable chance (at least 5 percent) that the true error rate might be as high as roughly 1 in 9.” (P. 153.)

I think I know what PCAST is trying to say, but this manner of expressing it is puzzling. Let the true error rate be some unknown, fixed value θ. The sentence in the report might amount to an assertion that the probability that θ is roughly 1 in 9 is at least 5%. In symbols,
Pr(θ ≈ 1/9 | x = 0, n = 25) ≥ 0.05. 
Or does it mean that the probability that θ is roughly 1/9 or less is 5% or more, that is,
Pr(θ ≤ 1/9 | x = 0, n = 25) ≥ 0.05? 
Or does it state that the probability that θ is roughly 1/9 or more is 5% or more, that is,
Pr(θ ≥ 1/9 | x = 0, n = 25) ≥ 0.05?

None of these interpretations can be correct. Theta is a parameter rather than a random variable. From the frequentist perspective of CIs, it does not have probabilities attached to it. What one can say is that, if an enormous number of identical experiments were conducted, each would generate some point estimate for the false positive probability θ in the population. Some of these estimates x/n would be a little higher than θ; some would be a little lower (if θ > 0); some would a lot higher; some a lot lower (if θ ≫ 0); some would be spot on. If we constructed a 95% CI around each estimate, about 95% of them would cover the unknown θ, and about 5% would miss it. (I am not sure where the "at least 5%" in the PCAST report comes from.) Likewise, if we constructed a 90% CI for each experiment, about 90% would cover the unknown θ, and about 10% would miss it.

In assessing a point estimate, the width of the interval is what matters. At a given confidence level, wider intervals signal less precision in the estimate. The 90% CI would be narrower than the 95% one — it would have the appearance of greater precision, but that greater precision is illusory.  It just means that we can narrow the interval by being less “confident” about the claim that it captures the true value θ. PCAST uses 95% confidence because 95% is a conventional number in many (but not all) fields of scientific inquiry -- a fact that greatly impressed Justice Breyer in oral argument in Hall v. Florida.

In sum, 95% two-sided CIs are useful for manifesting the fuzziness that surrounds any point estimate because of sampling error. But they do not answer the question of what probability we should attach to the claim that “the true error rate might be as high as roughly 1 in 9.” Even with a properly computed CI that goes from 0 to 1/9 (which we can write as [0, 1/9]), the most we can say is that the process used to generate such CIs will err about 5% of the time. (It is misleading to claim that it will err “at least 5%” of the time.) We have a single CI, and we can plausibly propose that it is one of those that does not err. But we cannot say that 95% or fewer of the intervals [0, 1/9] will capture θ. Hence, we cannot say that “there is ... at least [a] 5 percent [chance] that the true error rate [θ] might be as high as roughly 1 in 9.”

To reconstruct how PCAST computed “roughly 1 in 9,” we can use EpiTools, the web-based calculator that the report recommended. This tool does not have an explicit option for one-sided intervals, but a 95% one-sided CI leaves 5% of the probability mass in a single tail. Hence, the upper limit is equal to that for a two-sided 90% CI. This is so because a 90% CI places one-half of the 10% of the mass that it does not cover in each tail. Figure 1 shows the logic.

Figure 1. Graphical Representation of PCAST's
Upper Limit of a One-sided 95% CI (not drawn to scale)
 95% CI (two-sided)
       **
      *****
     ********
    ************
   ****************
*************************
--[--------------------]------------> x/n
Lower 2.5% region       Upper 2.5% region

 90% CI (two-sided)
       **
      *****
     ********
    ************
  *****************
*************************
---[---------------]--------------> x/n
Lower 5% region     Upper 5% region

PCAST: Report the interval from
the observed x/n to the start of the upper 5% region? 
Or just report where the upper 5% region starts?

---[---------------]--------------> x/n
Lower 5% region     Upper 5% region

Inputting n = 25, x = 0, and confidence = 90% yields the results in Table 1.

Table 1. Checking and Supplementing PCAST’s CIs
in Its First Hypothetical Case (x/n = 0/25)

Upper One-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.1129 = 1/9 0.1372 = 1/7
Wilson 0.0977 = 1/10 0.1332 = 1/8
Jeffreys 0.0732 = 1/14 0.0947 = 1/11
Agresti-Coull 0.1162 = 1/9 0.1576 = 1/6

The Wilson and Jeffreys methods, which are recommended in the statistics literature cited by PCAST, give smaller upper bounds on the false-positive error rate than the Clopper-Pearson method that PCAST uses. The recommended numbers are 1 in 10 and 1 in 14 compared to PCAST’s 1 in 9.

On the other hand, the upper limit for a two-sided 95% CI for the Wilson and Jeffreys tests are 1 in 8 and 1 in 11. In this example (and in general) using the one-sided interval has the opposite effect from what PCAST may have intended. PCAST wants to keep juries from hearing many false-positive-error probabilities that are below θ at the cost of giving them estimates that are above θ. But as Figure 1 illustrates, using the upper limit of a 90% two-sided interval to arrive at a one-sided 95% CI lowers the upper limit compared to the two-sided 95% CI. PCAST’s one-sided intervals do less, not more, to protect defendants from estimates of error false-positive error probabilities that are too low.

Of course, I am being picky. What is the difference if the expert reports that the false-positive probability could be 1 in 14 instead of PCAST’s upper limit of 1 in 9? The jury will receive a similar message — with perfect scores on only 25 relevant tests of examiners, one cannot plausibly claim that examiners rarely err. Another way to make this point is to compute the probability of the study’s finding so few false positives (x/n = 0/25) when the probability of an error on each independent test is 1 in 9 (or any other number that one likes). If 1/9 is the binomial probability of a false-positive error, then the chance of x = 0 errors in 25 tests of (unbeknownst to the examiners) same-source specimens is (1 – 1/9)25 = 0.053.

At this point, you may say, wait a minute, this is essentially the 5% figure given by PCAST. Indeed, it is, but it has a very different meaning. It indicates how often one would see studies with a zero false-positive error rate when the unknown, true rate is actually 1/9. In other words, it is the p-value for the data from the experiment when the hypothesis θ = 1/9 is true. (This is a one-tailed p-value. The expected number of false positives is 25(1/9) = 2.7. The probability of more than 6 or more false positives is 5.3%. So the two-tailed p-value is 10.5%. Only about 1 time in ten would studies with 25 trials produce outcomes that diverged at least much as this one did from what is statistically expected if the probability of error in each trial with different-source specimens really is 1/9.) Thus, the suggestion that the true probability is 1/9 is not grossly inconsistent with the data. But the data are not highly probable under the hypothesis that the false-positive error probability is 1/9 (or any larger number).

PCAST also gives some numbers for another hypothetical case. The report states that
[I]f a study finds zero false positives in 100 tries, the four methods mentioned give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context "the false positive rate might be as high as."
The output of EpiTools is included in Table 2.

Table 2. Supplementing PCAST’s CIs
in Its Second Hypothetical Case (x/n = 0/100)

Upper One-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0295 = 1/34 0.0362 = 1/28
Wilson 0.0263 = 1/38 0.0370 = 1/27
Jeffreys 0.0190 = 1/53 0.0247 = 1/40
Agresti-Coull 0.0317 = 1/32 0.0444 = 1/23

Again, to the detriment of defendants, the recommended methods give smaller upper bounds than a more appropriate two-sided interval. For example, PCAST’s 90% Clopper-Pearson interval is [0, 3%], while the 95% interval is [0, 4%]. A devotee of the Jeffreys method (or a less scrupulous expert witness seeking to minimize the false positive-error risk) could report the smallest interval of [0, 2%].

II. Two Real Studies with Larger Sample Sizes

The two hypotheticals are extreme cases. Sample sizes are small, and the outcomes are extreme — no false-positive errors at all. Let’s look at examples from the report that involve larger samples and higher observed error rates.

The report notes that in Langenburg, Champod, and Genessay (2012),
For the non-mated pairs, there were 17 false positive matches among 711 conclusive examinations by the experts. The false positive rate was 2.4 percent (upper 95 percent confidence bound of 3.5 percent). The estimated error rate corresponds to 1 error in 42 cases, with an upper bound corresponding to 1 error in 28 cases.
P. 93 (notes omitted). Invoking EpiTools, one finds that

Table 3. Supplementing PCAST’s CIs
for Langenburg et al. (2012) (x/n = 17/711)

Upper One-sided 95% CI Limit Lower Two-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0356 = 1/28 0.0140 = 1/71 0.0380 = 1/26
Wilson 0.0353 = 1/28 0.0150 = 1/67 0.0380 = 1/26
Jeffreys 0.0348 = 1/29 0.0145 = 1/69 0.0371 = 1/27
Agresti-Coull 0.0355 = 1/28 0.0147 = 1/68 0.0382 = 1/26

As one would expect for a larger sample, all the usual methods for producing a binomial CI generate similar results. The PCAST one-sided interval is [2.4%, 3.6%] compared to a two-sided interval of [1.4%, 3.8%].

For a final example, we can consider PCAST's approach to sampling error for the FBI-Noblis study of latent fingerprint identification. The report describes two related studies of latent fingerprints, both with large sample sizes and small error rates, but I will only provide additional calculations for the 2011 one. The report describes it as follows:
The authors assembled a collection of 744 latent-known pairs, consisting of 520 mated pairs and 224 non-mated pairs. To attempt to ensure that the non-mated pairs were representative of the type of matches that might arise when police identify a suspect by searching fingerprint databases, the known prints were selected by searching the latent prints against the 58 million fingerprints in the AFIS database and selecting one of the closest matching hits. Each of 169 fingerprint examiners was shown 100 pairs and asked to classify them as an identification, an exclusion, or inconclusive. The study reported 6 false positive identifications among 3628 nonmated pairs that examiners judged to have “value for identification.” The false positive rate was thus 0.17 percent (upper 95 percent confidence bound of 0.33 percent). The estimated rate corresponds to 1 error in 604 cases, with the upper bound indicating that the rate could be as high as 1 error in 306 cases.
The experiment is described more fully in Fingerprinting Under the Microscope: A Controlled Experiment on the Accuracy and Reliability of Latent Print Examinations (Part I), Apr. 26, 2011, et seq. More detail on the one statistic selected for discussion in the PCAST report is in Table 4.

Table 4. Supplementing PCAST’s CIs
for Ulery et al. (2011) (x/n = 6/3628)

Upper One-sided 95% CI Limit Lower Two-sided 95% CI Limit Upper Two-sided 95% CI Limit
Clopper-Pearson 0.0033 = 1/303 0.0006 = 1/1667 0.0036 = 1/278
Wilson 0.0032 = 1/313 0.0008 = 1/1250 0.0036 = 1/278
Jeffreys 0.0031 = 1/323 0.0007 = 1/1423 0.0034 = 1/294
Agresti-Coull 0.0033 = 1/303 0.0007 = 1/1423 0.0037 = 1/270

The interval that PCAST seems to recommend is [0.17%, 0.33%]. The two-sided interval is [0.06%, 0.36%]. Whereas PCAST would have the expert testify that the error rate is "as high as 1 error in 306 cases," or perhaps between "1 error in 604 cases [and] 1 error in 306 cases," the two-sided 95% CI extends from 1 in 1667 on up to 1 in 278.

Although PCAST's focus on the high end of estimated error rates strikes me as tendentious, a possible argument for it is psychologically and legally oriented. If jurors would anchor on the first number they hear, such as 1/1667, they would underestimate the false-positive error probability. By excluding lower limits from the presentation, we avoid that possibility. Of course, we also increase the risk that jurors will overestimate the effect of sampling error. A better solution to this perceived problem might be to present the interval from highest to lowest rather than ignoring the low end entirely.

APPENDIX

PCAST is less than pellucid on whether a judge or jury should learn of (1) the point estimate together with an upper limit or (2) only the upper limit. The Council seems to have rejected the more conventional approach of giving both end points of a two-sided CI. The following assertions suggest that the Council favored the second, and most extreme approach:
  • [T]o inform jurors that [the] only two properly designed studies of the accuracy of latent fingerprint analysis ... found false positive rates that could be as high as 1 in 306 in one study and 1 in 18 in the other study ... would appropriately inform jurors that errors occur at detectable frequencies, allowing them to weigh the probative value of the evidence.” (P. 96.)
  • The report states that "the other (Miami-Dade 2014 study) yielded a considerably higher false positive rate of 1 in 18." (P. 96.) Yet, 1 in 18 was not the false-positive rate observed in the study, but, at best, the upper end of a one-sided 95% CI above this rate.
Other parts of the report point in the other direction or are simply ambiguous. Examples follow:
  • If firearms analysis is allowed in court, the scientific criteria for validity as applied should be understood to require clearly reporting the error rates seen in appropriately designed black-box studies (estimated at 1 in 66, with a 95 percent confidence limit of 1 in 46, in the one such study to date). (P. 112)
  • Studies designed to estimate a method’s false positive rate and sensitivity are necessarily conducted using only a finite number of samples. As a consequence, they cannot provide “exact” values for these quantities (and should not claim to do so), but only “confidence intervals,” whose bounds reflect, respectively, the range of values that are reasonably compatible with the results. When reporting a false positive rate to a jury, it is scientifically important to state the “upper 95 percent one-sided confidence bound” to reflect the fact that the actual false positive rate could reasonably be as high as this value. 116/ (P. 53)
  • The upper confidence bound properly incorporates the precision of the estimate based on the sample size. For example, if a study found no errors in 100 tests, it would be misleading to tell a jury that the error rate was 0 percent. In fact, if the tests are independent, the upper 95 percent confidence bound for the true error rate is 3.0 percent. Accordingly a jury should be told that the error rate could be as high as 3.0 percent (that is, 1 in 33). The true error rate could be higher, but with rather small probability (less than 5 percent). If the study were much smaller, the upper 95 percent confidence limit would be higher. For a study that found no errors in 10 tests, the upper 95 percent confidence bound is 26 percent—that is, the actual false positive rate could be roughly 1 in 4 (see Appendix A). (P. 53 n. 116.)
  • In summarizing these studies, we apply the guidelines described earlier in this report (see Chapter 4 and Appendix A). First, while we note (1) both the estimated false positive rates and (2) the upper 95 percent confidence bound on the false positive rate, we focus on the latter as, from a scientific perspective, the appropriate rate to report to a jury—because the primary concern should be about underestimating the false positive rate and the true rate could reasonably be as high as this value. 262/ (P. 92.)
  • Since empirical measurements are based on a limited number of samples, SEN and FPR cannot be measured exactly, but only estimated. Because of the finite sample sizes, the maximum likelihood estimates thus do not tell the whole story. Rather, it is necessary and appropriate to quote confidence bounds within which SEN, and FPR, are highly likely to lie. (P. 152.)
  • For example, if a study finds zero false positives in 100 tries, the [Clopper-Pearson/Exact Binomial method, the Wilson Score interval, the Agresti-Coull (adjusted Wald) interval, and the Jeffreys interval] give, respectively, the values 0.030, 0.026, 0.032, and 0.019 for the upper 95 percent confidence bound. From a scientific standpoint, any of these might appropriately be reported to a jury in the context “the false positive rate might be as high as.” (In this report, we used the Clopper-Pearson/Exact Binomial method.) (P. 153.)