Friday, July 1, 2022

Proposed Amendment to Federal Rule of Evidence 702 Clears More Hurdles

The following report appeared in the OSAC newsletter OSAC In Brief, June 2022, at 4-6 with the title "Proposed Amendment to Federal Rule of Evidence 702 Clears More Hurdles." It updates a report in the June 2022 issue (posted earlier today on this blog). Both reports are meant to be boringly factual. More opinionated remarks may appear later.

After five years of discussion, a proposed amendment to Federal Rule of Evidence 702 on testimony by expert witnesses has progressed to the Judicial Conference of the United States—the policy-making arm of the federal judiciary. If the Judicial Conference accepts the unanimous recommendations of both its Advisory Committee on Evidence Rules, which drafted the amendment, and its standing Committee on Rules of Practice and Procedure, which endorsed it this month, the amendment will be delivered to the Supreme Court for transmittal to Congress. Then, unless Congress intervenes, it will become effective by the end of next year.

But what effect would it have? According to the Advisory Committee chair, U.S. District Court Judge Patrick Schiltz, the amendment does not alter the meaning of the rule in the slightest. “It simply makes it clearer, makes it easier for people to understand, so that fewer mistakes will be made” (as reported June 7, in Bloomberg Law). Box 1 shows the proposed changes, which differ slightly from those discussed in the OSAC In Brief article of July 2021.

BOX 1. Proposed Changes to Federal Rule of Evidence 702
A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if the proponent demonstrates to the court that it is more likely than not that:
(a) the expert's scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue;
(b) the testimony is based on sufficient facts or data;
(c) the testimony is the product of reliable principles and methods; and
(d) the expert has reliably applied expert's opinion reflects a reliable application of the principles and methods to the facts of the case.

On the face of it, the amendment does little, if anything, to alter the substance of the existing rule. It adds the words “if the proponent demonstrates to the court that it is more likely than not” in front of the criteria for admitting expert testimony, but the Supreme Court had already noted that in exercising a longstanding “gatekeeping” role, the district court needs to determine whether the conditions for admitting expert testimony are “established by a preponderance of proof.” (Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579, 592 n.10 (1993) (citing Fed. Evid. 104(a); as a result of public comments, the Advisory Committee substituted “more likely than not” for the “preponderance of evidence” to describe the proponent’s burden of persuasion on the issue of admissibility).

The other wording change concerns the well entrenched reliability-as-applied requirement (“the expert has reliably applied” in part (d)). The amendment uses an alternative phrase—“the expert's opinion reflects a reliable application.” Although one could argue that the specific reference to “opinion” limits the requirement to personal opinions, that is not the intent. An explanatory note that will accompany the revised rule (if and when it is adopted) makes it plain that it still must appear that the expert has applied a valid and reliable method proficiently and appropriately in making any and all findings and inferences. The only purpose of the change is “to emphasize that each expert opinion must stay within the bounds of what can be concluded from a reliable application” of a reliable method to the facts of the case. And, this Advisory Committee Note (ACN) adds that this directive is “is especially pertinent to the testimony of forensic experts,” for which “the judge should (where possible) receive an estimate of the known or potential rate of error of the methodology employed, based (where appropriate) on studies that reflect how often the method produces accurate results” rather than “assertions of absolute or one hundred percent certainty—or to a reasonable degree of scientific certainty ... .”

During the six-month comment period that ended in February, the draft received well over 500 comments. The Reporter to the Advisory Committee found the public reaction “somewhat surprising, because the proposed amendment essentially seeks only to clarify the application of Rule 702 as it was amended in 2000—and that amendment received [only] 179 comments.” Lawyers from the plaintiffs’ side of the civil bar opposed the latest amendment, while defendants’ lawyers supported it.

There were relatively few comments about the implications of the additional words and the accompanying note for the areas of forensic science covered by OSAC. These too were (predictably) divided. The National District Attorneys Association (NDAA) objected to the ACN’s singling out forensic-science testimony as a problem and saw the amendments as “a solution in search of a problem.” But the New York City Bar Association expressed “particular concern [with] criminal prosecutions” and “the scientific validity of many types of ‘feature-comparison’ methods of identification, such as those involving fingerprints, footwear and hair.” The New York State Crime Laboratory Advisory Committee (NYSCLAC) objected to “changes limiting forensic science testimony” but then maintained that its laboratories already complied with the guidance in the ACN. The Union of Concerned Scientists questioned parts of the NDAA and NYSCLAC statements and insisted that “forensic evidence should be required to present courts with estimates of error rates relevant to their methodologies.” The Innocence Project and other organizations and individuals submitted a joint statement praising the changes and pressing for more. They wanted the text of the rule to contain a requirement that testimony is not only “the product of reliable principles and methods” (the current wording), but also to specify that it “includes the limitations and uncertainty of those principles and methods.”

The conflicting comments regarding forensic science produced no modifications. If the amendment is adopted, it will implement, to some extent, the 2016 recommendation of the President’s Council of Advisors on Science and Technology that “the Judicial Conference of the United States ... should prepare ... an Advisory Committee note, providing guidance to Federal judges concerning the admissibility under Rule 702 of expert testimony based on forensic feature-comparison methods.”

Author’s disclaimer: This report presents the views of the author. Their publication in In Brief is not an endorsement by NIST or OSAC, and they are not intended to represent the views of any OSAC unit. No estimate of the known or potential rate of error is available.

Proposed Changes to Federal Rule of Evidence 702

The following report appeared in the OSAC newsletter OSAC In Brief, July 2021, at 3-7 with the uninspired title "Proposed Changes to Federal Rule of Evidence 702." It was followed by an update in the June 2022 issue (about to be reproduced on this blog). Both are meant to be boringly factual. More opinionated remarks may appear later.

On April 30, the federal Advisory Committee on Evidence Rules unanimously proposed two changes to the wording of Federal Rule of Evidence 702. The rule, which many states have adopted in one form or another, provides for testimony by expert witnesses. The changes do not alter the meaning of the rule, but they can be seen as a course-correction signal telling courts to be more vigorous in ensuring that “forensic expert testimony is valid, reliable, and not overstated in court.”

The quoted words come from a report of the Advisory Committee. Facilitating such testimony also is part of OSAC’s raison d’ĂȘtre. This article for In Brief therefore describes the proposed amendment, a little bit of its history, the steps required for it to be enacted into law, and its significance for OSAC’s work.

The Proposer: An Advisory Committee to the Standing Committee of the Judicial Conference

The Judicial Conference of the United States is the policymaking organ of the judicial branch of the federal government. Composed of the Chief Justice of the U.S. Supreme Court, the chief judges of the 13 federal judicial circuits, and select federal district judges, it also is required by statute “to carry on a continuous study of the operation and effect of the general rules of practice and procedure" that apply in the federal courts (and, with some variations, in many state court systems as well). The Conference relies on a “Committee on Rules of Practice and Procedure, commonly referred to as the ‘Standing Committee.’" The Standing Committee, in turn, relies on advisory committees on appellate, bankruptcy, civil, criminal, and evidence rules. These advisory committees are comprised of “federal judges, practicing lawyers, law professors, state chief justices, and representatives of the Department of Justice.” (Quotations are from the Administrative Office of the U.S. Courts.) The Advisory Committee on Evidence Rules (which we can abbreviate as ACER) is one of these committees.

The Proposed Text: Two Wording Changes

Rule 702 went into effect in federal courts in 1975. It was one sentence long. The Supreme Court famously interpreted it in Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), a somewhat ambivalent and abstract opinion. The Court expounded further in cases in 1997 and 1999. The rule was rewritten to incorporate the teachings in these cases in 2000, leading to the version with the longer sentence in the right-hand side of Box 1.

BOX 1. FEDERAL RULE OF EVIDENCE 702 THEN AND NOW
The Rule in 1975 The Rule in 2021
If scientific, technical, or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill, experience, training, or education, may testify thereto in the form of an opinion or otherwise. A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if:
(a) the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue;
(b) the testimony is based on sufficient facts or data;
(c) the testimony is the product of reliable principles and methods; and
(d) the expert has reliably applied the principles and methods to the facts of the case.

The proposed amendment makes two seemingly minor changes, shown in Box 2:

BOX 2. THE ADVISORY COMMITTEE’S PROPOSED AMENDMENT TO RULE 702

A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if the proponent has demonstrated by a preponderance of the evidence that:
(a) the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue;
(b) the testimony is based on sufficient facts or data;
(c) the testimony is the product of reliable principles and methods; and
(d) the expert has reliably applied expert’s opinion reflects a reliable application of the principles and methods to the facts of the case.

Reading these words, one might well ask what is going on. The first change seems to state the obvious (to lawyers, anyway). A footnote in Daubert already indicates that in the “preliminary assessment of whether the reasoning or methodology” possesses “evidentiary reliability,” the trial court must be satisfied by “a preponderance of proof” because that is the threshold for all “[p]reliminary questions concerning the qualification of a person to be a witness, the existence of a privilege, or the admissibility of evidence.” It may not hurt to state this standard in the text of the rule (although including it after the opening clause about qualifications awkwardly fails to modify the qualifications part of the rule). But why bother?

Similarly, the change to Part (d) is potentially confusing because it limits the “reliable application” prong of the rule to expert “opinion” even though, as the Advisory Committee that drafted the original rule noted, it is “logically unfounded” to “assume[] that experts testify only in the form of opinions.” Instead, “[t]he rule … recognizes that an expert on the stand may give a dissertation or exposition of scientific or other principles relevant to the case, leaving the trier of fact to apply them to the facts.” But aside from the probably unintended limitation of the as-applied prong to opinions, why bother? What is the difference between testimony when an expert has “reasonably applied the principles and methods” and testimony that “reflects a reasonable application of the principles and methods”?

The answers lie in ACER’s official note prepared to accompany the rule, the minutes of its meetings, and its periodic reports to the Standing Committee on its progress in revising the rule.

The Purpose of the New Text

For OSAC, the most salient parts of the note of the Advisory Committee are in Boxes 3 and 4. As to the first change, regarding “preponderance,” ACER believed that

BOX 3. Part of ACER’s Proposed Note Explaining Its First Proposed Change

[M]any courts have held that the critical questions of the sufficiency of an expert’s basis, and the application of the expert’s methodology, are questions of weight and not admissibility. These rulings are an incorrect application of Rules 702 and 104(a). … The Committee concluded that emphasizing the preponderance standard in Rule 702 specifically was made necessary by the courts that have failed to apply correctly the reliability requirements of that rule. … [Explicitly incorporating the standard] means that once the court has found the admissibility requirement to be met by a preponderance of the evidence, any attack by the opponent will go only to the weight of the evidence.

A major push for this change came from individuals and organizations concerned with civil litigation in which, they believed, courts have admitted expert opinions that a drug or chemical is harmful without adequately verifying that there is a body of scientific literature sufficient to let a reasonable expert conclude that the substance can cause the kind of harm claimed to have occurred under the conditions of the case. However, it also will remind judges in criminal cases that they must have proof that the scientific literature is sufficient to support the findings of forensic-science experts.

As Box 4 shows, the second part of the “amendment is especially pertinent to the testimony of forensic [science] experts in both criminal and civil cases”:

BOX 4. Part of ACER’s Proposed Note Explaining Its Second Proposed Change

Rule 702(d) has also been amended to emphasize that a trial judge must exercise gatekeeping authority with respect to the opinion ultimately expressed by a testifying expert. … The amendment is especially pertinent to the testimony of forensic experts in both criminal and civil cases. Forensic experts should avoid assertions of absolute or one hundred percent certainty—or to a reasonable degree of scientific certainty—if the methodology is subjective and thus potentially subject to error. In deciding whether to admit forensic expert testimony, the judge should (where possible) receive an estimate of the known or potential rate of error of the methodology employed, based (where appropriate) on studies that reflect how often the method produces accurate results. Expert opinion testimony regarding the weight of feature comparison evidence (i.e., evidence that a set of features corresponds between two examined items) must be limited to those inferences that can reasonably be drawn from a reliable application of the principles and methods. This amendment does not, however, bar testimony that comports with substantive law requiring opinions to a particular degree of certainty. … [N]othing in the amendment requires the court to nitpick an expert’s opinion in order to reach a perfect expression of what the basis and methodology can support. The … standard does not require perfection. On the other hand, it does not permit the expert to make extravagant claims that are unsupported by the expert’s basis and methodology.

It is the ACER note, much more than the revisions to the text of the rule, that has implications for forensic-science evidence. As the note indicates, the committee was especially concerned with forensic-science testimony. Its briefing materials included summaries of federal cases from across the spectrum of forensic sciences that raised the issue of “overstatement.” Furthermore, the idea of a new Advisory Committee Note came from the 2016 report of the President’s Council of Advisors on Science and Technology. PCAST called on “the Judicial Conference [to] prepare, with advice from the scientific community, a best practices manual and an Advisory Committee note, providing guidance to Federal judges concerning the admissibility under Rule 702 of expert testimony based on forensic feature-comparison methods.”

Apparently, PCAST did not realize that ACER is not empowered to write new notes to old rules. At a symposium convened by ACER in 2017, PCAST co-chair and newly appointed Presidential science advisor, Eric Lander, advised the committee as follows: “If an advisory note is a possibility, I’d favor it. If it’s not, change a comma in the rule and then write a new advisory note. Change one word, any word and write an advisory note.” Advisory Comm. on Evid. Rules Symposium on Forensic Expert Testimony, Daubert, and Rule 702, 86 Ford. L. Rev. 1463, 1523 (2018). This change-a-word artifice is more or less what is happening.

What Is Next in the Rulemaking Process?

The proposed amendment is just that—proposed. To become law, the ACER amendment and accompanying note must be approved by the Standing Committee after a six-month period for public comment and testimony (after which ACER reviews and can revise the proposed amendment and seek more comment). The Standing Committee then reviews the final drafts. It can revise and return the draft to ACER, or it can submit the amendment and note to the full Judicial Conference for its review. If the Judicial Conference approves, the drafts go to the Supreme Court, which normally transmits them to Congress with no substantive review. Congress then can adopt, reject, modify, or defer the rule change, but if Congress is silent for seven months, the amendment becomes effective at the end of the year.

Plainly, the proposal, which was four years in the making, still has a long way to go, but the very fact that ACER deliberated at length and expressed concern about forensic-science testimony, overstatement, and error probabilities could have more immediate impact in litigation.

Implications for OSAC

To help satisfy the proof requirements of Rule 702 (both as it stands and as it might be amended), subcommittees drafting standards for making findings and for reporting or testifying should specifically cite the scientific literature that supports each part of the standard. Valid estimates of potential error rates (or related statistics on the accuracy of results), or procedures to arrive at these estimates, should be part of such standards. Scientific and Technical Review Panels (STRPs) already are instructed to look for this content or for an explanation in the standard of why methods for ascertaining and expressing uncertainty in measurements, observations, or inferences are not present in the standards they review.

The repeated references to “overstatement” in ACER’s deliberations and materials should reinforce the desire of OSAC units to address the admittedly difficult problem of prescribing standards for testimony—and to use phrases in all standards that involve results that will satisfy the insistence on “those inferences that can reasonably be drawn from a reliable application of the principles and methods.” Cases on firearms-toolmark identifications (called “ballistics” cases in the ACER materials) suggest that judicial efforts are unlikely to produce the best solution. The Department of Justice has attempted to confront this issue with its Uniform Language for Testimony and Reports standards (ULTRs). It argued to ACER that these ULTRs help solve the problem of overclaiming, but one response was that because there are no such standards in laboratories generally, a new Advisory Committee Note is necessary. OSAC units still can help fill this gap if they act quickly.

Disclaimer: This report presents the views of the author. Their publication in In Brief is not an endorsement by NIST or OSAC, and they are not intended to represent the views of any OSAC unit. The error rate associated with them is not known.

Saturday, June 11, 2022

State v. Ghigliotti, Computer-assisted Bullet Matching, and the ASB Standards

In State v. Ghigliotti, 232 A.3d 468, 471 (N.J. App. Div. 2020), a firearms examiner concluded that a particular gun did not fire the bullet (or, more precisely, a bullet jacket) removed from the body of a man found shot to death by the side of a road in Union County, New Jersey. That was 2005, and the case went nowhere.

Ten years later, a detective prevailed on a second firearms examiner to see what he thought of the toolmark evidence. After considerable effort, this examiner reported that the microscopic comparisons with many test bullets from the gun in question were inconclusive.

However, at a training seminar in New Orleans he learned of two tools developed and marketed by Ultra Electronics Forensic Technology, the creator of the Integrated Ballistics Identification System (IBIS) that "can find the 'needle in the haystack', suggesting possible matches between pairs of spent bullets and cartridge cases, at speeds well beyond human capacity. The Bullettrax system “digitally captures the surface of a bullet in 2D and 3D, providing a topographic model of the marks around its circumference.” As “[t]he world’s most advanced bullet acquisition station” it uses “intelligent surface tracking that automatically adapts to deformations of damaged and fragmented bullets.”

The complementary Matchpoint is an “analysis station” with “[p]owerful visualization tools [that] go beyond conventional comparison microscopes to ease the recognition of high-confidence matches. Indeed, Matchpoint increases identification success rates while reducing efforts required for ultimate confirmations.” It features multiple side-by-side view of images from the Bullettrax data and score analysis. The court explained that “the Matchpoint software ... included tools for flattening and manipulating the images, adjusting the brightness, zooming in, and ‘different overlays of ... color scaling.’”

But the examiner did not make the comparisons based on the digitally generated and enhanced images, and he did not rely on any similarity-score analysis. Rather, he “looked at the images side-by-side on a computer screen using Matchpoint [only] ‘to try and target areas of interest to determine ... if (he) was going to go back and continue with further [conventional] microscopic comparisons or not.’” He found four such areas of agreement. Conducting a new microscopic analysis of these and other areas a few weeks later, he “‘came to an opinion of an identification or a positive identification’ ... grounded in his ‘training and experience and education as a practitioner in firearms identification’ and his handling of over 2300 cases.” 232 A.3d at 478–49.

The trial court “determined that a Frye hearing was necessary to demonstrate the reliability of the computer images of the bullets produced by BULLETTRAX before the expert would be permitted to testify at trial.” Id. at 471, The state filed an interlocutory appeal, arguing that the positive identification did not depend on Ultra’s products. The Appellate Division affirmed, holding that the hearing should proceed.

I do not know where the case stands, but its facts provide the basis for a thought experiment. At about the same time as the Ghigliotto court affirmed the order for a hearing, the American Academy of Forensic Sciences Standards Board (ASB) published a package of standards on toolmark comparisons. Created in 2015, ASB describes itself as “an ANSI [American National Standards Institute]-accredited Standards Developing Organization with the purpose of providing accessible, high quality science-based consensus forensic standards.” Academy Standards Board, Who We Are, 2022. Two of its standards concern three-dimensional (3D) data and inferences in toolmark comparisons, while the third is specific to software for comparing 2D or 3D data.

We can put the third to the side, for it is limited to software that "seeks to assess both the level of geometric similarity (similarity of toolmarks) and the degree of certainty that the observed similarity results from a common origin." ANSI/ASB Standard 062, Standard for Topography Comparison Software for Toolmark Analysis § 3.1 (2021). The data collection and visualization software here does neither, and the scoring feature of Matchpoint was not used.

ANSI/ASB Standard 061, Firearms and Toolmarks 3D Measurement Systems and Measurement Quality Control (2021), is more apposite although it is only intended “to ensure the instrument’s accuracy, to conduct instrument calibration, and to estimate measurement uncertainty for each axis (X, Y, and Z).” It promises “procedures for validation of 3D system hardware” but not software. It “does not apply to legacy 2D type systems,” leaving one to wonder whether there are any standards for validating them.

Even for "3D system hardware," the procedure for “developmental validity” (§ 4.1) is nonexistent. There are no criteria in this standard for recognizing when a measurement system is valid and no steps that a researcher must follow to study validity. Instead, the section on “Developmental Validation (Mandatory)” states that an “organization with appropriate knowledge and/or [sic] expertise” shall complete “a developmental validation”; that this validation “typically” (but not necessarily) consists of library research (“identifying and citing previously published scientific literature”); and that “ample”—but entirely uncited— literature exists “to establish the underlying imaging technology” for seven enumerated technologies. In full, the three sentences on “developmental validation” are

As per ANSI/ASB Standard 063, Implementation of 3D Technologies in Forensic Firearm and Toolmark Comparison Laboratories, a developmental validation shall be completed by at least one organization with appropriate knowledge and/or expertise. The developmental validation of imaging hardware typically consists of identifying and citing previously published scientific literature establishing the underlying imaging technology. The methods defined above of coherence scanning interferometry, confocal microscopy, confocal chromatic microscopy, focus variation microscopy, phase-shifting interferometric microscopy, photometric stereo, and structured light projection all have ample published scientific literature which can be cited to establish an underlying imaging technology.

Perhaps the section is merely there to point the reader to the different standard, ASB 063, on implementation of 3D technologies. \1/ But that standard seems to conceive of “developmental validation” as a process that occurs in a forensic laboratory or other organization by a predefined process with a “technical reviewer” to sign off on the resulting document that becomes the object of further review through “[p]eer-reviewed publication (or other means of dissemination to the scientific community, such as a peer-reviewed presentation at a scientific meeting).” § 4.1.3.4. The data and the statistics needed to assess measurement validity are left to the readers' imaginations (or statistical acumen). \2/

ASB 061 devotes more attention to what it calls “deployment validation” on the part of every laboratory that chooses to use a 3D measuring instrument. This part of the standard describes some procedures for checking whether X, Y, and Z “scales” that should reveal whether measurements of the coordinates of points on the surface of the material are close to what they should be. For example, § 4.2.5.4.1 specifies that

Using calibrated geometric standards (e.g., sine wave, pitch, step heights), measurements shall be conducted to check the X and Y lateral scales as well as the vertical Z scale. Ten measurements shall be performed consecutively ... . The measurement uncertainty of the repeatability measurements shall overlap with the certified value and uncertainty of the geometric standard used.

The phrasing is confusing (to me, at least). I assume that a “geometric standard” is the equivalent of a ruler of known length (a “certified value” of, say, 1 ± 0.01 microns). But what does the edict that “[t]he measurement uncertainty of the repeatability measurements shall overlap with the certified value and uncertainty of the geometric standard used” mean operationally?

The best answer I can think of is that the standard contemplates comparing two intervals. One is the scale value (along, say, the X-axis). Imagine that the “geometric standard” that is taken to be the truth is certified as having a length of 1± 0.01 microns. Let’s call this the “certified standard interval.”

Now the laboratory makes ten measurements for its “deployment validation” to produce what we can call a “sample interval” from the ten measurements. The ASB standard does not contain any directions on how this is to be done. One approach would be to compute a confidence interval on the assumption that the sample measurements are normally distributed. Suppose the observed sample mean for them is 0.80, and the standard error computed from the ten sample measurements is s = 0.10 microns. The confidence interval is then 0.80 ± k(0.10), where is k is some constant. If the confidence interval includes any part of the certified interval, this part of the deployment-validation requirement is met.

What values of k would be suitable for the instrument to be regarded as “deploymentally valid”? The standard is devoid of any insight into this critical value and its relationship to confidence. It does not explain what the interval-overlap requirement is supposed to accomplish, but if the confidence interval is part of it, it is an ad hoc form of hypothesis testing with an unstated significance level.

Is the question of whether the hypothesis that there is no difference between the standard reference value of 1 and the true sample mean can be rejected at some preset significance level all that important here? Should not the question be how much the disparities between the sample of ten measured values and the geometric-standard value would affect the efficacy of the measurements? An observed sample mean that is 20% too low does not lead to the rejection of the hypothesis that the instrument’s measurements are, in the long run, exactly correct, but with only ten measurements in the sample, that may tell us more about the lack of the statistical power of the test than about the ability of the instrumentation to measure what it seeks to measure with suitable accuracy for the applications to which it is put.

In sum, the standard’s section on “Developmental Validation (Mandatory)” mandates nothing that is not trivially obvious—the court already knows that it should look for support for the 3D scanning and image-manipulation methods in the scientific literature, and the standard does not reveal what the substance of this validation should be. “Deployment Validation (Mandatory)” is supposed to ensure that the laboratory is properly prepared to use a previously validated system for casework. It is of little use in a hearing on the general acceptance of the scanning system and the theories behind it. (One could argue that scientists would accept a system that a laboratory has been rigorously pretested and shown to be perform accurately, even with no other validation, but it is not clear that the standard describes an appropriate, rigorous pretesting procedure.)

Moreover, the standard explicitly excludes software from its reach, making it inapplicable to the Matchpoint image-manipulation tools the helped the examiner in Ghigliotti zero in on the regions that altered his opinion. The companion standard on software does not fill this gap, for it deals only with software that produces similarity scores or random-match probabilities. Finally, ASB 063's substantive requirements for "deployment validation" prior to laboratory implementation might well prohibit an examiner from going to the developer of hardware and software not yet adopted by his or her employer for help with locating features for further visual analysis, as occurred in Ghigliotti. But that is not responsive to the legal question of whether the developer's system is generally accepted as valid in the scientific community.

NOTES
  1. ANSI/ASB 063 is even more devoid of references. The entire bibliography consists of a webpage entitled “control chart.” There, attorneys, courts, or experts seeking to use the standard will discover that a “control chart is a graph used to study how a process changes over time.” That is great for quality control of instrumentation, but it is irrelevant to validation.
  2. Under § 4.1.2.4, "The plan for developmental validation study shall include the following:
    "a) the limitations of the procedure;
    "b) the conditions under which reliable results can be obtained;
    "c) critical aspects of the procedure that shall be controlled and monitored;
    "d) the ability of the resulting procedure to meet the needs of the given application."

Last updated: 12 June 2022

Tuesday, May 24, 2022

The New York Court of Appeals Returns to Probabilistic Genotyping Software (Part III—Six Empirical Studies)

New York’s Court of Appeals returned to the contentious issue of “probabilistic genotyping software” (PGS) in People v. Wakefield, 2022 N.Y. Slip Op. 02771, 2022 WL 1217463 (N.Y. Apr. 26, 2022). As previously discussed, in People v. Williams, 147 N.E.3d 1131 (N.Y. 2020), a slim majority of the court had reasoned that the output of a computer program should not have been admitted without a full evidentiary hearing on the program's general acceptance within the scientific community.

In Wakefield, the Court of Appeals faced a different question for a more complex computer program. This time, the question was whether, after holding such a hearing, the trial court erred in finding that the more sophisticated program was generally accepted as a scientifically valid and reliable means of estimating “likelihood ratios” for DNA mixtures like the ones recovered in the case. The program, known as TrueAllele, is marketed by Cybergenetics, “a Pittsburgh-based bioinformation company [whose] computers translate DNA data into useful information.”

As discussed separately, the Wakefield court held that, in the circumstances of the case, the output of TrueAllele was admissible to associate the defendant with a murder. It emphasized “multiple validation studies ... demonstrat[ing] TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples.” 2022 WL 1217463 at *7. But the court did not describe the level of accuracy attained in any of the validation studies. That is surely something lawyers would want to know about, so I decided to read the “peer-reviewed publications in scientific journals” (id.) to which the court must have been referring.

The state introduced 31 exhibits at the evidentiary hearing in 2015. Nine were journal publications of some kind. Six of those described data collected to establish (or indirectly suggesting) that TrueAllele was accurate. Only three of them relied on “known DNA samples” as opposed to samples from casework. The synopses that follow do not describe all the parts of them, let alone all the findings from them. I merely pick out the parts that I found most interesting and most pertinent to the question of accuracy or error (two sides of the same coin).

The 2009 Cybergenetics Known-samples Study

The first study is M.W. Perlin & A. Sinelnikov, An Information Gap in DNA Evidence Interpretation. 4 PLoS ONE  e8327. This experiment used 40 laboratory-constructed two-contributor mixture samples (from two pairs of unrelated individuals) with varying mixture proportions and total DNA amounts (0.125 ng to 1 ng) to show that TrueAllele was much better at classifying a sample as containing a contributor’s DNA than was the cumulative probability of inclusion method (CPI) that employed peak-height thresholds for binary determinations of the presence of alleles. TrueAllele’s likelihood ratios (LRs) supported the hypothesis of inclusion in nearly every instance (LR>1).

However, the data could not reveal whether the level of positive support (log-LR) was accurate. Does a computed LR of 1,000,000 “really” indicate evidence that is five orders of magnitude more probative than a computed LR of 10? The “empirical evidence” from the study cannot answer this question. The best we can do is to verify that the computed LR increases as the quantity of DNA does. The uncertainty inherent in the PCR process is smaller for larger starting quantities, and this should be reflected in the magnitude of the LR.

The 2011 Cybergenetics–New York State Police Casework Study

The second study also used two-contributor mixtures, but these came from casework in which the alleles, as ascertained by conventional methods, did not exclude the defendant as a possible contributor. In Mark W. Perlin et al., Validating TrueAllele  DNA Mixture Interpretation, 56 J. Forensic Sci. 1430 (2011), researchers from Cybergenetics and the New York State Police laboratory selected “16 two-person mixture samples” that met certain criteria “from 40 adjudicated cases and one proficiency test conducted in” the New York laboratory. TrueAllele generated larger LRs than those from the manual analyses. That TrueAllele did not produce LRs < 1 (indicative of exclusions) for any defendant included by conventional  analysis is evidence of a low false-exclusion probability. The computed LRs are greater than 1 when they should be. But this empirical evidence does not directly address the question of whether the magnitude of the LRs themselves are as close or as far from 1 as they should be if they are to be understood as a Bayes' factor.

The 2013 CybergeneticsNew York State Police Casework Study

The third study is more extensive. In Mark W. Perlin et al., New York State TrueAllele ® Casework Validation Study, 58 J. Forensic Sci. 1458 (2013), Cybergenetics worked with the New York laboratory to reanalyze DNA mixtures with up to three contributors  from 39 adjudicated cases and two proficiency tests. “Whenever there was a human result, the computer’s genotype was concordant,” and TrueAllele “produced a match statistic on 81 mixture items ... , while human review reported a statistic on [only] 25 of these items.” 

This time Cybergenetics also tried to answer the question of how often TrueAllele produces false “matches” (LR>1) when it compares a known noncontributor’s sample to a mixed sample. It accomplished this by simulating false pairs of samples for TrueAllele to process. As the authors explained,

We compared each of the 87 matched mixture evidence genotypes with the (<87) reference genotypes from the other 40 cases. Each of these 7298 comparisons should generate a mismatch between the unrelated genotypes from different cases and hence a negative log(LR) value. A genotype inference method having good specificity should exhibit mismatch information values [log-LRs] that are negative in the same way that true matches are positive.

Id. at 1461. Thus, they derived two empirical distributions for likelihood ratios—one for the nonexcluded defendants in the cases (who we would expect to be actual sources)—and one for the unrelated individuals (who we would expect to be non-sources). The empirical distributions were well separated, and the log(LR) was always less than zero for the presumed non-sources. 

So TrueAllele seems to work well as a classifier (for distinguishing true-source pairs from false-source pairs) in these small-scale studies. But again, the question of whether the magnitudes of its LRs are highly accurate remains. With astronomically large LRs, it is hard to know the answer. Cf. David H. Kaye, Theona M. Vyvial & Dennis L. Young, Validating the Probability of Paternity, 31 Transfusion 823 (1991). \1/

The 2013 UCF–Cybergenetics Known-samples Study

The fourth study is J. Ballantyne, E.K. Hanson & M.W. Perlin, DNA Mixture Genotyping by Probabilistic Computer Interpretation of Binomially-sampled Laser Captured Cell Populations: Combining Quantitative Data for Greater Identification Information, 53 Sci. & Justice 103–114 (2013). It is not a validation study, but researchers from the University of Central Florida and Cybergenetics made two different two-person mixtures with equal quantities of DNA from each person. In such 50:50 mixtures, peak heights are expected to be similar, making it harder to fit the pattern of alleles into the pairs (single-locus genotypes) from each contributor than if there had been a major and minor contributor. So the team created ten small (20 cell) subsamples of each of the two mixed DNA samples by selecting cells at random. They analyzed these subsamples separately. They used TrueAllele to estimate the relative contributions (“mixture weights”) in the 20-cell samples, and found that when TrueAllele combined data from multiple subsamples, it assigned a 99% probability to the two contributor genotypes. The point of the study was to demonstrate the possibility of subdividing even small balanced samples to take advantage of peak height differences arising from imbalances in the even smaller subsamples.

The 2013 Cybergenetics–Virginia Department of Forensic Services Casework Study

The fifth study is more on point. In Mark W. Perlin et al., TrueAllele Casework on Virginia DNA Mixture Evidence: Computer and Manual Interpretation in 72 Reported Criminal Cases, 9 PLOS ONE e92837 (2014), researchers from Cybergenetics and the Virginia Department of Forensic Services compared TrueAllele with manual analysis on 111 selected casework samples. The set of criminal case mixtures paired with a nonexcluded defendant’s profile should produce large LRs. For ten pairs, TrueAllele failed to return “a reproducible positive match statistic.” Among the 101 remaining, presumably same-source pairs, the smallest LR was 18. Since the LR must be less than 1 to be deemed indicative of a noncontributor, in no instance did TrueAllele generate a falsely exonerating result.

But what about falsely incriminating LRs? This time, the researchers did not reassign the defendant’s profiles to other cases to produce false pairs. Rather, they generated 10,000 random STR genotypes (from population databases of alleles in Virginia) to simulate the STR profiles of non-sources of the mixtures from the criminal cases. They paired each of these non-source profiles with 101 genotypes that emerged from the unknown mixtures and calculated LR values. There were fewer than 1 in 20,000 LRs suggesting an association (LR > 1) among these mixture/non-source pairs; less than 1in 1,000,000 for LR > 1,000; and no false positives at all for LR > 6,054. In other words, TrueAllele produced an empirical distribution for false pairs that consisted almost entirely of LRs < 1 and that never had very large LRs. Again, it seems to be an excellent classifier.

The 2015 Cybergenetics–Kern Regional Crime Laboratory Known-samples Study

Finally, in M.W. Perlin et al., TrueAllele Genotype Identification on DNA Mixtures Containing up to Five Unknown Contributors, 60 J. Forensic Sci. 857 (2015), researchers from Cybergenetics and the Kern Regional Crime Laboratory in California obtained DNA samples from five known individuals. They constructed ten two-person mixtures by randomly selecting two of the five contributors and mixing their DNA in proportions picked at random. The researchers constructed ten 3-, 4-, and 5-person mixtures in the same manner. From each of these 4 × 10 mixtures, they created a 1 nanogram and a 200 picogram sample for STR analysis. TrueAlelle computed an LR for each of the genotypes that went into each analyzed sample (the alternative hypothesis being a random genotype).

Defining an exclusion as a LR < 1, TrueAllele rarely excluded true contributors to the 1 ng 2- or 3-contributor mixtures (no exclusions in 20 comparisons and 1 in 30, respectively), but with 4 and 5 contributors involved, the false-exclusion rates were 9/40 and 9/50, respectively. The false exclusions came from the more extreme mixtures. As long as at least 10% of the nanogram mixtures came from the lesser contributor, there were no false exclusions. The false-exclusion rates for the 200 pg samples were larger: 2/20, 4/30, 13/40, and 19/50. For these low-template mixtures, a greater proportion of the lesser contributor’s DNA (25%) had to be present to avoid false exclusions.

To assess false inclusions, 10,000 genotypes were randomly generated from each of three ethnic population allele databases. These noncontributor profiles were compared with the eight mixtures. For ethnic group and DNA mixture sample, the LRs fell well below LR=1, meaning that there were few false inclusions. For the high DNA levels (1 ng), the proportion of comparisons with misleading LRs (LR > 1 for the simulated noncontributors) were 0/600,000, 25/900,000, 186/1,200,000, and 1,301/1,500,000 for the 2-, 3-, 4-, and 5-person mixtures, respectively. The worst case (the most misleadingly high LR) occurred for the five-person mixture, where one LR was 1,592. For the low-template DNA mixtures, the corresponding false-inclusion proportions were 2/600,000, 53/900,000, 177/1,200,000, and 145/1,500,000. The worst outcome was an LR of 101 for a four-person mixture.

Apparently using “reliable” in its legal or nonstatistical sense (as in Daubert and Federal Rule of Evidence 702), the researchers concluded that “[t]his in-depth experimental study and statistical analysis establish the reliability of TrueAllele for the interpretation of DNA mixture evidence over a broad range of forensic casework conditions.” \2/ My sense of the studies as of the time of the hearing in Wakefield is that they show that within certain ranges (with regard to the quantity of DNA, the number of contributors, and the fractions from the multiple contributors), TrueAlelle’s likelihood ratios discriminate quite well between samples paired with true contributors and the same samples paired with unrelated noncontributors. \3/ Moreover, the program’s output behaves qualitatively as it should, generally producing smaller likelihood ratios for electrophoretic data that are more complex or more deviled by stochastic effects on peak heights and locations.

NOTES

  1. In this early study, we compared the empirical LR distribution for parentage using presumably true and false mother-child-father trios derived from a set of civil paternity cases to the “paternity index,” a likelihood ratio computed with software applying simple genetic principles to the inheritance of HLA types. We found that the theoretical PI diverged from the empirical LR for PI > 80 or so.
  2. Cf. David W. Bauer, Nasir Butt, Jennifer M. Hornyak & Mark W. Perlin, Validating TrueAllele Interpretation of DNA Mixtures Containing up to Ten Unknown Contributors, 65 J. Forensic Sci. 380, 380 (2020), doi: 10.1111/1556-4029.14204 (abstract concluding that “[t]he study found that TrueAllele is a reliable method for analyzing DNA mixtures containing up to ten unknown contributors
  3. One might argue that the number of mixed samples collectively studied is too small. PCAST indicated that “there is relatively little published evidence” because “[i]n human molecular genetics, an experimental validation of an important diagnostic method would typically involve hundreds of distinct samples.” President's Council of Advisors on Sci. & Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods 81 (2016) 81 (notes omitted), https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensic_science_report_final.pdf [https://perma.cc/R76Y-7VU]. The number of distinct samples (mixtures from different contributors) combining all the studies listed here seems closer to 100.

The New York Court of Appeals Returns to Probabilistic Genotyping Software (Part II—General Acceptance)

The New York Court of Appeals returned to the contentious issue of “probabilistic genotyping software” (PGS) in People v. Wakefield, 2022 N.Y. Slip Op. 02771, 2022 WL 1217463 (N.Y. Apr. 26, 2022). As previously discussed, in People v. Williams, 147 N.E.3d 1131 (N.Y. 2020), a slim majority of the court held that the output of a computer program should not have been admitted without a full evidentiary hearing on its general acceptance within the scientific community. The majority opinion described a confluence of considerations:

  1. The program had only been tested in the laboratory that developed it (“an invitation to bias,” id. at 1141);
  2. The only evidentiary hearing ever conducted on the program had only shown “internal validation” and formal approval by a subcommittee of a state forensic science commission that was a “narrow class of reviewers, some of whom were employed by the very agency that developed the technology,” id. at 1142;
  3. Given “the ‘black box’ nature of that program,” the developer's “secretive approach ... was inconsistent with quality assurance standards” id.; and
  4. Submissions for hearings in other cases “suggested that the accuracy calculations of that program may be flawed,” id.

But which of these four factors were dispositive? Was it the combination of all four, or something in between, that rendered the evidence inadmissible? If the developer were to change its “secretive approach” so as to allow defense experts to study the program’s source code, would that, plus the “internal validation,” be enough to establish general scientific acceptance? Would it be sufficient for the state to refute the suggestions of flawed “accuracy calculations of the program” through testimony from its experts? Just what did the court mean when it summarized its analysis with the statement that “[i]n short, the [PGS] should be supported by those with no professional interest in its acceptance. Frye demands an objective, unbiased review”?

The opinion did not reveal how the majority might answer these questions. Of course, in holding that a hearing was necessary, the Williams majority implied that some information outside of the normal scientific literature could fill the gap created by the absence of replicated developmental validation studies from external (“objective, unbiased”) researchers. But what might that information be?

The court’s encounter with PGS last month did not answer this open question, for the court in Wakefield found that there were replicated studies from the developer of a more sophisticated computer program and other researchers. In addition, it pointed to other evaluations or uses of the program. The totality of the evidence, it reasoned, was stronger than the developer-only record in Williams and demonstrated the requisite general acceptance. But the opinion provoked one member of the court to complain of a "jarring turnabout" from "the same view unsuccessfully advocated by a minority in Williams two years ago."

This posting describes the case, the DNA evidence, and aspects of the discussions of general acceptance that struck me as interesting or puzzling.

The Crime, the Samples, and Some Misunderstood Probabilities of Exclusion

In 2010, John Wakefield strangled the occupant of an apartment with a guitar amplifier cord and made off with various items. The New York State Police laboratory analyzed samples from four areas: the front part of the collar of the victim's shirt, the rear part of the collar; the victim's forearm; and the amplifier cord. The laboratory concluded that the DNA on the collar was “consistent with at least two donors, one of which was the victim, and defendant could not be excluded as the other contributor”; that the DNA from the forearm “was consistent with DNA from the victim, as the major contributor, mixed with at least two additional donors” and the DNA on the cord was “a mixture of at least two donors, from which the victim could not be excluded as a possible contributor.” 2022 WL 1217463, at *1.

At this point, the court’s description of the State Police laboratory’s work becomes had to follow. The court wrote that:

[T]he analyst did not call any alleles based on peaks on the electropherogram below [the pre-established stochastic] threshold. As a result, there was insufficient data to allow the Lab to calculate probabilities for the unknown contributors to the DNA mixtures found on the amplifier cord and the front of the shirt collar.

No alleles at all? It takes only one allele to compute a probability of exclusion, although with such a limited profile, the exclusion probability might be close to zero, meaning that the data are uninformative. In any event, for the other two samples, “[t]he Lab was able to call ... 4 ... STR loci” that enabled “the analyst, using the combined probability of inclusion method, [to opine that] the probability an unrelated individual contributed DNA to the outside rear shirt collar was 1 in 1,088" and “that the probability an unrelated individual contributed DNA ... was 1 in 422" for “the profile obtained from the victim's forearm.”

Or so the court said. As explained in Box 1, these numbers are not “the probability an unrelated individual contributed DNA.” They are estimates of the probability that a randomly selected, unrelated individual could not be excluded as a possible source. Given a large number of unrelated individuals in the region, there easily could be more than a hundred people with STR profiles compatible with the mixtures.

BOX 1: TRANSPOSITION

The probability of inclusion is not the probability that an included individual is the contributor. It is the probability of not excluding an individual as a possible contributor. That probability is not necessarily equal to the probability that an included individual actually contributed to the sample from which he or she could not be excluded. If C stands for contributor and I for included, the probability of inclusion for any randomly selected individual can be written P(I given C). The source probability for the individual is different. It is P(C given I). Equating the two is known as the transposition fallacy (or the “prosecutor’s fallacy,” though it could be called the “judges” fallacy as well).

We do not need any symbols to see that the two conditional probabilities are not necessarily equal. The population of Schnectady county, where the crime occurred, was about 155,000 in 2010. Let’s round down to 150,000. That ought to remove all of Wakefield’s relatives. Excluding all but 1 in 1,088 individuals would leave 138 people as possible perpetrators. Of course, some would be far more plausible suspects than others, but based on the DNA evidence alone, how can the court claim that “the probability an unrelated individual contributed DNA to the outside rear shirt collar was 1 in 1,088”? That probability cannot be determined from the DNA evidence alone. It can be computed only if we are willing to assign a “prior probability” of being the murderer to each of the unrelated individuals in Schnectady (or anywhere else).

Suppose we assume that, ab initio, everyone in the county has an equal probability of being a source of the DNA on the collar. At that point, Wakefield’s probability is quite small. It is 1/150,000. Since the DNA testing would have excluded all but some 138 people, and because Wakefield is one of them, the probability attached to him is larger. Now the probability is 1/138. But that still leaves the vast bulk of the probability with the 137 unrelated individuals. Instead of transposing, we should say that “the probability an unrelated individual contributed DNA to the outside rear shirt collar was 137 out of in 138” rather than the court’s “1 in 1,088.” Of course, our assumption of equal probabilities for every unrelated individual is unrealistic, but that does not impeach the broader point that the mathematics does not make the probability of an unrelated individual the number that the court supplied.

Cybergenetics to the Rescue

To secure a better and more complete analysis, “the electronic data from the DNA testing of the four samples at issue was then sent to Cybergenetics [for] calculating a likelihood ratio—using all of the information generated on the electropherogram, including peaks that fall below a laboratory's stochastic threshold.” Cybergenetics is a private company whose “flagship TrueAllele® technology resolves complex forensic evidence, providing accurate and objective DNA match statistics.” TrueAllele's calculations of the likelihood ratios, using the hypothesis that the four samples contained DNA from an unrelated black individual as the alternative to the hypothesis that Wakefield’s DNA was present were 5.88 billion for the cord, 170 quintillion for the outside rear shirt collar, 303 billion for the outside front shirt collar, and 56.1 million for the forearm.

Wakefield moved to exclude these findings, The Schnectady County Supreme Court held a pretrial evidentiary hearing “over numerous days.” People v. Wakefield, 47 Misc.3d 850, 851, 9 N.Y.S.3d 540 (2015). (New York calls its trial courts supreme courts.) Finding “that Cybergenetics TrueAllele Casework is not novel but instead is ‘generally accepted’ under the Frye standard,” \1/ Justice Michael V. Coccoma (New York calls its trial judges justices) denied the motion. 47 Misc.3d at 859. A jury convicted Wakefield of first degree murder and robbery. The Appellate Division affirmed, and seven years after the trial, so did the Court of Appeals (New York calls its most supreme court the Court of Appeals).

Changes in New York’s Highest Court

Back in Williams, the Court of Appeals judges had split 4-3 on whether New York City's home-grown PGS had attained general acceptance. The three judges led by Chief Judge Janet M. DiFiore objected to the majority’s negative comments about PGS and propounded a narrower rationale for requiring a Frye hearing. But even if one could have confidently applied the majority reasoning in Williams to the scientific status of TrueAllele in Wakefield, the exercise in legal logic might have been futile. In the two short years since Williams, the composition of the court had changed. One concurring judge died, and the majority-opinion bloc lost half its members, including the opinion’s author, to retirements. The reconstituted court gave Chief Judge DiFiore the opportunity to write a more laudatory opinion for a new and larger majority.

Only one judge stood apart from this new majority. Having been in the majority in Williams, Judge Jenny Rivera now found herself in the Chief Judge’s situation in Williams, composing a dissenting opinion with respect to the reasoning on general acceptance but concurring in the result. Drawing on Williams, Judge Rivera maintained that “the court erred in admitting the TrueAllele results but the error ... was harmless” in view of the other evidence of guilt.

The Court’s Understanding of TrueAllele

The opinions are vague about the inner workings of TrueAllele. The majority opinion suggests that what is distinctive about PGS is that it cranks out a likelihood ratio. \2/ But “likelihood ratio,” for present purposes, simply denotes the probability of data given one hypothesis divided by the probability of the same data given a (simple) alternative hypothesis. It has nothing to do with the probabilistic part of TrueAllelle. Indeed, TrueAllele only computes a likelihood ratio after the probability analysis is completed. It does this by dividing (i) the final posterior odds that favor one source hypothesis as compared to another by (ii) the initial prior odds. This division gives a “Bayes' factor” that states how much the data have changed the odds.

Let me try saying this another way. In effect, TrueAllele starts with prior odds based solely on the frequencies of various DNA alleles (and hence genotypes) in some population, performs successive approximations to converge on a better estimate of the odds, and divides the adjusted odds by the prior odds to yield what Cybergenetics calls “the match statistic.” If all goes well, this quotient (call it a likelihood ratio, a Bayes' factor, a match statistic, or whatever you want) reveals how powerful the DNA evidence is (which is not necessarily the same as the odds that any hypothesis is true). At least, that is what I think goes on. The court contents itself with warm and fuzzy statements such as “a probability model to assess the values of a genotype objectively” “based on mathematical computations from all the data in the electropherograms.” and “separates the genotypes using the mathematical probability principle of the Markov chain Monte Carlo (MCMC) search to calculate the probability for what the different genotypes could be.” (This last clause may not be so warm and fuzzy; it begins to unpack what I simplistically called successive approximations.)

The Timing for General Acceptance

Wakefield is a backwards-looking case. The main question before the Court of Appeals was whether, in 2015, TrueAllele reasonably could have been deemed to have been generally accepted in the scientific community. That is what New York law requires. \3/ The Chief Judge’s analysis of the general acceptance of TrueAllele starts with the observation that “[t]he well-known Frye test applied to the admissibility of novel scientific evidence (Frye v. United States, 293 F. 1013 [D.C. Cir.1923]) is 'whether the accepted techniques, when properly performed, generate results accepted as reliable within the scientific community generally' (People v. Wesley, 83 N.Y.2d 417, 422, 611 N.Y.S.2d 97, 633 N.E.2d 451 [1994]).”

Wesley is an interesting case to cite here. One would not know from the citation or the analysis in Wakefield that in Wesley there was no opinion for a majority of the seven judges on the court. There was one opinion for three judges and another opinion for two judges concurring only in the result. The remaining two judges did not participate. The concurring opinion was written by the late Chief Judge Judith S. Kaye, the longest-serving chief judge in New York history.

Chief Judge Kaye’s concurrence is memorable for its skepticism about finding general acceptance on the basis of studies from the developer of a method. Current Chief Judge Janet DiFiore briefly summarized that discussion (as did the majority in Williams). A more complete exposition is in Box 2. Chief Judge DiFiore then suggests that the Wesley concurrence was satisfied because “[n]otwithstanding these concerns, Chief Judge Kaye ultimately agreed that, at the time the appeal was decided, "RFLP-based forensic analysis [was] generally accepted as reliable" and those testing procedures were accepted as the standard methodology used in the scientific community until the advent of the PCR STR method used today.”

This presentation places an odd spin on the Wesley concurrence. The sole basis for the concurrence was that “it can fairly be said that use of DNA evidence was harmless beyond a reasonable doubt” because the DNA evidence “added nothing to the People's case.” 83 N.Y.2d at 444–45. The observations that five years after the hearing in Wesley, it had become clear that “in principle” RFLP-VNTR testing was “fundamentally sound” and was generally accepted were clearly dicta. Chief Judge Kaye was not suggesting that because a method had become generally accepted later, its earlier admission was vindicated. The dicta on later general acceptance was intended to inform trial courts that while they were at liberty to admit RFLP-VNTR evidence without pretrial hearings on general acceptance, they still needed to probe “the adequacy of the methods used to acquire and analyze samples ... case by case.” Id. at 445.

In contrast to Wesley, which emphasized the state of the science “at the time of the Frye hearing in 1988,” 83 N.Y.2d at 425 (plurality opinion), and whether “in 1988, ... there was consensus,” id. at 439 (concurring opinion), Chief Judge DiFiore’s opinion is less precise on when general acceptance came into existence:

BOX 2. PEOPLE v. WESLEY
83 N.Y.2d 417, 439–41, 611 N.Y.S.2d 97, 633 N.E.2d 451 (N.Y. 1994) (Chief Judge Kaye, concurring) (citations and footnote omitted)

The inquiry into forensic analysis of DNA in this case also demonstrates the "pitfalls of self-validation by a small group" Before bringing novel evidence to court, proponents of new techniques must subject their methods to the scrutiny of fellow scientists, unimpeded by commercial concerns.

A Frye court should be particularly cautious when — as here — "the supporting research is conducted by someone with a professional or commercial interest in the technique" DNA forensic analysis was developed in commercial laboratories under conditions of secrecy, preventing emergence of independent views. No independent academic or governmental laboratories were publishing studies concerning forensic use of DNA profiling. The Federal Bureau of Investigation did not consider use of the technique until 1989. Because no other facilities were apparently conducting research in the field, the commercial laboratory's unchallenged endorsement of the reliability of its own techniques was accepted by the hearing court as sufficient to represent acceptance of the technique by scientists generally. The sole forensic witness at the hearing in this case was Dr. Michael Baird, Director of Forensics at Lifecodes laboratory, where the samples were to be analyzed. While he assured the court of the reliability of the forensic application of DNA, virtually the sole publications on forensic use of DNA were his own or those of Dr. Jeffreys, the founder of Cellmark, one of Lifecodes' competitors. Nor had the forensic procedure been subjected to thorough peer review. ***

The opinions of two scientists, both with commercial interests in the work under consideration and both the primary developers and proponents of the technique, were insufficient to establish "general acceptance" in the scientific field. The People's effort to gain a consensus by having their own witnesses "peer review" the relevant studies in time to return to court with supporting testimony was hardly an appropriate substitute for the thoughtful exchange of ideas in an unbiased scientific community envisioned by Frye. Our colleagues' characterization of a dearth of publications on this novel technique as the equivalent of unanimous endorsement of its reliability ignores the plain reality that this technique was not yet being discussed and tested in the scientific community.

"Although the continuous probabilistic approach was not used in the majority of forensic crime laboratories at the time of the hearing, the methodology has been generally accepted in the relevant scientific community based on the empirical evidence of its validity, as demonstrated by multiple validation studies, including collaborative studies, peer-reviewed publications in scientific journals and its use in other jurisdictions. The empirical studies demonstrated TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples."

Presumably, and notwithstanding citations to materials appearing after 2015, \4/ she meant to write that the methodology had been generally accepted in 2015 because the indications listed were present then. (Whether the decisive time for general acceptance should be that of the trial rather than the appeal is not completely obvious. If a technique becomes generally accepted later, why should the defendant be entitled to a new trial in which the evidence that should have been excluded has become admissible anyway? The defendant's interest in the time-of-trial rule is the interest in not being convicted with the help of scientifically sound evidence (as per the general-acceptance standard based on the best current knowledge). A counter-argument is that a large pool of potential defense experts to question the application of the general accepted method in the particular case did not exist at the time of trial because the evidence was too novel.)

Quantifying the Accuracy of PGS

Turning to the question of the state of acceptance as of 2015, the majority opinion maintains that

]T]he methodology has been generally accepted in the relevant scientific community based on the empirical evidence of its validity, as demonstrated by multiple validation studies, including collaborative studies, peer-reviewed publications in scientific journals and its use in other jurisdictions. The empirical studies demonstrated TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples.

Both the fact that the software was written to implement uncontroversial mathematical ideas and the published empirical evidence are important. If the software were designed to implement a mathematically invalid procedure, the game would be over before it began. But techniques such as Bayes’ rule and sampling methods for getting a representative picture of the posterior distribution only work when they are developed appropriately for a particular application. Acknowledging that these tools have been used to solve problems in many fields of science is a bit like saying that the mathematics of probability theory is undisputed. The validity of the mathematical ideas are a necessary but hardly a sufficient condition for a finding that software intended to apply the ideas functions as intended. Using a particular mathematical formula or method to describe or predict real-world phenomena is an endeavor that is subject to and in need of empirical confirmation. Because PGS models the variability in the empirical data that emerge from chemical reactions and electronic detectors, “empirical evidence ... of its accuracy” is indispensable to establishing its accuracy.

Unfortunately, Wakefield is short on details from the “multiple validation studies” and “peer-reviewed publications.” What do the studies and publications reveal about the accuracy of output such as “5.88 billion times more probable” and “170 quintillion times more probable”? The Supreme Court opinion is devoid of any quantitative statement of how well the deconvoluted individual profiles and their Bayes’ factors reported by TrueAllele correspond to the presence or absence of those profiles in samples constructed with or otherwise known to contain DNA from given individuals. So is the Appellate Division opinion. So too with the Court of Appeals’ opinions. The court is persuaded that “[t]he empirical studies demonstrated TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples.” But how well did True Allele perform in the “many published and peer reviewed” validity studies?

A separate posting summarizes parts of the six studies circa 2015 that are both published and peer reviewed. The numbers in these studies suggest that within certain ranges (with regard to the quantity of DNA, the number of contributors, and the fractions from the multiple contributors), TrueAllele’s likelihood ratios discriminate quite well between samples paired with true contributors and the same samples paired with noncontributors. For example, in one experiment, LR was never greater than 1 for 600,000 simulations of false contributors to 10 two-person mixtures containing 1 nanogram of DNA—no observed false positives! Conversely, LR was never less than 1 for every true contributor to the same ten mixtures—no observed false negatives in 20 comparisons. Moreover, the program’s output behaves qualitatively as it should, generally producing smaller likelihood ratios for electrophoretic data that are more complex or more deviled by stochastic effects on peak heights and locations.

Such results suggest that TrueAllele’s LRs are in the ballpark. Yet, it is hard to gauge the size of the ballpark. Is a computed LR of 5.88 billion truly a relative probability of 5.88 billion? Could the ratio be a lot less or a lot more? The validity studies do not give quantitative answers to these questions about “accuracy.” \5/

The Developer’s Involvement

On appeal, Wakefield had to convince the court that the unchallenged studies and other indicia of general acceptance were too weak to permit a finding of general acceptance. To do so, he pointed to “the dearth of independent validation as a result of Dr. Perlin's involvement in the large majority of studies produced at the hearing.” (Indeed, Dr. Perlin is the lead author of every one of the five published validity studies and a co-author of a sixth published study that also helps show validity.)

The majority acknowledged “legitimate concern” but decided that it was overcome “by the import of the empirical evidence of reliability demonstrated here and the acceptance of the methodology by the relevant scientific community.” However, the discussion of “the import of the empirical evidence” seems somewhat garbled.

1

First, the court notes that “the FBI Quality Assurance Standards requires ‘a developmental validation for a particular technology’ be published.” That the FBI might be satisfied with a single publication from the developer of a method does not speak to what the broader scientific community regards as essential to the validation. Along with the QAS, the court cites "NIST, DNA Mixture Interpretation: A NIST Scientific Foundation Review, at 64 (June 2021 Draft report)." The page merely reports that the NIST staff were able to examine “[p]ublicly available data on DNA mixture interpretation performance ... from five sources [including] published PGS studies” and that “conducting mixture studies may be viewed as a necessity to meet published guidelines or QAS requirements ... .” That scientists and other NIST personnel who choose to review a technology will read the scientific reports of the developers of the technology does not tell us much about defendant’s claim that Cybergenetics’ involvement in the published validation studies gravely diminishes “the import of the empirical evidence.”

2

Second, the Court of Appeals maintained that “the interest of the developer was addressed at the Frye hearing in this case.” As the court described the hearing, the response to this concern was that “[a]lthough Dr. Perlin was involved in and coauthored most of the validation studies, his interest in TrueAllele was disclosed as required by the journals who published the studies and the empirical evidence of the reliability of TrueAllele was not disputed.”

These responses seem rather flaccid. Some of the articles contain conflict-of-interest statements; most do not. \7/ But the presence or absence of obvious disclaimers does not come to grips with the complaint. Defendant’s argument is not that there are hidden funding sources or financial relationships. It is that interests in the outcomes of the studies somehow may affect the results. The claim is not that validation data were fabricated or that the data analysis was faulty. As with the movement for replication and “open science,” it is a response to more subtle threats.

3

Third, the opinion asserts that “the scientific method” is “entirely consistent with” proof of validity coming from the inventors, discoverers, or commercializers (citing President's Council of Advisors on Sci. and Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods, at 46 (2016)). Again, however, the argument is not that only disinterested parties do and should participate in scientific dialog. It is that "[w]hile it is completely appropriate for method developers to evaluate their own methods, establishing scientific validity also requires scientific evaluation by other scientific groups that did not develop the method.” Id. at 80 (https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensic_science_report_final.pdf [https://perma.cc/R76Y-7VU]).

4

That precept leads to the court’s last and most telling response to the “legitimate concern” over “the dearth of independent validation.” The Chief Judge finally wrote that “there were [not only] developer [but also] independent validation studies and laboratory internal validation studies, many published and peer-reviewed.”

But is this a fair characterization of the scientific literature as of 2015? From what I can tell, no more than five or six studies appear in peer-reviewed journals, and none are completely “independent validation studies.” The NIST report cited in Wakefield lists but a single “internal validation” study, from Virginia in 2013, apparently released in response to a Freedom of Information Act request, Although the NIST reviewers limited themselves to laboratory studies or data posted to the Internet, they concluded that “[c]urrently, there is not enough publicly available data to enable an external and independent assessment of the degree of reliability of DNA mixture interpretation practices, including the use of probabilistic genotyping software (PGS) systems.” 

Of course, this “Key Takeaway #4.3” is merely part of a draft report and is not a judgment as to what conclusions on validity should be reached on the basis of the published studies and the internal ones. Nevertheless, the court overlooks this prominent “takeaway” (and others). Instead, the Chief Judge asserts that “[t]he technology was approved for use by NIST”—even though NIST is not a regulatory agency that approves technologies—and that “NIST's use of the TrueAllele system for its standard reference materials likewise demonstrates confidence within the relevant community that the system generates accurate results.”

~~~

This is not to say that the scientific literature was patently insufficient to support the court’s assessment of the general scientific acceptance of TrueAllele for interpreting the DNA data in the case. But it does raise the question of whether the court’s assertions about the large number of “independent validity studies” and internal ones that have been “published and peer-reviewed” are exaggerated.

Source Code and General Acceptance

The defense also contended that the state’s testimony and exhibits from “the Frye hearing [were] insufficient because, absent disclosure of the TrueAllele source code for examination by the scientific community, its ‘proprietary black box technology’ cannot be generally accepted as a matter of law.” This argument bears two possible interpretations. On the one hand, it could be a claim that scientists demand open-source programs—those with every line of code deposited somewhere for everyone to see—before they will consider a program suitable for data analysis or other purposes. We can call this position the open-source theory.

On the other hand, the claim might be “that disclosure of the TrueAllele source code [to the defense, perhaps with an order to protect against more widespread dissemination of trade secrets] was required to properly conduct the Frye hearing” and that without at least that much discovery of the code, scientists would not regard TrueAllele as valid. We can call this position the discovery-based theory. It implies that, in establishing general scientific acceptance in a Frye hearing, pretrial discovery of secret code is an adequate substitute for exposing the code to the possible scrutiny of the entire scientific community. \8/

The Wakefield opinions are not entirely clear on about which theory they embrace or reject. Judge Rivera’s concurrence may have endorsed both theories. In addition to accentuating “the need to provide defendant with access to the source code,” she decried the absence of “objective, expert third-party access,” writing that

The court's decision was an abuse of discretion as a matter of law because it relied on validation studies by interested parties and evaluations founded on incomplete information about TrueAllele's computer-based methodology. Without defense counsel and objective, expert third-party access to and evaluation of the underlying algorithms and source code, the court could not conclude that TrueAllele's brand of probabilistic genotyping was generally accepted within the forensic science community.

The “evaluations founded on incomplete information” were from a standards developing organization, a state forensic science commission, and NIST. They were incomplete because, according to Judge Rivera, “without the source code, the agencies could not adequately evaluate the use of TrueAllele for this type of DNA mixture analysis ... .”

Focusing on the discovery-based theory, the rest of the court determined that “[d]isclosure ... was not needed in order to establish at the Frye hearing the acceptance of the methodology by the relevant scientific community. The Chief Judge gave two, somewhat confusingly stated, reasons. The first was that Wakefield sought the source code under a rule for discovery that did not apply and then “made no further attempt to demonstrate a particularized need for the source code by motion to the court.” But it is not clear how the failure “to demonstrate a particularized need” overcomes (or even responds to) the argument that the scientific community will not accept software as validly implementing algorithms unless the source code is either open source or given only to the defendant.

The Chief Judge continued:

Moreover, defendant's arguments as to why the source code had to be disclosed pay no heed to the empirical evidence in the validation studies of the reliability of the instrument or to the general acceptance of the methodology in the scientific community—the issue for the Frye hearing—and are directed more toward the foundational concern of whether the source code performed accurately and as intended (see Wesley, 83 N.Y.2d at 429, 611 N.Y.S.2d 97, 633 N.E.2d 451).

The meaning of the sentence may not be immediately apparent. The defense argument is that giving a defendant (or perhaps the scientific community generally) access to source code is a prerequisite to general acceptance of the proposition that the software correctly implements theoretically sound algorithms. If this broad proposition is false dogma, the court should simply say so. It should announce that source code need not be disclosed because there is an alternative, reasonably effective means for establishing that the software performs as it should. The first part of the first sentence starts out that way, but the sentence then states that “whether the source code performed accurately and as intended” is not a matter of general acceptance at all. It is only “foundational” in the sense identified by Chief Judge Kaye in Wesley, who, as we saw (Box 2), wrote that even though RFLP-VNTR testing was generally accepted, the complete “foundation” for admitting DNA evidence entails proof that the generally accepted procedure was performed properly in the case at bar.

But regarding the argument about source code as falling outside of the Frye inquiry misapprehends the defense argument. Neither the open-source nor the discovery-based theories pertain to the execution of valid software. They question the premise that validity can be generally accepted without disclosure of the program’s source code. Yet, the majority elaborates on its non-Frye "foundational" classification for the source-code argument as follows:

To the extent the testimony at the hearing reflected that the TrueAllele Casework System may generate less reliable results when analyzing more complex mixtures (see also President's Council of Advisors on Sci. and Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods, at 80 [2016] https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensic_science_report_final.pdf [published after the Frye hearing was held]), defendant did not refine his challenge to address the general acceptance of TrueAllele on such complex mixtures or how that hypothesis would have been applicable to the particular facts of this case. As a result, it is unclear that any such objection would have been relevant to defendant's case, where the samples consisted largely of simple (two-contributor) mixtures with the victim as a known contributor (see also NIST, DNA Mixture Interpretation: A NIST Scientific Foundation Review, at 3 [June 2021 Draft report] https://nvlpubs.nist.gov/nistpubs/ir/2021/NIST.IR.8351-draft.pdf).

These citations to the PCAST and NIST reports actually undercut any suggestion that source-code secrecy does not implicate Frye. The NIST draft repeatedly states that

Forensic scientists interpret DNA mixtures with the assistance of statistical models and expert judgment. Interpretation becomes more complicated when contributors to the mixture alleles. Complications can also arise when random variations, also known as stochastic effects, make it more difficult to confidently interpret the resulting DNA profile.

Not all DNA mixtures present these types of challenges. We agree with the President’s Council of Advisors on Science and Technology (PCAST) that “DNA analysis of single-source samples or simple mixtures of two individuals, such as from many rape kits, is an objective method that has been established to be foundationally valid” (PCAST 2016).

NIST, DNA Mixture Interpretation: A NIST Scientific Foundation Review, at 2-3 & 11-12 (June 2021 draft) (citations omitted). To demand that “defendant ... refine his challenge to address the general acceptance of TrueAllele on ... complex mixtures or ... the particular facts of this case” is to hold that TrueAllele is generally accepted for use with “single-source samples or simple mixtures of two individuals”—even though the source code is hidden. But if science does not demand the disclosure of source code for general acceptance inside the "single" or simple "zone," then why would it demand disclosure for general acceptance outside that zone?

The court's remarks make more sense as a response to Wakefield’s different discovery argument about the need for source code for trial purposes. This argument does not claim that disclosure of source is essential to general acceptance to exist. It looks to the trial rather than the pretrial Frye hearing. The thought may be that if the accuracy of the program for the “simple” cases is assured, then the need for discovery of the code to prepare for trial testimony is less compelling. The court appears to be responding that because “the samples consisted largely of simple (two-contributor) mixtures with the victim as a known contributor,” there was little need for discovery of the code in this case.

Although this rejoinder departs from the topic of what Wakefield teaches us about general acceptance, I would note that it is difficult to reconcile this characterization of the case with Chief Judge DeFiore’s own description of the samples. The court mentioned four samples. Its initial description of them indicates that the New York laboratory deemed the sample on the amplifying cord to be “at least” a three-person mixture and stated that “because of the complexity of the mixture,” the laboratory could not even compare “results generated from the amplifier cord ... to defendant's DNA profile.” 2022 WL 1217463, at *1. Because of the “stochastic threshold,” the laboratory could discern peaks at only 4 out of 15 loci for “the outside rear shirt collar” and “for the profile obtained from the victim's forearm.” Id. Presumably, the “insufficient data” on “the unknown contributors to the DNA mixtures found on the amplifier cord and the front of the shirt collar” is what led the state to call Cybergenetics for help. These samples are not instances of what PCAST called “DNA analysis of single-source samples or simple mixtures of two individuals, such as from many rape kits” or what the NIST group called “two-person mixtures involving significant quantities of DNA.” They are “more complicated” situations that arise “when contributors to the mixture share common alleles [and] when random variations, also known as stochastic effects” are present.

In sum, the deeper one looks into the Wakefield opinions, the more there is to wonder about. But whatever quirks and quiddities reside in the writing, the nearly unanimous opinion of the Court of Appeals signals that a trial court can choose to dispense with the general-acceptance inquiry for at least one PGS program—TrueAllele—for nonchallenging single samples or two-person mixtures and for samples of somewhat greater complexity as well.

NOTES

  1. This formulation conflates the issue of novelty with the issue of general acceptance, which can change over time. See Williams, 35 N.Y.3d at 43, 147 N.E.3d at 1143.
  2. The description begins with the remark that “The likelihood ratio in its modern form was developed by Alan Turing during World War II as a code-breaking method.” That is a possibly defective bit of intellectual history, inasmuch as Turing did not develop the likelihood ratio. To decipher messages, Turing relied on a logarithmic scale for the Bayes’ factor in two ways—as indicating the strength of evidence, and as a tool for sequential analysis. Sir Harold Jeffreys had done the former in his 1939 Theory of Probability book. The sequential analysis problem is not clearly connected to PGS. It arises when the sample size is not fixed in advance and the data are evaluated continuously as they are collected. PGS processes all the data at once.
  3. As the court wrote in People v. Williams, 35 N.Y.3d 24, 147 N.E.3d 1131, 1139–40, 124 N.Y.S.3d 593 (N.Y. 2020), “[r]eview of a Frye determination must be based on the state of scientific knowledge and opinion at the time of the ruling (see Cornell, 22 N.Y.3d at 784-785, 986 N.Y.S.2d 389, 9 N.E.3d 884 (‘a Frye ruling on lack of general causation hinges on the scientific literature in the record before the trial court in the particular case’”).
  4. E.g., 2022 WL 1217463 at *7 n.10 (“TrueAllele is not an outlier in the use of the continuous probabilistic genotyping method. Other types of probabilistic genotyping software, such as STRMix, have likewise been found to be generally accepted (see e.g. United States v. Gissantaner, 990 F.3d 457, 467 (6th Cir.2021)).”
  5. Cf. David H. Kaye, Theona M. Vyvial & Dennis L. Young, Validating the Probability of Paternity, 31 Transfusion 823 (1991) (comparing the empirical LR distribution for parentage using presumably true and false mother-child-father trios derived from a set of civil paternity cases to the “paternity index” (PI), a likelihood ratio computed with software applying simple genetic principles to the inheritance of HLA types, and reporting that the theoretical PI diverged from the empirical LR for PI > 80 or so).
  6. At trial, “Gary Skuse, Ph.D., a professor of biological sciences at the Rochester Institute of Technology, testified at trial as a defense witness [and] agreed ... that defendant's DNA was present in the mixtures found on the shirt collar and amplifier cord and that it was ‘most likely’ present on the victim's forearm.”
  7. The articles in the Journal of Forensic Sciences and Science and Justice have no such statements. The “Competing Interests” paragraph in a PloS One article advises that “I have read the journal’s policy and have the following conflicts. Mark Perlin is a shareholder, officer and employee of Cybergenetics in Pittsburgh, PA, a company that develops genetic technology for computer interpretation of DNA evidence. Cybergenetics manufactures the patented TrueAllele Casework system, and provides expert testimony about DNA case results. Kiersten Dormer and Jennifer Hornyak are current or former employees of Cybergenetics. Lisa Schiermeier-Wood and Dr. Susan Greenspoon are current employees of the Virginia Department of Forensic Science, a government laboratory that provides expert DNA testimony in criminal cases and is adopting the TrueAllele Casework system. This does not alter our adherence to all the PLOS ONE policies on sharing data and materials.”
  8. The defense advanced another different discovery theory in arguing that it could not adequately cross-examine and confront Dr. Perlin at trial unless it could access the source code. The court rejected this theory too.