Tuesday, May 24, 2022

The New York Court of Appeals Returns to Probabilistic Genotyping Software (Part III—Six Empirical Studies)

New York’s Court of Appeals returned to the contentious issue of “probabilistic genotyping software” (PGS) in People v. Wakefield, 2022 N.Y. Slip Op. 02771, 2022 WL 1217463 (N.Y. Apr. 26, 2022). As previously discussed, in People v. Williams, 147 N.E.3d 1131 (N.Y. 2020), a slim majority of the court had reasoned that the output of a computer program should not have been admitted without a full evidentiary hearing on the program's general acceptance within the scientific community.

In Wakefield, the Court of Appeals faced a different question for a more complex computer program. This time, the question was whether, after holding such a hearing, the trial court erred in finding that the more sophisticated program was generally accepted as a scientifically valid and reliable means of estimating “likelihood ratios” for DNA mixtures like the ones recovered in the case. The program, known as TrueAllele, is marketed by Cybergenetics, “a Pittsburgh-based bioinformation company [whose] computers translate DNA data into useful information.”

As discussed separately, the Wakefield court held that, in the circumstances of the case, the output of TrueAllele was admissible to associate the defendant with a murder. It emphasized “multiple validation studies ... demonstrat[ing] TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples.” 2022 WL 1217463 at *7. But the court did not describe the level of accuracy attained in any of the validation studies. That is surely something lawyers would want to know about, so I decided to read the “peer-reviewed publications in scientific journals” (id.) to which the court must have been referring.

The state introduced 31 exhibits at the evidentiary hearing in 2015. Nine were journal publications of some kind. Six of those described data collected to establish (or indirectly suggesting) that TrueAllele was accurate. Only three of them relied on “known DNA samples” as opposed to samples from casework. The synopses that follow do not describe all the parts of them, let alone all the findings from them. I merely pick out the parts that I found most interesting and most pertinent to the question of accuracy or error (two sides of the same coin).

The 2009 Cybergenetics Known-samples Study

The first study is M.W. Perlin & A. Sinelnikov, An Information Gap in DNA Evidence Interpretation. 4 PLoS ONE  e8327. This experiment used 40 laboratory-constructed two-contributor mixture samples (from two pairs of unrelated individuals) with varying mixture proportions and total DNA amounts (0.125 ng to 1 ng) to show that TrueAllele was much better at classifying a sample as containing a contributor’s DNA than was the cumulative probability of inclusion method (CPI) that employed peak-height thresholds for binary determinations of the presence of alleles. TrueAllele’s likelihood ratios (LRs) supported the hypothesis of inclusion in nearly every instance (LR>1).

However, the data could not reveal whether the level of positive support (log-LR) was accurate. Does a computed LR of 1,000,000 “really” indicate evidence that is five orders of magnitude more probative than a computed LR of 10? The “empirical evidence” from the study cannot answer this question. The best we can do is to verify that the computed LR increases as the quantity of DNA does. The uncertainty inherent in the PCR process is smaller for larger starting quantities, and this should be reflected in the magnitude of the LR.

The 2011 Cybergenetics–New York State Police Casework Study

The second study also used two-contributor mixtures, but these came from casework in which the alleles, as ascertained by conventional methods, did not exclude the defendant as a possible contributor. In Mark W. Perlin et al., Validating TrueAllele  DNA Mixture Interpretation, 56 J. Forensic Sci. 1430 (2011), researchers from Cybergenetics and the New York State Police laboratory selected “16 two-person mixture samples” that met certain criteria “from 40 adjudicated cases and one proficiency test conducted in” the New York laboratory. TrueAllele generated larger LRs than those from the manual analyses. That TrueAllele did not produce LRs < 1 (indicative of exclusions) for any defendant included by conventional  analysis is evidence of a low false-exclusion probability. The computed LRs are greater than 1 when they should be. But this empirical evidence does not directly address the question of whether the magnitude of the LRs themselves are as close or as far from 1 as they should be if they are to be understood as a Bayes' factor.

The 2013 CybergeneticsNew York State Police Casework Study

The third study is more extensive. In Mark W. Perlin et al., New York State TrueAllele ® Casework Validation Study, 58 J. Forensic Sci. 1458 (2013), Cybergenetics worked with the New York laboratory to reanalyze DNA mixtures with up to three contributors  from 39 adjudicated cases and two proficiency tests. “Whenever there was a human result, the computer’s genotype was concordant,” and TrueAllele “produced a match statistic on 81 mixture items ... , while human review reported a statistic on [only] 25 of these items.” 

This time Cybergenetics also tried to answer the question of how often TrueAllele produces false “matches” (LR>1) when it compares a known noncontributor’s sample to a mixed sample. It accomplished this by simulating false pairs of samples for TrueAllele to process. As the authors explained,

We compared each of the 87 matched mixture evidence genotypes with the (<87) reference genotypes from the other 40 cases. Each of these 7298 comparisons should generate a mismatch between the unrelated genotypes from different cases and hence a negative log(LR) value. A genotype inference method having good specificity should exhibit mismatch information values [log-LRs] that are negative in the same way that true matches are positive.

Id. at 1461. Thus, they derived two empirical distributions for likelihood ratios—one for the nonexcluded defendants in the cases (who we would expect to be actual sources)—and one for the unrelated individuals (who we would expect to be non-sources). The empirical distributions were well separated, and the log(LR) was always less than zero for the presumed non-sources. 

So TrueAllele seems to work well as a classifier (for distinguishing true-source pairs from false-source pairs) in these small-scale studies. But again, the question of whether the magnitudes of its LRs are highly accurate remains. With astronomically large LRs, it is hard to know the answer. Cf. David H. Kaye, Theona M. Vyvial & Dennis L. Young, Validating the Probability of Paternity, 31 Transfusion 823 (1991). \1/

The 2013 UCF–Cybergenetics Known-samples Study

The fourth study is J. Ballantyne, E.K. Hanson & M.W. Perlin, DNA Mixture Genotyping by Probabilistic Computer Interpretation of Binomially-sampled Laser Captured Cell Populations: Combining Quantitative Data for Greater Identification Information, 53 Sci. & Justice 103–114 (2013). It is not a validation study, but researchers from the University of Central Florida and Cybergenetics made two different two-person mixtures with equal quantities of DNA from each person. In such 50:50 mixtures, peak heights are expected to be similar, making it harder to fit the pattern of alleles into the pairs (single-locus genotypes) from each contributor than if there had been a major and minor contributor. So the team created ten small (20 cell) subsamples of each of the two mixed DNA samples by selecting cells at random. They analyzed these subsamples separately. They used TrueAllele to estimate the relative contributions (“mixture weights”) in the 20-cell samples, and found that when TrueAllele combined data from multiple subsamples, it assigned a 99% probability to the two contributor genotypes. The point of the study was to demonstrate the possibility of subdividing even small balanced samples to take advantage of peak height differences arising from imbalances in the even smaller subsamples.

The 2013 Cybergenetics–Virginia Department of Forensic Services Casework Study

The fifth study is more on point. In Mark W. Perlin et al., TrueAllele Casework on Virginia DNA Mixture Evidence: Computer and Manual Interpretation in 72 Reported Criminal Cases, 9 PLOS ONE e92837 (2014), researchers from Cybergenetics and the Virginia Department of Forensic Services compared TrueAllele with manual analysis on 111 selected casework samples. The set of criminal case mixtures paired with a nonexcluded defendant’s profile should produce large LRs. For ten pairs, TrueAllele failed to return “a reproducible positive match statistic.” Among the 101 remaining, presumably same-source pairs, the smallest LR was 18. Since the LR must be less than 1 to be deemed indicative of a noncontributor, in no instance did TrueAllele generate a falsely exonerating result.

But what about falsely incriminating LRs? This time, the researchers did not reassign the defendant’s profiles to other cases to produce false pairs. Rather, they generated 10,000 random STR genotypes (from population databases of alleles in Virginia) to simulate the STR profiles of non-sources of the mixtures from the criminal cases. They paired each of these non-source profiles with 101 genotypes that emerged from the unknown mixtures and calculated LR values. There were fewer than 1 in 20,000 LRs suggesting an association (LR > 1) among these mixture/non-source pairs; less than 1in 1,000,000 for LR > 1,000; and no false positives at all for LR > 6,054. In other words, TrueAllele produced an empirical distribution for false pairs that consisted almost entirely of LRs < 1 and that never had very large LRs. Again, it seems to be an excellent classifier.

The 2015 Cybergenetics–Kern Regional Crime Laboratory Known-samples Study

Finally, in M.W. Perlin et al., TrueAllele Genotype Identification on DNA Mixtures Containing up to Five Unknown Contributors, 60 J. Forensic Sci. 857 (2015), researchers from Cybergenetics and the Kern Regional Crime Laboratory in California obtained DNA samples from five known individuals. They constructed ten two-person mixtures by randomly selecting two of the five contributors and mixing their DNA in proportions picked at random. The researchers constructed ten 3-, 4-, and 5-person mixtures in the same manner. From each of these 4 × 10 mixtures, they created a 1 nanogram and a 200 picogram sample for STR analysis. TrueAlelle computed an LR for each of the genotypes that went into each analyzed sample (the alternative hypothesis being a random genotype).

Defining an exclusion as a LR < 1, TrueAllele rarely excluded true contributors to the 1 ng 2- or 3-contributor mixtures (no exclusions in 20 comparisons and 1 in 30, respectively), but with 4 and 5 contributors involved, the false-exclusion rates were 9/40 and 9/50, respectively. The false exclusions came from the more extreme mixtures. As long as at least 10% of the nanogram mixtures came from the lesser contributor, there were no false exclusions. The false-exclusion rates for the 200 pg samples were larger: 2/20, 4/30, 13/40, and 19/50. For these low-template mixtures, a greater proportion of the lesser contributor’s DNA (25%) had to be present to avoid false exclusions.

To assess false inclusions, 10,000 genotypes were randomly generated from each of three ethnic population allele databases. These noncontributor profiles were compared with the eight mixtures. For ethnic group and DNA mixture sample, the LRs fell well below LR=1, meaning that there were few false inclusions. For the high DNA levels (1 ng), the proportion of comparisons with misleading LRs (LR > 1 for the simulated noncontributors) were 0/600,000, 25/900,000, 186/1,200,000, and 1,301/1,500,000 for the 2-, 3-, 4-, and 5-person mixtures, respectively. The worst case (the most misleadingly high LR) occurred for the five-person mixture, where one LR was 1,592. For the low-template DNA mixtures, the corresponding false-inclusion proportions were 2/600,000, 53/900,000, 177/1,200,000, and 145/1,500,000. The worst outcome was an LR of 101 for a four-person mixture.

Apparently using “reliable” in its legal or nonstatistical sense (as in Daubert and Federal Rule of Evidence 702), the researchers concluded that “[t]his in-depth experimental study and statistical analysis establish the reliability of TrueAllele for the interpretation of DNA mixture evidence over a broad range of forensic casework conditions.” \2/ My sense of the studies as of the time of the hearing in Wakefield is that they show that within certain ranges (with regard to the quantity of DNA, the number of contributors, and the fractions from the multiple contributors), TrueAlelle’s likelihood ratios discriminate quite well between samples paired with true contributors and the same samples paired with unrelated noncontributors. \3/ Moreover, the program’s output behaves qualitatively as it should, generally producing smaller likelihood ratios for electrophoretic data that are more complex or more deviled by stochastic effects on peak heights and locations.

NOTES

  1. In this early study, we compared the empirical LR distribution for parentage using presumably true and false mother-child-father trios derived from a set of civil paternity cases to the “paternity index,” a likelihood ratio computed with software applying simple genetic principles to the inheritance of HLA types. We found that the theoretical PI diverged from the empirical LR for PI > 80 or so.
  2. Cf. David W. Bauer, Nasir Butt, Jennifer M. Hornyak & Mark W. Perlin, Validating TrueAllele Interpretation of DNA Mixtures Containing up to Ten Unknown Contributors, 65 J. Forensic Sci. 380, 380 (2020), doi: 10.1111/1556-4029.14204 (abstract concluding that “[t]he study found that TrueAllele is a reliable method for analyzing DNA mixtures containing up to ten unknown contributors
  3. One might argue that the number of mixed samples collectively studied is too small. PCAST indicated that “there is relatively little published evidence” because “[i]n human molecular genetics, an experimental validation of an important diagnostic method would typically involve hundreds of distinct samples.” President's Council of Advisors on Sci. & Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods 81 (2016) 81 (notes omitted), https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensic_science_report_final.pdf [https://perma.cc/R76Y-7VU]. The number of distinct samples (mixtures from different contributors) combining all the studies listed here seems closer to 100.

The New York Court of Appeals Returns to Probabilistic Genotyping Software (Part II—General Acceptance)

The New York Court of Appeals returned to the contentious issue of “probabilistic genotyping software” (PGS) in People v. Wakefield, 2022 N.Y. Slip Op. 02771, 2022 WL 1217463 (N.Y. Apr. 26, 2022). As previously discussed, in People v. Williams, 147 N.E.3d 1131 (N.Y. 2020), a slim majority of the court held that the output of a computer program should not have been admitted without a full evidentiary hearing on its general acceptance within the scientific community. The majority opinion described a confluence of considerations:
  1. The program had only been tested in the laboratory that developed it (“an invitation to bias,” id. at 1141);
  2. The only evidentiary hearing ever conducted on the program had only shown “internal validation” and formal approval by a subcommittee of a state forensic science commission that was a “narrow class of reviewers, some of whom were employed by the very agency that developed the technology,” id. at 1142;
  3. Given “the ‘black box’ nature of that program,” the developer's “secretive approach ... was inconsistent with quality assurance standards” id.; and
  4. Submissions for hearings in other cases “suggested that the accuracy calculations of that program may be flawed,” id.

But which of these four factors were dispositive? Was it the combination of all four, or something in between, that rendered the evidence inadmissible? If the developer were to change its “secretive approach” so as to allow defense experts to study the program’s source code, would that, plus the “internal validation,” be enough to establish general scientific acceptance? Would it be sufficient for the state to refute the suggestions of flawed “accuracy calculations of the program” through testimony from its experts? Just what did the court mean when it summarized its analysis with the statement that “[i]n short, the [PGS] should be supported by those with no professional interest in its acceptance. Frye demands an objective, unbiased review”?

The opinion did not reveal how the majority might answer these questions. Of course, in holding that a hearing was necessary, the Williams majority implied that some information outside of the normal scientific literature could fill the gap created by the absence of replicated developmental validation studies from external (“objective, unbiased”) researchers. But what might that information be?

The court’s encounter with PGS last month did not answer this open question, for the court in Wakefield found that there were replicated studies from the developer of a more sophisticated computer program and other researchers. In addition, it pointed to other evaluations or uses of the program. The totality of the evidence, it reasoned, was stronger than the developer-only record in Williams and demonstrated the requisite general acceptance. But the opinion provoked one member of the court to complain of a "jarring turnabout" from "the same view unsuccessfully advocated by a minority in Williams two years ago."

This posting describes the case, the DNA evidence, and aspects of the discussions of general acceptance that struck me as interesting or puzzling.

The Crime, the Samples, and Some Misunderstood Probabilities of Exclusion

In 2010, John Wakefield strangled the occupant of an apartment with a guitar amplifier cord and made off with various items. The New York State Police laboratory analyzed samples from four areas: the front part of the collar of the victim's shirt, the rear part of the collar; the victim's forearm; and the amplifier cord. The laboratory concluded that the DNA on the collar was “consistent with at least two donors, one of which was the victim, and defendant could not be excluded as the other contributor”; that the DNA from the forearm “was consistent with DNA from the victim, as the major contributor, mixed with at least two additional donors” and the DNA on the cord was “a mixture of at least two donors, from which the victim could not be excluded as a possible contributor.” 2022 WL 1217463, at *1.

At this point, the court’s description of the State Police laboratory’s work becomes had to follow. The court wrote that:

[T]he analyst did not call any alleles based on peaks on the electropherogram below [the pre-established stochastic] threshold. As a result, there was insufficient data to allow the Lab to calculate probabilities for the unknown contributors to the DNA mixtures found on the amplifier cord and the front of the shirt collar.

No alleles at all? It takes only one allele to compute a probability of exclusion, although with such a limited profile, the exclusion probability might be close to zero, meaning that the data are uninformative. In any event, for the other two samples, “[t]he Lab was able to call ... 4 ... STR loci” that enabled “the analyst, using the combined probability of inclusion method, [to opine that] the probability an unrelated individual contributed DNA to the outside rear shirt collar was 1 in 1,088" and “that the probability an unrelated individual contributed DNA ... was 1 in 422" for “the profile obtained from the victim's forearm.”

Or so the court said. As explained in Box 1, these numbers are not “the probability an unrelated individual contributed DNA.” They are estimates of the probability that a randomly selected, unrelated individual could not be excluded as a possible source. Given a large number of unrelated individuals in the region, there easily could be more than a hundred people with STR profiles compatible with the mixtures.

BOX 1: TRANSPOSITION

The probability of inclusion is not the probability that an included individual is the contributor. It is the probability of not excluding an individual as a possible contributor. That probability is not necessarily equal to the probability that an included individual actually contributed to the sample from which he or she could not be excluded. If C stands for contributor and I for included, the probability of inclusion for any randomly selected individual can be written P(I given C). The source probability for the individual is different. It is P(C given I). Equating the two is known as the transposition fallacy (or the “prosecutor’s fallacy,” though it could be called the “judges” fallacy as well).

We do not need any symbols to see that the two conditional probabilities are not necessarily equal. The population of Schnectady county, where the crime occurred, was about 155,000 in 2010. Let’s round down to 150,000. That ought to remove all of Wakefield’s relatives. Excluding all but 1 in 1,088 individuals would leave 138 people as possible perpetrators. Of course, some would be far more plausible suspects than others, but based on the DNA evidence alone, how can the court claim that “the probability an unrelated individual contributed DNA to the outside rear shirt collar was 1 in 1,088”? That probability cannot be determined from the DNA evidence alone. It can be computed only if we are willing to assign a “prior probability” of being the murderer to each of the unrelated individuals in Schnectady (or anywhere else).

Suppose we assume that, ab initio, everyone in the county has an equal probability of being a source of the DNA on the collar. At that point, Wakefield’s probability is quite small. It is 1/150,000. Since the DNA testing would have excluded all but some 138 people, and because Wakefield is one of them, the probability attached to him is larger. Now the probability is 1/138. But that still leaves the vast bulk of the probability with the 137 unrelated individuals. Instead of transposing, we should say that “the probability an unrelated individual contributed DNA to the outside rear shirt collar was 137 out of in 138” rather than the court’s “1 in 1,088.” Of course, our assumption of equal probabilities for every unrelated individual is unrealistic, but that does not impeach the broader point that the mathematics does not make the probability of an unrelated individual the number that the court supplied.

Cybergenetics to the Rescue

To secure a better and more complete analysis, “the electronic data from the DNA testing of the four samples at issue was then sent to Cybergenetics [for] calculating a likelihood ratio—using all of the information generated on the electropherogram, including peaks that fall below a laboratory's stochastic threshold.” Cybergenetics is a private company whose “flagship TrueAllele® technology resolves complex forensic evidence, providing accurate and objective DNA match statistics.” TrueAllele's calculations of the likelihood ratios, using the hypothesis that the four samples contained DNA from an unrelated black individual as the alternative to the hypothesis that Wakefield’s DNA was present were 5.88 billion for the cord, 170 quintillion for the outside rear shirt collar, 303 billion for the outside front shirt collar, and 56.1 million for the forearm.

Wakefield moved to exclude these findings, The Schnectady County Supreme Court held a pretrial evidentiary hearing “over numerous days.” People v. Wakefield, 47 Misc.3d 850, 851, 9 N.Y.S.3d 540 (2015). (New York calls its trial courts supreme courts.) Finding “that Cybergenetics TrueAllele Casework is not novel but instead is ‘generally accepted’ under the Frye standard,” \1/ Justice Michael V. Coccoma (New York calls its trial judges justices) denied the motion. 47 Misc.3d at 859. A jury convicted Wakefield of first degree murder and robbery. The Appellate Division affirmed, and seven years after the trial, so did the Court of Appeals (New York calls its most supreme court the Court of Appeals).

Changes in New York’s Highest Court

Back in Williams, the Court of Appeals judges had split 4-3 on whether New York City's home-grown PGS had attained general acceptance. The three judges led by Chief Judge Janet M. DiFiore * objected to the majority’s negative comments about PGS and propounded a narrower rationale for requiring a Frye hearing. But even if one could have confidently applied the majority reasoning in Williams to the scientific status of TrueAllele in Wakefield, the exercise in legal logic might have been futile. In the two short years since Williams, the composition of the court had changed. One concurring judge died, and the majority-opinion bloc lost half its members, including the opinion’s author, to retirements. The reconstituted court gave Chief Judge DiFiore the opportunity to write a more laudatory opinion for a new and larger majority.

Only one judge stood apart from this new majority. Having been in the majority in Williams, Judge Jenny Rivera now found herself in the Chief Judge’s situation in Williams, composing a dissenting opinion with respect to the reasoning on general acceptance but concurring in the result. Drawing on Williams, Judge Rivera maintained that “the court erred in admitting the TrueAllele results but the error ... was harmless” in view of the other evidence of guilt.

The Court’s Understanding of TrueAllele

The opinions are vague about the inner workings of TrueAllele. The majority opinion suggests that what is distinctive about PGS is that it cranks out a likelihood ratio. \2/ But “likelihood ratio,” for present purposes, simply denotes the probability of data given one hypothesis divided by the probability of the same data given a (simple) alternative hypothesis. It has nothing to do with the probabilistic part of TrueAllelle. Indeed, TrueAllele only computes a likelihood ratio after the probability analysis is completed. It does this by dividing (i) the final posterior odds that favor one source hypothesis as compared to another by (ii) the initial prior odds. This division gives a “Bayes' factor” that states how much the data have changed the odds.

Let me try saying this another way. In effect, TrueAllele starts with prior odds based solely on the frequencies of various DNA alleles (and hence genotypes) in some population, performs successive approximations to converge on a better estimate of the odds, and divides the adjusted odds by the prior odds to yield what Cybergenetics calls “the match statistic.” If all goes well, this quotient (call it a likelihood ratio, a Bayes' factor, a match statistic, or whatever you want) reveals how powerful the DNA evidence is (which is not necessarily the same as the odds that any hypothesis is true). At least, that is what I think goes on. The court contents itself with warm and fuzzy statements such as “a probability model to assess the values of a genotype objectively” “based on mathematical computations from all the data in the electropherograms.” and “separates the genotypes using the mathematical probability principle of the Markov chain Monte Carlo (MCMC) search to calculate the probability for what the different genotypes could be.” (This last clause may not be so warm and fuzzy; it begins to unpack what I simplistically called successive approximations.)

The Timing for General Acceptance

Wakefield is a backwards-looking case. The main question before the Court of Appeals was whether, in 2015, TrueAllele reasonably could have been deemed to have been generally accepted in the scientific community. That is what New York law requires. \3/ The Chief Judge’s analysis of the general acceptance of TrueAllele starts with the observation that “[t]he well-known Frye test applied to the admissibility of novel scientific evidence (Frye v. United States, 293 F. 1013 [D.C. Cir.1923]) is 'whether the accepted techniques, when properly performed, generate results accepted as reliable within the scientific community generally' (People v. Wesley, 83 N.Y.2d 417, 422, 611 N.Y.S.2d 97, 633 N.E.2d 451 [1994]).”

Wesley is an interesting case to cite here. One would not know from the citation or the analysis in Wakefield that in Wesley there was no opinion for a majority of the seven judges on the court. There was one opinion for three judges and another opinion for two judges concurring only in the result. The remaining two judges did not participate. The concurring opinion was written by the late Chief Judge Judith S. Kaye, the longest-serving chief judge in New York history.

Chief Judge Kaye’s concurrence is memorable for its skepticism about finding general acceptance on the basis of studies from the developer of a method. Current Chief Judge Janet DiFiore briefly summarized that discussion (as did the majority in Williams). A more complete exposition is in Box 2. Chief Judge DiFiore then suggests that the Wesley concurrence was satisfied because “[n]otwithstanding these concerns, Chief Judge Kaye ultimately agreed that, at the time the appeal was decided, "RFLP-based forensic analysis [was] generally accepted as reliable" and those testing procedures were accepted as the standard methodology used in the scientific community until the advent of the PCR STR method used today.”

This presentation places an odd spin on the Wesley concurrence. The sole basis for the concurrence was that “it can fairly be said that use of DNA evidence was harmless beyond a reasonable doubt” because the DNA evidence “added nothing to the People's case.” 83 N.Y.2d at 444–45. The observations that five years after the hearing in Wesley, it had become clear that “in principle” RFLP-VNTR testing was “fundamentally sound” and was generally accepted were clearly dicta. Chief Judge Kaye was not suggesting that because a method had become generally accepted later, its earlier admission was vindicated. The dicta on later general acceptance was intended to inform trial courts that while they were at liberty to admit RFLP-VNTR evidence without pretrial hearings on general acceptance, they still needed to probe “the adequacy of the methods used to acquire and analyze samples ... case by case.” Id. at 445.

In contrast to Wesley, which emphasized the state of the science “at the time of the Frye hearing in 1988,” 83 N.Y.2d at 425 (plurality opinion), and whether “in 1988, ... there was consensus,” id. at 439 (concurring opinion), Chief Judge DiFiore’s opinion is less precise on when general acceptance came into existence:

BOX 2. PEOPLE v. WESLEY
83 N.Y.2d 417, 439–41, 611 N.Y.S.2d 97, 633 N.E.2d 451 (N.Y. 1994) (Chief Judge Kaye, concurring) (citations and footnote omitted)

The inquiry into forensic analysis of DNA in this case also demonstrates the "pitfalls of self-validation by a small group" Before bringing novel evidence to court, proponents of new techniques must subject their methods to the scrutiny of fellow scientists, unimpeded by commercial concerns.

A Frye court should be particularly cautious when — as here — "the supporting research is conducted by someone with a professional or commercial interest in the technique" DNA forensic analysis was developed in commercial laboratories under conditions of secrecy, preventing emergence of independent views. No independent academic or governmental laboratories were publishing studies concerning forensic use of DNA profiling. The Federal Bureau of Investigation did not consider use of the technique until 1989. Because no other facilities were apparently conducting research in the field, the commercial laboratory's unchallenged endorsement of the reliability of its own techniques was accepted by the hearing court as sufficient to represent acceptance of the technique by scientists generally. The sole forensic witness at the hearing in this case was Dr. Michael Baird, Director of Forensics at Lifecodes laboratory, where the samples were to be analyzed. While he assured the court of the reliability of the forensic application of DNA, virtually the sole publications on forensic use of DNA were his own or those of Dr. Jeffreys, the founder of Cellmark, one of Lifecodes' competitors. Nor had the forensic procedure been subjected to thorough peer review. ***

The opinions of two scientists, both with commercial interests in the work under consideration and both the primary developers and proponents of the technique, were insufficient to establish "general acceptance" in the scientific field. The People's effort to gain a consensus by having their own witnesses "peer review" the relevant studies in time to return to court with supporting testimony was hardly an appropriate substitute for the thoughtful exchange of ideas in an unbiased scientific community envisioned by Frye. Our colleagues' characterization of a dearth of publications on this novel technique as the equivalent of unanimous endorsement of its reliability ignores the plain reality that this technique was not yet being discussed and tested in the scientific community.

"Although the continuous probabilistic approach was not used in the majority of forensic crime laboratories at the time of the hearing, the methodology has been generally accepted in the relevant scientific community based on the empirical evidence of its validity, as demonstrated by multiple validation studies, including collaborative studies, peer-reviewed publications in scientific journals and its use in other jurisdictions. The empirical studies demonstrated TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples."

Presumably, and notwithstanding citations to materials appearing after 2015, \4/ she meant to write that the methodology had been generally accepted in 2015 because the indications listed were present then. (Whether the decisive time for general acceptance should be that of the trial rather than the appeal is not completely obvious. If a technique becomes generally accepted later, why should the defendant be entitled to a new trial in which the evidence that should have been excluded has become admissible anyway? The defendant's interest in the time-of-trial rule is the interest in not being convicted with the help of scientifically sound evidence (as per the general-acceptance standard based on the best current knowledge). A counter-argument is that a large pool of potential defense experts to question the application of the general accepted method in the particular case did not exist at the time of trial because the evidence was too novel.)

Quantifying the Accuracy of PGS

Turning to the question of the state of acceptance as of 2015, the majority opinion maintains that

]T]he methodology has been generally accepted in the relevant scientific community based on the empirical evidence of its validity, as demonstrated by multiple validation studies, including collaborative studies, peer-reviewed publications in scientific journals and its use in other jurisdictions. The empirical studies demonstrated TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples.

Both the fact that the software was written to implement uncontroversial mathematical ideas and the published empirical evidence are important. If the software were designed to implement a mathematically invalid procedure, the game would be over before it began. But techniques such as Bayes’ rule and sampling methods for getting a representative picture of the posterior distribution only work when they are developed appropriately for a particular application. Acknowledging that these tools have been used to solve problems in many fields of science is a bit like saying that the mathematics of probability theory is undisputed. The validity of the mathematical ideas are a necessary but hardly a sufficient condition for a finding that software intended to apply the ideas functions as intended. Using a particular mathematical formula or method to describe or predict real-world phenomena is an endeavor that is subject to and in need of empirical confirmation. Because PGS models the variability in the empirical data that emerge from chemical reactions and electronic detectors, “empirical evidence ... of its accuracy” is indispensable to establishing its accuracy.

Unfortunately, Wakefield is short on details from the “multiple validation studies” and “peer-reviewed publications.” What do the studies and publications reveal about the accuracy of output such as “5.88 billion times more probable” and “170 quintillion times more probable”? The Supreme Court opinion is devoid of any quantitative statement of how well the deconvoluted individual profiles and their Bayes’ factors reported by TrueAllele correspond to the presence or absence of those profiles in samples constructed with or otherwise known to contain DNA from given individuals. So is the Appellate Division opinion. So too with the Court of Appeals’ opinions. The court is persuaded that “[t]he empirical studies demonstrated TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples.” But how well did True Allele perform in the “many published and peer reviewed” validity studies?

A separate posting summarizes parts of the six studies circa 2015 that are both published and peer reviewed. The numbers in these studies suggest that within certain ranges (with regard to the quantity of DNA, the number of contributors, and the fractions from the multiple contributors), TrueAllele’s likelihood ratios discriminate quite well between samples paired with true contributors and the same samples paired with noncontributors. For example, in one experiment, LR was never greater than 1 for 600,000 simulations of false contributors to 10 two-person mixtures containing 1 nanogram of DNA—no observed false positives! Conversely, LR was never less than 1 for every true contributor to the same ten mixtures—no observed false negatives in 20 comparisons. Moreover, the program’s output behaves qualitatively as it should, generally producing smaller likelihood ratios for electrophoretic data that are more complex or more bedeviled by stochastic effects on peak heights and locations.

Such results suggest that TrueAllele’s LRs are in the ballpark. Yet, it is hard to gauge the size of the ballpark. Is a computed LR of 5.88 billion truly a probability ratio of 5.88 billion? Could the ratio be a lot less or a lot more? The validity studies do not give quantitative answers to these questions about “accuracy.” \5/

The Developer’s Involvement

On appeal, Wakefield had to convince the court that the unchallenged studies and other indicia of general acceptance were too weak to permit a finding of general acceptance. To do so, he pointed to “the dearth of independent validation as a result of Dr. Perlin's involvement in the large majority of studies produced at the hearing.” (Indeed, Dr. Perlin is the lead author of every one of the five published validity studies and a co-author of a sixth published study that also helps show validity.)

The majority acknowledged “legitimate concern” but decided that it was overcome “by the import of the empirical evidence of reliability demonstrated here and the acceptance of the methodology by the relevant scientific community.” However, the discussion of “the import of the empirical evidence” seems somewhat garbled.

1

First, the court notes that “the FBI Quality Assurance Standards requires ‘a developmental validation for a particular technology’ be published.” That the FBI might be satisfied with a single publication from the developer of a method does not speak to what the broader scientific community regards as essential to the validation. Along with the QAS, the court cites "NIST, DNA Mixture Interpretation: A NIST Scientific Foundation Review, at 64 (June 2021 Draft report)." The page merely reports that the NIST staff were able to examine “[p]ublicly available data on DNA mixture interpretation performance ... from five sources [including] published PGS studies” and that “conducting mixture studies may be viewed as a necessity to meet published guidelines or QAS requirements ... .” That scientists and other NIST personnel who choose to review a technology will read the scientific reports of the developers of the technology does not tell us much about defendant’s claim that Cybergenetics’ involvement in the published validation studies gravely diminishes “the import of the empirical evidence.”

2

Second, the Court of Appeals maintained that “the interest of the developer was addressed at the Frye hearing in this case.” As the court described the hearing, the response to this concern was that “[a]lthough Dr. Perlin was involved in and coauthored most of the validation studies, his interest in TrueAllele was disclosed as required by the journals who published the studies and the empirical evidence of the reliability of TrueAllele was not disputed.”

These responses seem rather flaccid. Some of the articles contain conflict-of-interest statements; most do not. \7/ But the presence or absence of obvious disclaimers does not come to grips with the complaint. Defendant’s argument is not that there are hidden funding sources or financial relationships. It is that interests in the outcomes of the studies somehow may affect the results. The claim is not that validation data were fabricated or that the data analysis was faulty. As with the movement for replication and “open science,” it is a response to more subtle threats.

3

Third, the opinion asserts that “the scientific method” is “entirely consistent with” proof of validity coming from the inventors, discoverers, or commercializers (citing President's Council of Advisors on Sci. and Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods, at 46 (2016)). Again, however, the argument is not that only disinterested parties do and should participate in scientific dialog. It is that "[w]hile it is completely appropriate for method developers to evaluate their own methods, establishing scientific validity also requires scientific evaluation by other scientific groups that did not develop the method.” Id. at 80 (https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensic_science_report_final.pdf [https://perma.cc/R76Y-7VU]).

4

That precept leads to the court’s last and most telling response to the “legitimate concern” over “the dearth of independent validation.” The Chief Judge finally wrote that “there were [not only] developer [but also] independent validation studies and laboratory internal validation studies, many published and peer-reviewed.”

But is this a fair characterization of the scientific literature as of 2015? From what I can tell, no more than five or six studies appear in peer-reviewed journals, and none are completely “independent validation studies.” The NIST report cited in Wakefield lists but a single “internal validation” study, from Virginia in 2013, apparently released in response to a Freedom of Information Act request, Although the NIST reviewers limited themselves to laboratory studies or data posted to the Internet, they concluded that “[c]urrently, there is not enough publicly available data to enable an external and independent assessment of the degree of reliability of DNA mixture interpretation practices, including the use of probabilistic genotyping software (PGS) systems.” 

Of course, this “Key Takeaway #4.3” is merely part of a draft report and is not a judgment as to what conclusions on validity should be reached on the basis of the published studies and the internal ones. Nevertheless, the court overlooks this prominent “takeaway” (and others). Instead, the Chief Judge asserts that “[t]he technology was approved for use by NIST”—even though NIST is not a regulatory agency that approves technologies—and that “NIST's use of the TrueAllele system for its standard reference materials likewise demonstrates confidence within the relevant community that the system generates accurate results.”

~~~

This is not to say that the scientific literature was patently insufficient to support the court’s assessment of the general scientific acceptance of TrueAllele for interpreting the DNA data in the case. But it does raise the question of whether the court’s assertions about the large number of “independent validity studies” and internal ones that have been “published and peer-reviewed” are exaggerated.

Source Code and General Acceptance

The defense also contended that the state’s testimony and exhibits from “the Frye hearing [were] insufficient because, absent disclosure of the TrueAllele source code for examination by the scientific community, its ‘proprietary black box technology’ cannot be generally accepted as a matter of law.” This argument bears two possible interpretations. On the one hand, it could be a claim that scientists demand open-source programs—those with every line of code deposited somewhere for everyone to see—before they will consider a program suitable for data analysis or other purposes. We can call this position the open-source theory.

On the other hand, the claim might be “that disclosure of the TrueAllele source code [to the defense, perhaps with an order to protect against more widespread dissemination of trade secrets] was required to properly conduct the Frye hearing” and that without at least that much discovery of the code, scientists would not regard TrueAllele as valid. We can call this position the discovery-based theory. It implies that, in establishing general scientific acceptance in a Frye hearing, pretrial discovery of secret code is an adequate substitute for exposing the code to the possible scrutiny of the entire scientific community. \8/

The Wakefield opinions are not entirely clear on about which theory they embrace or reject. Judge Rivera’s concurrence may have endorsed both theories. In addition to accentuating “the need to provide defendant with access to the source code,” she decried the absence of “objective, expert third-party access,” writing that

The court's decision was an abuse of discretion as a matter of law because it relied on validation studies by interested parties and evaluations founded on incomplete information about TrueAllele's computer-based methodology. Without defense counsel and objective, expert third-party access to and evaluation of the underlying algorithms and source code, the court could not conclude that TrueAllele's brand of probabilistic genotyping was generally accepted within the forensic science community.

The “evaluations founded on incomplete information” were from a standards developing organization, a state forensic science commission, and NIST. They were incomplete because, according to Judge Rivera, “without the source code, the agencies could not adequately evaluate the use of TrueAllele for this type of DNA mixture analysis ... .”

Focusing on the discovery-based theory, the rest of the court determined that “[d]isclosure ... was not needed in order to establish at the Frye hearing the acceptance of the methodology by the relevant scientific community. The Chief Judge gave two, somewhat confusingly stated, reasons. The first was that Wakefield sought the source code under a rule for discovery that did not apply and then “made no further attempt to demonstrate a particularized need for the source code by motion to the court.” But it is not clear how the failure “to demonstrate a particularized need” overcomes (or even responds to) the argument that the scientific community will not accept software as validly implementing algorithms unless the source code is either open source or given only to the defendant.

The Chief Judge continued:

Moreover, defendant's arguments as to why the source code had to be disclosed pay no heed to the empirical evidence in the validation studies of the reliability of the instrument or to the general acceptance of the methodology in the scientific community—the issue for the Frye hearing—and are directed more toward the foundational concern of whether the source code performed accurately and as intended (see Wesley, 83 N.Y.2d at 429, 611 N.Y.S.2d 97, 633 N.E.2d 451).

The meaning of the sentence may not be immediately apparent. The defense argument is that giving a defendant (or perhaps the scientific community generally) access to source code is a prerequisite to general acceptance of the proposition that the software correctly implements theoretically sound algorithms. If this broad proposition is false dogma, the court should simply say so. It should announce that source code need not be disclosed because there is an alternative, reasonably effective means for establishing that the software performs as it should. The first part of the first sentence starts out that way, but the sentence then states that “whether the source code performed accurately and as intended” is not a matter of general acceptance at all. It is only “foundational” in the sense identified by Chief Judge Kaye in Wesley, who, as we saw (Box 2), wrote that even though RFLP-VNTR testing was generally accepted, the complete “foundation” for admitting DNA evidence entails proof that the generally accepted procedure was performed properly in the case at bar.

But regarding the argument about source code as falling outside of the Frye inquiry misapprehends the defense argument. Neither the open-source nor the discovery-based theories pertain to the execution of valid software. They question the premise that validity can be generally accepted without disclosure of the program’s source code. Yet, the majority elaborates on its non-Frye "foundational" classification for the source-code argument as follows:

To the extent the testimony at the hearing reflected that the TrueAllele Casework System may generate less reliable results when analyzing more complex mixtures (see also President's Council of Advisors on Sci. and Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods, at 80 [2016] https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensic_science_report_final.pdf [published after the Frye hearing was held]), defendant did not refine his challenge to address the general acceptance of TrueAllele on such complex mixtures or how that hypothesis would have been applicable to the particular facts of this case. As a result, it is unclear that any such objection would have been relevant to defendant's case, where the samples consisted largely of simple (two-contributor) mixtures with the victim as a known contributor (see also NIST, DNA Mixture Interpretation: A NIST Scientific Foundation Review, at 3 [June 2021 Draft report] https://nvlpubs.nist.gov/nistpubs/ir/2021/NIST.IR.8351-draft.pdf).

These citations to the PCAST and NIST reports actually undercut any suggestion that source-code secrecy does not implicate Frye. The NIST draft repeatedly states that

Forensic scientists interpret DNA mixtures with the assistance of statistical models and expert judgment. Interpretation becomes more complicated when contributors to the mixture alleles. Complications can also arise when random variations, also known as stochastic effects, make it more difficult to confidently interpret the resulting DNA profile.

Not all DNA mixtures present these types of challenges. We agree with the President’s Council of Advisors on Science and Technology (PCAST) that “DNA analysis of single-source samples or simple mixtures of two individuals, such as from many rape kits, is an objective method that has been established to be foundationally valid” (PCAST 2016).

NIST, DNA Mixture Interpretation: A NIST Scientific Foundation Review, at 2-3 & 11-12 (June 2021 draft) (citations omitted). To demand that “defendant ... refine his challenge to address the general acceptance of TrueAllele on ... complex mixtures or ... the particular facts of this case” is to hold that TrueAllele is generally accepted for use with “single-source samples or simple mixtures of two individuals”—even though the source code is hidden. But if science does not demand the disclosure of source code for general acceptance inside the "single" or simple "zone," then why would it demand disclosure for general acceptance outside that zone?

The court's remarks make more sense as a response to Wakefield’s different discovery argument about the need for source code for trial purposes. This argument does not claim that disclosure of source is essential to general acceptance to exist. It looks to the trial rather than the pretrial Frye hearing. The thought may be that if the accuracy of the program for the “simple” cases is assured, then the need for discovery of the code to prepare for trial testimony is less compelling. The court appears to be responding that because “the samples consisted largely of simple (two-contributor) mixtures with the victim as a known contributor,” there was little need for discovery of the code in this case.

Although this rejoinder departs from the topic of what Wakefield teaches us about general acceptance, I would note that it is difficult to reconcile this characterization of the case with Chief Judge DeFiore’s own description of the samples. The court mentioned four samples. Its initial description of them indicates that the New York laboratory deemed the sample on the amplifying cord to be “at least” a three-person mixture and stated that “because of the complexity of the mixture,” the laboratory could not even compare “results generated from the amplifier cord ... to defendant's DNA profile.” 2022 WL 1217463, at *1. Because of the “stochastic threshold,” the laboratory could discern peaks at only 4 out of 15 loci for “the outside rear shirt collar” and “for the profile obtained from the victim's forearm.” Id. Presumably, the “insufficient data” on “the unknown contributors to the DNA mixtures found on the amplifier cord and the front of the shirt collar” is what led the state to call Cybergenetics for help. These samples are not instances of what PCAST called “DNA analysis of single-source samples or simple mixtures of two individuals, such as from many rape kits” or what the NIST group called “two-person mixtures involving significant quantities of DNA.” They are “more complicated” situations that arise “when contributors to the mixture share common alleles [and] when random variations, also known as stochastic effects” are present.

In sum, the deeper one looks into the Wakefield opinions, the more there is to wonder about. But whatever quirks and quiddities reside in the writing, the nearly unanimous opinion of the Court of Appeals signals that a trial court can choose to dispense with the general-acceptance inquiry for at least one PGS program—TrueAllele—for nonchallenging single samples or two-person mixtures and for samples of somewhat greater complexity as well.

NOTES

* UPDATE: On July 12, 2022, Chief Judge DiFiore announced that she will resign on August 31. See, e.g., Jimmy Vielkind & Corinne Ramey, New York’s Top Judge Resigns Amid Misconduct Proceeding: Attorney for Court of Appeals Judge Janet DiFiore Said Her Resignation Wasn’t Related to a Claim that She Improperly Attempted to Influence a Disciplinary Hearing, Wall St. J., July 12, 2022 8:31 am ET, https://www.wsj.com/articles/new-yorks-top-judge-resigns-amid-misconduct-proceeding-11657629111.
  1. This formulation conflates the issue of novelty with the issue of general acceptance, which can change over time. See Williams, 35 N.Y.3d at 43, 147 N.E.3d at 1143.
  2. The description begins with the remark that “The likelihood ratio in its modern form was developed by Alan Turing during World War II as a code-breaking method.” That is a possibly defective bit of intellectual history, inasmuch as Turing did not develop the likelihood ratio. To decipher messages, Turing relied on a logarithmic scale for the Bayes’ factor in two ways—as indicating the strength of evidence, and as a tool for sequential analysis. Sir Harold Jeffreys had done the former in his 1939 Theory of Probability book. The sequential analysis problem is not clearly connected to PGS. It arises when the sample size is not fixed in advance and the data are evaluated continuously as they are collected. PGS processes all the data at once.
  3. As the court wrote in People v. Williams, 35 N.Y.3d 24, 147 N.E.3d 1131, 1139–40, 124 N.Y.S.3d 593 (N.Y. 2020), “[r]eview of a Frye determination must be based on the state of scientific knowledge and opinion at the time of the ruling (see Cornell, 22 N.Y.3d at 784-785, 986 N.Y.S.2d 389, 9 N.E.3d 884 (‘a Frye ruling on lack of general causation hinges on the scientific literature in the record before the trial court in the particular case’”).
  4. E.g., 2022 WL 1217463 at *7 n.10 (“TrueAllele is not an outlier in the use of the continuous probabilistic genotyping method. Other types of probabilistic genotyping software, such as STRMix, have likewise been found to be generally accepted (see e.g. United States v. Gissantaner, 990 F.3d 457, 467 (6th Cir.2021)).”
  5. Cf. David H. Kaye, Theona M. Vyvial & Dennis L. Young, Validating the Probability of Paternity, 31 Transfusion 823 (1991) (comparing the empirical LR distribution for parentage using presumably true and false mother-child-father trios derived from a set of civil paternity cases to the “paternity index” (PI), a likelihood ratio computed with software applying simple genetic principles to the inheritance of HLA types, and reporting that the theoretical PI diverged from the empirical LR for PI > 80 or so).
  6. At trial, “Gary Skuse, Ph.D., a professor of biological sciences at the Rochester Institute of Technology, testified at trial as a defense witness [and] agreed ... that defendant's DNA was present in the mixtures found on the shirt collar and amplifier cord and that it was ‘most likely’ present on the victim's forearm.”
  7. The articles in the Journal of Forensic Sciences and Science and Justice have no such statements. The “Competing Interests” paragraph in a PloS One article advises that “I have read the journal’s policy and have the following conflicts. Mark Perlin is a shareholder, officer and employee of Cybergenetics in Pittsburgh, PA, a company that develops genetic technology for computer interpretation of DNA evidence. Cybergenetics manufactures the patented TrueAllele Casework system, and provides expert testimony about DNA case results. Kiersten Dormer and Jennifer Hornyak are current or former employees of Cybergenetics. Lisa Schiermeier-Wood and Dr. Susan Greenspoon are current employees of the Virginia Department of Forensic Science, a government laboratory that provides expert DNA testimony in criminal cases and is adopting the TrueAllele Casework system. This does not alter our adherence to all the PLOS ONE policies on sharing data and materials.”
  8. The defense advanced another different discovery theory in arguing that it could not adequately cross-examine and confront Dr. Perlin at trial unless it could access the source code. The court rejected this theory too.

Saturday, May 7, 2022

The New York Court of Appeals Returns to Probabilistic Genotyping Software (Part I—Williams)

For the past ten years or so, motions to exclude testimony of “probabilistic genotyping” results have been commonly lodgedand almost always denied. With rare exceptions, appellate courts have held these rulings to be proper (or at least within the trial judge’s discretion). Then came People v. Williams, 35 N.Y.3d 24, 147 N.E.3d 1131, 124 N.Y.S.3d 593 (N.Y. 2020).

In this murder case, New York’s highest court held that the output of one form of probabilistic genotyping software (PGS) was being admitted prematurely, before the scientific community had an adequate chance to evaluate it. But that was two years ago. Two weeks ago, in People v. Wakefield, 2022 N.Y. Slip Op. 02771, 2022 WL 1217463 (N.Y. Apr. 26, 2022), the court returned to the issue of PGS evidence. As in Williams, PGS produced "likelihood ratios" associating the defendant with a murder weapon, but this time the court held that the PGS in question had achieved the general scientific acceptance required to admit scientific evidence in New York.

This posting discusses Williams. A separate posting will consider how Wakefield differs from Williams, and why.

Cadman Williams was accused of a fatal shooting in the Bronx in 2008. The DNA in the case came from a gun hidden in William’s former girlfriend’s apartment. At trial, an expert from the New York City Office of the Chief Medical Examiner (OCME) testified “that it was millions of times more likely that the DNA mixture found on the gun contained contributions from defendant and one unknown, unrelated person, rather than from two unknown, unrelated people.” 35 N.Y.3d at 31. At least, that is how the court understood the testimony. It is not, however, an accurate statement of a likelihood ratio involving the identity of the individual (or individuals) whose DNA is in the sample. Such a likelihood ratio involves only the probability of the DNA data conditional on source hypotheses, not the other war around. (With large enough likelihood ratios, however, the distinction can be academic.) A further issue is the choice of hypotheses. Why not the former girlfriend rather than a random person's DNA?

Nonetheless, the Williams opinion did not need to consider the proper interpretation of the likelihood ratio for a pair of hypotheses or the selection of those hypotheses. The appeal only concerned the general scientific acceptance of the method that the OCME had devised for generating likelihood ratios for minute quantities of DNA. The Court of Appeals held that the trial court erred “in admitting expert testimony with respect to LCN and FST results in the absence of a Frye hearing.” 35 N.Y.3d at 42. LCN stands for “low copy number” and refers to the fact that the OCME laboratory tweaked the usual method for producing data on the personally identifying features of these fragments to obtain some results. FST stands for “Forensic Statistical Tool,” a computer program developed within the OCME to calculate likelihood ratios. Frye refers to Frye v. United States, 293 F. 1013 (D.C. Cir. 1923), a famous case from the District of Columbia that announced the rule that a scientific method had to achieve acceptance within the scientific community before its results can be received as evidence in court.

Although the Court of Appeals referred to one New York trial court opinion finding general acceptance of the OCME method as “questionable,” the Williams court did not hold that the computer output from the FST was necessarily inadmissible. The precise legal error lay in the trial court’s allowing the testimony without first conducting an evidentiary hearing on whether OCME’s methods were generally accepted within the scientific community. The majority opinion, written by Judge Eugene M. Fahey  distinguished between the LCN and FST parts of the OCME method and determined that neither could be said to be generally accepted based on the information presented to the trial judge (namely, prior opinions, mostly without evidentiary hearings; scientific publications; internal studies from the OCME laboratory; and a review conducted by a subcommittee of New York’s forensic science commission).

“With respect to the FST issue,” the prosecution had “maintained that such evidence should be admitted without a Frye hearing because ‘numerous articles published in peer-reviewed scientific journals’ supported the point that ‘the analytical software employs well-established principles such as Bayesian statistics and likelihood ratios which are used in many areas of science including forensics, medicine and social sciences.’” 35 N.Y.3d at 35 (note omitted). The prosecution added that “given both the thorough review of the FST by DNA Subcommittee of the New York State Forensic Science Committee [sic] and the exhaustive validation of that tool by OCME, the relevant scientific community had accepted the FST as reliable.” 35 N.Y.3d at 31.

The Court of Appeals was unmoved. It wrote that:

If the analysis was as simple as determining whether FST is comprised of existing mathematical formulas that are individually accepted as generally reliable within the relevant scientific community, then FST evidence probably would be admissible even in the absence of a Frye hearing. [¶] The point remains, however, that FST is a proprietary program exclusively developed and controlled by OCME. The sole developer and the sole user are the same. That is not “an appropriate substitute for the thoughtful exchange of ideas ... envisioned by Frye” (Wesley, 83 N.Y.2d at 441, 611 N.Y.S.2d 97, 633 N.E.2d 451 [Kaye, Ch. J., concurring] ). It is an invitation to bias. .... [That the] tool has ... been vetted and approved by “the distinguished scientists making up the DNA Subcommittee of the New York State Forensic Science Committee” is certainly relevant [but] that insular endorsement is no substitute for the scrutiny of the relevant scientific community. [¶] Indeed, here, defendant was hamstrung in demonstrating the existence of conflicting scientific opinions in order to show the need for Frye review of the FST based on the “black box” nature of that program, but his papers adequately showed that OCME's secretive approach to the FST was inconsistent with quality assurance standards within the relevant scientific community. Those papers also showed that facts adduced in challenges to the FST made in Frye applications in other proceedings suggested that the accuracy calculations of that program may be flawed. .... In short, the FST should be supported by those with no professional interest in its acceptance. Frye demands an objective, unbiased review.

147 N.E.3d at 1141–42.

This language was too much for three of the seven judges. Their concurring opinion written by Chief Judge Janet M. DiFiore balked at the “pejorative view of the ... OCME's ... LCN DNA typing technique and its ... probabilistic genotyping software program ... .” 147 N.E.3d at 1147. To be sure, the concurrence agreed that “the issues ... in this 2014 motion [were] ripe for a Frye hearing”but only because it appeared that the internal studies did not appear to encompass the small quantity of DNA that was analyzed in this case. According to the concurrence,

The LCN DNA profiles drive the FST analysis, and FST results are only as reliable as the predicate assumptions integrated into the FST software program. The People did not meet their burden of establishing the validity of the empirical data used to fuel the calculations performed by this statistical model, including the manner of accounting for the occurrence of the stochastic effect and allelic dropout in a multiple contributor sample of less than 25 picograms, in a manner sufficient to bypass a Frye hearing. Fundamentally, the combined use of that statistical tool with DNA typing on samples that fell beneath validated thresholds may have impacted the reliability of the results, raising a valid challenge to the admissibility of that evidence in a criminal prosecution.

35 N.Y.3d at 52–53. In other words, the concurring judges seemed to buy the state’s broad argument that there was no need for a Frye hearing to find general acceptance of the OCME system in many cases. But not in this case, where the laboratory pushed beyond what it had validated to its own satisfaction. These judges evinced no concern with the limited outside testing of the FST and expressed no doubts about the conclusions of general acceptance reached in the clear preponderance of trial court rulings on motions to exclude FST likelihood ratios. And the entire court agreed that admission in the case was harmless error because the other evidence against Williams was overwhelming. 

In sum, Williams was a case with broad reasoning (by the majority) on a narrow topic—the OCME’s home-grown Forensic Statistics Tool as applied to data from an especially challenging DNA sample (as stressed by the concurrence). What about other FST likelihood ratios with data from larger DNA samples? Other brands of PGS?

One thing was clear. Williams would not be the last word on probabilistic genotyping. Another case, involving an older and more established computer program known as TrueAllele, was on its way to the Court of Appeals. Stay tuned for thoughts on this case.