New York’s Court of Appeals returned to the contentious issue of “probabilistic genotyping software” (PGS) in People v. Wakefield, 2022 N.Y. Slip Op. 02771, 2022 WL 1217463 (N.Y. Apr. 26, 2022). As previously discussed, in People v. Williams, 147 N.E.3d 1131 (N.Y. 2020), a slim majority of the court had reasoned that the output of a computer program should not have been admitted without a full evidentiary hearing on the program's general acceptance within the scientific community.
In Wakefield, the Court of Appeals faced a different question for a more complex computer program. This time, the question was whether, after holding such a hearing, the trial court erred in finding that the more sophisticated program was generally accepted as a scientifically valid and reliable means of estimating “likelihood ratios” for DNA mixtures like the ones recovered in the case. The program, known as TrueAllele, is marketed by Cybergenetics, “a Pittsburgh-based bioinformation company [whose] computers translate DNA data into useful information.”
As discussed separately, the Wakefield court held that, in the circumstances of the case, the output of TrueAllele was admissible to associate the defendant with a murder. It emphasized “multiple validation studies ... demonstrat[ing] TrueAllele's reliability, by deriving reproducible and accurate results from the interpretation of known DNA samples.” 2022 WL 1217463 at *7. But the court did not describe the level of accuracy attained in any of the validation studies. That is surely something lawyers would want to know about, so I decided to read the “peer-reviewed publications in scientific journals” (id.) to which the court must have been referring.
The state introduced 31 exhibits at the evidentiary hearing in 2015. Nine were journal publications of some kind. Six of those described data collected to establish (or indirectly suggesting) that TrueAllele was accurate. Only three of them relied on “known DNA samples” as opposed to samples from casework. The synopses that follow do not describe all the parts of them, let alone all the findings from them. I merely pick out the parts that I found most interesting and most pertinent to the question of accuracy or error (two sides of the same coin).
The 2009 Cybergenetics Known-samples Study
The first study is M.W. Perlin & A. Sinelnikov, An Information Gap in DNA Evidence Interpretation. 4 PLoS ONE e8327. This experiment used 40 laboratory-constructed two-contributor mixture samples (from two pairs of unrelated individuals) with varying mixture proportions and total DNA amounts (0.125 ng to 1 ng) to show that TrueAllele was much better at classifying a sample as containing a contributor’s DNA than was the cumulative probability of inclusion method (CPI) that employed peak-height thresholds for binary determinations of the presence of alleles. TrueAllele’s likelihood ratios (LRs) supported the hypothesis of inclusion in nearly every instance (LR>1).
However, the data could not reveal whether the level of positive support (log-LR) was accurate. Does a computed LR of 1,000,000 “really” indicate evidence that is five orders of magnitude more probative than a computed LR of 10? The “empirical evidence” from the study cannot answer this question. The best we can do is to verify that the computed LR increases as the quantity of DNA does. The uncertainty inherent in the PCR process is smaller for larger starting quantities, and this should be reflected in the magnitude of the LR.
The 2011 Cybergenetics–New York State Police Casework Study
The second study also used two-contributor mixtures, but these came from casework in which the alleles, as ascertained by conventional methods, did not exclude the defendant as a possible contributor. In Mark W. Perlin et al., Validating TrueAllele DNA Mixture Interpretation, 56 J. Forensic Sci. 1430 (2011), researchers from Cybergenetics and the New York State Police laboratory selected “16 two-person mixture samples” that met certain criteria “from 40 adjudicated cases and one proficiency test conducted in” the New York laboratory. TrueAllele generated larger LRs than those from the manual analyses. That TrueAllele did not produce LRs < 1 (indicative of exclusions) for any defendant included by conventional analysis is evidence of a low false-exclusion probability. The computed LRs are greater than 1 when they should be. But this empirical evidence does not directly address the question of whether the magnitude of the LRs themselves are as close or as far from 1 as they should be if they are to be understood as a Bayes' factor.
The 2013 Cybergenetics–New York State Police Casework Study
The third study is more extensive. In Mark W. Perlin et al., New York State TrueAllele ® Casework Validation Study, 58 J. Forensic Sci. 1458 (2013), Cybergenetics worked with the New York laboratory to reanalyze DNA mixtures with up to three contributors from 39 adjudicated cases and two proficiency tests. “Whenever there was a human result, the computer’s genotype was concordant,” and TrueAllele “produced a match statistic on 81 mixture items ... , while human review reported a statistic on [only] 25 of these items.”
This time Cybergenetics also tried to answer the question of how often TrueAllele produces false “matches” (LR>1) when it compares a known noncontributor’s sample to a mixed sample. It accomplished this by simulating false pairs of samples for TrueAllele to process. As the authors explained,
We compared each of the 87 matched mixture evidence genotypes with the (<87) reference genotypes from the other 40 cases. Each of these 7298 comparisons should generate a mismatch between the unrelated genotypes from different cases and hence a negative log(LR) value. A genotype inference method having good specificity should exhibit mismatch information values [log-LRs] that are negative in the same way that true matches are positive.
Id. at 1461. Thus, they derived two empirical distributions for likelihood ratios—one for the nonexcluded defendants in the cases (who we would expect to be actual sources)—and one for the unrelated individuals (who we would expect to be non-sources). The empirical distributions were well separated, and the log(LR) was always less than zero for the presumed non-sources.
So TrueAllele seems to work well as a classifier (for distinguishing true-source pairs from false-source pairs) in these small-scale studies. But again, the question of whether the magnitudes of its LRs are highly accurate remains. With astronomically large LRs, it is hard to know the answer. Cf. David H. Kaye, Theona M. Vyvial & Dennis L. Young, Validating the Probability of Paternity, 31 Transfusion 823 (1991). \1/
The 2013 UCF–Cybergenetics Known-samples Study
The fourth study is J. Ballantyne, E.K. Hanson & M.W. Perlin, DNA Mixture Genotyping by Probabilistic Computer Interpretation of Binomially-sampled Laser Captured Cell Populations: Combining Quantitative Data for Greater Identification Information, 53 Sci. & Justice 103–114 (2013). It is not a validation study, but researchers from the University of Central Florida and Cybergenetics made two different two-person mixtures with equal quantities of DNA from each person. In such 50:50 mixtures, peak heights are expected to be similar, making it harder to fit the pattern of alleles into the pairs (single-locus genotypes) from each contributor than if there had been a major and minor contributor. So the team created ten small (20 cell) subsamples of each of the two mixed DNA samples by selecting cells at random. They analyzed these subsamples separately. They used TrueAllele to estimate the relative contributions (“mixture weights”) in the 20-cell samples, and found that when TrueAllele combined data from multiple subsamples, it assigned a 99% probability to the two contributor genotypes. The point of the study was to demonstrate the possibility of subdividing even small balanced samples to take advantage of peak height differences arising from imbalances in the even smaller subsamples.
The 2013 Cybergenetics–Virginia Department of Forensic Services Casework Study
The fifth study is more on point. In Mark W. Perlin et al., TrueAllele Casework on Virginia DNA Mixture Evidence: Computer and Manual Interpretation in 72 Reported Criminal Cases, 9 PLOS ONE e92837 (2014), researchers from Cybergenetics and the Virginia Department of Forensic Services compared TrueAllele with manual analysis on 111 selected casework samples. The set of criminal case mixtures paired with a nonexcluded defendant’s profile should produce large LRs. For ten pairs, TrueAllele failed to return “a reproducible positive match statistic.” Among the 101 remaining, presumably same-source pairs, the smallest LR was 18. Since the LR must be less than 1 to be deemed indicative of a noncontributor, in no instance did TrueAllele generate a falsely exonerating result.
But what about falsely incriminating LRs? This time, the researchers did not reassign the defendant’s profiles to other cases to produce false pairs. Rather, they generated 10,000 random STR genotypes (from population databases of alleles in Virginia) to simulate the STR profiles of non-sources of the mixtures from the criminal cases. They paired each of these non-source profiles with 101 genotypes that emerged from the unknown mixtures and calculated LR values. There were fewer than 1 in 20,000 LRs suggesting an association (LR > 1) among these mixture/non-source pairs; less than 1in 1,000,000 for LR > 1,000; and no false positives at all for LR > 6,054. In other words, TrueAllele produced an empirical distribution for false pairs that consisted almost entirely of LRs < 1 and that never had very large LRs. Again, it seems to be an excellent classifier.
The 2015 Cybergenetics–Kern Regional Crime Laboratory Known-samples Study
Finally, in M.W. Perlin et al., TrueAllele Genotype Identification on DNA Mixtures Containing up to Five Unknown Contributors, 60 J. Forensic Sci. 857 (2015), researchers from Cybergenetics and the Kern Regional Crime Laboratory in California obtained DNA samples from five known individuals. They constructed ten two-person mixtures by randomly selecting two of the five contributors and mixing their DNA in proportions picked at random. The researchers constructed ten 3-, 4-, and 5-person mixtures in the same manner. From each of these 4 × 10 mixtures, they created a 1 nanogram and a 200 picogram sample for STR analysis. TrueAlelle computed an LR for each of the genotypes that went into each analyzed sample (the alternative hypothesis being a random genotype).
Defining an exclusion as a LR < 1, TrueAllele rarely excluded true contributors to the 1 ng 2- or 3-contributor mixtures (no exclusions in 20 comparisons and 1 in 30, respectively), but with 4 and 5 contributors involved, the false-exclusion rates were 9/40 and 9/50, respectively. The false exclusions came from the more extreme mixtures. As long as at least 10% of the nanogram mixtures came from the lesser contributor, there were no false exclusions. The false-exclusion rates for the 200 pg samples were larger: 2/20, 4/30, 13/40, and 19/50. For these low-template mixtures, a greater proportion of the lesser contributor’s DNA (25%) had to be present to avoid false exclusions.
To assess false inclusions, 10,000 genotypes were randomly generated from each of three ethnic population allele databases. These noncontributor profiles were compared with the eight mixtures. For ethnic group and DNA mixture sample, the LRs fell well below LR=1, meaning that there were few false inclusions. For the high DNA levels (1 ng), the proportion of comparisons with misleading LRs (LR > 1 for the simulated noncontributors) were 0/600,000, 25/900,000, 186/1,200,000, and 1,301/1,500,000 for the 2-, 3-, 4-, and 5-person mixtures, respectively. The worst case (the most misleadingly high LR) occurred for the five-person mixture, where one LR was 1,592. For the low-template DNA mixtures, the corresponding false-inclusion proportions were 2/600,000, 53/900,000, 177/1,200,000, and 145/1,500,000. The worst outcome was an LR of 101 for a four-person mixture.
Apparently using “reliable” in its legal or nonstatistical sense (as in Daubert and Federal Rule of Evidence 702), the researchers concluded that “[t]his in-depth experimental study and statistical analysis establish the reliability of TrueAllele for the interpretation of DNA mixture evidence over a broad range of forensic casework conditions.” \2/ My sense of the studies as of the time of the hearing in Wakefield is that they show that within certain ranges (with regard to the quantity of DNA, the number of contributors, and the fractions from the multiple contributors), TrueAlelle’s likelihood ratios discriminate quite well between samples paired with true contributors and the same samples paired with unrelated noncontributors. \3/ Moreover, the program’s output behaves qualitatively as it should, generally producing smaller likelihood ratios for electrophoretic data that are more complex or more deviled by stochastic effects on peak heights and locations.
NOTES
- In this early study, we compared the empirical LR distribution for parentage using presumably true and false mother-child-father trios derived from a set of civil paternity cases to the “paternity index,” a likelihood ratio computed with software applying simple genetic principles to the inheritance of HLA types. We found that the theoretical PI diverged from the empirical LR for PI > 80 or so.
- Cf. David W. Bauer, Nasir Butt, Jennifer M. Hornyak & Mark W. Perlin, Validating TrueAllele Interpretation of DNA Mixtures Containing up to Ten Unknown Contributors, 65 J. Forensic Sci. 380, 380 (2020), doi: 10.1111/1556-4029.14204 (abstract concluding that “[t]he study found that TrueAllele is a reliable method for analyzing DNA mixtures containing up to ten unknown contributors
- One might argue that the number of mixed samples collectively studied is too small. PCAST indicated that “there is relatively little published evidence” because “[i]n human molecular genetics, an experimental validation of an important diagnostic method would typically involve hundreds of distinct samples.” President's Council of Advisors on Sci. & Tech., Exec. Office of the President, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods 81 (2016) 81 (notes omitted), https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/PCAST/pcast_forensic_science_report_final.pdf [https://perma.cc/R76Y-7VU]. The number of distinct samples (mixtures from different contributors) combining all the studies listed here seems closer to 100.
I received a comment noting that the issue of the accuracy of computed likelihood ratios is addressed in the literature on calibration and noting the following references:
ReplyDelete* Morrison G.S., Enzinger E., Hughes V., Jessen M., Meuwly D., Neumann C., Planting S., Thompson W.C., van der Vloed D., Ypma R.J.F., Zhang C., Anonymous A., Anonymous B. (2021). Consensus on validation of forensic voice comparison. Science & Justice, 61, 229–309. https://doi.org/10.1016/j.scijus.2021.02.002;
* Morrison G.S. (2021). In the context of forensic casework, are there meaningful metrics of the degree of calibration? Forensic Science International: Synergy, 3, article 100157. https://doi.org/10.1016/j.fsisyn.2021.100157;
* a 2021 Symposium on calibration at https://calibration-and-validation.forensic-data-science.net/;
* Weber P., Enzinger E., Labrador B., Lozano-Díez A., Ramos D., González-Rodríguez J., Morrison G.S. (2022). Validation of the alpha version of the E3 Forensic Speech Science System (E3FS3) core software tools. Forensic Science International: Synergy, 4, article 100223. https://doi.org/10.1016/j.fsisyn.2022.100223 (using calibration); and * Basu N., Bolton-King R.S., Morrison G.S. (2022). Forensic comparison of fired cartridge cases: Feature-extraction methods for feature-based calculation of likelihood ratios. Forensic Science International: Synergy, 5, article 100272. https://doi.org/10.1016/j.fsisyn.2022.100272 (Preprint at https://firearms.forensic-data-science.net/) (using calibration).