Thursday, January 24, 2019

NIST News on NGS

Back in July, before the President intransigently shut down much of the government, including the National Institute of Standards and Technology, NIST reported on a new study of population frequencies of DNA sequence variations in the regions whose lengths are used to produce identifying DNA profiles. The regions, or loci, are known as short tandem repeats, or STRs. Forensic-science laboratories now use a technology based on capillary electrophoresis that separates DNA fragments by length, but newer methods used in "Next Generation Sequencing" (NGS) effectively "read" the order of all the base-pairs in these loci.

This more detailed information could be useful in cases of incomplete profiles (occurring because the quantity of DNA is so small that some alleles do not show up in electropherograms) or in deciphering complex mixtures. In these situations, it would be helpful to have estimates of how often the sequence variations occur in relevant populations. Such estimates could be used to compute probabilities of matches to samples from individuals other than the matching suspect. And that is what the NIST researchers have provided in an article on "U.S. Population Sequence Data for 27 Autosomal STR Loci." 1/ In the words of the authors:
This NIST 1036 data set is expected to support the implementation of STR sequencing forensic casework by providing high-confidence sequence-based allele frequencies for the same sample set which are already the basis for population statistics in many U.S. forensic laboratories.
The news report NIST posted to its website 2/ used less technical language. It explained that
NIST measured those STR gene frequencies years ago using a library of DNA samples from 1,036 individuals. To calculate gene frequencies for NGS-based profiles, Gettings and her co-authors cracked open the freezer that contained the original samples, which were anonymized and donated by people who consented to their DNA being used for research. The scientists generated NGS-based profiles for them by sequencing 27 markers—the core set of 20 included in most DNA profiles in the U.S. plus seven others. They then calculated the frequencies for the various genetic sequences found at each marker.

It might be surprising that scientists can estimate gene frequencies from such a small library of samples. However, the NIST team was measuring frequencies not for the full profiles, but for the individual markers. Since they sequenced 27 markers, with each marker occurring twice per sample, the number of markers tested wasn’t 1,036, but more than 55,000.
Although this explanation seems easier to follow, it begs several questions. First, what is an "STR gene frequency"? STRs are not genes. The news account correctly states that "[t]o generate a DNA profile, forensic labs analyze sections of DNA, called genetic markers, where the genetic code repeats itself, like a word typed over and over again. Those sections are called short tandem repeats, or STRs, and the number of repeats at each marker varies from person to person." Some of this "microsatellite DNA," as it also is called, does occur within genes, but only in regions called "introns." Introns are transcribed into RNA but then cut out before the RNA is translated into proteins. Consequently, STRs are not part of the genetic code for proteins. This fact makes them relatively uninformative in inferring physical or mental traits, in diagnosing diseases, and in ascertaining health risks. Thus, speaking of an "STR gene frequency" reflects a certain confusion about what STRs and genes actually are.

Second, how can the NIST data set from 1,036 individuals be tantamount to a sample of size 55,000? The number of samples from members of each racial or ethnic population group in which allele frequencies are estimated is much less than 1,036. The published paper gives allele frequencies for "four U.S. populations: African American (AfAm, N = 342), Asian (N = 97), Caucasian (Cauc, N = 361), and Hispanic (Hisp, N = 236)." Furthermore, because an allele is a DNA variant at a given locus, the allele frequencies pertain to the variations at each locus. The total number of loci has no bearing on the precision of the estimate at any locus. Sequences on chromosome 1 are not compared to sequences on chromosome 5, for example. For the 97 Asians, there are at most a total of 97 individuals x 2 alleles per individual per locus = 194 alleles per locus -- not 55,000. The most common repeat sequence at the locus on chromosome 5, for instance, was ATCT. If my reading of the NIST study is correct, repeats of this sequence of four base-pairs (such as ATCTATCTACTCACTCACTC) occurred 150 times in the Asian sample, for a reported percentage of 150/194 = 77.32%. If we treat the NIST Asian sample of 194 such alleles as a simple random sample, then the standard error is [(.7732)(1 - .7732) / 194]1/2 = 3%. The 95% confidence interval for the actual value in the Asian population that was sampled is therefore 71% to 83%. The fact that there were 26 other loci does not reduce the sampling error for this allele-frequency estimate. The 55,000 figure has nothing to do with the accuracy of each frequency estimate. 3/

Indeed, the opening paragraph of the news report is problematic. It states that "DNA is often considered the most reliable form of forensic evidence" because "[w]hen [DNA experts] compare the DNA left at a crime scene with the DNA of a suspect, experts generate statistics that describe how closely those DNA samples match." But the population sample statistics are not used to describe how closely DNA samples match. Either two STR profiles match or they do not. At least, this is the usual testimony, which includes "random match probabilities" only when the samples are declared to match fully.

Does NIST need more quality control for its website?

NOTES
  1. K. B. Gettings, L. A. Borsuk, C. R. Steffen, K. M. Kiesler, P. M. Vallone, U.S. Population Sequence Data for 27 Autosomal STR Loci, Forensic Science International: Genetics (published online 19 July 2018. DOI: 10.1016/j.fsigen.2018.07.013)
  2. NIST, NIST Builds Statistical Foundation for Next-Generation Forensic DNA Profiling, July 23, 2018, https://www.nist.gov/news-events/news/2018/07/nist-builds-statistical-foundation-next-generation-forensic-dna-profiling
  3. Of course, the most salient sampling uncertainty is for the random-match probability for the full profile. Random sampling errors across loci should tend to cancel out -- some of the single-locus estimates that are combined according to a population-genetics model will be on the high side, some on the low side.That effect, however, is different than having a larger sample of the population of alleles.