Sunday, January 26, 2014

Hundreds of Errors in DNA Databases: What Do They Mean?

The other day, the New York Times reported that "[t]he Federal Bureau of Investigation, in a review of a national DNA database, has identified nearly 170 profiles that probably contain errors" and that "New York State authorities have turned up mistakes in DNA profiles in New York’s database."

Apparently, nearly all the errors involve a recorded profile that differs from the true profile at one (and only one) allele. These "mistakes were discovered in July, when the F.B.I., using improved software, broadened the search parameters used to detect matches. The change, one F.B.I. scientist said, was like upgrading or refining 'a spell-check.' In 166 instances, the new search found DNA profiles in the database that were almost identical but conflicted at a single point."

This discovery raises several questions. How prevalent are these errors? What caused them? And, what investigative or prosecutorial errors could they cause?

I. Prevalence and Causes

The article observes that "[t]he errors identified so far implicate only a tiny fraction of the total DNA profiles in the national database, which holds nearly 13 million profiles, more than 12 million from convicts and suspects, and an additional 527,000 from crime scenes." Thus, "Alice R. Isenberg, the chief of the biometric analysis section of the F.B.I. Laboratory, said that ... 'We were pleasantly surprised it was only 166. ... These are incredibly small numbers for the size of the database.'" She believes "most of the 166 cases probably resulted from interpretation errors by DNA analysts or typographical errors introduced when a lab worker uploaded the series of numbers denoting a person’s DNA profile."

II. Consequences: Risks of False Hits and Misses

It is true that 166 is a small fraction--about 0.0013%--of all the profiles on record. But it would be interesting to know whether these discrepancies are concentrated in the known offender or arrestee profiles or in the ones from crime scenes (the "forensic index").

A. Errors in the Offender-arresstee Indices

Suppose first that the profile of an individual in an offender or arrestee index departs from the true profile of that individual by a single allele. If that individual committed an offense for which a crime-scene profile is recovered, he would not be noticed in a database trawl (unless the search program flagged near misses like this). This would be a false negative error.

How probable is this false negative error? There have been less than 230,000 hits (see between profiles in the forensic index and the more than 12,000,000 profiles in the offender and arrestee indices. This means that an individual has about a 1.9% chance of being linked to a crime through the database. Assuming that all 166 misrecorded or mistyped profiles are in the offender-arrestee indices, it follows that the probability that one or more of these errors would yield a false negative is 166 x 0.0013% x 1.9% = 0.000042 = 0.0042%. Even if the 166 errors were the tip of the proverbial iceberg, 90% of which lies below the visible surface, the probability of a false negative resulting from the inaccurate profiles is only 0.042%.

If the inaccurate profile pertains to a offender or an arrestee, as we have assumed so far, then the probability of a false positive -- a hit to an individual represented in the database who is not the source of the crime-scene sample -- typically is even smaller. A false positive could occur if someone else in the world has a true profile that (1) is not in the offender-arrestee index and (2) perfectly matches the inaccurate profile in the index. Because all DNA profiles consisting of substantial number of loci are rare, the probability of a false positive must be small. 1/

B. Errors in the Forensic Index

Of course, all the inaccurate profiles did not come from previous offenders or arrestees. Some came from the crime-scene samples -- producing erroneous entries in the forensic index. These errors "had the effect of obscuring clues, blinding investigators to connections among crime scenes and known offenders" in three cases in New York. When forensic-index errors are present, the risk of a false negative error is much larger. Even if the source of the crime-scene sample is represented in the offender-arrestee indices (as seems to occur for some 2% of these database inhabitants), a database trawl for a match to the erroneous forensic index profile will miss this person.

Again, however, the chance of a false positive -- a hit between the false crime-scene profile and an offender or arrestee profile -- remains low for full DNA profiles. The probability of another person having the same false profile is still quite small. 2/

III. Caveats

The probabilities noted here are the result of back-of-the-envelope calculations (see note 1). Although I would think that more precise analyses would not give dramatically different results, I am not suggesting that the sources of errors reported by the FBI should be ignored or minimized. The incidence of errors in generating and recording data can be reduced, in part by automated systems. 3/ In addition, the estimates I have provided do not pertain to errors resulting from contamination of a crime-scene sample with an innocent suspect's sample and to profiles that are less complete than thirteen loci.

  1. If each STR allele is present in 10% of the relevant population (the actual values for each allele will vary from case to case) and if 13 loci are in the profile (the current norm), then the probability of a full match to the source of a crime-scene sample is less than (2 x 1/10)13 = 0.00000000082 (if the source of the crime-scene sample and the database inhabitant are unrelated members of a population in Hardy-Weinberg equilibrium and there is no linkage disequilibrium). Even if the correctly profiled crime-scene sample comes from a brother, the probability of a 13-locus match is only about (1/4 + p/2 + 2p2)13, where p is the average chance that the two alleles will match (the proportion of homozygotes in the population). (National Research Council Committee on DNA Technology in Forensic Science 1992, p. 167). If the homozygosity rate is, say 30% (which is higher than that reported in Budowle et al. (1999, p. 1278 (tbl. 1)), the probability of a full sibling possessing a one-off profile would only be about 0.00084.
  2. See supra note 1.
  3. Additional recording errors might be detected by massive, all-pairs trawls of the indices in CODIS. These would flag suspiciously similar profiles recorded as coming from what are thought to be different sources.


No comments:

Post a Comment