Sunday, January 26, 2014

Hundreds of Errors in DNA Databases: What Do They Mean?

The other day, the New York Times reported that "[t]he Federal Bureau of Investigation, in a review of a national DNA database, has identified nearly 170 profiles that probably contain errors" and that "New York State authorities have turned up mistakes in DNA profiles in New York’s database."

Apparently, nearly all the errors involve a recorded profile that differs from the true profile at one (and only one) allele. These "mistakes were discovered in July, when the F.B.I., using improved software, broadened the search parameters used to detect matches. The change, one F.B.I. scientist said, was like upgrading or refining 'a spell-check.' In 166 instances, the new search found DNA profiles in the database that were almost identical but conflicted at a single point."

This discovery raises several questions. How prevalent are these errors? What caused them? And, what investigative or prosecutorial errors could they cause?

I. Prevalence and Causes

The article observes that "[t]he errors identified so far implicate only a tiny fraction of the total DNA profiles in the national database, which holds nearly 13 million profiles, more than 12 million from convicts and suspects, and an additional 527,000 from crime scenes." Thus, "Alice R. Isenberg, the chief of the biometric analysis section of the F.B.I. Laboratory, said that ... 'We were pleasantly surprised it was only 166. ... These are incredibly small numbers for the size of the database.'" She believes "most of the 166 cases probably resulted from interpretation errors by DNA analysts or typographical errors introduced when a lab worker uploaded the series of numbers denoting a person’s DNA profile."

II. Consequences: Risks of False Hits and Misses

It is true that 166 is a small fraction--about 0.0013%--of all the profiles on record. But it would be interesting to know whether these discrepancies are concentrated in the known offender or arrestee profiles or in the ones from crime scenes (the "forensic index").

A. Errors in the Offender-arresstee Indices

Suppose first that the profile of an individual in an offender or arrestee index departs from the true profile of that individual by a single allele. If that individual committed an offense for which a crime-scene profile is recovered, he would not be noticed in a database trawl (unless the search program flagged near misses like this). This would be a false negative error.

How probable is this false negative error? There have been less than 230,000 hits (see http://www.fbi.gov/about-us/lab/biometric-analysis/codis/ndis-statistics) between profiles in the forensic index and the more than 12,000,000 profiles in the offender and arrestee indices. This means that an individual has about a 1.9% chance of being linked to a crime through the database. Assuming that all 166 misrecorded or mistyped profiles are in the offender-arrestee indices, it follows that the probability that one or more of these errors would yield a false negative is 166 x 0.0013% x 1.9% = 0.000042 = 0.0042%. Even if the 166 errors were the tip of the proverbial iceberg, 90% of which lies below the visible surface, the probability of a false negative resulting from the inaccurate profiles is only 0.042%.

If the inaccurate profile pertains to a offender or an arrestee, as we have assumed so far, then the probability of a false positive -- a hit to an individual represented in the database who is not the source of the crime-scene sample -- typically is even smaller. A false positive could occur if someone else in the world has a true profile that (1) is not in the offender-arrestee index and (2) perfectly matches the inaccurate profile in the index. Because all DNA profiles consisting of substantial number of loci are rare, the probability of a false positive must be small. 1/

B. Errors in the Forensic Index

Of course, all the inaccurate profiles did not come from previous offenders or arrestees. Some came from the crime-scene samples -- producing erroneous entries in the forensic index. These errors "had the effect of obscuring clues, blinding investigators to connections among crime scenes and known offenders" in three cases in New York. When forensic-index errors are present, the risk of a false negative error is much larger. Even if the source of the crime-scene sample is represented in the offender-arrestee indices (as seems to occur for some 2% of these database inhabitants), a database trawl for a match to the erroneous forensic index profile will miss this person.

Again, however, the chance of a false positive -- a hit between the false crime-scene profile and an offender or arrestee profile -- remains low for full DNA profiles. The probability of another person having the same false profile is still quite small. 2/

III. Caveats


The probabilities noted here are the result of back-of-the-envelope calculations (see note 1). Although I would think that more precise analyses would not give dramatically different results, I am not suggesting that the sources of errors reported by the FBI should be ignored or minimized. The incidence of errors in generating and recording data can be reduced, in part by automated systems. 3/ In addition, the estimates I have provided do not pertain to errors resulting from contamination of a crime-scene sample with an innocent suspect's sample and to profiles that are less complete than thirteen loci.

Notes
  1. If each STR allele is present in 10% of the relevant population (the actual values for each allele will vary from case to case) and if 13 loci are in the profile (the current norm), then the probability of a full match to the source of a crime-scene sample is less than (2 x 1/10)13 = 0.00000000082 (if the source of the crime-scene sample and the database inhabitant are unrelated members of a population in Hardy-Weinberg equilibrium and there is no linkage disequilibrium). Even if the correctly profiled crime-scene sample comes from a brother, the probability of a 13-locus match is only about (1/4 + p/2 + 2p2)13, where p is the average chance that the two alleles will match (the proportion of homozygotes in the population). (National Research Council Committee on DNA Technology in Forensic Science 1992, p. 167). If the homozygosity rate is, say 30% (which is higher than that reported in Budowle et al. (1999, p. 1278 (tbl. 1)), the probability of a full sibling possessing a one-off profile would only be about 0.00084.
  2. See supra note 1.
  3. Additional recording errors might be detected by massive, all-pairs trawls of the indices in CODIS. These would flag suspiciously similar profiles recorded as coming from what are thought to be different sources.

References

Saturday, January 18, 2014

The Signal, the Noise, and the Errors

Published in September 2012, The Signal and the Noise by Nate Silver soon reached The New York Times best-seller list for nonfiction. Amazon.com named it the best nonfiction book of 2012, and it won the 2013 Phi Beta Kappa Award in science. Not bad for a book that presents Bayes' rule as prescription for better thinking in life and as a model for consensus formation in science. The subtitle, for those who have not read it, is "Why So Many Predictions Fail--But Some Don't," and the explanations include poor data, cognitive biases, and statistical models not grounded in an understanding of the phenomena being modeled.

The book is both thoughtful and entertaining, covering many fields. I learned something about meteorology (your local TV weather forecaster probably is biased toward forecasting bad weather--stick to the National Weather Service forecasts), earthquake predictions, climatology, poker, human and computer chess-playing, equity markets, sports betting, political polling and poor prognostication by pundits, and more. Silver does not pretend to be an expert in all these fields, but he is perceptive and interviewed a lot of interesting people.

Indeed, although Wikipedia describes Silver as "an American statistician and writer who analyzes baseball (see Sabermetrics) and elections (see Psephology)," he does not present himself as as expert in statistics, and statisticians seem conflicted on whether to include him their ranks (see AmStat News). He seems to be pretty much self-educated in the subject, and he advocates "getting your hands dirty with the data set" rather than "spending too much time doing reading and so forth." Frick (2013).

Perhaps that emphasis, combined with the objective of writing an entertaining book for the general public, has something to do with the rather sweeping--and sometimes sloppy--arguments for Bayesian over frequentist methods. Although Silver gives a few engaging and precise examples of Bayes' rule in operation (playing poker or deciding whether your domestic partner is cheating on you, for instance), he is quick to characterize a variety of informal, intuitive modes of combining many different kinds of data as tantamount to following Bayes' rule. Marcus & Davis (2013) identify one telling example--a very successful sports bettor who recognizes the importance of data that the bookies overlook, misjudge, or do not acquire . (Pp. 232-61). What makes this gambler a Bayesian? Silver thinks it is the fact that "[s]uccessful gamblers ... think of the future as speckles of probability, flickering upward and downward like a stock market ticker to every new jolt of information." (P. 237). That's fine, but why presume that the flickers follow Bayes' rule as opposed to some other procedure for updating beliefs? And why castigate frequentist statisticians, as Silver seems to, as "think[ing] of the future in terms of no-lose bets, unimpeachable theories, and infinitely precise measurements"? Ibid. Surely, that is not the world in which statisticians live.

Changing probability judgments does not make someone a Bayesian

In proselytizing for Bayes' theorem and in urging readers to "think probabilistically" (p. 448), Silver also writes that
When you first start to make these probability estimates, they may be quite poor. But there are two pieces of favorable news. First, these estimates are just a starting point: Bayes's theorem will have you revise and improve them as you encounter new information. Second, there is evidence that this is something we can learn to improve. The military, for instance, has sometimes trained soldiers in these techniques,5 with reasonably good results.6 There is also evidence that doctors think about medical diagnoses in a Bayesian manner.7 [¶] It is probably better to follow the lead of our doctors and our soldiers than our television pundits.
It is hard to argue with the concluding sentence, but where is the evidence that many soldiers and doctors are intuitive (or trained) Bayesians? The report cited (n.5) for the proposition that "[t]he military ... has sometimes trained soldiers in the [Bayesian] techniques" says nothing of the kind.* Similarly, the article that is supposed to show that the alleged training in Bayes' rule produces "reasonably good results" is quite wide of the mark. It is a 35-year-old report for the Army on "Training for Calibration" about research that made no effort to train soldiers to use Bayes' rule.**

How about doctors? The source here is an article in the British Medical Journal that asserts that "[c]linicians apply bayesian reasoning in framing and revising differential diagnoses." Gill et al. (2005). But these authors--I won't call them researchers because they did no real research--rely only on their impressions and post hoc explanations for diagnoses that are not expressed probabilistically. As one distinguished physician tartly observed, "[c]linicians certainly do change their minds about the probability of a diagnosis being true as new evidence emerges to improve the odds of being correct, but the similarity to the formal Bayesian procedure is more apparent than real and it is not very likely, in fact, that most clinicians would consider themselves bayesians." Waldron (2008, pp. 2-3 n.2).

The transposition fallacy

Consistent with this tendency to conflate expressing judgments probabilistically with using Bayes' rule to arrive at the assessments, Silver presents probabilities that have nothing to do with Bayes' rule as if they are properly computed posterior probabilities. In particular, he naively transposes conditional probabilities to misrepresent p-values as degrees of belief.

At page 185, he writes that
A once-famous “leading indicator” of economic performance, for instance, was the winner of the Super Bowl. From Super Bowl I in 1967 through Super Bowl XXXI in 1997, the stock market gained an average of 14 percent for the rest of the year when a team from the National Football League (NFL) won the game. But it fell by almost 10% when a team from the original American Football Leage (AFL) won instead. [¶] Through 1997, this indicator had correctly “predicted” the direction of the stock market in twenty-eight of thirty-one years. A standard test of statistical significance, if taken literally, would have implied that there was only about a 1-in-4,700,000 possibility that the relationship had emerged from chance alone.
This is a cute example of the mistake of interpreting a p-value, acquired after a search for significance, as if there had been no such search. As Silver submits, "[c]onsider how creative you might be when you have a stack of economic variables as thick as a phone book." Ibid.

But is the ridiculously small p-value (that he obtained by regressing the S&P 500 index on the conference affiliation of the Super Bowl winner) really the probability "that the relationship had emerged from chance alone"? No, it is the probability that such a remarkable association would be seen if the Super Bowl outcome and the S&P 500 index were entirely uncorrelated (and no one had searched for a data set that shared a seemingly shocking correlation to the S&P 500 index). Silver may be a Bayesian at heart, but he did not compute the probability of the null hypothesis given the data, and it is problematic to tell the reader that a "standard test of statistical significance" (or more precisely, a p-value) gives the (Bayesian) probability that the null hypothesis is true.

Of course, with such an extreme p-value, the misinterpretation might not make any practical difference, but the same misinterpretation is evident in Silver's characterization of a significance test at the 0.05 level. He writes that "[b]ecause 95 percent confidence in a statistical test is Fisher’s traditional dividing line between 'significant' and 'insignificant,' researchers are much more likely to report findings that statistical tests classify as 95.1 percent certain than those they classify as 94.9 percent certain—a practice that seems more superstitious than scientific." (P. 256 n. †). Sure, any rigid dividing line (a procedure that Fisher did not really use) is arbitrary, but rejecting a hypothesis in a classical statistical test at the 0.05 level does not imply a 95% certainty that this rejection is correct.

In transposing conditional probabilities in violation of both Bayesian and frequentist precepts, Silver is in good and plentiful company. As the pages on this blog reveal, theoretical physicists, epidemiologists, judges, lawyers, forensic scientists, journalists, and many other people make this mistake. E.g., The Probability that the Higgs Boson Has Been Discovered, July 6, 2012. Despite its general excellence in describing data-driven thinking, The Signal and the Noise would have benefited from a little more error-correcting code.

Notes

* Rather, Gunzelmann & Gluck (2004) discusses training in unspecified "mission-relevant skills." An "expert model is able to compare their actions against the optimal actions in the task situation" and "identify [trainee errors] and provide specific feedback about why the action was incorrect, what the correct action was, and what the students should do to correct their mistake." The expert model--and not the soldiers--"uses Bayes’ theorem to assess mastery learning based upon the history of success and failure on particular units of skill within the task." Ibid. According to the authors, this "Bayesian knowledge tracing approach" is inadequate because it "does not account for forgetting, and thus cannot provide predictions about skill retention." Ibid.

** Lichtenstein & Fischhoff's (1978) objective was "to help analysts to more accurately use numerical probabilities to indicate their degree of confidence in their decisions." They did not study military analysts, but instead recruited 12 individuals from their personal contacts. They had these trainees assess the probabilities of statements in the areas of geography, history, literature, science, and music. They measured how well calibrated their subjects were. (A well calibrated individual gives correct answers to x% of the questions for which he or she assesses the probability of the given answer to be x%.) The proportion of the subjects whose calibration improved after feedback was 72%. Ibid.

References

Walter Frick, Nate Silver on Finding a Mentor, Teaching Yourself Statistics, and Not Settling in Your Career, Harvard Business Review Blog Network, Sept. 24, 2013, http://blogs.hbr.org/2013/09/nate-silver-on-finding-a-mentor-teaching-yourself-statistics-and-not-settling-in-your-career/.

Christopher J. Gill, Lora Sabin & Christopher H. Schmid, Why Clinicians Are Natural Bayesians, 330 Brit. Med. J. 1080–83 (2005), available at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC557240/

Glenn F. Gunzelmann & Kevin A. Gluck, Knowledge Tracing for Complex Training Applications: Beyond Bayesian Mastery Estimates, in Proceedings of the Thirteenth Conference on Behavior Representation in Modeling and Simulation 383-84 (2004), available at http://act-r.psy.cmu.edu/wordpress/wp-content/uploads/2012/12/710gunzelmann_gluck-2004.pdf.

Sarah Lichtenstein & Baruch Fischhoff, Training for Calibration, Army Research Institute Technical Report TR-78-A32, Nov. 1978, available at http://www.dtic.mil/dtic/tr/fulltext/u2/a069703.pdf

Gary Marcus & Ernest Davis, What Nate Silver Gets Wrong, New Yorker, Jan. 25, 2013, http://www.newyorker.com/online/blogs/books/2013/01/what-nate-silver-gets-wrong.html

Tony Waldron, Palaeopathology (2008), excerpt available at http://assets.cambridge.org/97805216/78551/excerpt/9780521678551_excerpt.pdf

Thursday, January 16, 2014

Tres Mal Errors with DNA Evidence

A story in the Denver Post (Gurman 2014) begins with the disturbing news that
A malfunction in a DNA processing machine led to the scrambling of samples from 11 Denver police burglary cases, officials acknowledged Friday. It took more than two years for the department to discover the errors. As a result of the mix-up, prosecutors are dismissing burglary cases against four people, three of whom had already pleaded guilty.
What happened?

In 2011, "[a] machine 'froze' while running a tray of 19 DNA samples." An analyst "replaced [the samples] in the wrong order" after asking the manufacturer of the robot how to proceed. More than two years later, "the machine froze for a second time." An analyst called again and "became concerned because the directions seemed different the second time. Further review over the next month revealed" the 2011 error. Ibid.

What of It?

The police department chief of staff observed that "[n]one of the DNA was compromised; it was merely associated with the wrong case when we were done." Ibid. In other words, the crime-scene DNA profiles were mislabeled. Such errors could have helped criminals avoid detection. For instance, a burglar in case A falsely associated with case B might have had a strong alibi defense for case B. Alternatively, such labeling errors could have caused individuals to be convicted of the wrong crime -- perhaps a man guilty of a burglary could have been found guilty of an murder (or vice versa).

In this incident, however, a police spokeswoman said that "[a]ll four people had confessed to at least one burglary, but the DNA error meant they were charged with the wrong ones." Ibid. Prosecutors dismissed the charges against the four, and they will not be tried for the other burglaries because the statute of limitations has expired.

Another "tray mal" case

Misuse of automated machinery for DNA analysis also produced an error--this one involving an entirely innocent man--in "what is described as the most advanced automated DNA testing system in the UK at LGC forensics labs in Teddington." Israel 2012. The machinery extracts DNA from wells in a plastic tray. Police arrested Andrew Scott, 20, after a street fight and sent a saliva to LGC for profiling. (Doyle 2012). Instead of throwing away the tray after the run with Scott's saliva sample, however, a worker reused it in an unrelated rape case. The tray contained enough of Scott's left-over DNA to show his DNA profile in the later rape sample. The laboratory should have been aware of a problem, for "[t]he batch containing the rape sample showed DNA present in the negative control (a blank sample put through to test for contamination)." (Rennison 2012).

As a result of the error, Scott was charged with "a violent attack on a woman in Manchester – carried out when he was hundreds of miles away in Plymouth." (Doyle 2012). After he spent months in prison, the charge was dismissed. "Phone records showed he was 300 miles away on the south coast when the rape took place." Scott described the experience as a "living nightmare": "They kept me in a segregation wing which was full of rapists and paedophiles. I suffered lots of verbal abuse and other inmates spitting at us and shouting 'paedos.'" Ibid.

References
Acknowledgments
  •  Thanks to Bill Thompson for alerting me to the Denver case.
Copyr. (c) DH Kaye 2014