Friday, July 26, 2013

Good Point, Bad Math: DNA Database Statistics Misunderstood (Again)

In a New York Times op-ed article, High-Tech, High-Risk Forensics, Hastings Law School Professor Osagie K. Obasogie mentions a false arrest in the investigation of the murder near San Jose of millionaire investor Raveesh Kumra. A database trawl seemed to implicate Lukis Anderson, a homeless man who then “spent more than five months in jail with a possible death sentence hanging over his head” before prosecutors received records showing that Mr. Anderson was in a hospital for alcohol intoxication on the night of the murder. Prosecutors have suggested that paramedics who treated Anderson for intoxication at a liquor store in San Jose a few hours before the murder inadvertently transferred his DNA to Mr. Kumra’s fingernails.

This is by no means the first instance of secondary transfer of DNA to a crime scene (if that is the correct explanation), and forensic scientists have been conducting studies to see how often such transfer occurs. The case shows that police and prosecutors, as well as defense counsel and jurors, should not be overawed by matches from DNA database trawls. This is the good point that Professor Obasogie makes.

The rest of the op-ed article perpetuates mathematical fallacies about DNA databases. First, Professor Obosagie questions “the frequent claim that it is highly unlikely, if not impossible, for two DNA profiles to match by coincidence.” Nothing is impossible, but not many people share the same DNA identification profiles. The op-ed tries to deny this with the observation that “[a] 2005 audit of Arizona’s DNA database showed that, out of some 65,000 profiles, nearly 150 pairs matched at a level typically considered high enough to identify and prosecute suspects. Yet these profiles were clearly from different people.”

The 150 or so matches were, in fact, mismatches. That is, they were partial matches that actually excluded every “matching” pair. Only if an analyst improperly ignored the nonmatching parts of the profiles or if these did not appear in a crime-scene sample could they be reported to match.

Moreover, even we treat all 150 partial matches as tantamount to false full matches in casework, a proper analysis must account for artificially pairing every profile with every other profile and for the many ways to find some kind of partial match. The scientific and legal literature clearly shows that many partial matches are to be expected under these conditions. The 65,000-some samples then in the database gave rise to over a trillion possible partial matches. (The same combinatorial explosion explains the well-known “Birthday paradox” in probability theory.) A mere 150 partial matches out of a trillion opportunities to make such matches represents a quasi-false match rate on the order of about 100 per trillion (0.0000000001). This number is not quite zero, but neither is it “high risk.”

Second, Professor Obosagie notes that “There are also problems with the way DNA evidence is interpreted and presented to juries.” True enough. A common problem is the confusion of a random match probability with a source probability, and the op-ed makes this very mistake when it claims that jurors in the San Francisco prosecution of John Puckett five years ago were “told that there was only a one-in-1.1 million chance that this DNA match was pure coincidence.”

According to Mr. Puckett’s brief on appeal, the DNA analyst testified not to this source probability, but only that “the profile would occur at random among unrelated individuals in about 1 in 1.1 million U.S. Caucasians ... .” (The op-ed also mistakenly asserts that the defendant, who died before his appeal was heard, “is now serving a life sentence.”)

How prosecutors should present statistics in the cases in which the match came about from a database trawl is an important question, but it would have been misleading to tell jurors that the “chance that this DNA match [to John Puckett] was pure coincidence” was “one in three,” as Puckett wanted to do. Putting the possibility of laboratory error to the side (as the op-ed does), the probability of a database trawl suggesting that John Puckett was the killer of 22-year-old Diana Sylvester is 1 if Puckett was in fact the killer. It is about 1 in 1.1 million if he was not.

A figure like 1 in 3 therefore is not suitable for presenting to a jury as a measure of how revealing a database match is, but perhaps it is useful for another purpose. Puckett produced the probability by multiplying the tiny random-match probability of 1 in 1.1 million by the size of the database. This multiplication yields an estimate of how often trawling a database populated entirely by people innocent of every crime for which the database is used would produce matches to anyone. The larger the number, the greater the risk that “innocent databases”—those that fail to contain profiles of the individuals leaving their DNA at crime-scenes—will lead to false accusations.

Of course, we know that this is not a meaningful estimate of the false positive rate of the real databases, for most matches are corroborated with other evidence. If all databases were innocent, and the 1-in-3 number applied, we would be seeing lots of matches to people too young or too old to have committed the crime being investigated, in prison at the time, or having other solid alibis like Mr. Anderson’s.

Thus, it is improbable that strictly coincidental database matches are common—particularly in the run-of-the-mill cases in which the crime-scene DNA is not a mixture or too small or too degraded to be fully typed. Nevertheless, no one can say exactly how often DNA database searches turn up plausible suspects who are, in fact, innocent. This takes me back to Professor Obasogie’s one good point—let’s not get carried away with DNA database matches. There can be innocent explanations for reported matches. But let’s not become overly skeptical of the value of DNA databases to generate investigative leads—and let’s not use bad math to prop up our conjectures.



  1. Professor Kaye,

    You have misquoted and misrepresented my oped. Readers of this blog are welcomed to read the piece at the link below.

    1. The link that is in the main posting (twice) might be more convenient for readers.

      Charges of misquotation and misrepresentation are not to be taken (or made) lightly. I apologize in advance for any such mistakes. All quotations came from cutting and pasting from the op-ed, so I assume that the words appeared there. Needless to say, I read the op-ed about "high risk" more than once to extract its meaning to the best of my ability. Therefore, I am unable to find misquotations or misrepresentations on my own, but I would welcome an opportunity to make any and all corrections that are warranted.