Tuesday, October 4, 2011

An Odd Set of Odds in Kinship Matching with DNA Databases

The 22d International Symposium on the Future of Human Identification began yesterday with a set of workshops. One was on "familial searching." The phrase refers to trawling the profiles in a DNA database for certain types of partial matches to a DNA profile from a crime-scene sample.

Partial matches that are useful in generating investigative leads to family members arise much more often when a particular kind of relative (say, a full sibling) is the source of the crime-scene sample than when an individual who is not closely related to the database inhabitant is the source. The ratio of the probability of the partial match under the former condition (a given genetic relationship) to the latter (unrelated individuals) is a likelihood ratio (LR). The LR (or, technically, its logarithm) for siblingship expresses the weight of the evidence in favor of the hypothesis that the source is full sibling as opposed to an unrelated individual.

After explaining the this idea, the first speaker presented the following formula:
"Odds" = LRautosomal x LRY-STR x 1/N         (1)
She attributed this formula to the California state DNA laboratory that does familial searching in that state. In this equation, N is the size of the database, LRautosomal is the likelihood ratio for the partial match at a set of autosomal STR loci, and LRY-STR is the likelihood ratio for the matching Y-STR haplotype.

She described this as a Bayesian computation that could lead to statements in court such as "there is a 98% probability" that the person whose DNA was found at the crime scene is a brother of Joe Smith, a convicted offender whose DNA profile is in a DNA database.

There are three interesting things to note about these suggestions. To begin with, it is not clear why such a statement would be introduced in a trial. By the time the suspect has become a defendant, a new sample of his DNA should have been tested to establish a full match to the crime-scene sample. At that point, why would the judge or jury care whether defendant is related to a database inhabitant. The relevance of the DNA evidence lies in the full match to the crime-scene sample, and the jury need not consider whether the defendant is a relative of someone not involved in the alleged crime. (One might ask whether the trawl through the database somehow degrades the probative value of the full match, but, if anything, it increases it. [1])

The issue could arise, however, if police were to seek a court order or search warrant to collect a DNA sample from the suspect. At that point, they would need to describe the significance of the partial match to the convicted offender.

This possibility brings us to the second noteworthy point about equation (1). The "odds" (or the corresponding probability) are not the way to present the weight of the partial match. Consider the prior probability of a match in a small database, say, of size N=2. Prior to considering the partial match, why would one think that the probability of a database inhabitant being the sibling of the criminal who resides outside the database is 1/N = 1/2? It is quite improbable that the database of two people includes a relative of every criminal who leaves DNA at a crime-scene. The a priori probability for a small database must be closer to 0 than 1/N.

That the prior probability is less than 1/N is a general result. The only exception occurs when it is absolutely certain that a sibling of the perpetrator is in the database. On that assumption, prior odds of 1 to N-1 are not unreasonable. But that assumption is entirely artificial, and to advise a magistrate that the posterior odds have the value computed according to (1) would be to overstate the implications of the partial match.

The third thing to note about dividing by N is that it accomplishes nothing in producing a viable list of partially matching profiles in a DNA database trawl. The straightforward approach is to produce a short list of candidates in the database whose first-degree relatives might be the source of the crime-scene sample. The minimum value of LRautosomal x LRY-STR should be large enough to keep the two conditional error probabilities (including a candidate when there is no relationship, and not including a candidate when there is a relationship) small. This threshold value does not depend on N. (A later speaker made this observation.)

Equation (1), it seems, is useless. Instead, the magistrate should be told the value of the LR and how often such large LRs would occur when a crime-scene sample comes from a relative versus how often it would occur when it comes from an related person.


1. David H. Kaye, 2009, Rounding Up the Usual Suspects: A Legal and Logical Analysis of DNA Database Trawls, North Carolina Law Review, 87(2), 425-503.

No comments:

Post a Comment