Sunday, September 23, 2018

Is the NorCal Rapist Arrest Another Success Story for Forensic Genomic Genealogy?

On the morning of September 20, police arrested Roy Charles Waller, 58, of Benicia, California, for 10 rapes over a 15-year period starting in 1991. A Washington Post article described the crucial lead in the case as "what's called a familial DNA search." According to the Post,
    The arrest was reminiscent of that of Joseph James DeAngelo, who was arrested in April on the suspicion that he was the so-called “Golden State Killer,” who was wanted for raping dozens of women and killing at least 12 people in a bloody swath of crime that spanned decades in the state.
    Like DeAngelo, Waller was arrested after police searched the genealogy site GEDmatch for leads. In DeAngelo’s case, officials did what’s called a familial DNA search of GEDmatch, in which they sought to find someone who was closely genetically related to and worked backward to find a suspect. Familial DNA searching, particularly as it relates to government-run DNA databases, has come into wider use around the country, but it raises complicated questions about whether it means that the privacy rights of people are forfeited, in effect, by the decisions made by their relatives.
    It was not immediately clear if police did a familial search in Waller’s case or found his profile in GEDmatch.
In forensic genetics, "familial searching" normally refers to trawling a law-enforcement database of DNA profiles after failing to find a match between a profile in the database and a profile from a crime-scene sample at a mere 20 or so STRs (short tandem repeats of a short string of base pairs) in the genome of billions of base pairs. But a near miss to a profile in the database might be the result of kinship -- the database inhabitant does not match, but a very close relative outside the database does. A more precise description of such "familial searching" is outer-directed kinship trawling of law-enforcement databases (Kaye 2013).

The Post's suggestion that the NorCal Rapist may have been identified through "familial DNA searching" and the statement that "[f]amilial DNA searching, particularly as it relates to government-run DNA databases ... raises complicated questions about whether it means that the privacy rights of people are forfeited, in effect, by the decisions made by their relatives" is badly confused. The inhabitants, so to speak, of the law-enforcement databases, have not elected to be there. Their relatives, near or distant, have not chosen to enroll in the database. DNA profiles are in these databases because of a conviction or, sometimes, an arrest for certain crimes. The profiles are not useful for much of anything other than personal identification and kinship testing for close relatives (lineal and siblings).

The "genealogy site GEDmatch" is very different. It is populated by individuals who have elected to make their DNA searchable for kinship by other people who are looking for possible biological relatives. These trawls do not use the limited STR profiles developed for both inner- and outer-directed trawls of offender databases. They use the much more extensive set of SNPs (single-nucleotide polymorphisms) -- hundreds of  thousands of them -- that genetic testing companies such as 23-and-me provide to consumers who are curious about their traits and ancestry. Those data can be used to infer more attenuated kinship. Knowing that long, shared stretches of DNA in the crime-scene sample and the sample in a private (that is, nongovernmental) genealogy database such as GEDmatch could reflect a common ancestor several generations back enables genealogists using public information about families to walk up the family tree to a putative common ancestor and then down again to living relatives.

It is clear enough that California officials used the latter kind of kinship trawling in the publicly accessible genealogy database GEDmatch to find Mr. Waller. At a half-an-hour long press conference, Sacramento County  District Attorney Anne Marie Schubert refused to give direct answers to questions such as "Did a relative upload something?" and did you have to go through a lot of family members to get to him? However, she did state that "[t]he link was made through genetic genealogy through the use of GEDmatch." Eschewing details, she repeated "It's genetic genealogy, that's what I'll say." If Mr. Waller had put his own genome-wide scan on GEDmatch, why would Ms. Schubert add "a lot of kudos to the folks in our office ... that I call the experts in tree building"? The district attorney's desire not to say anything about specific family members reflects a commendable concern about unnecessarily divulging information about the family, but she could have been more transparent about the investigative procedure without revealing any particularly sensitive information on the family or any material that would interfere with the prosecution.


Friday, September 21, 2018

Forensic Genomic Genealogy and Comparative Justice

Thirty years ago, the introduction of law-enforcement DNA databases for locating the sources of DNA samples recovered from crime-scenes and victims was greeted with unbridled enthusiasm from some quarters and deep distrust from others.  So too, reactions to the spate of recent arrests in cold cases made possible by forays into DNA databases created for genealogy research have ranged from visions of a golden era for police investigations to glimpses into a dark and dystopian future.

Particularly in the law-enforcement database context, one concern has been the disproportionate impact of confining DNA databases to profiles of individuals who have been arrested for or convicted of crimes. For a variety of reasons, racial minorities tend to be overrepresented in law-enforcement databases (as compared to their percentage of the general population). 1/ But it seems most unlikely that nonwhites are similarly concentrated in the private genealogy databases that have resulted from the growth of recreational genetics. It may be, as Peter Neufeld observed when interviewed about the Golden State Killer arrest, that "[t]here is a whole generation that says, ‘I don’t really care about privacy,’" 2/ but it seems odd to speak of minorities as being "disproportionately affected [by] the unintended consequences of this genetic data" in the newly exploited databases. 3/

If anything, to quote Professor Erin Murphy, "the racial composition of recreational DNA sites -- which heavily skew white -- may end up complementing and balancing that of government databases, which disproportionately contain profiles from persons of color." 4/ That is not much of an argument for widespread forensic genealogy (and Professor Murphy did not rely on it for that purpose). Given how labor intensive forensic genomic genealogy is for genealogists and police, it seems unlikely that the technique for developing investigative leads to distant relatives will be used often enough to produce or correct massive disparities in who is subject to arrest or conviction.

Still, if police routinely were able to obtain complete results on crime-scene DNA with the DNA chips used in genome-wide association studies and recreational genetics, they could easily check whether any of the DNA records on those databases are immediate matches (or indicative of close relatives who might be tracked down without too much effort). The database used in the Golden State Killer case, GEDmatch, has data from a million or so curious individuals in it. That is considerably less than the 16 or 17 million profiles in the FBI's national DNA database (NDIS), but it is far from insignificant.

  1. David H. Kaye & Michael Smith, DNA Identification Databases: Legality, Legitimacy, and the Case for Population-Wide Coverage, 2003 Wisc. L. Rev. 41.
  2. Gina Kolata & Heather Murphy, The Golden State Killer Is Tracked Through a Thicket of DNA, and Experts Shudder, N.Y. Times, Apr. 27, 2018 (quoting Peter Neufeld).
  3. Id.
  4. Erin Murphy, Law and Policy Oversight of Familial Searches in Recreational Genealogy Databases, Forensic Sci. Int’l (2018) (in press).

Monday, September 17, 2018

P-values in "the home for big data in biology"

A p-value indicates the significance of the difference in frequency of the allele tested between cases and controls i.e. the probability that the allele is likely to be associated with the trait. 1/
Such is the explanation of "p-value" provided by "the EMBL-EBI Training Programme [which] is celebrating 10 amazing years of providing onsite, offsite and online training in bioinformatics" for "scientists at all levels." 2/ The explanation appears in the answer to the question "What are genome wide association studies (GWAS)?" 3/ It comes from people in the know -- "The home for big data in biology" 4/ at "Europe's flagship laboratory for the life sciences." 5/

Above the description is a Manhattan plot of the p-values for the differences between the frequencies of the single nucleotide alleles comprising a genome-wide SNP array in samples of "cases" (individuals with a trait) and "controls" (individuals without the trait). Sadly, the p-values in the plot cannot be equated to "the probability that the allele is likely to to be associated with the trait."

I. Transposition

To begin with, a p-value is the probability of a difference at least as large as the one observed in the sample of "cases" and the sample of "controls" -- when the probability that a random person has the allele is the same for both cases and controls in the entire population. For a single allele, the expected value of the difference in the sample proportions is zero, but the observed value will vary from one pair of samples to the next. Because most pairs will have small differences, a big difference in a single study is evidence against the hypothesis that there is no difference at the population level. Differences that rarely would occur by chance alone are indicative of a true association. They are not usually false positives. At least, that is the theory behind the p-value.

For example, a p-value of 1/100,000, is a rare occurrence when the two samples come from a population in which the probability of the trait is the same with or without the allele. Consequently, the difference that corresponds to this p-value is thought to be strong evidence that the allele really is more (or less) common among cases than controls.

But even if the reasoning that "we would not expect it if H is true, therefore H is likely to be false" is correct, the p-value of 1/100,000 is not "the probability that the allele is likely to to be associated with the trait." It is the probability of that sort of a discrepancy if the allele has absolutely no association with the trait. In contrast, the probability of "no association" is not even defined in the statistical framework that gives us p-values.

Another way to say it: The p-value is a statement about the evidence (the observed difference) given the hypothesis of no association. It does not represent the probability of the hypothesis of zero association given an observed association. Equating the probability associated with the evidence with the probability associated with the hypothesis is so common that it has a name -- the transposition fallacy.

II. Multiple Comparisons

A second defect of the EMBL-EBI's definition is that a p-value of, say, 1/100,000 is not a good measure of surprise for GWAS data. Suppose there are 500,000 SNPs in the array and none of their alleles has any true association with the trait. If all apparent associations are independent, the expected number with the p-value is 500,000 × (1/100,000) = 5. Because of the many opportunities for individually impressive differences to appear, it is no surprise that some alleles have this small p-value. The p-value would have to be much smaller than 1/100,000 for the apparent association to be as surprising as the reported p-value would suggest. P-values that would produce a reasonable false discovery rate could be very small indeed.

An oversimplified analogy 6/ is this: Flipping a fair coin 18 times and getting 18 heads or tails has a probability on the order of 1/100,000. 7/ Flipping 500,000 coins 18 times each and finding that some of these experiments yielded 18 heads or tails is not strong evidence against the proposition that all the coins are fair. It is just what we would expect to see if the coins are all fair.

Of course, the bioinformaticists at EMBL-EBI are acutely aware of the effect of multiple comparisons. Their catalog of findings only includes "variant-trait associations ... if they have a p-value <1.0 × 10-5 in the overall (initial GWAS + replication) population." 8/ But why insist on so small a number if this p-value is "the probability that the allele is likely to to be associated with the trait"? For multiple comparisons, the p-value of 1/100,000 is not the measure of surprise that it is supposed to be.

  1. EMBL-EBI, What Are Genome Wide Association Studies (GWAS)?,, last visited Sept. 16, 2018.
  2. EMBL-EBI, EMBLI-EBI Training,, last visited Sept. 16, 2018.
  3. EMBL-EBI, supra note 1.
  4. EMBL-EBI,, last visited Sept. 16, 2017.
  5. Id. 
  6. It is oversimplified because not all associations are independent. Nearby SNPs tend to be inherited together, but methods that take account of dependencies and enable researchers to pick out the associations that are real have been studied. 
  7. The more exact probability is 1/131,072.
  8. EMBL-EBI, Where Does the Data Come From?,, last visited Sept. 17, 2018. The phrase "the overall (initial GWAS + replication) population" is puzzling. It sounds like the data from an exploratory study are combined with those from the replication study to give a p-value for a larger sample (not a population). If so, the p-values for each study could be more than 1/100,000.

Thursday, August 9, 2018

Consumer Genetic Testing Companies' New Policies for Law Enforcement Requests

Direct-to-consumer genetic testing companies now have "best practices" for privacy that they have pledged to follow. The impetus may have been the arrest in “the Golden State Killer case [made] by comparing DNA from crime scenes with genetic data that the suspect’s relatives had submitted to the testing company GEDmatch” and the agreement between 23andMe to “share user data, with permission, with GlaxoSmithKline after the pharmaceutical giant invested US$300 million.” 1/

The best practices document from the Future of Privacy Forum does not address the constitutional limits on subpoenas or court orders seeking genetic information from the companies. With respect to law enforcement, it merely states that "Genetic Data may be disclosed to law enforcement entities without Consumer consent when required by valid legal process" and that "[w]hen possible, companies will attempt to notify Consumers on the occurrence of personal information releases to law enforcement requests." 2/

23andMe maintains that it does "not provide information to law enforcement unless ... required to comply with a valid subpoena or court order." Another webpage states that it "will not provide information to law enforcement or regulatory authorities unless required by law to comply with a valid court order, subpoena, or search warrant for genetic or Personal Information (visit our Transparency Report)." (Emphasis deleted). But how does 23andMe decide which instruments are valid? The Transparency Report promises that
[W]e use all practical legal and administrative resources to resist such requests. In the event we are required by law to make a disclosure, we will notify you in advance, unless doing so would violate the law or a court order. To learn more about how 23andMe handles law enforcement requests for user information, please see our Guide for Law Enforcement
Does using "all practical legal ... resources" mean that the company will try to quash all subpoenas for samples or genetic data as unreasonable or oppressive (the standard under, for example, Federal Rule of Criminal Procedure 17(c)(2))? That seems doubtful. The "Guide for Law Enforcement" states that
23andMe requires valid legal process in order to consider producing information about our users. 23andMe will only review inquiries as defined in 18 USC § 2703(c)(2) related to to [sic] a valid trial, grand jury or administrative subpoena, warrant, or order. .... 23andMe will only consider inquiries from a government agency with proper jurisdiction. ... 23andMe will assess whether or not it is required by law to comply with the request, based on whether 23andMe is subject to personal jurisdiction in the requesting entity, the validity of the method of service, the relevance of the requested data, the specificity of the request, and other factors.
This explanation does not indicate that the company will make Fourth Amendment arguments on the consumer's behalf against compliance with formally "valid legal process." Nor could it under the long-established doctrine that limits the invocation of Fourth Amendment rights to the party whose rights are at stake. Although the company can complain that a subpoena is vague or overbroad or requires it to do too much work to produce the information, it cannot refuse to comply on the ground that a warrantless search without probable cause violates some Fourth Amendment right of the consumer. 3/

What the company will do (usually) is notify the customer who sent in the DNA sample so that he or she can contest the subpoena:
If 23andMe is required by law to comply with a valid court order, subpoena, or search warrant for genetic or personal information, we will notify the affected individual(s) through the contact information they have provided to us before we disclose this information to law enforcement, unless doing so would violate the law or a court order. We will give them a reasonable period of time to move to quash the subpoena before we answer it.

If law enforcement officials prevent this disclosure by submitting a Delayed Notice Order (DNO) pursuant to 18 U.S.C. § 2705(b) or equivalent state statute that is signed by a judge, we will delay notifying the user until the order expires. 23andMe retains sole discretion to not notify the user if doing so would create a risk of death or serious physical injury to an identifiable individual or group of individuals, and if we are legally permitted to do so. Under these circumstances, we will notify users of the law enforcement request once the emergency situation expires.
In other words, even if the companies do not want to work hand in glove with law enforcement, consumers who order tests cannot expect the new procedures to raise the individuals' constitutional objections to law enforcement demands for genetic samples or data (to the extent that there are any).

  1. Genetic Privacy, 560 Nature 146-147 (2018), doi: 10.1038/d41586-018-05888-2
  2. Future of Privacy Forum, Privacy Best Practices for Consumer Genetic Testing Services, July 2018, at 8. Accompanying footnotes discuss statutory protections for data used in federally funded studies or held by certain medical providers.
  3. Unlike search warrants, subpoenas do not normally require probable cause, and ordinarily, they are not "searches" within the meaning of the Fourth Amendment. In Carpenter v. United States, No. 16–402, 2018 WL 3073916 (U.S. June 22, 2018), however, a sharply divided Court carved out an exception for "the rare case where the suspect has a legitimate privacy interest in records held by a third party." Whether a subpoena for a DNA sample or SNP array data produced and held by companies like 23andMe falls into this category is an intriguing question. Bank records and records of numbers dialed from a home telephone do not, but "a detailed log of a person's movements over several years [or even over a six-day period]" do. One Justice in Carpenter suggested that a search warrant based on probable cause would be required for a DNA sample held by a direct-to-consumer testing company. But even if that is a second "rare case," the opinions in Carpenter did not discuss the standing of the company to raise the personal right of its customers.

Friday, July 27, 2018

The ACLU’s In-Your-Face Test of Facial Recognition Software

The ACLU has reported that Amazon’s facial recognition “software incorrectly matched 28 members of Congress, identifying them as other people who have been arrested for a crime.” [1] This figure is calculated to impress the very legislators the ACLU is asking to “enact a moratorium on law enforcement use of face recognition.” All these false matches, the organization announced, create “28 more causes for concern.” Inasmuch as there are 535 members of Congress (Senators plus Representatives), the false-match rate is 5%.

Or is it? The ACLU’s webpage states that
To conduct our test, we used the exact same [sic] facial recognition system that Amazon offers to the public, which anyone could use to scan for matches between images of faces. And running the entire test cost us $12.33 — less than a large pizza.

Using Rekognition, we built a face database and search tool using 25,000 publicly available arrest photos. Then we searched that database against public photos of every current member of the House and Senate. We used the default match settings that Amazon sets for Rekognition.
So there were 535 × 25000 = 13,375,000 comparisons. With that denominator, the false-match rate is about 2 per million (0.0002%).

But none of these figures—28, 5%, or 0.0002%— means very much, since the ACLU’s “test” used a low level of similarity to make its matches. The default setting for the classifier is 80%. Police agencies do not use this weak a threshold [2, 3]. Using a low figure like 80% ensures that there will more be false matches among so many comparisons. Amazon recommends that police who use its system raise the threshold to 95%. The ACLU apparently neglected to adjust the level (even though it would have cost less than a large pizza). Or, worse, it tried the system at the higher level and chose not to report an outcome that probably would have had fewer "causes for concern." Either way, public discourse would benefit from more complete testing or reporting.

It also is unfortunate that Amazon and journalists [2, 3] call the threshold for matches a “confidence threshold.” The percentage is not a measure of how confident one can be in the result. It is not the probability of a true match given a classified match. It is not a probability at all. It is a similarity score on a scale of 0 to 1. A similarity score of 0.95 or 95%, does not even mean that the paired images are 95% similar in an intuitively obvious sense.

The software does give a “confidence value,” which sounds like a probability, but the Amazon documentation I have skimmed suggests that this quantity relates to some kind of “confidence” in the conclusion that a face (as opposed to anything else) is within the rectangle of pixels (the “bounding box”). The Developer Guide states that [4]
For each face match, the response provides a bounding box of the face, facial landmarks, pose details (pitch, role, and yaw), quality (brightness and sharpness), and confidence value (indicating the level of confidence that the bounding box contains a face). The response also provides a similarity score, which indicates how closely the faces match.
and [5]
For each face match that was found, the response includes similarity and face metadata, as shown in the following example response [sic]:
    "FaceMatches": [
            "Similarity": 100.0,
            "Face": {
                "BoundingBox": {
                    "Width": 0.6154,
                    "Top": 0.2442,
                    "Left": 0.1765,
                    "Height": 0.4692
                "FaceId": "84de1c86-5059-53f2-a432-34ebb704615d",
                "Confidence": 99.9997,
                "ImageId": "d38ebf91-1a11-58fc-ba42-f978b3f32f60"
            "Similarity": 84.6859,
            "Face": {
                "BoundingBox": {
                    "Width": 0.2044,
                    "Top": 0.2254,
                    "Left": 0.4622,
                    "Height": 0.3119
                "FaceId": "6fc892c7-5739-50da-a0d7-80cc92c0ba54",
                "Confidence": 99.9981,
                "ImageId": "5d913eaf-cf7f-5e09-8c8f-cb1bdea8e6aa"
From a statistical standpoint, the ACLU’s finding is no surprise. Researchers encounter the false discovery problem with big data sets every day. If you make enough comparisons with a highly accurate system, a small fraction will be false alarms. Police are well advised to use facial recognition software in the same manner as automated fingerprint identification systems—not as simple, single-source classifiers, but rather as a screening tool to generate a list of potential sources. And, they can have more confidence in classified matches from comparisons in a small database of images of, say, dangerous fugitives than in a reported hit to one of thousands upon thousands of mug shots.

These observations do not negate the privacy concerns with applying facial recognition software to public surveillance systems. Moreover, I have not discussed the ACLU’s statistics on differences in false-positive rates by race. There are important issues of privacy and equality at stake. In addressing these issues, however, a greater degree of statistical sophistication would be in order.

  1. Jacob Snow, Amazon’s Face Recognition Falsely Matched 28 Members of Congress with Mugshots, July 26, 2018, 8:00 AM,
  2. Natasha Singer, Amazon’s Facial Recognition Wrongly Identifies 28 Lawmakers, A.C.L.U. Says, N.Y. Times, July 26, 2018,
  3. Ryan Suppe. Amazon's Facial Recognition Tool Misidentified 28 Members of Congress in ACLU Test, USA Today, July 26, 2018,
  4. Amazon Rekognition Developer Guide: CompareFaces,
  5. Amazon Rekognition Developer Guide: SearchFaces Operation Response,