Sunday, September 23, 2018

Is the NorCal Rapist Arrest Another Success Story for Forensic Genomic Genealogy?

On the morning of September 20, police arrested Roy Charles Waller, 58, of Benicia, California, for 10 rapes over a 15-year period starting in 1991. A Washington Post article described the crucial lead in the case as "what's called a familial DNA search." According to the Post,
    The arrest was reminiscent of that of Joseph James DeAngelo, who was arrested in April on the suspicion that he was the so-called “Golden State Killer,” who was wanted for raping dozens of women and killing at least 12 people in a bloody swath of crime that spanned decades in the state.
    Like DeAngelo, Waller was arrested after police searched the genealogy site GEDmatch for leads. In DeAngelo’s case, officials did what’s called a familial DNA search of GEDmatch, in which they sought to find someone who was closely genetically related to and worked backward to find a suspect. Familial DNA searching, particularly as it relates to government-run DNA databases, has come into wider use around the country, but it raises complicated questions about whether it means that the privacy rights of people are forfeited, in effect, by the decisions made by their relatives.
    It was not immediately clear if police did a familial search in Waller’s case or found his profile in GEDmatch.
In forensic genetics, "familial searching" normally refers to trawling a law-enforcement database of DNA profiles after failing to find a match between a profile in the database and a profile from a crime-scene sample at a mere 20 or so STRs (short tandem repeats of a short string of base pairs) in the genome of billions of base pairs. But a near miss to a profile in the database might be the result of kinship -- the database inhabitant does not match, but a very close relative outside the database does. A more precise description of such "familial searching" is outer-directed kinship trawling of law-enforcement databases (Kaye 2013).

The Post's suggestion that the NorCal Rapist may have been identified through "familial DNA searching" and the statement that "[f]amilial DNA searching, particularly as it relates to government-run DNA databases ... raises complicated questions about whether it means that the privacy rights of people are forfeited, in effect, by the decisions made by their relatives" is badly confused. The inhabitants, so to speak, of the law-enforcement databases, have not elected to be there. Their relatives, near or distant, have not chosen to enroll in the database. DNA profiles are in these databases because of a conviction or, sometimes, an arrest for certain crimes. The profiles are not useful for much of anything other than personal identification and kinship testing for close relatives (lineal and siblings).

The "genealogy site GEDmatch" is very different. It is populated by individuals who have elected to make their DNA searchable for kinship by other people who are looking for possible biological relatives. These trawls do not use the limited STR profiles developed for both inner- and outer-directed trawls of offender databases. They use the much more extensive set of SNPs (single-nucleotide polymorphisms) -- hundreds of  thousands of them -- that genetic testing companies such as 23-and-me provide to consumers who are curious about their traits and ancestry. Those data can be used to infer more attenuated kinship. Knowing that long, shared stretches of DNA in the crime-scene sample and the sample in a private (that is, nongovernmental) genealogy database such as GEDmatch could reflect a common ancestor several generations back enables genealogists using public information about families to walk up the family tree to a putative common ancestor and then down again to living relatives.

It is clear enough that California officials used the latter kind of kinship trawling in the publicly accessible genealogy database GEDmatch to find Mr. Waller. At a half-an-hour long press conference, Sacramento County  District Attorney Anne Marie Schubert refused to give direct answers to questions such as "Did a relative upload something?" and did you have to go through a lot of family members to get to him? However, she did state that "[t]he link was made through genetic genealogy through the use of GEDmatch." Eschewing details, she repeated "It's genetic genealogy, that's what I'll say." If Mr. Waller had put his own genome-wide scan on GEDmatch, why would Ms. Schubert add "a lot of kudos to the folks in our office ... that I call the experts in tree building"? The district attorney's desire not to say anything about specific family members reflects a commendable concern about unnecessarily divulging information about the family, but she could have been more transparent about the investigative procedure without revealing any particularly sensitive information on the family or any material that would interfere with the prosecution.

REFERENCES

Friday, September 21, 2018

Forensic Genomic Genealogy and Comparative Justice

Thirty years ago, the introduction of law-enforcement DNA databases for locating the sources of DNA samples recovered from crime-scenes and victims was greeted with unbridled enthusiasm from some quarters and deep distrust from others.  So too, reactions to the spate of recent arrests in cold cases made possible by forays into DNA databases created for genealogy research have ranged from visions of a golden era for police investigations to glimpses into a dark and dystopian future.

Particularly in the law-enforcement database context, one concern has been the disproportionate impact of confining DNA databases to profiles of individuals who have been arrested for or convicted of crimes. For a variety of reasons, racial minorities tend to be overrepresented in law-enforcement databases (as compared to their percentage of the general population). 1/ But it seems most unlikely that nonwhites are similarly concentrated in the private genealogy databases that have resulted from the growth of recreational genetics. It may be, as Peter Neufeld observed when interviewed about the Golden State Killer arrest, that "[t]here is a whole generation that says, ‘I don’t really care about privacy,’" 2/ but it seems odd to speak of minorities as being "disproportionately affected [by] the unintended consequences of this genetic data" in the newly exploited databases. 3/

If anything, to quote Professor Erin Murphy, "the racial composition of recreational DNA sites -- which heavily skew white -- may end up complementing and balancing that of government databases, which disproportionately contain profiles from persons of color." 4/ That is not much of an argument for widespread forensic genealogy (and Professor Murphy did not rely on it for that purpose). Given how labor intensive forensic genomic genealogy is for genealogists and police, it seems unlikely that the technique for developing investigative leads to distant relatives will be used often enough to produce or correct massive disparities in who is subject to arrest or conviction.

Still, if police routinely were able to obtain complete results on crime-scene DNA with the DNA chips used in genome-wide association studies and recreational genetics, they could easily check whether any of the DNA records on those databases are immediate matches (or indicative of close relatives who might be tracked down without too much effort). The database used in the Golden State Killer case, GEDmatch, has data from a million or so curious individuals in it. That is considerably less than the 16 or 17 million profiles in the FBI's national DNA database (NDIS), but it is far from insignificant.

NOTES
  1. David H. Kaye & Michael Smith, DNA Identification Databases: Legality, Legitimacy, and the Case for Population-Wide Coverage, 2003 Wisc. L. Rev. 41.
  2. Gina Kolata & Heather Murphy, The Golden State Killer Is Tracked Through a Thicket of DNA, and Experts Shudder, N.Y. Times, Apr. 27, 2018 (quoting Peter Neufeld).
  3. Id.
  4. Erin Murphy, Law and Policy Oversight of Familial Searches in Recreational Genealogy Databases, Forensic Sci. Int’l (2018) (in press).

Monday, September 17, 2018

P-values in "the home for big data in biology"

A p-value indicates the significance of the difference in frequency of the allele tested between cases and controls i.e. the probability that the allele is likely to be associated with the trait. 1/
Such is the explanation of "p-value" provided by "the EMBL-EBI Training Programme [which] is celebrating 10 amazing years of providing onsite, offsite and online training in bioinformatics" for "scientists at all levels." 2/ The explanation appears in the answer to the question "What are genome wide association studies (GWAS)?" 3/ It comes from people in the know -- "The home for big data in biology" 4/ at "Europe's flagship laboratory for the life sciences." 5/

Above the description is a Manhattan plot of the p-values for the differences between the frequencies of the single nucleotide alleles comprising a genome-wide SNP array in samples of "cases" (individuals with a trait) and "controls" (individuals without the trait). Sadly, the p-values in the plot cannot be equated to "the probability that the allele is likely to to be associated with the trait."

I. Transposition

To begin with, a p-value is the probability of a difference at least as large as the one observed in the sample of "cases" and the sample of "controls" -- when the probability that a random person has the allele is the same for both cases and controls in the entire population. For a single allele, the expected value of the difference in the sample proportions is zero, but the observed value will vary from one pair of samples to the next. Because most pairs will have small differences, a big difference in a single study is evidence against the hypothesis that there is no difference at the population level. Differences that rarely would occur by chance alone are indicative of a true association. They are not usually false positives. At least, that is the theory behind the p-value.

For example, a p-value of 1/100,000, is a rare occurrence when the two samples come from a population in which the probability of the trait is the same with or without the allele. Consequently, the difference that corresponds to this p-value is thought to be strong evidence that the allele really is more (or less) common among cases than controls.

But even if the reasoning that "we would not expect it if H is true, therefore H is likely to be false" is correct, the p-value of 1/100,000 is not "the probability that the allele is likely to to be associated with the trait." It is the probability of that sort of a discrepancy if the allele has absolutely no association with the trait. In contrast, the probability of "no association" is not even defined in the statistical framework that gives us p-values.

Another way to say it: The p-value is a statement about the evidence (the observed difference) given the hypothesis of no association. It does not represent the probability of the hypothesis of zero association given an observed association. Equating the probability associated with the evidence with the probability associated with the hypothesis is so common that it has a name -- the transposition fallacy.

II. Multiple Comparisons

A second defect of the EMBL-EBI's definition is that a p-value of, say, 1/100,000 is not a good measure of surprise for GWAS data. Suppose there are 500,000 SNPs in the array and none of their alleles has any true association with the trait. If all apparent associations are independent, the expected number with the p-value is 500,000 × (1/100,000) = 5. Because of the many opportunities for individually impressive differences to appear, it is no surprise that some alleles have this small p-value. The p-value would have to be much smaller than 1/100,000 for the apparent association to be as surprising as the reported p-value would suggest. P-values that would produce a reasonable false discovery rate could be very small indeed.

An oversimplified analogy 6/ is this: Flipping a fair coin 18 times and getting 18 heads or tails has a probability on the order of 1/100,000. 7/ Flipping 500,000 coins 18 times each and finding that some of these experiments yielded 18 heads or tails is not strong evidence against the proposition that all the coins are fair. It is just what we would expect to see if the coins are all fair.

Of course, the bioinformaticists at EMBL-EBI are acutely aware of the effect of multiple comparisons. Their catalog of findings only includes "variant-trait associations ... if they have a p-value <1.0 × 10-5 in the overall (initial GWAS + replication) population." 8/ But why insist on so small a number if this p-value is "the probability that the allele is likely to to be associated with the trait"? For multiple comparisons, the p-value of 1/100,000 is not the measure of surprise that it is supposed to be.

NOTES
  1. EMBL-EBI, What Are Genome Wide Association Studies (GWAS)?, https://www.ebi.ac.uk/training/online/course/gwas-catalog-exploring-snp-trait-associations/why-do-we-need-gwas-catalog/what-are-genom, last visited Sept. 16, 2018.
  2. EMBL-EBI, EMBLI-EBI Training, https://www.ebi.ac.uk/training/online/course/gwas-catalog-exploring-snp-trait-associations/why-do-we-need-gwas-catalog/what-are-genome, last visited Sept. 16, 2018.
  3. EMBL-EBI, supra note 1.
  4. EMBL-EBI, https://www.ebi.ac.uk/, last visited Sept. 16, 2017.
  5. Id. 
  6. It is oversimplified because not all associations are independent. Nearby SNPs tend to be inherited together, but methods that take account of dependencies and enable researchers to pick out the associations that are real have been studied. 
  7. The more exact probability is 1/131,072.
  8. EMBL-EBI, Where Does the Data Come From?, https://www.ebi.ac.uk/training/online/course/gwas-catalog-exploring-snp-trait-associations/what-gwas-catalog/where-does-data-come, last visited Sept. 17, 2018. The phrase "the overall (initial GWAS + replication) population" is puzzling. It sounds like the data from an exploratory study are combined with those from the replication study to give a p-value for a larger sample (not a population). If so, the p-values for each study could be more than 1/100,000.

Thursday, August 9, 2018

Consumer Genetic Testing Companies' New Policies for Law Enforcement Requests

Direct-to-consumer genetic testing companies now have "best practices" for privacy that they have pledged to follow. The impetus may have been the arrest in “the Golden State Killer case [made] by comparing DNA from crime scenes with genetic data that the suspect’s relatives had submitted to the testing company GEDmatch” and the agreement between 23andMe to “share user data, with permission, with GlaxoSmithKline after the pharmaceutical giant invested US$300 million.” 1/

The best practices document from the Future of Privacy Forum does not address the constitutional limits on subpoenas or court orders seeking genetic information from the companies. With respect to law enforcement, it merely states that "Genetic Data may be disclosed to law enforcement entities without Consumer consent when required by valid legal process" and that "[w]hen possible, companies will attempt to notify Consumers on the occurrence of personal information releases to law enforcement requests." 2/

23andMe maintains that it does "not provide information to law enforcement unless ... required to comply with a valid subpoena or court order." Another webpage states that it "will not provide information to law enforcement or regulatory authorities unless required by law to comply with a valid court order, subpoena, or search warrant for genetic or Personal Information (visit our Transparency Report)." (Emphasis deleted). But how does 23andMe decide which instruments are valid? The Transparency Report promises that
[W]e use all practical legal and administrative resources to resist such requests. In the event we are required by law to make a disclosure, we will notify you in advance, unless doing so would violate the law or a court order. To learn more about how 23andMe handles law enforcement requests for user information, please see our Guide for Law Enforcement
Does using "all practical legal ... resources" mean that the company will try to quash all subpoenas for samples or genetic data as unreasonable or oppressive (the standard under, for example, Federal Rule of Criminal Procedure 17(c)(2))? That seems doubtful. The "Guide for Law Enforcement" states that
23andMe requires valid legal process in order to consider producing information about our users. 23andMe will only review inquiries as defined in 18 USC § 2703(c)(2) related to to [sic] a valid trial, grand jury or administrative subpoena, warrant, or order. .... 23andMe will only consider inquiries from a government agency with proper jurisdiction. ... 23andMe will assess whether or not it is required by law to comply with the request, based on whether 23andMe is subject to personal jurisdiction in the requesting entity, the validity of the method of service, the relevance of the requested data, the specificity of the request, and other factors.
This explanation does not indicate that the company will make Fourth Amendment arguments on the consumer's behalf against compliance with formally "valid legal process." Nor could it under the long-established doctrine that limits the invocation of Fourth Amendment rights to the party whose rights are at stake. Although the company can complain that a subpoena is vague or overbroad or requires it to do too much work to produce the information, it cannot refuse to comply on the ground that a warrantless search without probable cause violates some Fourth Amendment right of the consumer. 3/

What the company will do (usually) is notify the customer who sent in the DNA sample so that he or she can contest the subpoena:
If 23andMe is required by law to comply with a valid court order, subpoena, or search warrant for genetic or personal information, we will notify the affected individual(s) through the contact information they have provided to us before we disclose this information to law enforcement, unless doing so would violate the law or a court order. We will give them a reasonable period of time to move to quash the subpoena before we answer it.

If law enforcement officials prevent this disclosure by submitting a Delayed Notice Order (DNO) pursuant to 18 U.S.C. § 2705(b) or equivalent state statute that is signed by a judge, we will delay notifying the user until the order expires. 23andMe retains sole discretion to not notify the user if doing so would create a risk of death or serious physical injury to an identifiable individual or group of individuals, and if we are legally permitted to do so. Under these circumstances, we will notify users of the law enforcement request once the emergency situation expires.
In other words, even if the companies do not want to work hand in glove with law enforcement, consumers who order tests cannot expect the new procedures to raise the individuals' constitutional objections to law enforcement demands for genetic samples or data (to the extent that there are any).

NOTES
  1. Genetic Privacy, 560 Nature 146-147 (2018), doi: 10.1038/d41586-018-05888-2
  2. Future of Privacy Forum, Privacy Best Practices for Consumer Genetic Testing Services, July 2018, at 8. Accompanying footnotes discuss statutory protections for data used in federally funded studies or held by certain medical providers.
  3. Unlike search warrants, subpoenas do not normally require probable cause, and ordinarily, they are not "searches" within the meaning of the Fourth Amendment. In Carpenter v. United States, No. 16–402, 2018 WL 3073916 (U.S. June 22, 2018), however, a sharply divided Court carved out an exception for "the rare case where the suspect has a legitimate privacy interest in records held by a third party." Whether a subpoena for a DNA sample or SNP array data produced and held by companies like 23andMe falls into this category is an intriguing question. Bank records and records of numbers dialed from a home telephone do not, but "a detailed log of a person's movements over several years [or even over a six-day period]" do. One Justice in Carpenter suggested that a search warrant based on probable cause would be required for a DNA sample held by a direct-to-consumer testing company. But even if that is a second "rare case," the opinions in Carpenter did not discuss the standing of the company to raise the personal right of its customers.

Friday, July 27, 2018

The ACLU’s In-Your-Face Test of Facial Recognition Software

The ACLU has reported that Amazon’s facial recognition “software incorrectly matched 28 members of Congress, identifying them as other people who have been arrested for a crime.” [1] This figure is calculated to impress the very legislators the ACLU is asking to “enact a moratorium on law enforcement use of face recognition.” All these false matches, the organization announced, create “28 more causes for concern.” Inasmuch as there are 535 members of Congress (Senators plus Representatives), the false-match rate is 5%.

Or is it? The ACLU’s webpage states that
To conduct our test, we used the exact same [sic] facial recognition system that Amazon offers to the public, which anyone could use to scan for matches between images of faces. And running the entire test cost us $12.33 — less than a large pizza.

Using Rekognition, we built a face database and search tool using 25,000 publicly available arrest photos. Then we searched that database against public photos of every current member of the House and Senate. We used the default match settings that Amazon sets for Rekognition.
So there were 535 × 25000 = 13,375,000 comparisons. With that denominator, the false-match rate is about 2 per million (0.0002%).

But none of these figures—28, 5%, or 0.0002%— means very much, since the ACLU’s “test” used a low level of similarity to make its matches. The default setting for the classifier is 80%. Police agencies do not use this weak a threshold [2, 3]. Using a low figure like 80% ensures that there will more be false matches among so many comparisons. Amazon recommends that police who use its system raise the threshold to 95%. The ACLU apparently neglected to adjust the level (even though it would have cost less than a large pizza). Or, worse, it tried the system at the higher level and chose not to report an outcome that probably would have had fewer "causes for concern." Either way, public discourse would benefit from more complete testing or reporting.

It also is unfortunate that Amazon and journalists [2, 3] call the threshold for matches a “confidence threshold.” The percentage is not a measure of how confident one can be in the result. It is not the probability of a true match given a classified match. It is not a probability at all. It is a similarity score on a scale of 0 to 1. A similarity score of 0.95 or 95%, does not even mean that the paired images are 95% similar in an intuitively obvious sense.

The software does give a “confidence value,” which sounds like a probability, but the Amazon documentation I have skimmed suggests that this quantity relates to some kind of “confidence” in the conclusion that a face (as opposed to anything else) is within the rectangle of pixels (the “bounding box”). The Developer Guide states that [4]
For each face match, the response provides a bounding box of the face, facial landmarks, pose details (pitch, role, and yaw), quality (brightness and sharpness), and confidence value (indicating the level of confidence that the bounding box contains a face). The response also provides a similarity score, which indicates how closely the faces match.
and [5]
For each face match that was found, the response includes similarity and face metadata, as shown in the following example response [sic]:
{
   ...
    "FaceMatches": [
        {
            "Similarity": 100.0,
            "Face": {
                "BoundingBox": {
                    "Width": 0.6154,
                    "Top": 0.2442,
                    "Left": 0.1765,
                    "Height": 0.4692
                },
                "FaceId": "84de1c86-5059-53f2-a432-34ebb704615d",
                "Confidence": 99.9997,
                "ImageId": "d38ebf91-1a11-58fc-ba42-f978b3f32f60"
            }
        },
        {
            "Similarity": 84.6859,
            "Face": {
                "BoundingBox": {
                    "Width": 0.2044,
                    "Top": 0.2254,
                    "Left": 0.4622,
                    "Height": 0.3119
                },
                "FaceId": "6fc892c7-5739-50da-a0d7-80cc92c0ba54",
                "Confidence": 99.9981,
                "ImageId": "5d913eaf-cf7f-5e09-8c8f-cb1bdea8e6aa"
            }
        }
    ]
}
From a statistical standpoint, the ACLU’s finding is no surprise. Researchers encounter the false discovery problem with big data sets every day. If you make enough comparisons with a highly accurate system, a small fraction will be false alarms. Police are well advised to use facial recognition software in the same manner as automated fingerprint identification systems—not as simple, single-source classifiers, but rather as a screening tool to generate a list of potential sources. And, they can have more confidence in classified matches from comparisons in a small database of images of, say, dangerous fugitives than in a reported hit to one of thousands upon thousands of mug shots.

These observations do not negate the privacy concerns with applying facial recognition software to public surveillance systems. Moreover, I have not discussed the ACLU’s statistics on differences in false-positive rates by race. There are important issues of privacy and equality at stake. In addressing these issues, however, a greater degree of statistical sophistication would be in order.

REFERENCES
  1. Jacob Snow, Amazon’s Face Recognition Falsely Matched 28 Members of Congress with Mugshots, July 26, 2018, 8:00 AM, https://www.aclu.org/blog/privacy-technology/surveillance-technologies/amazons-face-recognition-falsely-matched-28
  2. Natasha Singer, Amazon’s Facial Recognition Wrongly Identifies 28 Lawmakers, A.C.L.U. Says, N.Y. Times, July 26, 2018,  https://www.nytimes.com/2018/07/26/technology/amazon-aclu-facial-recognition-congress.html
  3. Ryan Suppe. Amazon's Facial Recognition Tool Misidentified 28 Members of Congress in ACLU Test, USA Today, July 26, 2018, https://www.usatoday.com/story/tech/2018/07/26/amazon-rekognition-misidentified-28-members-congress-aclu-test/843169002/
  4. Amazon Rekognition Developer Guide: CompareFaces, https://docs.aws.amazon.com/rekognition/latest/dg/API_CompareFaces.html
  5. Amazon Rekognition Developer Guide: SearchFaces Operation Response, https://docs.aws.amazon.com/rekognition/latest/dg/search-face-with-id-procedure.html

Friday, July 20, 2018

Handwriting Evidence in Almeciga and Pitts: Ships Passing in the Night?

Almeciga: A Signature Case

Erica Almeciga sued the Center for Investigative Reporting (CIR) for releasing a video on Rosalio Reta, a former member of the Los Zetas Drug Cartel, in which she was interviewed about Rosalio Reta, her “romantic partner at the time.” Almeciga v. Center for Investigative Reporting, Inc., 185 F.Supp.3d 401 (S.D.N.Y. 2016). Her complaint was that the producers breached a promise to conceal her identity, causing her to “develop[] paranoia” and to be “treated for depression and Post Traumatic Stress Disorder.” Id. at 409. In response, CIR “produced a standard release form ... authorizing [it] to use [her] ‘name, likeness, image, voice, biography, interview, performance and/or photographs or films taken of [her] ... in connection with the Project.’” The release, she said, was fabricated—she never saw or signed it—and she obtained an expert report from “a reputed handwriting expert, Wendy Carlson,” id. at 413, that “‘[b]ased on [her] scientific examination’ the signature on the Release was a forgery.” Id. at 414. To conduct that examination, Carlson compared the signature on the release to “numerous purported ‘known’ signatures” given to her by Almeciga’s lawyer. Id. at 414.

The case found its way to the United States District Court for the Southern of New York. Judge Jed. S. Rakoff dismissed the complaint because “New York's Statute of Frauds [requires that] if a contract is not capable of complete performance within one year, it must be in writing to be enforceable.” Id. at 409. The alleged promise to keep Almeciga's identity concealed was oral, not written.

The court also imposed sanctions on Almeciga for “fabricat[ing] the critical allegations in her Amended Complaint.” Id. at 408. Of course, if the handwriting expert’s analysis was correct, Almeciga’s claim that the release was forged was true, and there would have been no “fraud upon the Court.” Id. at 413. Therefore, Judge Rakoff held “a ‘Daubert’ hearing on the admissibility of Carlson's testimony” in conjunction with the evidentiary hearing on CIR's motion for sanctions. Id. at 414. His conclusion was uncompromising:
[T]he Court grants defendant's motion to exclude Carlson's “expert” testimony, finding that handwriting analysis in general is unlikely to meet the admissibility requirements of Federal Rule of Evidence 702 and that, in any event, Ms. Carlson's testimony does not meet those standards.
Id. at 407–08.As this sentence indicates, there are two facets to the Almeciga opinion: (1) “that handwriting analysis in general”—meaning “the ‘ACE–V’ methodology ... , an acronym for ‘Analyze, Compare, Evaluate, and Verify’” (id. at 418)—“bears none of the indicia of science and suggests, at best, a form of subjective expertise” (id. at 419); and (2) that the particulars of how the expert examined the signatures not only “flunks Daubert” (id. at 493), but also fell short of the potentially less stringent requirements for nonscientific expertise.

Although one would not expect the defects in the particular case to be at issue in all or even most cases, one would expect the court’s Daubert holding to be a wake-up call. As Judge Rakoff noted, “even if handwriting expertise were always admitted in the past (which it was not), it was not until Daubert that the scientific validity of such expertise was subject to any serious scrutiny.” Id. at 418.

Pitts: “Inapposite and Unpersuasive”

Lee Andrew Pitts allegedly “entered a branch of Chase Bank ... and handed [the manager at a teller window] a withdrawal slip that had written on it: “‘HAND OVER ALL 100, 50, 20 I HAVE A GUN I WILL SHOOT.’” United States v. Pitts, 16-CR-550 (DLI), 2018 WL 1116550 (E.D.N.Y. Feb. 26, 2018). After the manager repeatedly said that she had no money, the would-be robber “fled on foot ... leaving behind the withdrawal slip” with latent fingerprints. A trawl of a fingerprint database — the court does not say which one or how it was conducted — led New York police to arrest Pitts.

At Pitts’s impending trial on charges of entering the bank with the intent to rob it, the government planned to elicit testimony from “Criminalist Patricia Zippo, who is a handwriting examiner and concluded that Defendant ‘probably may have’ written the demand note found at the crime scene.” Pitts moved “to preclude the government from introducing expert opinion testimony as to ... handwriting analysis.” He “relie[d] principally on the [Almeciga] decision” from the other side of the East River.

Chief Judge Dora L. Irizarry dismissed Almeciga as “inapposite and unpersuasive” because of “significant factual differences from the instant case.” Let’s look at each of these differences.
  • First, the plaintiff in Almeciga tasked the analyst with determining whether plaintiff’s signature on a contractual release was a forgery. ... Forgery analysis is markedly more difficult than comparing typical signatures and has considerably higher error rates than simpler comparisons. Id. at 422 (citation omitted) (“[W]hile forensic document examiners might have some arguable expertise in distinguishing an authentic signature from a close forgery, they do not appear to have much, if any, facility for associating an author’s natural handwriting with his or her disguised handwriting.”).
It is true that the task in Pitts was not to compare signatures. It was to investigate the similarity between two written sentences as they appear on the robbery note and ... what? Exemplars the defendant was forced to write (and that like the exemplars in Almeciga, might have been disguised versions of normal handwriting)? Or did Zippo receive previously existing exemplars of defendant’s handwriting? What do scientific studies of performance on this sort of handwriting-comparison task show? The Pitts opinion does not even hazard a guess, and it blithely ignores the broad conclusion in Almeciga that
[as to] the third Daubert factor, “[t]here is little known about the error rates of forensic document examiners.” While a handful of studies have been conducted, the results have been mixed and “cannot be said to have ‘established’ the validity of the field to any meaningful degree.” (Citations omitted.)
  • Second, the expert performed her initial analysis without any independent knowledge of whether the “known” handwriting samples used for comparison belonged to the plaintiff.
This refers to the fact that in Almeciga, the expert received the exemplars from the lawyer—she did not collect them herself. Her conclusion therefore was conditioned on the assumption that the exemplars really were representative of Almeciga's true signatures. But the need to make this assumption does not pertain to the validty of the ACE-V part of the examination. This difference therefore has no bearing on Almeciga’s conclusion that handwriting determinations have not been scientifically validated.
  • Third, the expert conflictingly claimed that her analysis was based on her “experience” as a handwriting analyst, but then claimed in her expert report that her conclusions were based on her “scientific examination” of the handwriting samples.
Certainly, Judge Rakoff was not impressed with the witness, but the conclusion that Judge Rakoff drew from the juxtaposed statements was only that given these and other statements about the high degree of subjectivity in handwriting comparisons, “[i]t therefore behooves the Court to examine more specifically whether the ACE–V method of handwriting analysis, as described by Carlson, meets the common indicia of admissible scientific expertise as set forth in Daubert.” Judge Irizarry evidently was not disposed to conduct a similar inquiry.
  • Fourth, the court noted several instances of bias introduced by plaintiff’s counsel. For example, counsel initiated the retention by providing a conclusion that “[t]he questioned document was a Release that Defendant CIR forged.” (Citations omitted.)
Was the witness in Pitts insulated from expectation bias? The opinion does not describe any precautions taken to avoid potentially biasing ionformation. What did Patricia Zippo know when she received the handwritten note? Was she given equivalent sets of exemplars from several writers and not expecting only one to be the writer? That seems doubtful.
  • Fifth, the expert contradicted herself in numerous respects, including by stating that her conclusions were verified when they were not, and claiming both that the signature on the questioned document was “‘made to resemble’ plaintiff’s” and also that the signatures were “‘very different.’”
Like many of the other differences, this one does not bear on Judge Rakoff’s conclusion that the “amorphous, subjective approach” of ACE-V “flunks Daubert.” Almeciga simply used Carlson's contradictory statements and  the other as-applied factors to reject the argument that even if the handwriting examination was inadmissible as scientific evidence, it might be admissible as expertise that “is not scientific in nature.”

The Significant Difference

In sum, the Pitts opinion does not grapple with the Daubert issue of scientific validity. Instead of surveying the scientific literature to ascertain whether handwriting examiners’ claims of expertise have been validated (which boils down to studies of how accurate examiners are at the kind of comparisons performed in the case), the court reasons that the process must be accurate because handwriting examiners’ opinions are commonly admitted in court and “wholesale exclusion of handwriting analysis ... is not the majority view in this Circuit.”

Both the Almeciga and Pitts courts were “free to consider how well handwriting analysis fares under Daubert and whether ... testimony is admissible, either as “science” or otherwise.” Almeciga, 185 F.Supp.3d at 418. The most significant difference between the two opinions is that one judge took a hard look at what is actually known about handwriting expertise (or at least tried to), while the other did not.

Tuesday, July 17, 2018

More on Pitts and Lundi: Why Bother with Opposing Experts?

In the post-PCAST cases of United States v. Pitts 1/ and United States v. Lundi 2/, the government prevented a scholar of the development and culture of fingerprinting from testifying for the defense. The proposed witness was Simon Cole, Professor of Criminology, Law and Society at the Department of Social Ecology of the University of California (Irvine). Pitts “contend[ed] that Dr. Cole’s testimony [was] necessary ‘contrary evidence’ that calls into question the reliability of fingerprint analysis.” Lundi wanted Cole to testify about “forensic print analysis, in particular in the areas of accuracy and validation,” including "best practices." The federal district court would not allow it.

Qualifications of an Expert Witness

In Pitts, the government first denied that Cole had the qualifications to say anything useful about fingerprinting. It maintained (in the court's words) that “Dr. Cole (1) is ‘not a trained fingerprint examiner’; (2) ‘has not published peer-reviewed scientific articles on the topic of latent fingerprint evidence’; and (3) ‘has not conducted any validation research in the field.’” The court neither accepted nor clearly rejected this argument, for it decided to keep Cole away from the jury on a different ground.

Exclusion for lack of qualifications would have done violence to Federal Rule of Evidence 702. First, Cole was not going to offer an opinion as a criminalist (which he is not) nor as an interdisciplinary scholar (which he is) on whether the examiner in the case had accurately perceived, compared, and evaluated the images. More likely, he would have opined on the extent to which scientific studies have shown that fingerprint examiners can distinguish between same-source and different-source prints. Training and experience in conducting fingerprint identifications is largely irrelevant to this task.

Second, Rule 702 does not require someone to publish peer-reviewed articles on a topic to be qualified to give an opinion as to the state of the scientific literature. Epidemiologists and toxicologists, for example, can opine about the toxicity of a compound without first publishing their own research on the compound's toxicity. Finally, there is no support in logic or law for the notion that someone has to conduct his or her own validity study to have helpful information on the studies that others have done and what these studies prove.

The Panacea of Cross-examination

The government’s other argument was “that Dr. Cole’s testimony will not assist the trier of fact.” But this argument was garbled:
Specifically, the government points out that Dr. Cole’s only disclosed opinion is that the government’s expert’s testimony “exaggerates the probative value of the evidence because such testimony improperly purports to eliminate the probability that someone else might be the source of the latent print.” “Professor Cole fails to provide any analysis of why latent fingerprint evidence [in general] is so unreliable that it should not be submitted to the jury or, if such evidence can be reliable in some circumstances, what precisely the NYPD examiners did incorrectly in this case.” Dr. Cole is not expected to testify that the identification made by the government’s expert in this case is unreliable or that the examiners made a misidentification. Therefore, the government argues Dr. Cole’s opinion goes to the weight of the government’s evidence, not its admissibility. (Citations and internal quotation marks omitted.)
Chief Judge Dora L. Irizarry had already ruled that a source attribution made with an acknowledgment of at least a theoretical possibility that the match could be a false positive was admissible. If expert evidence is admitted, the opposing party is normally permitted to counter it with expert testimony that it deserves little weight. To argue that just because evidence goes to weight, rebuttal evidence about its weight is inadmissible makes no sense.  Once the evidence is admitted, it's weight is the only game in town.

The real issue is what validity or possibility-of-error testimony would add to the jury's knowledge. In this regard, Judge Irizarry wrote that
The Court is not convinced that Dr. Cole’s testimony would be helpful to the trier of fact. The only opinion Defendant seeks to introduce is that fingerprint examiners “exaggerate” their results to the exclusion of others. However, the government has indicated that its experts will not testify to absolutely certain identification nor that the identification was to the exclusion of all others. Thus, Defendant seeks admit Dr. Cole’s testimony for the sole purpose of rebutting testimony the government does not seek to elicit. Accordingly, Dr. Cole’s testimony will not assist the trier of fact to understand the evidence or determine a fact in issue. (Citations omitted.)
At first blush, this seems reasonable. If the only thing Cole was prepared to say was that fingerprinting does not permit “absolutely certain identification,” and if the fingerprint examiners will have said this anyway, why have him repeat it?

But surely Cole (or another witness— say, a statistician) could have testified to something more than that. An expert with statistical knowledge could inform the jury that although there is very little direct evidence on how frequently fingerprinting experts err in making source attributions in real casework, experiments have tested their accuracy, and the researchers detected errors at various rates. This information could “assist the trier of fact to understand the evidence or determine a fact in issue.” So why keep Cole from giving this “science framework” testimony?

The court’s answer boils down to this:
Moreover, the substance of Dr. Cole’s opinion largely appears in the reports and attachments cited in Defendant’s motion to suppress .... For example, Dr. Cole’s article More Than Zero contains a lengthy discussion about error rates in fingerprint analysis and the rhetoric in conveying those error rates ... , and the PCAST Report notes that jurors assume that error rates are much lower than studies reveal them to be (PCAST Report at 9-10 (noting that error rates can be as high as one in eighteen)). Defendant identifies no additional information or expertise that Dr. Cole’s testimony provides beyond what is in these articles and does not explain why cross-examination of the government’s experts using these reports would be insufficient. 3/
Now, I think the 1 in 18 figure is mildly ridiculous, 4/ but there is no general rule that because published findings could be introduced via cross-examination, a party cannot call on an opposing expert to present or summarize the findings. First, the expert being cross-examined might not concede that the findings are from authoritative sources. This occurred repeatedly when a number of prosecution DNA experts flatly refused to acknowledge the 1992 NAS report on forensic DNA technology as authoritative. That created a hearsay problem for defendants. After all, the authors of the report were not testifying and hence were not subject to cross-examination. The rule against hearsay applies to such statements because the jury would have to evaluate the truthfulness of the statements without hearing from the individuals who wrote them.

Therefore, counsel could not quote or paraphrase the report’s statements over a hearsay objection unless the report fell under some exception to the rule against hearsay. The obvious exception—for “learned treatises”—does not apply unless the report first is “established as a reliable authority by the testimony or admission of the witness or by other expert testimony or by judicial notice.” 5/

In Pitts, however, it appears that the government’s experts were willing to concede that the NAS and PCAST reports were authoritative (even though a common complaint from vocal parts of the forensic-science community about both reports was that they were not credible because they lacked representation from enough practicing forensic scientists). Moreover, a court might well have to admit the PCAST report under the hearsay exception for government reports.

Nonetheless, a second problem with treating cross-examination as the equivalent of testimony from an opposing expert is that it is not equivalent. By way of comparison, would  judges in a products liability case against the manufacturer of an alleged teratogen reason that the defense cannot call an expert to present and summarize the results of studies that address the strength of the  association between exposure and birth defects but rather can only ask the plaintiff’s experts about the studies?

In criminal cases, even if the defense expert eschews opinions on whether the defendant is the source of the latent prints as beyond his (or anyone’s?) expertise, the jury might consider this expert to be more credible and more knowledgeable about the underlying scientific literature than the latent print examiners. Examiners understandably can have great confidence in their careful judgments and in the foundations of the important work that they do. It would not be surprising for their message on cross-examination (or re-direct examination) to be, yes, errors are possible and they have occurred in artificial experiments and a few extreme cases, but, really,the process is highly valid and reliable. An outside observer may have a less sanguine perspective to offer even when discussing the same underlying literature.

Cross-examination is all well and good, but cross-examination of experts is delicate, difficult, and dangerous. Confining the defense to posing questions about specific studies in lieu of its own expert testimony about these studies is not normal. Court instructions about error probabilities (analogous to instructions about the factors that degrade eyewitness identifications) might be a device to avoid unduly time-consuming defense witnesses, but those do not yet exist. The opportunity to cross-examine the other party's witnesses rarely warrants depriving a party of the right to present testimony from its experts.

NOTES
  1. United States v. Pitts, 16-CR-550 (DLI), 2018 WL 1169139 (E.D.N.Y. Mar. 2, 2018).
  2. United States v. Lundi, 17-CR-388 (DLI), 2018 WL 3369665 (E.D.N.Y. July 10, 2018).
  3. The opinion in Lundi is similar:
    The government seeks to preclude Defendant’s proposed expert, Dr. Cole, from testifying, and points to this Court’s decision in Pitts, ... . The government argues that, as was the case in Pitts, Dr. Cole’s anticipated testimony would serve to rebut testimony from the government’s experts that the government does not expect to elicit. ... The government argues further that Dr. Cole’s additional proposed testimony, which would address the reliability of fingerprint examinations and the “best practices” to be followed when conducting such examinations, is not distinguishable from the information contained in the reports Defendant attached to his motion, and with which he can cross examine the government’s experts. ...

    Defendant claims that Dr. Cole’s testimony is necessary in this case because the reports could not be introduced through the government’s experts. .... However, the government has given every indication that its experts would recognize these reports, such that Defendant can use them on cross-examination. See Opp’n at 18 (“[t]o the extent the defendant wants to cross examine the [fingerprint] examiners on the basis of the empirical studies in which the error rates cited in the defendant’s motion were found, the defendant is free to do so....”). The Court finds that Dr. Cole’s testimony would not assist the trier of fact. See Pitts, 2018 WL 1169139, at *3. Accordingly, the testimony is precluded.
  4. See David H. Kaye, On a “Ridiculous” Estimate of an “Error Rate for Fingerprint Comparisons”, Forensic Sci., Stat. & L., Dec. 10, 2016, http://for-sci-law.blogspot.com/2016/12/on-ridiculous-estimate-of-error-rate.html.
  5. Federal Rule of Evidence 803(18); see generally David H. Kaye, David A. Bernstein, & Jennifer L. Mnookin, The New Wigmore: A Treatise on Evidence: Expert Evidence § 5.4 (2d ed. 2011).

Monday, July 16, 2018

Ignoring PCAST’s Explication of Rule 702(d): The Opinions on Fingerprint Evidence in Pitts and Lundi

With the release of an opinion in February and another in July 2018, the District Court for the Eastern District of New York became at least the second federal district court to find that the 2016 report of the President’s Council of Advisors on Science and Technology (PCAST) [1] did not militate in favor of excluding testimony that a defendant is the source of a latent fingerprint. Chief Judge Dora L. Irizarry wrote both opinions.

United States v. Pitts [2]

The first ruling came in United States v. Pitts. The government alleged that Lee Andrew Pitts “entered a branch of Chase Bank ... and handed [the manager at a teller window] a withdrawal slip that had written on it: “‘HAND OVER ALL 100, 50, 20 I HAVE A GUN I WILL SHOOT.’” After the manager repeatedly said that she had no money, the would-be robber fled on foot ... leaving behind the withdrawal slip” with latent fingerprints. A trawl of a fingerprint database — the court does not say which one or how it was conducted — led New York police to arrest Pitts two weeks later.

Facing trial on charges of entering the bank with the intent to rob it, Pitts moved “to preclude the government from introducing expert opinion testimony as to latent fingerprint and handwriting analysis.” The opinion does not specify the exact nature of the expert's fingerprint testimony. Presumably, it would have been an opinion that Pitts is the source of the print on the withdrawal slip. The court merely noted that the government "claims that its fingerprint experts do not intend to testify that fingerprint analysis has a zero or near zero error rate."

Judge Irizarry made short work of Pitts’s contention that such testimony would contravene Federal Rule of Evidence 702 and Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993). Pitts relied “chiefly on the findings of the PCAST Report, the [2009] NAS Report [3], and several out-of-circuit court decisions that question the reliability of latent fingerprint analysis.” The judge was “not persuaded.” She acknowledged that “[t]he PCAST and NAS Reports [indicate that] error rates are much higher than jurors anticipate” and that “the NAS Report [stated] that “[w]e have reviewed available scientific evidence of the validity of the ACE-V method and found none.” But she was “dismayed that Defendant’s opening brief failed to address an addendum to the PCAST Report.” According to the court,
[The 2017 Addendum] applaud[ed] the work of the friction-ridge discipline” for steps it had taken to confirm the validity and reliability of its methods. ... The PCAST Addendum further concluded that “there was clear empirical evidence” that “latent fingerprint analysis [...] method[ology] met the threshold requirements of ‘scientific validity’ and ‘reliability’ under the Federal Rules of Evidence.”
Actually, the Addendum [4] adds little to the 2016 report. It responds to criticisms from the forensic-science establishment. The assessment of the scientific showing for the admissibility of latent fingerprint identification under Rule 702 is unchanged. The original report stated that “latent fingerprint analysis is a foundationally valid subjective methodology—albeit with a false positive rate that is substantial and is likely to be higher than expected by many jurors based on longstanding claims about the infallibility of fingerprint analysis.” It added that “[i]n reporting results of latent-fingerprint examination, it is important to state the false-positive rates based on properly designed validation studies.” The Addendum does not retreat from or modify these conclusions in any way.

Both the Report and the Addendum reinforce the conclusion that, despite the lack of detailed, objective standards for evaluating the degree of similarity between pairs of prints, experiments have shown that analysts can reach the conclusion that two prints have a common source with good accuracy. But the Report also lists five more conditions that bear on whether a particular analyst has reached the correct conclusion in a given case. It creates the neoteric phrase “validity as applied” for the showing that a procedure has been properly applied in a the case at bar:
Scientific validity as applied, then, requires that an expert: (1) has undergone relevant proficiency testing to test his or her accuracy and reports the results of the proficiency testing; (2) discloses whether he or she documented the features in the latent print in writing before comparing it to the known print; (3) provides a written analysis explaining the selection and comparison of the features; (4) discloses whether, when performing the examination, he or she was aware of any other facts of the case that might influence the conclusion; and (5) verifies that the latent print in the case at hand is similar in quality to the range of latent prints considered in the foundational studies.
The opinion does not discuss whether the court accepts or rejects this five-part test for admitting the proposed testimony. It jumps to the unedifying conclusion that defendant’s “critiques [do not] go to the admissibility of fingerprint analysis, rather than its weight.”

United States v. Lundi [5]

Chief Judge Irizarry returned to the question of admissibility of source attributions form latent prints in United States v. Lundi.  In the middle of thre afternoon of February 20, 2017, three men entered a check cashing and hair salon on Flatbush Avenue in Brooklyn. They forced an employee in a locked glass booth to let them in by pointing a gun at the head of a customer. They made off with approximately $13,000, but one of them had put his hands on top of the glass booth. Police ran an image of the latent prints from the booth through a New York City automated fingerprint identification system (AFIS) database. They decided that those prints came from Steve Lundi. Federal charges followed.

In advance of trial, Lundi moved to exclude the identification. He avoided the Pitts pitfall of arguing that there was no adequate scientific basis for expert latent print source attributions (although the more recent report of an American Association for the Advancement of Science (AAAS) working group would have lent some credence to such a claim [6]). Instead, Lundi “challeng[ed] the application of that [validated] science to the specific examinations conducted in the instant case.” It is impossible to tell from the opinion whether the court was made aware of the PCAST five-part test admissibility under Rule 702(d). Again, citing the unpublished opinion of a federal court in Illinois, Judge Irizarri apparently leapt over this part of the Report to the conclusion that
This Court is not persuaded that Defendant’s challenges go to the admissibility of the government’s fingerprint evidence, rather than to the weight accorded to it. Moreover, as this Court noted in Pitts, fingerprint analysis has long been admitted at trial without a Daubert hearing. ... The Court sees no reason to preclude such evidence here. Accordingly, Defendant’s motion to preclude fingerprint evidence is denied.
Again, it is impossible to tell from the court's cursory and conclusory analysis whether the theory is that an uncontroverted assurance that an expert undertook an “analysis,” a “comparison,” and an “evaluation” and that another expert did a “verification” ipso facto satisfies Rule 702(d).  The judge noted that “the government points to concrete indicators of how the ACE-V method actually was followed by Detective Skelly,” but it would be hard to find a modern fingerprint identification in which there were no indications that the examiner (1) analyzed the latent print (decided that it was of adequate quality to continue), (2) picked out features to compare and compared them, and then (3) evaluated what was seen. If this is all it takes to satisfy the Rule 702(d) requirement that “the expert has reliably applied the principles and methods to the facts of the case,” then the normal burden on the advocate of expert evidence to show that it meets all the rule’s requirements has evaporated into thin air.

Yet, this could be all that the court required. It suggested that all expert evidence is admissible as long as it is reliable in some general sense, writing that “our adversary system provides the necessary tools for challenging reliable, albeit debatable, expert testimony” and “[v]igorous cross-examination, presentation of contrary evidence, and careful instruction on the burden of proof are the traditional and appropriate means of attacking shaky but admissible evidence” (citing Daubert, 509 U.S. at 596).

The suggestion assumes what is to be proved—that the evidence—“shaky” or unshakeable—is “admissible.” The PCAST Report tried to give meaning to the case-specific reliability prong of Rule 702 (which simply codifies post-Daubert jurisprudence) by spelling out, for highly subjective procedures like ACE-V, what is necessary to demonstrate the legally reliable application in a specific case. Perhaps the “concrete indicators” showed that PCAST’s conditions were satisfied. Perhaps they did not go that far. Perhaps the PCAST conditions are too demanding. Perhaps they are too flaccid. Judge Irizarri does not tell us what she thinks.

After Lundi and Pitts, courts should strive to fill the gap in the analysis of the application of a highly subjective procedure. They should reveal what they think of PCAST’s effort to clarify (or, more candidly, to prescribe) what is required for long-standing methods in forensic science to be admissible under Rule 702(d).

REFERENCES
  1. Executive Office of the President, President’s Council of Advisors on Science and Technology, Report to the President: Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods, Sept. 2016.
  2. United States v. Pitts, 16-CR-550 (DLI), 2018 WL 1116550 (E.D.N.Y. Feb. 26, 2018).
  3. Comm. on Identifying the Needs of the Forensic Sci. Cmty., Nat'l Research Council, Strengthening Forensic Science in the United States: A Path Forward (2009).
  4. PCAST, An Addendum to the PCAST Report on Forensic Science in Criminal Courts, Jan. 6, 2017.
  5. United States v. Lundi, 17-CR-388 (DLI), 2018 WL 3369665 (E.D.N.Y. July 10, 2018).
  6. William Thompson, John Black, Anil Jain & Joseph Kadane, Forensic Science Assessments: A Quality and Gap Analysis, Latent Fingerprint Examination (2017).

Thursday, July 5, 2018

A Strange Report of "Forensic Epigenetics ... in CODIS"

The “Featured Story” in today’s Forensic Magazine is “Forensic Epigenetics: How Do You Sort Out Age, Smoking in CODIS?” The obvious answer is that you don't and you can't. CODIS records contain no epigenetic data.

What Is Epigenetics?

As a Nature educational webpage explains, “[e]pigenetics involves genetic control by factors other than an individual's DNA sequence. Epigenetic changes can switch genes on or off and determine which proteins are transcribed.” 1/ One chemical mechanism for accomplishing this is DNA methylation, "a chemical process that adds a methyl group to DNA." 2/ More precisely, "methylation of DNA (not to be confused with histone methylation) is a common epigenetic signaling tool that cells use to lock genes in the 'off' position." 3/ This methylation is involved in cell differentiation and hence the formation and maintenance of different tissue types. 4/  "Given the many processes in which methylation plays a part, it is perhaps not surprising that researchers have also linked errors in methylation to a variety of devastating consequences, including several human diseases.” 5/

In forensic genetics, "DNA methylation profiling [has been proposed] for tissue determination, age prediction, and differentiation between monozygotic twins." 6/ Because this "profiling" can uncover health-related and other information as well, discussion of regulating its use by police has begun. 7/

What Is CODIS?

CODIS is “the acronym for the Combined DNA Index System and is the generic term used to describe the FBI’s program of support for criminal justice DNA databases as well as the software used to run these databases.” 8/ The DNA data, which come from twenty locations (loci) on various chromosomes, reveal nothing about methylation patterns. The information from these loci relates solely to the set of underlying DNA sequences. These particular sequences are not transcribed, are essentially identical in all tissues and all identical twins, and do not change as a person ages (except for occasional mutations).

What Is “Sort[ing] Out Age, Smoking in CODIS?”

I don't know. Forensic epigenetics or epigenomics involves neither CODIS databases, CODIS loci, nor CODIS software. Does Forensic Magazine's “Senior Science Writer” think that the databases will be expanded to include epigenetic data? That is not what the article asserts. The only attempt to bridge the two is a concluding sentence that reads, "But some studies, like a Stanford exploration last spring, show that even 13 loci can carry more information than originally believed."

That is not much of a connection, and the statement itself is a trifle misleading. Thirteen is the number of STR loci in CODIS profiles before the expansion to twenty in 2017. The description of the “Stanford exploration” referenced in the article 9/ does not show that the original understanding of the information contained in those core CODIS loci was faulty. Rather, it talks about the growth of the size of the databases and research showing that CODIS profiles “could possibly” be linked to records in medical research databases by “authorized or unauthorized analysts equipped with two datasets, one with SNP genotypes and another CODIS genotypes.” 10/

This possibility does not come as a complete surprise. CODIS profiles are meant to be individual identifiers (or nearly so). If there are genomic databases that sufficiently overlap these regions, then a CODIS profile can be used to locate the record pertaining to the same individual in those databases. The extent to which this possibility is cause for concern is worth considering, 11/ but it has nothing to do with the privacy implications of epigenetic data.

NOTES
  1. Simmons, D. (2008) Epigenetic influence and disease. Nature Education 1(1):6
  2. Id.
  3. Theresa Phillips (2008) The role of methylation in gene expression. Nature Education 1(1):116.
  4. See, e.g., Karyn L. Sheaffer, Rinho Kim, Reina Aoki, et al. (2014) DNA methylation is required for the control of stem cell differentiation in the small intestine. Genes & Development, http://genesdev.cshlp.org/content/28/6/652.abstract; Bo Zhang, Yan Zhou, Nan Lin, et al. (2013) Functional DNA methylation differences between tissues, cell types, and across individuals discovered using the M&M algorithm. Genome Research, https://genome.cshlp.org/content/early/2013/06/26/gr.156539.113.abstract
  5. Phillips, supra note 3.
  6. Athina Vidaki & Manfred Kayser (2017) From forensic epigenetics to forensic epigenomics: broadening DNA investigative intelligence, Genome Biol. 18: 238, doi:  10.1186/s13059-017-1373-1
  7. Mahsa Shabani, Pascal Borry, Inge Smeers, & Bram Bekaert (2018) Forensic Epigenetic Age Estimation and Beyond: Ethical and Legal Considerations. Trends in Genet 34(7): 489–491
  8. FBI, Frequently Asked Questions on CODIS and NDIS, https://www.fbi.gov/services/laboratory/biometric-analysis/codis/codis-and-ndis-fact-sheet
  9. Seth Augenstein, CODIS Has More ID Information than Believed, Scientists Find,” Forensic Mag., May 15, 2017, https://www.forensicmag.com/news/2017/05/codis-has-more-id-information-believed-scientists-find
  10. Id.
  11. Cf. David H. Kaye, The Genealogy Detectives: A Constitutional Analysis of “Familial Searching,” 51 Am. Crim. L. Rev. 109, 137 n. 170 (2013) (“However, there is at least one rather roundabout way in which the identification profiles could reveal substantial medical information. In the future, when the full genomes of individuals are recorded in clinical databases of medical records, a police agency possessing the profile and having surreptitious access to the database could locate the entry for the individual’s genome and any associated medical records without anyone’s knowledge. Although the STRs would be useful only for identification, that use could be the key to locating information in patient records. Furthermore, the patient’s records and full genome could lead police to the stored genomes and records of relatives. Although I cannot think of many scenarios in which police would be motivated to engage in this computer hacking and medical snooping, there may be some.”).

Saturday, June 23, 2018

Trawling Genealogy Databases and the Fourth Amendment: Part I

Law-enforcement use of a DNA database created for genealogy enthusiasts helped discover the man believed to be the Golden State Killer. It also provoked an immediate outpouring of media reports of concerns about "genetic privacy." Now, essays from groups of bioethicists and lawyers have appeared in both the Annals of Internal Medicine [1] and Science [2]. Neither article gives a convincing and complete analysis of the legal issues—hardly surprising given the word limits for such policy forum essays—but both are useful as starting points for discussion.

I. The Misplaced “Abandonment” Theory

The Annals article, Is It Ethical to Use Genealogy Data to Solve Crimes? [1], assures us that the law provides “clarity.” The authors find this clarity in “the abandonment doctrine.” The following paragraph comprises their entire legal analysis:
The legal questions raised by genealogy searches are measurably simpler than the ethical concerns. In terms of the U.S. Constitution, a genealogy search triggered by DNA collected from a crime scene probably would not count as a “search” under the Fourth Amendment (4). Even assuming it would, the applicable legal theory—the “abandonment doctrine”—holds that a person has no “reasonable expectation of privacy” in abandoned materials. Courts have allowed law enforcement to test DNA “abandoned” in a range of settings (such as hair clippings and discarded cigarette butts). At genealogy Web sites, users voluntarily upload (that is, abandon) familial data into commercial databases. Whether they are aware that their data are subject to police collection is, legally, irrelevant (5). Notwithstanding the clarity of the law, it is questionable whether it is good social policy to consider the uploading of genealogic data the same as abandoning DNA in a public space.
These remarks confuse two very different questions. The first is whether a data-gathering method is a search within the meaning of the Fourth Amendment. The Amendment protects against “unreasonable searches and seizures” of “persons, houses, papers, and effects,” in large part by requiring police to acquire judicial warrants based on probable cause before undertaking a search or seizure. (Reliance on a properly issued warrant makes the search reasonable.)

But not all information collection is a search or seizure that triggers the Fourth Amendment demand for reasonableness. For example, a police officer who merely watches a shady character—or anyone else—walk down the street has not searched or seized anyone. If the officer snaps a photo of the person and compares it to photos of wanted criminals, there is still no search or seizure, for there has been no interference with the individual’s body, movements, or property. And if no search has occurred, there is no need to ask whether the officer’s decision to study the individual was reasonable in light of the facts known to the officer. The notion that what a person knowingly exposes to the general public cannot be the subject of a “search” is sometimes called the “public exposure doctrine.” It pertains to the threshold question of whether a search has occurred.

The “abandonment doctrine” also applies to this threshold question. It applies to property that a person has discarded or left behind. If the police see an individual throw away a syringe, they may collect it and then analyze it for the presence of heroin without obtaining a warrant—because they have not performed a search that affects any legitimate interest. By intentionally relinquishing the syringe, the individual has given up any property interest. He or she might not want it to become known that the syringe has traces of heroin in it, but if heroin possession is a criminal act, then the individual can hardly claim that the interest in keeping this fact secret is legitimate and hence protected by the Fourth Amendment. So the abandonment doctrine is another route to a conclusion that the police have conducted no search or seizure within the meaning of the Amendment.

The lower courts have almost always applied the abandonment doctrine to DNA molecules shed or deposited in both legal and illegal activities. But it is odd to maintain, as Is It Ethical? does, that abandonment makes a search reasonable because “a person has no ‘reasonable expectation of privacy’ in abandoned materials.” That mistakes the question of whether a search is justified for the question of whether the police conduct is a search. The “reasonable expectation” standard, introduced in Katz v. United States, 389 U.S. 347 (1967), is merely a way to show that police have engaged in a search; it is not a way to show that a search is reasonable.

This distinction may sound finicky. Functionally, what is the difference between (1) defining everything as a search, but then asking whether the investigation is reasonable because there is no reasonable expectation of privacy in the items searched, and (2) asking whether there is a search because there is no reasonable expectation of privacy in the first place? Nevertheless, no court that is faithful to the Supreme Court's many Fourth Amendment opinions would adopt the position contemplated in Is It Ethical?. No such court would write that even though profiling DNA from a crime-scene is a search, it does not require a warrant (or an exception to the generally applicable warrant requirement for a search to be constitutionally reasonable).

More importantly, the opinions allowing the police to profile shed DNA and compare the identifying profile with a suspect's DNA (or with all the profiles in a law enforcement DNA database), all without a warrant, do not demonstrate that police can also trawl a database of DNA sequences to see who might be related to whom. Further analysis is required to determine the constitutionality of familial, or other-directed searching by the state in both law-enforcement [3] and private (i.e., non-governmental) databases such as GEDmatch and the more restrictive commercial ones.

With respect to the private databases, the state's argument lies not so much in abandonment as in public exposure. The very reason the individual puts DNA data on the database is to enable curious members of the public to inspect it. As such, the Fourth Amendment issue is whether police (without a warrant) can do what anyone else can—namely, trawl the database for a partial match indicative of a genetic relationship to the suspect whose DNA is associated with a crime. In many contexts—overflying private property to get a look at what is there, for example—the Supreme Court has reasoned that what is open to the public generally is open to the police as well. Indeed, the Court has even held that entrusting or conveying information to private parties defeats the claim of a reasonable expectation of privacy and hence the claim of a search that requires probable cause and a warrant. Exposure to a small slice of the public—even a banker or a telephone company—is enough let the police in without a showing of probable cause. (Disclosure of information to one’s lawyer may be protected by the attorney-client privilege but not the Fourth Amendment.)

In the past several years, however, some Justices have evinced discomfort with this “third-party doctrine.” Just yesterday, the Court held in Carpenter v. United States, No. 16–402, 2018 WL 3073916 (U.S. June 22, 2018), that certain data generated by a cell-phone service provider—the third party—is not outside the protective umbrella of the Fourth Amendment just because it has been given to or generated by a third party. The data in the case amounted to extended tracking of the past whereabouts of a person’s cellphone’s via the electronic tracks, so to speak, left at cell towers. That information, the majority reasoned, was so sensitive as to make its possession by the cellular phone service providers insufficient to defeat the claim of a reasonable expectation of privacy. A warrant was required.

The Science article correctly frames the pivotal Fourth Amendment issue as the scope of the third-party doctrine, but it leaves much unsaid. I will turn to the implications of this evolving doctrine for trawls of genealogy databases in a later installment.

REFERENCES

1. Benjamin E. Berkman, Wynter K. Miller & Christine Grady, Is It Ethical to Use Genealogy Data to Solve Crimes?, Annals Internal Med., May 29, 2018, DOI: 10.7326/M18-1348.
2. Natalie Ram, Christi J. Guerrini & Amy L. McGuire, Genealogy Databases and the Future of Criminal Investigation, 360 Science 1078-1079 (2018) DOI: 10.1126/science.aau1083
3. David H. Kaye, The Genealogy Detectives: A Constitutional Analysis of “Familial Searching”, 51 Am. Crim. L. Rev. 109 (2013), https://ssrn.com/abstract=2043091

Tuesday, June 5, 2018

DNA Evidence and the Warrant Affidavit in the Golden State Killer Case

Last Friday, Sacramento Superior Court Judge Michael Sweet “ordered arrest and search warrant information in the East Area Rapist/Golden State Killer case unsealed after weeks of arguments between attorneys over how the release would impact the trial of suspect Joseph James DeAngelo.” 1/ The 171 heavily redacted pages of documents did not discuss the kinship trawl of the publicly accessible genealogy database that occupied much of the news about the case. However, they did refer to later DNA tests of surreptiously procured samples of DeAngelo’s DNA:
     [I]nvestigators didn't have a sample of DeAngelo's DNA, so Sacramento sheriff's detectives began following him as he moved about town, finally watching April 18 as DeAngelo parked his car in a public parking lot at a Hobby Lobby store in Roseville, according to an arrest warrant affidavit unsealed Friday.
     "A swab was collected from the door handle while DeAngelo was inside the store," according to the affidavit from sheriff's Detective Sgt. Ken Clark. "This car door swab was submitted to the Sacramento DA crime lab for DNA testing." ... The swab contained DNA from three different people, and 47 percent of the DNA came from one person, the affidavit said.
     That DNA was compared to murders in Orange and Ventura counties where DNA had been collected and saved from decades before, and it came back with results that elated investigators. "The likelihood ratio for the three-person mixture can be expressed as at least 10 billion times more likely to obtain the DNA results if the contributor was the same as the Orange County/Ventura County (redacted) profile and two unknown and unrelated individuals than if three unknown and unrelated individuals were the contributors," Clark wrote in his affidavit seeking an arrest warrant for DeAngelo. ...
     Sacramento County District Attorney Anne Marie Schubert has said previously that even with the possible match she asked for a better sample, so investigators went hunting again, this time focusing on DeAngelo's trash on April 23.
     "The trash can was put out on the street in front of his house the night before," Clark wrote. "DeAngelo is the only male ever seen at the residence during the surveillance of his home which has occurred over the last three days."
     Detectives gathered "multiple samples" from the trash can and sent them to the crime lab on Broadway for analysis. "Only one item, a piece of tissue (item #234-#8), provided interpretable DNA results," Clark wrote."The likelihood ratio for this sample can be expressed as at least 47.5 Septillion times more likely to obtain the DNA results if the contributor was the same as the Orange County/Ventura County (redacted) profile than if an unknown and unrelated individual is the contributor." 2/
Compare this statement of the likelihood ratio to the misstated version from a “senior science writer” for Forensic Magazine:
The warrants now show that: ... [t]he additional surreptitious sample was from DeAngelo’s trash can set out at the curb. Only one piece of tissue provided interpretable DNA results, but those translated to a likelihood that DeAngelo was 47.5 septillion times more likely to be the Golden State Killer than an unknown and unrelated individual. 3/
To see the error, click on the label “transposition” in this blog. Of course, one can ask what’s the big deal when the likelihood ratio is in the septillions. But that question translates into an argument about harmless error. Sometimes the errors associated with sloppy phrasing won’t have immediate repercussions, but a magazine written for forensic practitioners ought not propagate sloppy thinking. In any case, things are looking up when detectives take the care to avoid transposing their conditional probabilities.

NOTES
  1. Sam Stanton & Darrell Smith, Read the Warrant Documents in the East Area Rapist Case, Sacramento Bee, June 1, 2018, http://www.sacbee.com/news/local/article212377094.html
  2. Sam Stanton & Darrell Smith, How Detectives Collected DNA Samples from the East Area Rapist Suspect, Sacramento Bee, June 1, 2018, http://www.sacbee.com/latest-news/article212334279.html
  3. Seth Augenstein, Golden State Killer Warrants Show Evolution of Killer — But Not Genealogy, Forensic Mag., June 4, 2018, https://www.forensicmag.com/news/2018/06/golden-state-killer-warrants-show-evolution-killer-not-genealogy

Wednesday, May 30, 2018

Fusing Humans and Machines to Recognize Faces

A new article on the accuracy of facial recognition by humans and machines represents “the most comprehensive examination to date of face identification performance across groups of humans with variable levels of training, experience, talent, and motivation.” 1/ It concludes that the optimal performance comes from a “fusion” of man (or woman) and machine. But the meaning of “accuracy” and “fusion” are not necessarily what one might think.

Researchers from NIST, The University of Texas at Dallas, the University of Maryland, and the University of New South Wales displayed “highly challenging” pairs of face images to individuals with and without training in matching images, and to “deep convolutional neural networks” (DCNNs) that trained themselves to classify images as being from the same source or from different sources.

The Experiment
Twenty pairs of pictures (12 same-source and 8 different-source pairs) were presented to the following groups:
  • 57 forensic facial examiners (“professionals trained to identify faces in images and videos [for use in court] using a set of tools and procedures that vary across forensic laboratories”);
  • 30 forensic facial reviewers (“trained to perform faster and less rigorous identifications [for] generating leads in criminal cases”);
  • 13 super-recognizers (“untrained people with strong skills in face recognition”);
  • 31 undergraduate students; and
  • 4 DCNNs (“deep convolutional neural networks” developed between 2015 and 2017”).
Students took the test in a single session, while the facial examiners, reviewers, super-recognizers, and fingerprint examiners had three months to complete the test. They all expressed degrees of confidence that each pair showed the same person as opposed to two different people. (+3 meant that “the observations strongly support that it is the same person”; –3 meant that “the observations strongly support that it is not the same person”). The computer programs generated “similarity scores” that were transformed to the same seven-point scale.

Comparisons of the Groups
To compare the performance of the groups, the researchers relied on a statistic known as AUC (or, more precisely, AUROC, for “Area Under the Receiver Operating Characteristic” curve). AUROC combines two more familiar statistics—the true-positive (TP) proportion and the false-positive (FP) proportion—into one number. In doing so, it pays no heed to the fact that a false-positive may be more costly than a false negative. A simple way to think about the number is this: The AUROC of a classifier is equal to the probability that the classifier will rank a randomly chosen pair of images higher when they originate from the same source than when the pair come from two different sources. That is,

AUROC = P(score|same > score|different)

Because making up scores at random would be expected to be correct in this sense about half the time, an AUROC of 0.5 means that, overall, the classifier’s scores are useless for distinguishing between same-source and different-source pairs. AUROCs greater than 0.5 indicate better overall classifications, but the value for the area generally does not translate into the more familiar (and more easily comprehended) measures of accuracy such as the sensitivity (the true-positive probability) and specificity (the true-negative probability) of a classification test. See Box 1. Basically, the larger the AUROC, the better the scores are--in some overall sense--in discriminating between same-source and and different-source pairs. 

Now that we have some idea of what the AUROC signifies (corrections are welcome—I do not purport to be an expect on signal detection theory), let’s see how the different groups of classifiers did. The median performance of each group was
A2017b:░░░░░░░░░░ 0.96
facial examiners:░░░░░░░░░ 0.93
facial reviewers:░░░░░░░░░ 0.87
A2017a:░░░░░░░░░ 0.85
super-recognizers:░░░░░░░░ 0.83
A2106:░░░░░░░░ 0.76
fingerprint examiners:░░░░░░░░ 0.76
A2015:░░░░░░░ 0.68
students:░░░░░░░ 0.68
Again, these are medians. Roughly half the classifiers in each group had higher AUROCs, and half had lower ones. (The automated systems A2015, A2016, A2017a, and A2017b had only one ROC, and hence only one AUROC.) “Individual accuracy varied widely in all [human] groups. All face specialist groups (facial examiners, reviewers, and super-recognizers) had at least one participant with an AUC below the median of the students. At the top of the distribution, all but the student group had at least one participant with no errors.”

Using the distribution of student UAROCs (fitted to a normal distribution), the authors reported the fraction of participants in each group who scored above the student 95th percentile as follows:
facial examiners:░░░░░░░░░░░ 53%
super-recognizers:░░░░░░░░░ 46%
facial reviewers:░░░░░░░ 36%
fingerprint examiners:░░░ 17%
The best computerized system, A2017b, had a higher AUROC than 73% of the face specialists. To put it another way, “35% of examiners, 13% of reviewers, and 23% of superrecognizers were more accurate than A2017b,” which “was equivalent to a student at the 98th percentile.”

But none of the preceding reveals how often the classifications based on the scores would be right or wrong. Buried in an appendix to the article (and reproduced below in Box 2) are estimates of “the error rates associated with judgments of +3 and −3 [obtained by computing] the fraction of high-confidence same-person (+3) ratings made to different identity face pairs” and estimates of “the probability of same identity pairs being assigned a −3.” The table indicates that facial examiners who were very confident usually were correct, expressing maximum confidence less than 1% of the time for same-source pairs (false positives) and less than 2% of the time for different-source pairs (false negatives). Students made these errors a little more than 7% and 14% of the time, respectively.

Fusion
The article promises to “show the benefits of a collaborative effort that combines the judgments of humans and machines.” It describes the method for ascertaining whether “a colloborative effort” improves performance as follows:
We examined the effectiveness of combining examiners, reviewers, and superrecognizers with algorithms. Human judgments were fused with each of the four algorithms as follows. For each face image pair, an algorithm returned a similarity score that is an estimate of how likely it is that the images show the same person. Because the similarity score scales differ across algorithms, we rescaled the scores to the range of human ratings (SI Appendix, SI Text). For each face pair, the human rating and scaled algorithm score were averaged, and the AUC was computed for each participant–algorithm fusion.
Unless I am missing something, there was no collaboration between human and machine. Each did their own thing. A number midway between the separate similarity scores on each pair produced a larger area under the ROC than either set of separate scores. To the extent that “Fusing Humans and Machines” conjures images of cyborgs, it seems a bit much. The more modest point is that a very simple combination of scores of a human and a machine classifier works better (with respect to AUROC as a measure of success) than either one alone.

BOX 1. THE ROC CURVE AND ITS AREA

Suppose that we were to take a score of +1 or more as sufficient to classify a pair of images as originating from the same source. Some of these classifications would be incorrect (contributing to the false-positive (FP) proportion for this decision threshold), and some would be correct (contributing to the true-positive (TP) proportion). Of course, the threshold for the classification could be set at other scores. The ROC curve is simply a plot of the points (TPP[score], FPP[score]) for the person or machine scoring the pairs of images for the many possible decision thresholds.

For example, if the threshold score for a positive classification were set higher than all the reported scores, there would no declared positives. Both the false positive and the true positive proportions would be zero. At the other extreme, if the threshold score were placed at the bottom of the scale, all the classifications would be positive. Hence, every same-source pair would be classified positively, as would every different-source pair. Both the TPP and the FPP would be 1. A so-called random classifier, in which the scores have no correlation to the actual source of images, would be expected to produce a straight line connecting these points (0,0) and (1,1). A more useful classifier would have a curve with mostly higher points, as shown in the sketch below.

      TPP (sensitivity)
     1 |           *   o
       |       *
       |           o
       |   
       +   *   o
       |                   o Random (worthless) classifier
       |   o               * Better classifier (AUC > 0.5)
       |                         
       o---+------------ FPP (1 – specificity)
                       1
An AUROC of, say, 0.75, does not mean that 75% of the classifications (using a particular score as the threshold for declaring a positive association) are correct. Neither does it mean that 75% is the sensitivity or specificity when using a given score as a decision threshold. Nor does not mean that 25% is the false-positive or the false-negative proportion. Instead, how many classifications are correct at a given score threshold depends on: (1) the specificity at that score threshold, (2) the specificity at that score threshold, and (3) the proportion of same-source and different-source pairs in the sample or population of pairs.

Look at the better classifier in the graph (the one whose operating characteristics are indicated by the asterisks). Consider the score implicit in the asterisk above the little tick-mark on the horizontal axis and across from the mark on the vertical axis. The FPP there is 0.2, so the specificity is 0.8. The sensitivity is the height of the better ROC curve at that implicit score threshold. The height of that asterisk is 0.5. The better classifier with that threshold makes correct associations only half the time when confronted with same-source pairs and 80% of the time when presented with different-source pairs. When shown 20 pairs, 12 of which are from the same face, as in the experiment discussed here, the better classifier is expected to make 50% × 12 = 6 correct positive classifications and 80% × 8 = 6.4 correct negative classifications. The overall expected percentage of correct classifications is therefore 12.4/20 = 62% rather than 75%.

The moral of the arithmetic: The area under the ROC is not so readily related to the accuracy of the classifier for particular similarity scores. (It is more helpful in describing how well the classifier generally ranks a same-source pair relative to a different-source pair.) 2/


BOX 2. "[T]he estimate qˆ for the error rate and the upper and lower limits of the 95% confidence interval." (From Table S2)
GroupEstimate0.95 CI
Type of Error: False Positive (+3 on different faces)
Facial Examiners0.9%0.002 to 0.022
Facial Reviewers1.2%0.003 to 0.036
Super-recognizers1.0%0.0002 to 0.052
Fingerprint Examiners3.8%0.022 to 0.061
Students7.3%0.044 to 0.112
Type of Error: False Negative (-3 on same faces)
Facial Examiners1.8%0.009 to 0.030
Facial Reviewers1.4%0.005 to 0.032
Super-recognizers5.1%0.022 to 0.099
Fingerprint Examiners3.3%0.021 to 0.050
Students14.5%0.111 to 0.185



UPDATES
June 9, 2018: Corrections and additions made in response to comments from Hari Iyer.
NOTES
  1. P.J. Phillips, A.N. Yates, Y. Hu, C.A. Hahn, E. Noyes, K. Jackson, J.G. Cavazos, G. Jeckeln, R. Ranjan, S. Sankaranarayanan, J.-C. Chen, C.D. Castillo, R. Chellappa, D. White and A.J. O’Toole. Face Recognition Accuracy of Forensic Examiners, Superrecognizers, and Algorithms. Proceedings of the National Academy of Sciences, Published online May 28, 2018. DOI: 10.1073/pnas.1721355115
  2. As Hari Iyer put it in response to the explanation in Box 1, "given a randomly chosen observation x1 belonging to class 1, and a randomly chosen observation x0 belonging to class 0, the (empirical) AUC is the estimated probability that the evaluated classification algorithm will assign a higher score to x1 than to x0." For a proof, see Alexej Gossman, Probabilistic Interpretation of AUC, Jan. 25, 2018, http://www.alexejgossmann.com/auc/. A geometric proof can be found in Matthew Drury, The Probabilistic Interpretation of AUC, in Scatterplot Smoothers, Jun 21, 2017, http://madrury.github.io/jekyll/update/statistics/2017/06/21/auc-proof.html.