Sunday, March 3, 2019

Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 1)

The FBI’s crime laboratory has had its ups and downs. Mistakes or misconduct by hair analysts, latent fingerprint examiners, explosives experts, and DNA analysts have been the subject of many a headline. 1/ In contrast, the more mundane efforts of photographic experts have not prompted extensive reports of any problems, let alone scandals. These experts testify about such matters as whether a surveillance photograph matches a photograph of a given person, 2/ the height of individuals as inferred from photographic images, 3/ and whether images in child pornography cases are those “of real, non-virtual people.” 4/

These halcyon days came to end this year when Propublica published a pair of detailed articles accusing the FBI's Forensic Audio/Video and Image Analysis Unit of relying on “flawed methods” to generate biased findings that did not deserve to be called “scientific evidence.” 5/ In an interview for the first story, the reporter told me that he discovered studies showing that analysts were unable to consistently pick out the same features from pictures and that, in one case, an FBI expert had testified that the probability of certain detailed features appearing on pictures of two different shirts would be 1 in 650 billion. He wanted to know what a statistician would say about these matters. I responded that these findings were definitely problematic, and my immediate reactions appeared in the first article:
If examiners cannot mark the same features each time they use a technique, “then you can’t rely on the result, I think that’s what any statistician would say,” said David Kaye, a Penn State University law professor and expert on DNA analysis. “It’s not a reliable measure.”
Vorder Bruegge’s statistic — 1 in 650 billion — is simply too astronomical to be true, said Kaye, the Penn State professor. There isn’t a database documenting features on plaid-shirt seams like there is for human DNA, making it impossible to determine the likelihood a different shirt would appear to match the robber’s shirt. ... “It may be an honest belief,” Kaye said, “though terribly flawed.”
At the time, I had not seen the studies and testimony in question. Having begun to read through them, I am ready to share some more informed (but still tentative) impressions.

I. Reliability

A reliable measuring instrument gives approximately the same results when used repeatedly on the same material under the same conditions. A radar gun that registers the same speed (within, say, 1 mile per hour) for all cars moving at one speed gives reliable measurements at that speed. The measurements may not be correct, but they are consistent, which is all that reliability connotes in statistics.

In pattern-matching tasks performed by human observers, the situation is slightly more complicated. If the patterns are chock full of different features that are very specific to the source of the pattern, all examiners, or even the same examiner repeating a comparison at a later time, might not use the same features in coming to a subjective judgment that the source of two patterns is the same or different. For example, one fingerprint examiner might use minutiae on one part of a print one time and minutiae from another region at other times. That kind of variation would not necessarily make the binary judgments (of same-source and different-source pairs) unreliable or invalid.

The articles cited in the Propublica exposé are not really directed at such binary judgments. The more substantial article cited as proof of unreliability proposes a more standardized and quantitative approach -- a "count-based method" -- to comparing images of hands in photographs, 6/ but it does not dismiss the current, trust-the-examiner's-global-judgment approach as invalid. Rather it states that
While there is a lack of standardization in how features are documented, [this] does not invalidate the use of such features in a photographic comparison. Future study is warranted to examine how successful examiners are when tested with comparisons leading to a conclusion in regard to individualization.
In other words, even if one or two studies indicate that image examiners are not consistent in picking out features in photographs of hands for later comparisons, it does not follow that examiners are unable to reliably and accurately classify pairs of images with respect to hands they depict. The problem -- and it is a grave problem -- is that we lack the studies to know when examiners can do what they claim (at known levels of accuracy) and when they cannot.

A reader of the Propublica articles might conclude that research has demonstrated that the FBI procedures are bunk or junk science -- the article speaks of their having been "debunked." The more accurate conclusion is that they are not science (although digital technology is part of them). Some of the underlying ideas are plausible, but their implementation has yet to be validated (or invalidated) scientifically. 7/ Looking at the write-ups of the two studies, I cannot say that they move the ball greatly..And neither of them claim to. 8/

II. Probability and Individualization

The Propublica articles singled out a leading forensic scientist — Richard Vorder Bruegge, who also chairs the Digital-Multimedia Committee of the Organization of Scientific Area Committees for Forensic Sciences (OSAC) — as a source and savior of unscientific testimony about individuals and clothing captured on film. A centerpiece of the first article is his testimony in a sixteen-year-old case about the probability of finding certain features in photos of a shirt worn by a bank robber caught on film in eight robberies.

Having spent much of my career puzzling over probability computations presented in court, I wanted to read the actual transcript. For readers who share my curiosity about how probability evidence reaches juries but who may not have extra hours on their schedules, I have greatly condensed the testimony and will present some of it together with annotations in a later posting. 9/

NOTES
  1. E.g., Mark Hansen, Crime Labs Under the Microscope After a String of Shoddy, Suspect and Fraudulent Results, ABAJ, Sept. 2013; John Kelly & Phillip Wearne, Tainting Evidence: Inside The Scandals at the FBI Crime Lab (1998); Office of the Inspector General, The FBI DNA Laboratory: A Review of Protocol and Practice Vulnerabilities, May 2004.
  2. United States v. Martin, No. Crim. 98–178, 2000 WL 233217 (E.D. Pa. Feb. 25, 2000), aff’d, 46 Fed.Appx. 119 (3d Cir. 2002) (on the basis of many similarities, Dr. Richard Vorder Bruegge came “very close to making a positive identification”).
  3. United States v. Kyler, 2015 WL 13450483, Case No.: 4:09cr30/RH/GRJ, Case No.: 4:12cv438/RH/GRJ (N.D. Fla. Feb. 26, 2015), aff’d, 429 Fed.Appx. 828 (11th Cir. 2011) (“Dr. Vorder Bruegge testified that he used reverse projection photogrammetry to analyze video images [from three robberies to] determine[] that the ... true height of the individual in the three videos was slightly above that range [of 5'10.5" and 5'11"].
  4. United States v. Rodriguez-Pacheco, 475 F.3d 434 (1st Cir. 2007).
  5. Ryan Gabrielson, The FBI Says Its Photo Analysis Is Scientific Evidence. Scientists Disagree, Propublica, Jan. 17, 2019; Ryan Gabrielson, FBI Scientist’s Statements Linked Defendants to Crimes, Even When His Lab Results Didn’t, Propublica, Feb. 22, 2019..
  6. Christina A. Malone, Michael J. Salyards & Meredith Hein, Inter-/Intra-observer Reliability of Hand Assessment Using Skin Detail: A Count-based Method, 60 J. Forensic Sci. 1605 (2015). In many perceptual tasks, the clearer the stimuli are, the more consistent the responses will be. Thus, this paper also suggests that some hand images produce more repeatable and reproducible observations of features than others. If image examiners are reasonably reliable under some conditions, it is not appropriate to dismiss all comparisons as unreliable. But rescuing some of them requires a metric for distinguishing the cases likely to lead to reliable measurements of individual features from the cases that are likely to be generate unreliable measurements.
  7. To say, as the article quoted above does, that future study is "warranted" if "individualization" is to be introduced as scientific evidence, is an understatement. Scientific evidence requires scientific validation, which translates into serious, blind testing in this context. Of course, there is no compelling reason for "individualization." Examiners can testify to their sense of the probabilities of the observed features under alternative hypotheses about the origin of those features without assessing the probabilities of the hypotheses themselves.
  8. The other study cited in the article is unpublished. It was the subject of a talk at last year's annual meeting of the American Association of Forensic Sciences. Derek A. Boyd, Aislynn MacKenzie, Briana M. Turner-Gilmore, Richard Vorder Bruegge et al., Observer Agreement in the Identification and Quantification of Dorsal Hand Traits From Digital Images (abstract). The abstract acknowledges the limited proof of the validity of comparisons of "dorsal hand traits in the identification of perpetrators and victims of these criminal activities through photographic comparison." In fact, the FBI researchers state that "[t]he qualitative nature of this method prevents it from meeting Daubert standards, as there are no known error rates associated with rates of identification of these traits from visual media." Their study of the back of one hand and six examiners checked for reliability in counting "scars, moles, freckles, and knuckle skin-creases." The results were mixed:
    Calculated coefficients of variance indicated high levels of data dispersion among scars (cv=1.206), moles (cv=1.546), and freckles (cv=1.270). Coefficients of variance calculated for counts of knuckle skin-creases on each digit suggested comparatively lower levels of dispersion (cv1=0.419; cv2=0.404; cv3=0.450; cv4=0.530; cv5=0.354). Tests of intra-observer error indicated a statistically significant difference in mean counts between first and second observations of freckles (t=-2.43, df=11, p=0.034) and knuckle skin-creases on the second digit (t=-2.80, df=11, p=0.017), but not for any other traits observed.
    Testing the hypothesis that there is no difference at all in mean counts is an odd way to discuss reliability, but it sounds like the researchers were trying to say that only a few features were not reliably measured. At least, they wrote that "[t]his exploratory study found that most traits exhibited statistically minimal intra- and inter-observe disagreement" (emphasis added). But then they wrote the sentence that Propublica highlighted: "There is considerable intra- and inter-individual variation in the specific observations made by participants, which calls into question the reliability of dorsal hand traits as suitable points of interest for photographic hand comparison." Whatever Boyd et al. meant by "specific features" and "minimal ... disagreement," they somehow concluded that "[t]hese findings are consistent with recent studies that show support for qualitative methods of identification and have implications for current efforts to develop quantitative methods based off the traits investigated here."
  9. After the report was published, Propublica reporter Ryan Gabrielson kindly called my attention to the transcript of the direct examination on the fourth day of the trial and the cross-examination on the fifth day, which Propublica has made available as (low resolution) pdf files.
POSTINGS IN THIS SERIES
  • Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 1), Mar. 3, 2019
  • Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 2), Mar. 19, 2019.
  • Propublica's Picture of Photographic Analysis at the FBI Laboratory (pt. 3), Mar. 20, 2019

No comments:

Post a Comment