Sunday, April 8, 2018

On the Difficulty of Latent Fingerprint Examinations

This morning the Office of Justice Programs of the Department of Justice circulated an email mentioning a “New Article: Defining the Difficulty of Fingerprint Comparisons.” The article, written a couple of weeks ago, is from the DOJ’s National Institute of Justice (NIJ). [1] It summarizes an NIJ-funded study that was completed two years ago. [2] The researchers published findings on their attempt to measure difficulty in the online science journal PLOS One four years ago. [3]

The “New Article” explains that
[T]he researchers asked how capable fingerprint examiners are at assessing the difficulty of the comparisons they make. Can they evaluate the difficulty of a comparison? A related question is whether latent print examiners can tell when a print pair is difficult in an objective sense; that is, whether it would broadly be considered more or less difficult by the community of examiners.
The first of these two questions asks whether examiners’ subjective assessments of difficulty are generally accurate. To answer this question, one needs an independent, objective criterion for difficulty. If the examiners’ subjective assessments of difficulty line up with the objective measure, then we can say that examiners are capable of assessing difficulty.

Notice that agreement among examiners on what is difficult and what is easy would not transform subjective assessments of difficulty into “objective” ones—anymore than the fact that a particular supermodel would “broadly be considered” beautiful would make her beautiful “in an objective sense.” It would simply mean that there is inter-subjective agreement within a culture. One should not mistake inter-examiner reliability for objectivity.

In psychometrics, a simple measure of the difficulty of a question on a test is the proportion of test-takers who answer correctly. [4] Of course, “difficulty” could have other meanings. It might be that test-takers would think that one item is more difficult than another even though, after struggling with it, they did just as well as they had on an item that they (reliably) rated as much easier. A criterion for difficulty in this sense might be the length of time a test-taker devotes to the question. But the correct-answer criterion is appropriate in the fingerprint study because the research is directed at finding a method of identifying those subjective conclusions that are most likely to be correct (or incorrect).

NIJ’s new article also mentions the hotly disputed issue of whether error probabilities, as estimated by the performance of examiners making a specific set of comparisons, should be applied to a single examiner in a given case. One would think the answer is that as long as the conditions of the experiments are informative of what can happen in practice, group means are the appropriate starting point—recognizing that they are subject to adjustment by the factfinder for individualized determinations about the acuity of the examiner and the difficulty of the task at hand. However, prosecutors have argued that the general statistics are irrelevant to weighing the case-specific conclusions from their experts. The NIJ article states that
The researchers noted that being aware that some fingerprint comparisons are highly accurate whereas others may be prone to error, “demonstrates that error rates are indeed a function of comparison difficulty.” “Because error rates can be tied to comparison difficulty,” they said, “it is misleading to generalize when talking about an overall error rate for the field.”
But the assertion that “it is misleading to generalize when talking about an overall error rate for the field” cannot be found in the 59-page document. When I searched for the string “generalize,” no such sentence appeared. When I searched for “misleading,” I found the following paragraph (p. 51):
The mere fact that some fingerprint comparisons are highly accurate whereas others are prone to error has a wide range of implications. First, it demonstrates that error rates are indeed a function of comparison difficulty (as well as other factors), and it is therefore very limited (and can even be misleading) to talk about an overall “error rate” for the field as a whole. In this study, more than half the prints were evaluated with perfect accuracy by examiners, while one print was misclassified by 91 percent of those examiners evaluating it. Numerous others were also misclassified by multiple examiners. This experiment provides strong evidence that prints do vary in difficulty and that these variations also affect the likelihood of error.
As always, the inability to condition on relevant variables with unknown values “can be misleading” when making an estimate or prediction. But this fact about statistics does not make an overall mean irrelevant. Knowing that there is a high overall rate of malaria in a country is at least somewhat useful in deciding whether to take precautions against malaria when visiting that country—even though a more finely grained analysis of the specific locales within the country could be more valuable. That said, when a difficulty-adjusted estimate of a probability of error becomes available, requiring it to be presented to the triers of fact instead of the group mean would be a sound approach to the relevance objection.

The experiments described in the report to NIJ are fascinating in many respects. In the long run, the ideas and findings could lead to better estimates of accuracy (error rates) for use in court. More immediately, one can ask how the error rates seen in these experiments compare to earlier findings (reviewed in the report and on this blog). But it is hard to make meaningful comparisons. In the first of the three experiments in the NIJ-funded research, 56 examiners were recruited from participants in the 2011 IAI Educational Conference. These examiners (a few of whom were not latent-print examiners) made forced judgments with a time constraint about the association (positive or negative) of many pairs of prints. The following classification table can be inferred from the text of the report:

Truly Positive Truly Negative
Positive Reported 985 37
Negative Reported 163 1107
Total 1148 1144

The observed sensitivity, P(say + | is +), across the examiners and comparisons was 985/1148 = 85.8%, and the observed specificity, P(say – | is –), was 1107/1144 = 96.8%. The corresponding conditional error proportions are 14.2% for false negatives and 3.2% for false positives. These error rates are higher than those in other research, but in those experiments, the examiners could declare a comparison to be inconclusive and did not have to make a finding within a fixed time. These constraints were modified in a subsequent experiment in the NIJ-funded study, but the report does not provide a sufficient description to produce a complete table.

References
1. National Institute of Justice, “Defining the Difficulty of Fingerprint Comparisons,” March 22, 2018, NIJ.gov: https://nij.gov/topics/forensics/evidence/impression/Pages/defining-difficulty-of-fingerprint-comparisons.aspx

2. Jennifer Mnookin, Philip J. Kellman, Itiel Dror, Gennady Erlikhman, Patrick Garrigan, Tandra Ghose, Everett Metler, & Dave Charlton, Error Rates for Latent Fingerprinting as a Function of Visual Complexity and Cognitive Difficulty, May 2016, https://www.ncjrs.gov/pdffiles1/nij/grants/249890.pdf

3. Philip J. Kellman, Jennifer L. Mnookin, Gennady Erlikhman, Patrick Garrigan, Tandra Ghose, Everett Mettler, David Charlton, & Itiel E. Dror, Forensic Comparison and Matching of Fingerprints: Using Quantitative Image Measures for Estimating Error Rates through Understanding and Predicting Difficulty, PLOS One, May 2, 2014, https://doi.org/10.1371/journal.pone.0094617

4. Frederic M. Lord, The Relationship of the Reliability of Multiple-Choice Test to the Distribution of Item Difficulties, 18 Psychometrika 181 (1952).