Sunday, November 8, 2015

Can Forensic Pattern Matching Be Validated?

An article in the latest issue of the International Statistical Review raises (once again) fundamental questions for forensic scientists: 1/ How can one establish the validity of human judgment in a pattern recognition task such as deciding whether two samples of fingerprints or handwriting emanate from the same source? How can one estimate error probabilities for these judgments?

The message I get reading between the lines is that convincing validation is barely possible and the subjective assessments that are today’s norm will have to be replaced by objective measurements and statistical decision rules. This conclusion may not sit well with practicing criminalists who are committed to the current mode of skill-based assessments. At the same time, the particular statistical perspective of the article (null hypothesis testing) stands in opposition to a movement in the academic segment of the forensic science world that importunes criminalists to get away from categorical judgments — whether these judgments are subjective or objective. Nevertheless, the author of the article on Statistical Issues in Assessing Forensic EvidenceKaren Kafadar, is a leading figure in forensic statistics, 2/ and her perspective is traditional among statisticians. Thus, an examination of a few parts of the article seems appropriate.

The article focuses on “forensic evidence that involves patterns (latent fingerprints, firearms and toolmarks, handwriting, tire treads and microscopic hair),” often comparing it to DNA evidence (much as the NRC Committee on Identifying the Needs of the Forensic Science Community did in 2009). Professor Kafadar emphasizes that in the pattern-matching fields, analysts do not make quantitative measurements of a pre-specified number of well-defined, discrete features like the short tandem repeat (STR) alleles that now dominate forensic DNA testing. Instead, the analyses “depend to a large extent on the examiner whose past experience enables some qualitative assessment of the distinctiveness of the features.” In other words, human perception and judgment establish how similar the two feature sets are and how discriminating those feature sets are. Such “pattern evidence is ... subjective and in need of quantitative validation.”

I. Validating Expert Judgments

How, then, can one quantitatively validate the subjective process? The article proceeds to “define measures used in quantifying error probabilities and how they can be used for pattern evidence.”

A. Validity

The first measure is

Validity (accuracy): Given a sample piece of evidence on which a measurement is made, is the measurement accurate? That is, if the measurement is ‘angle of bifurcation’ or ‘number of matching features’, does that measurement yield the correct answer? For example, if a bifurcation appears on an image with an angle of 30°, does the measurement technology render a result of ‘30’ [degrees], at least on average if several measurements are made? As another example, if a hair diameter is 153 μm, will the measurement, or average of several measurements, indicate ‘153’?

This is only a rough definition. Suppose a measuring instrument always gives a value of 30.001 when the angle is actually 30. Are the measurements “valid”? Neither individually nor on average is the instrument entirely accurate. But the measurements always are close, so maybe they do qualify as valid. There are degrees of validity, and a common measure of validity in this example would be the root mean squared error, where an error is a difference between the true angle and the measurement of it.

But fingerprint analysts do not measure alignments in degrees. Their comparisons are more like that of a person asked to hold two objects, one in each hand, and say which one is heavier (or whether the weights are practically the same). Experiments can validate the ability of test subjects to discriminate the masses under various conditions. If the subjects rarely err, their qualitative, comparative, subjective judgments could be considered valid.

Of course, there is no specific point at which accuracy suddenly merits the accolade of “valid,” and it can take more than one statistic to measure the degree of validity. For example, two forensic scientists, Max Houck and Jay Siegel, interpret a study of the outcomes of the microscopic comparisons and mitochondrial DNA testing as establishing a 91% “accuracy,” 3/ where “accuracy” is the overall “proportion of true results.” 4/ Yet, the microscopic hair analysts associated a questioned hair with a known sample in 1/5 of the cases in which DNA testing excluded any such association (and 1/3 in cases in which there was a DNA exclusion and a definitive result from the microscopy). 5/ In dealing with binary classifications, “validity (accuracy)” may require attention to more than being correct “on average.”

Moreover, whether the validity statistic derived from one experiment applies more widely is almost always open to debate. Even if one group of “weight analysts” always successfully discriminated between the heavier and the lighter weights, the question of generalizability or “external validity,” as social scientists often call it, would remain. A rigorous, double-blind study might show that the analysts did superbly with one set of weights under particular, controlled conditions. This study would possess high internal validity. But it might not tell us much about the performance of different analysts under different conditions; its external validity might be weak. Indeed, it has been said that “[i]t is axiomatic in social science research that there is an inverse relationship between internal and external validity.” 6/

Plainly, the quick definition of “validity” in Statistical Assessments does not exhaust the subject. (Nor was it intended to.) Things get even more complicated when we think of validity as relating to the purpose of the measurement. The data from a polygraph instrument may be valid measurements of physiological characteristics but not valid measures of conscious deception. The usual idea of validity is that the instrument (human or machine) accurately measures what it is supposed to measure. This aspect of “validity” is closely related to the requirement of “fit” announced in the Supreme Court's majority opinion in Daubert v. Merrell Dow Pharmaceuticals, Inc. 7/

B. Consistency

The article indicates that it takes more than “validity” to validate a measurement or inference process. The second requirement is

Consistency (reliability): Given the same sample, how consistent (or variable) are the results? If the measurement is repeated under different conditions (e.g. different fingers, different examiners, different analysis times, different measurement systems and different levels of quality in evidence), is the measurement the same? ... Under what conditions are the measurements most variable? That is, do measurements vary most with different levels of latent print quality? Or with different fingers of the same person? Or with different times of day for the same examiner? Or with different automated fingerprint identification systems (AFIS)? Or with different examiners? If measurements are found to be most consistent when the latent print quality is high and when AFIS system type A is used, but results vary greatly among examiners when the latent print quality is low or when other AFIS systems are used, then one would be in a good position to recommend the restriction of this particular type of forensic evidence under only those conditions when consistency can be assured. ... Notice that a measurement can be highly consistent around the wrong answer (consistent but inaccurate). ...

The critical definitional point here is that “reliability” concerns consistency, but there is room for argument over whether the measuring process has to be consistent under all conditions to be considered “reliable.” If one automated system is consistent in a given domain, it is reliable in that domain. If one skilled examiner reaches consistent results, her reliability is high even if inter-examiner reliability is low. In these examples, the notion of “reliability” overlaps or blurs into the idea of external validity. Likewise, all our weight analysts might be very reliable when comparing 10 pound weights to 20 pound ones but quite unreliable in distinguishing between with 15 and 16-pound ones. This would not prove that subjective judgments are ipso facto unreliable — only that reliability is less for more difficult tasks than for easy ones.

These ruminations on terminology do not undercut the important message in Statistical Assessments that research that teases out the conditions under which reliability and validity are degraded is vital to avoiding unnecessary errors: “[M]any observational studies are needed to confirm the performance of latent print analysis under a wide array of scenarios, examiners and laboratories.”

C. Well-determined Error Probabilities

The final component of validation described in Statistical Assessments is "well-determined error probabilities." When it comes to the classification task (differentiating same-source from different-source specimens), the error probabilities indicate whether the classifications are valid. A highly specific test has relatively few false positives — when confronted with different-source specimens, examiners conclude that they do not match. A highly sensitive test has relatively few false negatives — when confronted with same-source specimens, examiners conclude that they do match.

Tests that are both sensitive and specific also can be described as generating results that have a high “likelihood ratio.” If a perceived positive association is much more probable when the specimens truly are associated, and a negative association (an exclusion) is much more probable when they are not, then the likelihood ratio LR has a large numerator (close to the maximum probability of 1) and a small denominator (close to the minimum of 0):
LR = Pr(test + | association) / Pr(test + | no association)
      = specificity / 1 – Pr(test – | no association)
      = specificity / (1 – sensitivity)
      = large (almost 1) / small (a little more than 0)
      = very large
But a high likelihood ratio does not guarantee a high probability of a true association. It signals high “probative value,” to use the legal phrase, because it justifies a substantial change in probability that the suspected source is the real source compared to that probability without the evidence. For example, if the odds of an association without knowledge of the evidence are 1 to 10,000 and the examiner’s perception is 1,000 times more probable if the specimens are from the same source (LR = 1000), then, by Bayes' rule, the odds given the evidence rise to 1000 × 1:10000 = 1:10. Odds of only 1 to 10 cannot justify a conviction or even a conclusion that the two specimens probably are associated, but the examiner’s evidence has contributed greatly to the case. With other evidence, guilt may be established; without the forensic-science evidence, the totality of the evidence may fall far short. Therefore, if the well-defined error probabilities are low (and, hence, the likelihood ratio is high), it would be a mistake to dismiss the examiner’s assessment as lacking in value.

Yet, the standard terminology of positive and negative “predictive value” used in Statistical Assessments suggests that much more than this is required for the evidence to have “value.” For example, the article states that

In the courtroom, one does not have the ‘true’ answer; one has only the results of the forensic analysis. The question for the jury to decide is as follows: Given the results of the analysis, what is the probability that the condition is present or absent? For fingerprint analysis, one might phrase this question as follows:

PPV = P{same source | analysis claims ‘same source’}.

If PPV is high, and if the test result indicates ‘same source’, then we have some reasonable confidence that the two prints really did come from the same person. But if PPV is low, then, despite the test result (‘same source’), there may be an unacceptably high chance that in fact the prints did not come from the same person—that is, we have made a serious ‘type I error’ in claiming a ‘match’ when in fact the prints came from different persons.

Yes, the fingerprint analyst who asserts that the defendant is certainly the source when the PPV is low is likely to have falsely rejected the hypothesis that the defendant is not the source. But why must fingerprint examiners make these categorical judgments? Their job is to supply probative evidence to the judge or jury so as to permit the factfinder to reach the best conclusion based on the totality of the evidence in the case. 8/ If experiments have shown that examiners like the one in question, operating under comparable conditions with comparable prints, almost always report that the prints come from the same source when they do (high sensitivity) and that they do not come from the same source when they do not (high specificity), then there is no error in reporting that the prints in question are substantially more likely to have various features in common if they came from the same finger than if they came from fingers from two different individuals. 9/ This is a correct statement about the weight of the evidence rather than the probability of the hypothesis.

Indeed, one can imagine expanding the range of evaluative conclusions that fingerprint examiners might give. Instead of thinking “it’s either an identification or an exclusion” (for simplicity, I am ignoring judgments of “insufficient” and “inconclusive”), the examiner might be trained to offer judgments on a scale for the likelihood ratio, as European forensic science institutes have proposed. 10/ A large number of clear and unusual corresponding features in the latent print and the exemplar should generate a large subjective probability for the numerator of LR and a small probability for the denominator. A smaller number of such features should generate a smaller subjective ratio.

Although this mode of reporting on the evidentiary weight of the features is more nuanced and supplies more information to the factfinder, it would increase the difficulty of validating the judgments. How could one be confident that the moderate-likelihood-ratio judgments correspond to less powerful evidence than the high-likelihood-ratio ones?

II. Validating an Objective Statistical Rule

Statistical Assessments does not seriously consider the possibility of moving from categorical decisions on source attribution to a weight-of-evidence system. Instead, it presents a schematic for validating source attributions in which quantitative measurement replaces subjective impressions of the nature and implications of the degree of similarity in the feature sets. The proposal is to devise an empirically grounded null hypothesis test for objective measurements. Development of the test would proceed as follows (using 95% as an example):

    (1) Identify a metric (or set of metrics) that describes the essential features of the data. For example, these metrics might consist of the numbers of certain types of features (minutiae) or the differences between the two prints in the (i) average distances between the features (e.g. between ridges or bifurcations), (ii) eccentricities of identified loops or (iii) other characteristics on the prints that could be measured.
    (2) Determine a range on the metric(s) that is ‘likely to occur’ (has a 95% chance of occurring) if ‘nothing interesting is happening’ (i.e. the two prints do not arise from the same source). For example, one could calculate these metrics on 10,000 randomly selected latent prints known to have come from different sources.
    (3) Identify ‘extreme range’ = range of the metric(s) outside of the ‘95%’ range. For example, one can calculate ranges in which 95% of the 10,000 values of each metric lie.
    (4) Conduct the experiment and calculate the metric(s). For example, from the ‘best match’ that is identified, one can calculate the relevant metrics.
    (5) If the metric falls in the ‘expected’ range, then data are deemed consistent with the hypothesis that ‘nothing interesting is happening’. If the metric falls in the ‘extreme’ range, the data are not consistent with this hypothesis and indicate instead an alternative hypothesis.

This type of approach keeps the risk of a false rejection of the null hypothesis (that the suspect is not the source) to no more than 5% (ignoring the complications arising from the fact that not one but many variables are being considered separately), but it is subject to well-known criticisms. First, why 5%? Why not 1%? Or 9.3%?

Second, whatever the level of the test, does it make sense to report an association when the measurements barely make it into the “extreme range” but not when they are barely shy of it?

Third, what is the risk of a false acceptance — a false exclusion of a truly matching print? To estimate that error probability, a different distribution would need to be considered — the distribution of the measured values of the features when sampling from the same finger. The 2009 NRC Report refers this issue. In a somewhat garbled passage on the postulated uniqueness of fingerprints, 11/ it observes that
Uniqueness and persistence are necessary conditions for friction ridge identification to be feasible, but [u]niqueness does not guarantee that prints from two different people are always sufficiently different that they cannot be confused, or that two impressions made by the same finger will also be sufficiently similar to be discerned as coming from the same source. The impression left by a given finger will differ every time, because of inevitable variations in pressure, which change the degree of contact between each part of the ridge structure and the impression medium. None of these variabilities — of features across a population of fingers or of repeated impressions left by the same finger — has been characterized, quantified, or compared. 12/
To rest assured that both error probabilities of the statistical test of the quantified feature set are comfortably low, the same-finger experiment also needs to be conducted. Of course, Dr. Kafadar might readily concede the need to consider the alternative hypothesis (that prints originated from the same finger) and say that replicate measurements from the same fingers should be part of the experimental validation of the more objective latent-print examination process (or that in ordinary casework, examiners should make replicate measurements for each suspect). 13/

Still, the question remains: What if a similarity score on the crime-scene latent print and the ten-print exemplar falls outside the 95% range of variability of prints from different individuals and outside the 95% range for replicate latent prints from the same individual? Which hypothesis is left standing — the null (different source) or the alternative (same source)? One could say that the fingerprint evidence is inconclusive in this case, but would it be better to report a likelihood ratio in all cases rather than worrying about the tail-end probabilities in any of them? (This LR would not depend on whether the different-source hypothesis is rejected. It would increase more smoothly with an increasing similarity score.)

III. Human Expertise and Statistical Criteria

A major appeal of objectively ascertained similarity scores and a fixed cut-off is that the system supplies consistent results with quantified error probabilities and reliability. But would the more objective process be any more accurate than subjective, human judgment in forensic pattern recognition tasks? The objective measures that might emerge are likely to be more limited than the many features that the human pattern matchers might evaluate. And, it can be argued that the statistical evaluation of them may not be as sensitive to unusual circumstances or subtleties as individual “clinical” examination would be.

Thus, the preference in Statistical Assessments for objectively ascertained similarity scores and a fixed cut-off is reminiscent of the arguments for “actuarial” or “statistical” rather than “clinical” assessments in psychology and medicine. 14/ The work in those fields of expertise raises serious doubt about claims of superior decisionmaking from expert human examiners. Nonetheless, more direct data on this issue can be gathered. Along with the research Dr. Kafadar proposes, studies of whether the statistical system outperforms the classical, clinical one in the forensic science fields should be undertaken. The burden of proof should be on the advocates of purely clinical judgments. Unless the less transparent and less easily validated human judgments clearly outperform the algorithmic approaches, they should give way to more objective measurements and interpretations.

Notes
1. Karen Kafadar, Statistical Issues in Assessing Forensic Evidence, 83 Int’l Stat. Rev. 111–34 (2015).

2. Dr. Kafadar is Commonwealth Professor and chair of the statistics department at the University of Virginia, a member of the Forensic Science Standards Board of the Organization of Scientific Area Committees of the National Institute of Standards and Technology, and a leading participant in the newly established “Forensic Science Center of Excellence focused on pattern and digital evidence” — “a partnership that includes Carnegie Mellon University (Pittsburgh, Penn.), the University of Virginia (Charlottesville, Va.) and the University of California, Irvine (Irvine, Calif.) [that] will focus on improving the statistical foundation for fingerprint, firearm, toolmark, dental and other pattern evidence analyses, and for computer, video, audio and other digital evidence analyses.” New NIST Center of Excellence  to Improve Statistical Analysis of Forensic Evidence, NIST Tech Beat, May 26, 2015.

2. As Dr. Kafadar has observed, “[s]tatistics plays multiple roles in moving forensic science forward, in characterizing forensic analyses and their underlying bases, designing experiments and analyzing relevant data that can lead to reduced error rates and increased accuracy, and communicating the results in the courtroom.” U.Va. Partners in New Effort to Improve Statistical Analysis of Forensic Evidence, UVAToday, June 2, 2015.

3. Max M. Houck & Jay A. Siegel, Fundamentals of Forensic Science 310 (2015).

4. Id.

5. David H. Kaye, Ultracrepidarianism in Forensic Science: The Hair Evidence Debacle, 72 Wash. & Lee L. Rev. Online 227 (2015).

6. E.g., Allan Steckler & Kenneth R. McLeroy, The Importance of External Validity, 98 Am. J. Public Health 9 (2008).

7. See David H. Kaye et al., The New Wigmore: A Treatise on Evidence: Expert Evidence (2d ed. 2011).

8.  As Sir Ronald Fisher reminded his follow statisticians, “We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right of other free minds to utilize them in making their own decisions.” Ronald A. Fisher Statistical Methods and Scientific Induction, 17 J. Roy. Statist. Soc. B 69 (1955).

9. Yet, Statistical Assessments insists that by virtue of Bayes’ rule, “low prevalence, high sensitivity and high specificity are needed for high PPV and NPV ... [there is a] need for sensible restriction of the suspect population.” This terminology is confusing. Low prevalence (guilt is rare) comes with a large suspect population rather than a restricted one. It cuts against a high PPV. Conversely, if “low prevalence” means a small suspect population (innocence is relatively rare), then it is harder to have a high NPV.

10. ENFSI Guideline for Evaluative Reporting in Forensic Science, June 9, 2015.

11. The assertion below that “[u]niqueness and persistence are necessary conditions for friction ridge identification to be feasible” ignores the value of a probabilistic identification. A declared match can be immensely probative even if a print is not unique in the population. If a particular print occurred twice in the world’s population, a match to the suspect still would be powerful evidence of identification. DNA evidence is like that — the possibility of a genetically identical twin somewhere has not greatly undermined the feasibility DNA identifications. The correspondence in the feature set still makes the source probability higher than it was prior to learning of the DNA match. The matching alleles need not make the probability equal to 1 to constitute a useful identification.

12. NRC Committee on Identifying the Needs of the Forensic Science Community, Strengthening Forensic Science in the United States: A Path Forward 144 (2009)(footnote omitted).

13. Statistical Assessments states that fingerprint analysts currently compare “latent prints found at a [crime scene] with those from a database of ‘latent exemplars’ taken under controlled conditions.” Does this mean that latent print examiners create an ad hoc databank in each case of a suspect’s latent prints to gain a sense of the variability of those prints? I had always thought that examiners merely compare a given latent print to exemplars of full prints from suspects (what used to be called “rolled prints”). In the same vein, giving DNA profiling as an example, Statistical Assessments asserts that “[a]nalysis of the evidence generally proceeds by comparing it with specimens in a database.” However, even if CODIS database trawls have become routine, the existence and use of a large database has little to do with the validity of the side-by-side comparisons that typify fingerprint, bullet, handwriting, and hair analyses.

14. See, e.g., R.M. Dawes et al., Clinical Versus Actuarial Judgment, 243 Science 1668 (1989) (“Research comparing these two approaches shows the actuarial method to be superior.”); William M. Grove, & Paul E. Meehl, Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The Clinical–statistical controversy, 2 Psych., Pub. Pol’y & L. 293 (1996) (“Empirical comparisons of the accuracy of the two methods (136 studies over a wide range of predictands) show that the mechanical method is almost invariably equal to or superior to the clinical method); Konstantinos V. Katsikopoulos et al., From Meehl to Fast and Frugal Heuristics (and Back): New Insights into How to Bridge the Clinical—Actuarial Divide, 18 Theory & Psych. 443 (2008); Steven Schwartz & Timothy Griffin, Medical Thinking: The Psychology of Medical Judgment and Decision Making (2012).
Acknowledgement

Thanks are due to Barry Scheck for calling the article discussed here to my attention.

No comments:

Post a Comment