Saturday, December 29, 2018

Results of a Proficiency Test of Hair Examiners

Existing proficiency tests of forensic examiners who offer opinions on the origins of trace evidence are not designed to estimate the conditional probabilities for false negatives (exclusions) and false positives (inclusions). 1/ They rarely replicate the conditions of casework; indeed, proficiency exercises may be significantly easier than casework, and the examiners usually know they are being tested..

This is not to question the importance and value of basic proficiency testing. Even simple tests can inform managers of the capabilities of examiners and identify some analysts who could benefit from additional training. But it remains tempting to think that high scores on proficiency tests mean that examiners rarely make mistakes in practice. 2/ Conversely, it has been argued that poor scores on proficiency tests are danger signs for admitting or relying on expert opinions in court. 3/

To the extent that current proficiency test data are pertinent to accuracy in casework, a recent test by Forensic Testing Services may be of interest. 4/ FTS administered the test to hair examiners at an undisclosed number of laboratories, where 45 out of 52 examiners completed it. It investigated whether examiners could tell that a small number of "questioned" hairs, that FTS sampled from one individual and gave to the examiners, did not come from two other individuals (as indicated by their correspondence to reference samples that FTS produced from those individuals). The questioned sample, designated Item #3 (and said for test purposes to have been found clutched in the hand of a victim), was a set of five hairs from the scalp of a 56-year-old white woman. Item #1 was a set of ten hairs from the scalp of a deceased 87-year-old white male. Item #2 was ten hairs from the scalp of a deceased 65-year-old white male.

The examiners were asked to determine whether item #3 -- the five "questioned" hairs -- are "consistent in microscopic characteristics to the known hair sample sets" #1 and #2. They also considered macroscopic characteristics such as color. A "Statement Regarding Test Design" explained that
Different hairs from the same body region of a person exhibit variation in microscopical characteristics and features. It is difficult to prepare a microscopical examination of hair proficiency test due to this natural variation. Our approach to this test is to provide several questioned hairs (known to be from the same individual) to compare as a group to a sample of known hair. Although the test is not realistic from the standpoint that most analysts would characterize each hair individually, this approach is designed to ensure consistency between distributed tests. [5/]

We realize that most examiners would prefer larger sample sets of known hairs. The use of smaller known sample sets was also intended to ensure consistency [among] distributed tests. [6/]
In essence, the examiners responded C (consistent), X (excluded), or I (inconclusive) as shown in the following table:


#3 from #1? #3 from #2?
C 6 0 ← false inclusions
X 37 41 ← true exclusions
I 2 4 ← no conclusion

Putting aside the inconclusives (which should not have any impact in court even though they are an important aspect of an examiner's proficiency at acquiring and evaluating information), the comparisons of #3 to #1 produced 6 / (6+37) = 14% false inclusions, and the comparison of #3 to #2 produced no false inclusions. Pooling these decisions, the examiner made (6+0) / (6+37 + 0+41) = 6/84 = 7.1% false inclusions.

Because the examiners had no opportunity to compare the five questioned hairs to a representative sample of the woman's hairs, the proficiency test yields no true inclusions. Consequently, it is not possible to estimate the probative value of the hair examiners' conclusions of consistency. To see this, suppose that a reference set of #3 hairs had been provided and that, disappointingly, the examiners found consistency only 7.1% of  the time in this situation. Then the examiners would be declaring consistency as often with different sources as with the same source! 7/ Findings of C would not help a judge or jury distinguish between sources and (these particular) nonsources of the questioned hairs.

Of course, it seems likely that examiners would achieve a higher proportion of true inclusions than a mere 7.1%. If the proportion were, say, 71%, judgments of C would offer ten times as much support to the hypothesis that the inclusion is correct than to the hypothesis that a nonsource is included. The highest probative value (point estimate) compatible with the reported data occurs when the examiners are perfect at responding to true sources with a judgment of C. In that case, the ratio would be 100%  / 7.1%  = 14.

But figures like 1, 10, and 14 are speculative. This proficiency test provides no data on sensitivity (to use the technical term for Pr(C | #3)), which is an essential component of probative value. 8/ Proficiency test manufacturers might want to consider adding a measure of sensitivity to their tests.

NOTES
  1. See, e.g., Jonathan J. Koehler, Proficiency Tests to Estimate Error Rates in the Forensic Sciences, 12 Law, Probability & Risk 89 (2013).
  2. E.g., United States v. Crisp, 324 F.3d 261, 270 (4th Cir. 2003) (handwriting expert "had passed numerous proficiency tests, consistently receiving perfect scores") United States v. Otero, 849 F.Supp.2d 425 434 (D.N.J. 2012) (because "proficiency testing is indicative of a low error rate ... for false identifications made by trained examiners ...  this Daubert factor also weighs in favor of admitting the challenged expert testimony").
  3. E.g., Edward J. lmwinkelried, The Constitutionality of Introducing Evaluative Laboratory Reports Against Criminal Defendants, 30 Hastings L.J. 621,636 (1979); Randolph N. Jonakait, Forensic Science: The Need for Regulation, 4 Harv. J. L. & Tech. 109 (1991).
  4. Forensic Testing Services, 2018 Hair Comparison Proficiency Test FTS‐18‐HAIR1 Summary Report
  5. Because the examiners were told, in effect, that the questioned hairs all had the same source for a reason having nothing to do with their physical features, a report on the five questioned hairs individually or on the internal consistency of that set would not have tested their skill at distinguishing hairs on the basis of those features.
  6. According to R.A. Wickenheiser & D.G. Hepworth, Further Evaluation of Probabilities in Human Scalp Hair Comparisons, 35 J. Forensic Sci. 1323, 1329 (1990), "[m]acroscopic selection of 5-13 mutually dissimilar hairs was frequently unrepresentative of the microscopic range of features present in the known samples." The FTS summary does not state how the two sets of ten reference hairs were selected. Failing to capture the full range of variation in the head hairs of an individual would increase the chance of an exclusion. For this study, that could decrease the proportion of false inclusions and inflate the proportion of true inclusions.
  7. The likelihood ratio in the sample would be Pr(C | #3) / Pr(C | #1 or #2) = 7.1 / 7.1 = 1.
  8. See, e.g., David H. Kaye, Review-essay, Digging into the Foundations of Evidence Law, 116 Mich. L. Rev. 915 (2017).

No comments:

Post a Comment