Tuesday, December 24, 2013

Breathalyzers and Beyond: The Unintuitive Meanings of "Measurement Error" and "True Values" in the 2009 NRC Report on Forensic Science

Five years ago, the National Research Council released its eagerly awaited and repeatedly postponed report on "Strengthening Forensic Science in the United States: A Path Forward." One theme of the report was that forensic experts must present their findings with due recognition of Rumsfeldian "known unknowns." For example, the report repeatedly referred to "the importance of ... a measurement with an interval that has a high probability of containing the true value" (NRC Committee 2009, p. 121), and it referred to "error rates" for categorical determinations (ibid., pp. 117-22). 

Earlier this year, UC-Davis law professor and evidence guru Edward Imwinkelried and I submitted a letter urging the Washington Supreme Court to review a case raising the issue of whether the state courts should admit point estimates of blood or breath alcohol concentration without an accompanying quantitative estimate of the uncertainty in each estimate. (The court denied review.) Since the NRC report uses breath-alcohol measurements to explain the meaning of its call for interval estimates, one would think that the report would have a good illustration of a suitable interval. But that is not what I found. The report's illustration reads as follows:
As with all other scientific investigations, laboratory analyses conducted by forensic scientists are subject to measurement error. Such error reflects the intrinsic strengths and limitations of the particular scientific technique. For example, methods for measuring the level of blood alcohol in an individual or methods for measuring the heroin content of a sample can do so only within a confidence interval of possible values. In addition to the inherent limitations of the measurement technique, a range of other factors may also be present and can affect the accuracy of laboratory analyses. Such factors may include deficiencies in the reference materials used in the analysis, equipment errors, environmental conditions that lie outside the range within which the method was validated, sample mix-ups and contamination, transcriptional errors, and more.

Consider, for example, a case in which an instrument (e.g., a breathalyzer such as Intoxilyzer) is used to measure the blood-alcohol level of an individual three times, and the three measurements are 0.08 percent, 0.09 percent, and 0.10 percent. The variability in the three measurements may arise from the internal components of the instrument, the different times and ways in which the measurements were taken, or a variety of other factors. These measured results need to be reported, along with a confidence interval that has a high probability of containing the true blood-alcohol level (e.g., the mean plus or minus two standard deviations). For this illustration, the average is 0.09 percent and the standard deviation is 0.01 percent; therefore, a two-standard-deviation confidence interval (0.07 percent, 0.11 percent) has a high probability of containing the person’s true blood-alcohol level. (Statistical models dictate the methods for generating such intervals in other circumstances so that they have a high probability of containing the true result.)
(Ibid., pp. 116-17.)

What is troublesome about this explanation? Let me count the ways.

1. "Measurement error" does not refer to all errors of measurement

"[D]eficiencies in the reference materials used in the analysis, equipment errors, environmental conditions that lie outside the range within which the method was validated, sample mix-ups and contamination, transcriptional errors, and more" all "can affect the accuracy of laboratory analyses." Nevertheless, they do no count as "measurement error" because they are "factors other than the inherent limitations of the measurement technique." Not being "intrinsic [to] the particular scientific technique," they fall outside the committee's definition of "measurement error."

That narrow definition calls to mind the claims of some fingerprint analysts that the ACE-V method has an "methodological" error rate of zero because the only possibility for error arises when a human being does not apply the method perfectly. The difference, however, is that one can measure the errors when the breathalyzer has no deficient reference materials, no extreme environmental conditions, no sample mix-ups and contamination, no transcriptional errors, and so on. The fingerprint analyst, in contrast, is the measuring instrument, and it is impossible to distinguish between instrument measurement error and human error in that context.

There is nothing illogical in quantifying some but not all measurement errors when some are more readily and validly quantifiable than others. Machines might not be tested periodically to ensure that they are operating as they are supposed to (e.g., DiFilipo 2011; Sovern 2012), but whether one can usefully build that possibility into the computation of the uncertainty of a measurement that might be suitable for courtroom testimony is not clear. Yet, using the seemingly all-encompassing phrase "measurement error" in a narrow, technical sense -- to denote only the noise inherent in the apparatus when operated under certain conditions -- is potentially misleading.

2. "True values" are not true blood-alcohol levels.

Because the committee's example of "measurement error" quantifies only "intrinsic" error, its statement that "a two-standard-deviation confidence interval (0.07 percent, 0.11 percent) has a high probability of containing the person’s true blood-alcohol level" also is easily misunderstood. The confidence interval (CI) for "true values" does not pertain to the actual blood-alcohol level. That level can differ from the point estimate of 0.09 for other reasons, making the real uncertainty greater than ± 0.02.

In addition, a breathalyzer measures alcohol in the breath, not in the bloodstream. The concentrations are related, but the precise functional relationship varies across individuals (e.g., Martinez & Martinez 2002). This is another source of uncertainty not reflected in the committee's CI for blood-alcohol concentration (BAC), although the committee could have sidestepped this issue by referring to breath-alcohol concentration (BrAC).

3. The standard error of the breathalyzer would be determined differently.

The NRC committee imagines using a breathalyzer to make three measurements of the same breath sample. The parenthetical, concluding sentence about "statistical models" for "other circumstances" suggests that the committee realized that this approach is not one that anyone would use to estimate the noise in the apparatus. The breathalyzer should be tested on many samples with known concentrations to ensure that it is not biased and to quantify the extent of the random variations about those known values. Manufacturers perform such tests (e.g., Coyle et al. 2010).

4. A CI of ±2 standard errors might not have "a high probability of containing the person’s true blood-alcohol level"

Let's put aside all the concerns raised so far. Suppose that the errors in the machine's measurement always are normally distributed about the true value in a breath sample; that the applicable standard deviation for this distribution is 0.01; and that the single measured value is 0.09. Is it now true that the interval 0.09 ± 0.02 "has a high probability of containing the person’s true blood-alcohol level"?

Maybe. Two standard errors give an interval with a confidence coefficient of approximately 95%. That is to say that this one interval comes from a procedure that generates intervals that cover the true value about 95% of the time. It is tempting to say that the probability that the interval in question covers the true value therefore is 95%.

But let's think about how the sample came to be tested. The arrested officer picks someone out of a population of motorists. The motorists have varying levels of BrACs, and the officer has some level of skill in spotting the ones who might well be inebriated. Suppose that the drivers the officer stops and tests have BrACs that are normally distributed with mean 0.04 and standard deviation 0.01. The officer's breathalyzer is functioning according to manufacturer's specifications, and the standard deviation in its measurements is 0.01, as in the NRC report. Having obtained a measurement of 0.08 on the one driver's breath sample, what is a high probability interval for true BrAC in this one breath sample? Is it 0.07 to 0.11?

It turns out the probability that the true BrAC falls within the NRC's interval is only 24% (applying equations 2.9 and 2.10 in Gelman et al. 2004). If the officer stopped drivers who whose mean BrAC were greater than 0.04 or with more variable BrACs, the probability for the NRC's interval being correct would be greater. If, for example, the standard deviation in this group were 0.02 instead of 0.01 (and the mean were still 0.04), then the probability for the NRC's interval would be 87%.

Of course, we do not know much about the distribution of BrAC in the group that the officer stops. As indicated above, this distribution would depend on the drinking habits of drivers in the town and the officer's skill in pulling over drunken drivers. The choice of a normal distribution with the parameters mentioned above is not likely to be realistic. But whatever the distribution may be, it, along with the single measured value, bears on the true value of the tested driver's BrAC. This fact makes it tricky to quantify the probability that the NRC's CI includes the driver's BrAC.

* * *

The NRC Report was certainly correct to call on forensic scientists to develop better measures of the uncertainty in their findings and to apply them in their reports and testimony. But figuring out what these measures should be and how to use them is a formidable challenge. Meeting this challenge will be a lot harder than the simple example of a confidence interval in the report might suggest.


No comments:

Post a Comment