Friday, September 3, 2021

Does Qualitative Measurement Uncertainty Exist?

I have heard it said that forensic-science standards for interpreting the results of chemical or other tests need not discuss uncertainty in measurements of qualitative properties. For instance, ASTM International appropriately requires standards for test methods to include a section reporting on precision and bias as manifested in interlaboratory tests. Yet, it applies this requirement exclusively to quantitative measurements. Its 2021 style manual is unequivocal:

When a test method specifies that a test result is a nonnumerical report of success or failure or other categorization or classification based on criteria specified in the procedure, use a statement on precision and bias such as the following: “Precision and Bias—No information is presented about either the precision or bias of Test Method X0000 for measuring (insert here the name of the property) since the test result is nonquantitative" (ASTM 2020, § A21.5.4, pp. A3-A14).

Qualitative measurements are observation-statements such as the ink is blue, the friction ridge skin pattern includes loops, the bloodstain displays a cessation pattern, the blood group is type A, the glass fragments fit together perfectly, or the material contains cocaine. Likewise, the statements could be comparative: the recording of an unknown bell ringing sounds like it has a higher pitch than the ringing of a known bell; the hairs are microscopically indistinguishable; or the striations on the recovered bullet and the test bullet line up when viewed in the comparison microscope.

“Precision” is defined as “the closeness of agreement between test results obtained under prescribed conditions” (ibid. § A21.2.1, at A12). “A statement on precision allows potential users of the test method to assess in general terms its usefulness in proposed applications” and is mandatory (ibid. § A21.2, at A12). So how can it be that statements of precision and bias are not allowed for qualitative as opposed to quantitative findings? In both situations, the system that generates the findings could be noisy or skewed in its outcomes.

The only answer I have heard is that measurements cannot be qualitative because the word "measurement" is reserved for determining the magnitude of quantities such as length or mass. The values of these quantitative variables are basically isomorphic to the nonnegative real numbers. Counts, such as the number of alpha particles emitted in a given interval of time by radium atoms, also qualify as measurements because there is a quantitative, additive structure to them. The values of the variable are basically isomorphic to the natural numbers. Properties that only have names are described by nominal variables. Although numbers can assigned (1 for a match and 0 for a nonmatch, for example) these numbers are no more a measurement than a social security number is. In short, the argument is that because “measurements” do no not include qualitative judgments, classifications, decisions, identifications, or whatever one might call them, no statement of measurement uncertainty or error is possible, let alone required.

This argument is incredibly weak. To begin with, the definition of “measurement” is a highly contested concept. As one guide from NIST explains, a “much wider” conception of measurement than the one “contemplated in the current version of the International vocabulary of metrology (VIM)” has been developed in the metrology literature, and the measurand “may be ... qualitative (for example, the provenance of a glass fragment determined in a forensic investigation" (Possolo 2015). Broader conceptions of measurement have been the subject of many decades of writing in psychology and psychometrics (see, e.g., Humphry 2017; Mitchell 1990). Philosophers have been struggling to describe the scope and meaning of "measurement" at least since Aristotle (see, e.g., Tal 2015).

Second, even if one agrees with the definition in one NIST publication that “[m]easurement is [confined to] an experimental process that produces a value that can reasonably be attributed to a quantitative property of a phenomenon, body, or substance” (NIST 2019), some qualitative observations fit this definition. The color of a strip of litmus paper, for instance, can be understood as a value “that can reasonably be attributed to a quantitative property,” It is simply a crude measurement of pH.

Finally, the argument that there can be no measurement error for qualitative properties because those properties are not really “measured” is a semantic ploy that misses the point. The observations or estimates of nonquantitative properties as well as the individual measurements of quantitative properties are all subject to possible random and systematic error, and statements expressing the range of probable error for all measurements, observations, estimates, and classifications are essential. The need for these statements cannot be avoided for qualitative properties or judgments by the fiat of the VIM or some other dictionary. Even if “measurement” must be read in one particular, narrow, technical sense, “evaluation uncertainty” or “examination uncertainty” still must be reckoned with (Mari et al. 2020).

In sum, there is no excuse for ASTM and other organizations promulgating standards for forensic-science test methods to exempt any reported findings from required statements of uncertainty. Many statistics can be used to indicate how reliable (repeatable and reproducible) and valid (accurate) the test results may be (ibid.; Ellison & Gregory 1998; Pendrill & Petersson 2016). The qualitative-quantitative distinction affects the choice of the statistical method or expression but not the need to have one.

REFERENCES

  • ASTM Int’l, Form and Style for ASTM Standards (2020), https://www.astm.org/FormStyle_for_ASTM_STDS.html.
  • Stephen L. R. Ellison & Soumi Gregory, Perspective: Quantifying Uncertainty in Qualitative Analysis, Analyst 123, 1155-1161 (1998), https://doi.org/10.1039/A707970B
  • Stephen M. Humphry, Psychological Measurement: Theory, Paradoxes, and Prototypes, 27(3) Theory & Psychology 407–418 (2017)
  • L. Mari, C. Narduzzi, S. Trapmann, Foundations of Uncertainty in Evaluation of Nominal properties, 152 Measurement 107397 (2020), DOI:10.1016/j.measurement.2019.107397
  • Joel Mitchell, An Introduction to the Logic of Psychological Measurement (1990)
  • NIST, Statistical Engineering Division, Measurement Uncertainty, updated Nov. 15, 2019, https://www.nist.gov/itl/sed/topic-areas/measurement-uncertainty
  • Leslie Pendrill & Niclas Petersson, Metrology of human-based and other qualitative measurements, 27(9) Measurement Sci. Technol. 27 094003 (2016)
  • A. Possolo, Simple Guide for Evaluating and Expressing the Uncertainty of NIST
    Measurement Results (NIST Technical Note 1900), 2015, doi: 10.6028/NIST.TN.1900
  • Eran Tal, Measurement in Science, in Stanford Encyclopedia of Philosophy (Edward N. Zalta ed. 2015), https://plato.stanford.edu/archives/fall2017/entries/measurement-science/

APPENDIX: ADDITIONAL PUBLICATIONS ON "QUALITATIVE MEASUREMENT"

  1. Mary J. Allen & Wendy M. Yen, Introduction to Measurement Theory 2 (1979) ("In measurement, numbers are assigned systematically and can be of various forms. For example, labeling people with red hair "1" and people with brown hair "2" is a measurement. Since numbers are assigned to individuals in a systematic way and differences between scores represent differences in the property being measured (hair color).")
  2. Peter-Th. Wilrich, The determination of precision of qualitative measurement methods by interlaboratory experiments, Accreditation and quality assurance, 15: 439-444 (2010)
  3. Boris L. Milman, Identification of chemical compounds, Trends in Analytical Chemistry, 24:6, 2005 ("identification itself is considered as measurement on a qualitative scale")
  4. NIST Expert Working Group on Human Factors in Latent Print Analysis, Latent Print Examination and Human Factors: Improving the Practice Through a Systems Approach, Gaithersburg: National Institute of Standards and Technology, David H. Kaye ed., 2012 (defining "measurement" broadly, to encompass categorical variables, including the examiner's judgment about the source of a print).
  5. Donald Richards, Simultaneous Quantitative and Qualitative Measurements in Drug-Metabolism Investigations, Pharmaceutical Technology 2013
  6. Kadri Orro, Olga Smirnova, Jelena Arshavskaja, Kristiina Salk, Anne Meikas, Susan Pihelgas, Reet Rumvolt, Külli Kingo, Aram Kazarjan, Toomas Neuman & Pieter Spee, Development of TAP, a non-invasive test for -qualitative and quantitative measurements of biomarkers from the skin surface, Biomarker Research 2: 20 (2014)
  7. J M Conly & K Stein, Quantitative and qualitative measurements of K vitamins in human intestinal contents, Am J Gastroenterol. 1992 Mar;87(3):311-316
  8. Wenjia Meng, Qian Zheng, Gang Pan, Qualitative Measurements of Policy Discrepancy for Return-Based Deep Q-Network, IEEE Transactions on Neural Networks and Learning Systems 2020
  9. Rudolf M. Verdaasdonk, Jovanie Razafindrakoto, Philip Green, Real time large scale air flow imaging for qualitative measurements in view of infection control in the OR (Conference Presentation) Proceedings Volume 10870, Design and Quality for Biomedical Technologies XII; 1087002 (2019) https://doi.org/10.1117/12.2511185
  10. Rashis, Bernard, Witte, William G. & Hopko, Russell N., Qualitative Measurements of the Effective Heats of Ablation of Several Materials in Supersonic Air Jets at Stagnation Temperatures Up to 11,000 Degrees F, National Advisory Committee for Aeronautics, July 7, 1958
  11. Lawrence F Cunningham and Clifford E Young, Quantitative and Qualitative Approaches, Journal of Public Transportation 1(4) (1997) ("The study also contrasts the results of quantitative and qualitative measurements and methodologies for assessing transportation service quality")
  12. JM Conly, K Stein, Quantitative and qualitative measurements of K vitamins in human intestinal contents, American Journal of Gastroenterology, 1992
  13. P Sinha, Workshop on Biologically Motivated Computer Vision, 2002 - Springer ("Our emphasis on the use of qualitative measurements renders the representations stable in the presence of sensor noise and significant changes in object appearance. We develop our ideas in the context of the task of face-detection under varying illumination")
  14. D Michalski, S Liebig, E Thomae & A Hinz, Pain in Patients with Multiple Sclerosis: a Complex Assessment Including Quantitative and Qualitative Measurements, 40 J. Pain 219–225 (2011), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3160835/
  15. Cécilia Merlen, Marie Verriele, Sabine Crunaire,Vincent Ricard, Pascal Kaluzny, Nadine Locoge, Quantitative or Only Qualitative Measurements of Sulfur Compounds in Ambient Air at Ppb Level? Uncertainties Assessment for Active Sampling with Tenax TA®, 132 Microchemical J. 143-153 (2017)
  16. Tomomichi Suzuki, Jun Ichi Takeshita, Mayu Ogawa, Xiao-Nan Lu, Yoshikazu Ojima, Analysis of Measurement Precision Experiment with Categorical Variables, 13th International Workshop on Intelligent Statistical Quality Control 2019, Hong Kong ("Evaluating performance of a measurement method is essential in metrology. Concepts of repeatability and reproducibility are introduced in ISO5725-1 (1994) including how to run and analyse experiments (usually collaborative studies) to obtain these precision measures. ISO5725-2 (1994) describe precision evaluation in quantitative measurements but not in qualitative measurements. Some methods have been proposed for qualitative measurements cases such as Wilrich (2010), de Mast & van Wieringen (2010), Bashkansky, Gadrich & Kuselman (2012). Item response theory (Muraki, 1992) is another methodology that can be used to analyse qualitative data.").

Monday, June 14, 2021

Tibbs. Shipp, and Harris on "Meaningul" Peer Review of Studies on Firearms-toolmark Matching

The Supreme Court's celebrated (but ambiguous) opinion in Daubert v. Merell Dow Pharmaceuticals, \1/ was a direct response to a seemingly simple rule--results that are not published in the peer-reviewed scientific literature are inadmissible to prove that a scientific theory or method is generally accepted in the scientific community. The Court unanimously rejected this strict rule--and more broadly, the very requirement of general acceptance--in favor of a multifaceted examination guided by four or five criteria that have come to be known as "the Daubert factors."

But "peer review and publication" lives on--not as a formal requirement, but as one of these factors. Thus, courts routinely ask whether the peer-reviewed scientific literature supports the reasoning or data that an expert is prepared to present at trial. All too often, however, the examination of the literature is cursory or superficial. The temptation, especially for overburdened judges not skilled in sorting through biomedical and other journals, is to check that there are articles on point, and if  the theory has been discussed (critically or otherwise) in the literature, to write that the "peer review and publication" factor supports admission of the testimony.

One area in which this dynamic is apparent is traditional testimony of firearms examiners matching marks from guns to bullets or shell casings. \2/ Defendants have strenuously objected that traditional associations of particular guns to ammunition components is an inscrutable judgment call that does not pass muster under Daubert. Perhaps the most meticulous analysis of this issue comes from an unpublished opinion of Judge Todd Edelman in United States v. Tibbs. \3/ Judge Edelman's discussion of peer review and publication is unusually thorough and may have been penned as an antidote to the strategy in which the government gives the court a laundry list of articles that have discussed the procedure and the court checks off the "peer review and publication" box.

Being an opinion for a trial court (the District of Columbia Superior Court), Tibbs is not binding precedent for that court or any other, but it has not gone unnoticed. Two federal district courts recently reached mutually opposing conclusions about Judge Edelman's analysis of one large segment of the literature cited in support of admitting match determinations--namely, the extensive research reported in the AFTE Journal. ("AFTE" stands for the Association of Firearms and Toolmark Examiners. The organization was formed in 1969 in "recognition of the need for the interchange of information, methods, development of standards, and the furtherance of research, [by] a group of skilled and ethical firearm and/or toolmark examiners" who "stand prepared to give voice to this otherwise mute evidence." \4/)

Tibbs' Analysis of the AFTE Journal

Because of the AFTE Journal's orientation and editorial process, Tibbs did not give "the sheer number of studies conducted and published" there much weight. \5/ Judge Edelman made essentially four points about the journal:

  • Contrary to the testimony of the government’s experts, post-publication comments or later articles are not normally considered to be “peer review”;
  • The AFTE pre-publication peer review process is “open,” meaning that “both the author and reviewer know the other's identity and may contact each other during the review process”;
  • The reviewers who form the editorial board are all “members of AFTE” who may well “be trained and experienced in the field of firearms and toolmark examination, but do not necessarily have any ... training in research design and methodology” and who “have a vested, career-based interest in publishing studies that validate their own field and methodologies”; and
  • “AFTE does not make this publication generally available to the public or to ... reviewers and commentators outside of the organization's membership [and] unlike other scientific journals, the AFTE Journal ... cannot even be obtained in university libraries.” \6/

The court contrasted these aspects of the journal’s peer review to a "double-blind" process and observed that the AFTE “open” process was “highly unusual for the publication of empirical scientific research.” \7/ The full opinion, which develops these ideas more completely can be found online.

Shipp

Senior Judge Nicholas Garaufis of the Eastern District of New York was impressed with "this thorough opinion." \8/ His opinion in United States v. Shipp referred to the "several pages analyzing the AFTE Journal's peer review process [that] highlight[] several reasons for assigning less weight to articles published in the AFTE Journal than in other publications" and added that

The court shares these concerns about the AFTE Journal's peer review process. In particular, the court is concerned that the reviewers, who are all members of the AFTE, have a vested, career-based interest in publishing studies that validate their own field and methodologies. Also concerning is the possibility that the reviewers may be trained and experienced in the field of firearms and toolmark identification, but [may] not necessarily have any specialized or even relevant training in research design and methodology. \9/

Harris

In contrast, Judge Rudolph Contreras of the U.S. District Court for the District of Columbia, writing in United States v. Harris, \10/ had nothing complimentary to say about Tibbs. This court defended the AFTE Journal research articles said to demonstrate the validity of firearms-toolmark identification with two rejoinders to Tibbs. First, Judge Contreras maintained that “there is far from consensus in the scientific community that double-blind peer review is the only meaningful kind of peer review.” \11/ This is true enough, but the issue raised by the criticism of “open” review is not whether double-blind review is better than single-blind review (in which the author does not know the identity of the referees) or some other system. It is whether “open” review conducted exclusively by AFTE members is the kind of peer review envisioned as a strong indicator of scientific soundness in Daubert. The factors enumerated in Tibbs make that a serious question.

Second, Judge Contreras observed that the Journal of Forensic Sciences, which uses double-blind review, republished one AFTE study. This solitary event, the Harris opinion suggests, is a “compelling” rebuttal of “the allegation by Judge Edelman in Tibbs that the AFTE Journal does not provide 'meaningful' review." \12/ But Judge Edelman never proposed that every article in the AFTE journal was without scientific merit. Rather, his point was far less extreme. It was merely that courts should not “accept at face value the assertions regarding the adequacy of the journal's peer review process.” \13/ That one article—or even dozens—published in the AFTE Journal could have been published in other journals reveals very little about the level and quality of AFTE review. After all, even a completely fraudulent review process that accepted articles for publication by flipping a coin would result in the publication of some excellent articles—but not because the review process was meaningful or trustworthy. In addition, one might ask whether the very fact that an article had to be republished in a more widely read journal fortifies the fourth point in Tibbs, that the journal’s circulation is too restricted to make its publications part of the mainstream scientific literature. The discussion of peer review and publication in Harris ignores this concern.

Beyond the AFTE Journal

The significant concerns exposed in Tibbs do not prove that the peer-reviewed scientific literature, taken as a whole, undermines firearms identification as commonly practiced. They simply mean that the list of publications over the years in the AFTE Journal may not be entitled to great weight in evaluating whether the scientific literature supports the claim of firearms and toolmark examiners to be able to supply generally accurate and reliable "opinions relative to evidence which otherwise stands mute before the bar of justice." \14/

Fortunately, newer peer-reviewed studies exist, and not all the older research appears in the AFTE Journal. \15/ Thus, the Harris court asserted that

[E]ven if the Court were to discount the numerous peer-reviewed studies published in the AFTE Journal, Mr. Weller's affidavit also cites to forty-seven other scientific studies in the field of firearm and toolmark identification that have been published in eleven other peer-reviewed scientific journals. This alone would fulfill the required publication and peer review requirement. \16/

The last sentence could be misunderstood. As a statement that the 47 studies could be the basis of an scientifically informed judgment about the validity of firearms-toolmark matching, the conclusion is correct. As a statement that checking the "peer review and publication" box on the basis of a large number of studies published in the right places "alone" is a reason to admit the challenged testimony, it would be more problematic. The "required ... requirement" (to the extent Daubert imposes one) is for a substantial body of peer-reviewed papers that form a solid foundation for a scientific assessment of a method. Unless this research literature is actually supportive of the method, however, satisfying the "the required publication and peer review requirement" is not a reason to admit the evidence. 

Do the 47 studies (old and new) in widely accessible, quality journals all show that examiners' opinions derived from comparing toolmarks are consistently correct and stable for the kinds of comparisons made in practice? If so, then it is high time to stop the arguments over scientific validity. If not, if the 47 studies are of varying quality, scope, and relevance to ascertaining how repeatable, reproducible, and accurate the opinions rendered by firearms-toolmark examiners are, then there is room for further analysis of whether and how these experts can provide valuable information for the legal factfinders.

NOTES

  1. 509 U.S. 579 (1993),
  2. No. 2016 CF1 19431, 2019 D.C. Super. LEXIS 9, 2019 WL 4359486 (D.C. Super. Ct., Sept. 5, 2019).
  3. "[T]he process that most firearms examiners use when analyzing evidence" is desctibed in graphic detail in "[t]he Firearms Process Map, which captures the ‘as-is’ state of firearms examination, provides details about the procedures, methods and decision points most frequently encountered in firearms examination." NIST, OSAC's Firearms & Toolmarks Subcommittee Develops Firearms Process Map Jan. 19, 2021, https://www.nist.gov/news-events/news/2021/01/osacs-firearms-toolmarks-subcommittee-develops-firearms-process-map.
  4. AFTE Bylaws, Preamble, https://afte.org/about-us/bylaws
  5. 2019 D.C. Super. LEXIS 9, at *35. For a decade or so, both legal academics and forensic scientists had pointed to the AFTE Journal as an example of a practitioner-oriented outlet for publications that did not follow the peer review and publication practices of other scientific journals. See, e.g., David H. Kaye, Firearm-Mark Evidence: Looking Back and Looking Ahead, 68 Case W. Res. L. Rev. 723 (2018); Jennifer L. Mnook--n et al., The Need for a Research Culture in the Forensic Sciences, 58 UCLA L. Rev. 725 (2011).
  6. 2019 D.C. Super. LEXIS 9, at *32-*33.
  7. Id. at *33.
  8. United States v. Shipp, 422 F.Supp.3d 762, 776 (E.D.N.Y. 2019).
  9. Id. (citations and internal quotation marks omitted). Nevertheless, the court found "sufficient peer review." It wrote that "even assigning limited weight to the substantial fraction of the literature that is published in the AFTE Journal, this factor still weighs in favor of admissibility. Daubert found the existence of peer-reviewed literature important because “submission to the scrutiny of the scientific community ... increases the likelihood that substantive flaws in the methodology will be detected.” Daubert, 509 U.S. at 593. Despite AFTE Journal’s open peer-review process, the AFTE Theory has still been subjected to significant scrutiny. ... Therefore, the court finds that the AFTE Theory has been sufficiently subjected to 'peer review and publication' [outside of the AFTE Journal].” Daubert, 509 U.S. at 594."
  10. 502 F.Supp.3d 28 (D.D.C. 2020).
  11. Id. at 40.
  12. Id.
  13. Tibbs, 2019 D.C. Super. LEXIS 9, at *29.
  14. AFTE Bylaws, Preamble, https://afte.org/about-us/bylaws.
  15. AFTE has sought to remedy at least one complained-of feature of its peer review process. In 2020, it instituted the double-blind peer review that the Harris court found unnecessary. AFTE Peer Review Process – January 2020, https://afte.org/afte-journal/afte-journal-peer-review-process. Whether the qualifications and backgrounds of the journal's referrees have been changed is not apparent from the AFTE website.
  16. Harris, 502 F.Supp.3d at 40.

Monday, April 19, 2021

What is Accuracy?

The Organization of Scientific Area Committees for Forensic Science (OSAC) has an online "lexicon" that collects definitions of terms as they appear in published standards. 1/ These may or may not be the same as definitions in textbooks or other authoritative sources. 2/ They may or may not be accurate. (Yet, the drafters of OSAC standards sometimes point to the existence of a definition in the compendium as if it were a conclusive reason to perpetuate it. 3/)

Speaking of "accurate," the word "accuracy" has five overlapping definitions in OSAC's lexicon:

  • Closeness of agreement between a measured quantitiy [sic] value and a true quantity vlaue [sic] of a measurement.
  • The degree of agreement between a test result or measurement and the accepted reference value.
  • Closeness of agreement between a test result or measurement result and the true value. 1) In practice, the accepted reference value is substituted for the true value. 2) The term “accuracy,” when applied to a set of test or measurement results, involves a combination of random components and a common systematic error or bias component. 3) Accuracy refers to a combination of trueness and precision. [ISO 3534-2:2006].
  • The closeness of agreement between a test result and the accepted reference value. 1) In practice, the accepted reference value is substituted for the true value. 2) The term "accuracy," when applied to a set of test or measurement results, involves a combination of random components and a common systematic error or bias component. 3) Accuracy refers to a combination of trueness and precision.
  • Degree of conformity of a measure to a standard or true value.

Some of the definitions in the "lexicon" are designated "preferred terms." 4/ None of the definitions in the lexicon is marked preferred.

The main difficulty with the forensic scientists' set of definitions is that "accuracy" can refer to single measurements or estimates or to a process for making measurements or estimates. The longer definitions are confusing because they do not make it plain that "a combination of trueness and precision" applies to the accuracy of the process (or a large set of measurements from the process) and not so much to the accuracy of particular measurements.

"Precision" refers to the dispersion of repeated measurements under the same conditions. A precise estimate comes from a process that generates measurements that are typically tightly clustered around some value -- without regard to whether that value is the true one. A set of precise measurements -- ones that come from a process that tends to generate similar measurements when repeated -- may be far from the true value. Such measurements.(and the system that generates them) is statistically biased; these measurements have a systematic error component.

Conversely, an imprecise estimate -- one coming from a system that tends to produce widely divergent measurements -- may be essentially identical to the true value. Most other estimates from the same system would tend to stray farther from, the true value, but to say that an estimate that is spot on is not accurate sounds odd. The estimate may be unreliable (in the statistical sense of coming from a process that is highly variable), but it is practically 100% accurate (in this case). Even a generally inaccurate system may produce some accurate results.

The epistemological problem is that we should not rely on an unreliable system to ascertain the true value. For extremely imprecise point estimates, accuracy (in the sense of the absence of error and correspondence to the truth) becomes a matter of luck. It is unwise to act as if a particular measurement (or a small number of them) from an unreliable system adds much to our knowledge.

But the fact that the individual estimates provide little information is not well expressed by describing a result that is (luckily) correct as lacking accuracy.The investment analyst who said that a bitcoin will increase in value by 50% tomorrow is accurate if bitcoin's price did spike by approximately 50%. Nevertheless, this accurate prediction probably was unwarranted. Unless the analyst had a remarkable history of consistently predicting the ups and downs of bitcoin and an articulable and plausible basis for making the predictions, giving much credence to the prediction before the fact would have been unjustified.

Let's apply these elementary ideas to some forensic measurements. Suppose that analysts in a laboratory use an appropriate instrument to measure the refractive index of glass fragments. Most analysts are extremely proficient. Their measurements are both reliable (repeatability is high) and generally close to the true values. A smaller number of analysts are less proficient. Indeed, they are downright sloppy. They are not biased -- they err in both directions -- but the values they come up with are highly variable. An analyst from the proficient group obtains the value x for a particular fragment, and so does an analyst in the sloppy group.

Should we say that x is an accurate value when it comes from one of the former analysts and inaccurate when it comes from one of the latter? Some of the definitions from the standards suggest (or could be read as giving) one answer, whereas others suggest the opposite. It is far more straightforward to say that x is accurate (if it is close to the truth) in both cases.

To be sure, precision is a component of accuracy in the long run -- the imprecise analysts will tend to have lower accuracy (and higher error) rates. Their reports do not provide a sound basis for action. They are neither trustworthy nor statistically reliable. But it invites confusion to characterize every such report -- even ones that provide perfectly or approximately true values -- as inaccurate. When speaking of particular measurements, we simply need to distinguish between those that are wrong because they are far from the truth -- inaccurate -- and those that are accurate -- close to the truth either by good fortune or because of true knowledge. Systems that use luck to get the right answers are systematically inaccurate; properly functioning systems grounded on true knowledge are systematically accurate.

NOTES

  1. "The OSAC Forensic Lexicon should be the primary resource for terminology and used when drafting and editing forensic science standards and other OSAC work products. It is continually updated with the latest work from OSAC units, as well as terms from newly published documentary standards and standards elevated to the OSAC Registry." OSAC Registry, https://lexicon.forensicosac.org/ (undated).
  2. Cf. id. ("The terms and definitions in the OSAC Lexicon come from the published literature, including documentary standards, specialized dictionaries, Scientific Working Group (SWG) documents, books, journal articles, and technical reports. When a suitable definition can’t be located in any of these sources, an OSAC unit generates new or modifies existing definitions. Gradually terms are evaluated and harmonized by the OSAC to a single term. This process results in an OSAC Preferred Term."). 
  3. E.g.,  Comment Adjudication, OSAC 2021-N-0001, Wildlife Forensics Method-Collection of Known DNA Samples from Domestic Mammals, Feb. 11, 2021, at cells L25 & L27 (OSAC Proposed Standard added to the Registry Apr. 6, 2021) (link to Excel spreadsheet at https://www.nist.gov/osac/public-documents).
  4. Id. They should be called "preferred definitions" for terms, and terms that are not supposed to be used in standards anymore should be called "disparaged terms," but I digress.
Last modified: 4/26/21 08:43 ET

Thursday, March 18, 2021

Europe's Fear of the AstraZeneca Vaccine: Post Hoc Ergo Propter Hoc?

When is the fact that Event B follows Event A proof that A causes B? The temporal sequence is a prerequisite for causation, but, in and of itself, the time sequence proves very little. Vaccines are a case in point. If many people receive a vaccine, some percentage of them will experience the slings and arrows of outrageous fortune. Some of them will suffer trauma from, say, automobile accidents. But non-vaccinated people also are caught up in auto accidents—perhaps even more often than their more cautious counterparts who chose to be vaccinated. Thus, the mere number of reports of the sequence of events A and B is not proof of any association between A and B, let alone a causal connection.

Furthermore, even if the incidence of accidents turned out to be greater among the vaccinated part of the population, the question of what might explain this association would need to be answered. Suppose, for instance, that the vaccine confers protection against Lyme disease, which is caused by a bacterium transmitted by certain tick bites. The vaccinated group might include disproportionately many hunters. Possibly, many of them may drive late at night, when they are tired, on dark, narrow, and winding roads, through woods teeming with deer and elk—factors that expose the hunters to a greater risk of an accident. The driving conditions are, in this hypothetical situation, a confounding variable.

The Loss of LYMErix

The link between Lyme-disease vaccine and automobile accidents is totally fictitious, but the history of the LYMErix vaccine provides an oft-told cautionary tale risk perception and evaluation. In the early 1990s, SmithKlineBeecham developed LYMErix to stop infections by the spirochetal bacterium Borrelia burgdorferi that causes Lyme disease. The vaccine targeted a protein (called outer-surface protein A, or OspA) of the bacterium. In a clinical trial, individuals given three doses of LYMErix showed a 76% reduction in Lyme disease in the following year, with no significant side-effects. The U.S. Food and Drug Administration approved LYMErix on December 21, 1998.

But within a year, there were reports of adverse reactions after vaccination. The media carried stories of "vaccine victims," and the Philadelphia law firm of Sheller, Ludwig & Bailey filed a class action against SmithKlineBeecham. Other lawyers did the same.

By 2001, with over 1.4 million doses distributed in the United States, there were 59 reports of arthritis possibly associated with vaccination. The incidence was essentially the same as that in unvaccinated individuals. In addition, the data did not show a temporal spike in arthritis diagnoses after the second and third vaccine dose (an outcome that would be expected for the immune-mediated phenomenon that plaintiffs and others maintained was a side effect of the vaccine). The FDA found no suggestion of harm from the Lyme vaccine. Nevertheless, it soon reconvened its advisory panel for a raucous meeting with the LYMErix manufacturer, independent experts, practicing physicians, the "vaccine victims," and their lawyers.

Although the panel made no changes to the product's labelling or indications, vaccine sales plummeted. On February 26, 2002, GlaxoSmithKline (the successor to SKB) withdrew LYMErix from the market. On July 9, 2003, GSK settled the class actions by agreeing to pay over one million dollars to plaintiffs' lawyers' for their fees and expenses—and nothing to the "vaccine victims."

Flash forward to 2021.

The British-Swedish pharmaceutical company AstraZeneca produced a vaccine developed by researchers at Oxford University that uses a harmless virus, modified to contain the DNA code for a protein on the surface of the virus (SARS-CoV-2) that causes COVID-19. When this altered (recombinant) virus (called ChAdOx1) infects human cells, they cell read the inserted sequence and make copies of the SARS-CoV-2 protein. 1/ The copies induce an immune response to SARS-CoV-2. Clinical trials in various countries suggested an overall efficacy of about 70%, 90%, and 62% (with substantial margins of error) depending on the doses.

As countries rushed to vaccinate their populations, reports of blood clots following vaccinations received intense publicity. Although the company and international regulators say there is no evidence that the shot is to blame, much of Europe (Denmark, Germany, France, Italy, Spain, Ireland, the Netherlands, Norway, Sweden, Latvia, and Iceland) as well as Congo and Bulgaria suspended use of the vaccine.

The World Health Organization and the European Medical Agency (EMA), however, continued to recommend the vaccine's use. In a statement issued today, the EMA emphasized that "the vaccine is not associated with an increase in the overall risk of blood clots (thromboembolic events) in those who receive it." A blogger for Science Translational Medicine remarked that

AstraZeneca has said that they’re aware of 15 events of deep vein thrombosis and 22 events pulmonary embolisms, but that’s in 17 million people who have had at least one shot—and they say that is indeed “much lower than would be expected to occur naturally in a general population of this size“. It also appears to be similar to what’s been seen with the other coronavirus vaccines, which rather than meaning “they’re all bad” looks like they’re all showing the same baseline signal of such events across a broad population, without adding to it.

Is a lower rate of clots for vaccinated individual real? An explanation (other than statistical error in comparing low probability events) comes from the chair of the EMA safety committee. Dr. Sabine Straus told the Wall Street Journal that "since blood clots are associated with Covid-19, by inoculating people against the disease, the vaccine 'likely reduces the risk of thrombotic incidents overall.'”

In an interview for Stat, University of Pennsylvania biostatistician Susan Ellenberg explained that

Vaccines protect against one thing: the infection or the infection plus disease. They don’t protect you against everything else that might possibly happen to you. There’s no reason to think that somehow there’s a magical period of time like, you know, four days or a week or two weeks after you get vaccinated, when none of those other horrible things are going to happen to you.

In response to the EMA's reassurances, France, Italy, Spain and Portugal reinstated vaccinations. Officials in other countries frame their actions to halt vaccinations as "precautionary". 2/ But the vaccine itself is a precaution, prompting University of Cambridge statistician David Spiegelhalter to counter that “the cautionary approach would be to carry on vaccinating. Casting doubt—lasting doubt—on the safety of the vaccines is not a precautionary position.”

The Next Day

Researchers studying the vaccine recipients diagnosed with cerebral venous sinus thrombosis (CVST) are convinced that it results from an autoimmune response. "Pål André Holme, a professor of hematology ... who headed an investigation into the Norwegian cases, said his team had identified an antibody created by the vaccine that was triggering the adverse reaction." His conclusion that "nothing but the vaccine can explain why these individuals had this immune response" is shared by German researchers independently looking at "13 cases of CVST detected among around 1.6 million people who received the AstraZeneca vaccine." British regulators continued to characterize the allegedly causal link to the vaccine as unproven. So did the EMA, which noted that "a causal link ... is possible and deserves further analysis.”

Two Weeks Later

From Gretchen Vogel & Kai Kupferschmidt, Side Effect Worry Grows for AstraZeneca Vaccine, Science, 372:14-15, Apr. 2, 2021, https://science.sciencemag.org/content/372/6537/14, DOI: 10.1126/science.372.6537.14:

... This week, Canada and Germany joined Iceland, Sweden, Finland, and France in recommending against the vaccine's use in younger people, who seem to be at higher risk for the clotting problem and are less likely to develop severe COVID-19. .... The highly unusual combination of symptoms—widespread blood clots and a low platelet count, sometimes associated with bleeding—has so far been reported from at least seven countries. ... Estimates of the incidence range from one in 25,000 people given the AstraZeneca vaccine in Norway to at least one in 87,000 in Germany. .... The United Kingdom remains a puzzle. Despite administering more than 11 million AstraZeneca doses, it has so far reported only a handful of suspicious clotting cases. But the U.K. did not limit the vaccine to younger groups, so the average age of recipients there may be older. ...
Researchers in Germany have proposed that some component of the vaccine triggers a rare immune reaction like one occasionally seen with the blood thinner heparin, in which antibodies trigger platelets to form dangerous clots throughout the body. This week the team posted case descriptions of what they call vaccine-induced prothrombotic immune thrombocytopenia (VIPIT) on the preprint server Research Square. ...
Even if VIPIT isn't the whole story, multiple other researchers told Science they are now convinced the vaccine somehow causes the rare set of symptoms. If true, that could be a serious blow to a vaccine that is central to the World Health Organization's push to immunize the world. ...

NOTES

  1. LYMErix also was a recombinant vaccine.
  2. Germany's healthcare ministry argued that for Germany, at least, the expected number of cases of cerebral vein thrombosis was only 1.4, making the 7 cases that occurred valid grounds to pause mass vaccinations. However, if there were many subcategories of clotting-related events considered and if this is the only one to display an apparently elevated level, the pattern would not be surprising. Moreover, the EMA looked at adverse events in more than a single country when making comparisons.

REFERENCES

  • European Medicines Agency, COVID-19 Vaccine AstraZeneca: Benefits Still Outweigh the Risks Despite Possible Link to Rare Blood Clots with Low Blood Platelets, Mar. 18, 2021, https://www.ema.europa.eu/en/news/covid-19-vaccine-astrazeneca-benefits-still-outweigh-risks-despite-possible-link-rare-blood-clots
  • Matthew Herpe, The Curious Case of AstraZeneca’s Covid-19 Vaccine, Stat, Mar. 15, 2021, https://www.statnews.com/2021/03/15/the-curious-case-of-astrazenecas-covid-19-vaccine/
  • Ivan F. N. Hung & Gregory A. Poland, Single-dose Oxford–AstraZeneca COVID-19 Vvaccine Followed by a 12-week Booster, Lancet, Mar. 6, 2021, 397(10277): P854-855, https://doi.org/10.1016/S0140-6736(21)00528-6
  • Derek Lowe, What is Going on with the AstraZeneca/Oxford Vaccine?, Science Translational Medicine, Mar. 16, 2021, https://blogs.sciencemag.org/pipeline/archives/2021/03/16/what-is-going-on-with-the-astrazeneca-oxford-vaccin
  • Smriti Mallapaty & Ewen Callaway. What Scientists Do and Don’t Know about the Oxford–AstraZeneca COVID Vaccine. Nature News Explainer, Mar. 24, 2021, https://www.nature.com/articles/d41586-021-00785-7
  • Daniel Michaels, AstraZeneca’s Covid-19 Vaccine Cleared by EU After Blood-Clot Concerns, Wall St. J., Mar. 18. 2021, https://www.wsj.com/articles/astrazenecas-covid-19-vaccine-is-cleared-by-europe-after-blood-clot-concerns-11616083845?mod=hp_lead_pos1
  • L. E. Nigrovic & K. M. Thompson, The Lyme Vaccine: A Cautionary Tale, Epidemiology & Infection, 135(1):1–8 (2007), doi: 10.1017/S0950268806007096, PMCID: PMC2870557
  • Bojan Pancevski, Scientists Say They Found Cause of Rare Blood Clotting Linked to AstraZeneca Vaccine, Wall St. J., Mar. 19, 2021 4:02 pm ET, https://www.wsj.com/articles/scientists-say-they-found-cause-of-blood-clotting-linked-to-astrazeneca-vaccine-11616169108

Saturday, March 6, 2021

"The Judgment of an Experienced Examiner"

The quotation from a textbook on forensic science in the left-hand panel invites the question of what the author was thinking. That an examiner's judgment is more important in comparisons of "class" features than "individual" ones? That the features that are present are less important than the examiner's judgment of them? Neither interpretation makes much sense. The examiner's judgment does not dictate anything about the features. It is the other way around. In applying a valid method, a proficient examiner generally will make correct judgments of significance as dictated by the features that are present.

As with most class evidence, the significance of a fiber comparison is dictated by the circumstances of the case, by the location, number, and nature of the fibers examined, and, most important, by the judgment of an experienced examiner.
Richard Saferstein, Criminalistics: An Introduction to Forensic Science 272 (12rth ed. 2018, Pearson Education Inc.) (emphasis added)
Over the years, scientific and legal scholars have called for the implementation of algorithms (e.g., statistical methods) in forensic science to provide an empirical foundation to experts’ subjective conclusions. ... Reactions have ranged from passive skepticism to outright opposition, often in favor of traditional experience and expertise as a sufficient basis for conclusions. In this paper, we explore why practitioners are generally in opposition to algorithmic interventions and how their concerns might be overcome. We accomplish this by considering issues concerning human-algorithm interactions in both real world domains and laboratory studies as well as issues concerning the litigation of algorithms in the American legal system. [W]e propose a strategy for approaching the implementation of algorithms ... .
Henry Swofford & Christophe Champod, Implementation of Algorithms in Pattern & Impression Evidence: A Responsible and Practical Roadmap, 3 Forensic Sci. Int'l: Synergy 100142 (2021) (abstract)

Another example of sloppy phrasing about expert judgment is the boilerplate disclaimers or admonitions in ASTM standards for forensic-science methods. For example, the 2019 Standard Guide for Forensic Analysis of Fibers by Infrared Spectroscopy (E2224−19) insists (in italics no less) that

This standard cannot replace knowledge, skills, or abilities acquired through education, training, and experience and is to be used in conjunction with professional judgment by individuals with such discipline-specific knowledge, skills, and abilities.

Does the first independent clause mean that fiber analysts are free to depart from the standard on the basis of their general "knowledge, skills, or abilities acquired through education, training, and experience"? Ever since Congress funded the Organization of Scientific Area Committees for Forensic Science (OSAC) to write new and better standards, lawyers in the organization have objected to the ASTM wording (without doubting that expert methods should be applied by responsible experts).

Now rumor has it that ASTM will be changing its stock sentence to the less ambiguous observation that

This standard is intended for use by competent forensic science practitioners with the requisite formal education, discipline-specific training (see Practice E2917), and demonstrated proficiency to perform forensic casework.

That's innocuous. Indeed, you might think it goes without saying.