Saturday, December 25, 2021

The FBI's Misinformation Campaign on Firearms-toolmark Testimony

On Tuesday (21 December 2021), the Texas Forensic Science Commission issued a Statement Regarding 'Alternate Firearms Opinion Terminology'. It is a forceful correction to misinformation from the FBI Laboratory's Assistant General Counsel, Jim Agar II. \1/ The email that attracted the Commission's critical attention tells forensic analysts what they are supposed to say in opposition to motions to limit their testimony about firearms-toolmark comparisons. As previous postings show, there has been no shortage of defense motions seeking to forbid eliciting opinions that ammunition components associated with a crime came from a particular gun.

The FBI advice to firearms examiners is entitled "Dealing with Alternate Firearms Opinion Terminology" (hereinafter Dealing). It begins by dismissing the best efforts of federal and state judges to respond to weaknesses in traditional "This is the gun!" testimony as "wholesale attempts to rewrite the firearm expert's testimony by a layman with no experience in forensic science." \2/ The fact that eminent scientists and respected jurists have questioned source-attribution testimony in general and in this field in particular does not seem to matter. According to Dealing, the limitations are "not supported by either science or the law." Despite the government's annoyance with lay judges' rulings, however, courts have a duty to review the scientific and scholarly literature to decide whether strong claims of source attributions are sufficiently warranted. \3/

Dealing continues, more reasonably, with the strategic recommendation that "firearms examiners and prosecutors should address the terminology issue head-on during their direct examination at the admissibility hearing. Preempt this issue early. Don't wait for the judge or the defense counsel to bring it up." But the tactics for bringing it up are over the top. Dealing imagines the following colloquy:

Prosecutor: Can you testify truthfully that your opinion is that the cartridge cases and/or bullets in this case
   • "Could or may have been fired by this gun?"
   • ''Are consistent with having been fired by this gun?"
   • "Are more likely than not having been fired by this gun?"
   • "Cannot be excluded as having been fired by this gun?"
Examiner: No, I cannot testify truthfully to any of those statements or just the class characteristics alone.
Prosecutor: Why not?
Examiner: For three reasons: First, there are no empirical studies or science to backup any of those statements or terminology. Second, those statements are not endorsed nor approved by my laboratory, any nationally recognized forensic science organization, law enforcement, or the Department of Justice. Third, those statements are false as they do not reflect my true opinion of identification. Such statements would mislead the jury about my opinion in this case. It would also constitute a substantive and material change to my opinion from one of Identification to Inconclusive. This would constitute perjury on my part for I would not be telling the jury the whole truth.

The "three reasons" border on the absurd (if they do not cross the border). First, the empirical studies that prosecutors cite to support the ability of firearms experts to match ammunition components to specific guns also support the bulleted statements. This is because the alternatives are lesser included statements, so to speak. If a categorical source attribution is correct, then a weaker included statement such as "cannot be excluded" also is true. If "empirical studies or science" do not adequately support these weaker statements, then, a fortiori, they do not support the much stronger claims that Dealing advocates.

Second, that law enforcement organizations and crime laboratories do not approve of the policy of replacing traditional "This is the gun!" testimony with a less telling alternative proves nothing about whether the bulleted statements are true or false. It merely means that a laboratory is unwilling to change its standard operating procedure and that "law enforcement" opposes losing the opinions that prosecutor's love their experts to provide. No self-respecting expert can say that the desire of "law enforcement" and crime laboratories for the strongest possible testimony makes less compelling testimony "untruthful."

Finally, that any lawyer -- let alone one representing the FBI -- would ask a forensic examiner to tell a judge that it would be perjurious to testify in the bulleted ways is shocking. A federal perjury prosecution would be laughed out of court. Under federal law, statements that are known to be incomplete, or, worse, fully intended to distract or mislead, do not constitute perjury if they are literally true. The leading case is Bronson v. United States. \4/ There, defendant testified as follows:

Q. Do you have any bank accounts in Swiss banks, Mr. Bronston?
A. No, sir.
Q. Have you ever?
A. The company had an account there for about six months, in Zurich.
Q. Have you any nominees who have bank accounts in Swiss banks?
A. No, sir.
Q. Have you ever?
A. No, sir.

In reality, the witness had previously maintained and had made deposits to and withdrawals from a personal bank account in Geneva, Switzerland. Clearly, his answers were calculated to avoid revealing this fact. However, the Supreme Court unanimously reversed a conviction for perjury, concluding that the federal statute did not criminalize lying by omission and misdirection.

To be sure, some state statutes define the crime to encompass wilful omissions, but the core idea remains that perjury occurs when the witness intends to give the questioner false information or a false impression so as to obstruct the ascertainment of the truth. \5/ An expert witness who testifies sincerely to true statements such as "the defendant's gun cannot be excluded as the one that fired the recovered bullet" or "measurements of the bullet and the pistol showed them both to be 9 mm, so the bullet could have been fired from the gun," is not intending to lead anyone to a false conclusion. That the FBI would like firearms examiners to give more incriminating opinions does not make the lesser included testimony false or misleading. A prosecutor who truly is worried that "[t]estimony about class characteristics alone may falsely imply an examiner was unable to reach a conclusion of identification" can ask the court to instruct the jurors that the rules of evidence no longer allow an expert witness to testify that a bullet came from a particular gun and that they may not draw any inference from the absence of such inadmissible testimony. Instead, they are to use only the testimony that the expert gave in coming to a conclusion about which gun fired the recovered bullet.

After maintaining that "laymen" (courts) are asking toolmark examiners to commit perjury, Dealing gives another specious argument to persuade toolmark experts to stick to their guns (sorry about that) and refuse to "agree to testify to the terms of 'Could or may have fired,' or 'Consistent with,' 'More likely than not,' or 'Cannot be excluded.'" FBI counsel believes that examiners who testify this way when they feel that a traditional source attribution is justified "are ratifying these bogus statements and adopting this as their testimony, giving the judge a pass on the difficult decision to admit or exclude their testimony. They are also acquiescing to the judge's faulty terminology."

This is nonsense. The law has a spectrum of options ranging from excluding every bit of information a firearms expert might provide (which is unjustified given what is known about the performance of these experts) to unfettered admission of "This is the gun!" testimony (which is traditional). The only "fault" in the intermediate testimony is that it is not as strong as a prosecutor might want it to be. It is conservative in the sense of understating probative value (as FBI counsel understands the science), but testifying conservatively at trial when that is what a court requires does not "ratify" anything about the court's ruling. It simply presents a permissible opinion. DNA experts who testified to "ceiling" probabilities of random matches because that was the best the prosecution could get some courts to accept circa 1995 were not perceived as "ratifying these bogus statements." \6/

Dealing disagrees. FBI counsel insists that "acquiescing" in court rulings is "fatal" to an examiner's career as a witness:

This is fatal. Why? Once you testify to these bogus terms, you are wedded to them for life. At subsequent trials, defense counsel will pull out the verbatim transcript of the examiner's previous testimony where they used these court-induced terms. On cross examination, they will confront the examiner with their previous testimony and contrast their opinion of "Identification" with those in previous cases, then claim the expert is merely making this stuff up. The examiner no longer has any credibility in the jury's eyes.

This fear of cross-examination is fanciful. If the expert testifies at the admissibility stage (as Dealing contemplates) that "This is the gun!" testimony is scientifically justified, then that is what the expert is on record as stating. Later, more circumscribed testimony pursuant to court order is not an inconsistent statement useful for impeachment. Any competent expert witness will have no trouble explaining that in the earlier case, I reached the conclusion of "identification" (just read my case notes), and I used other terminology only because the prosecutor asking the question (or the judge) said I had to use the lesser included language because of a legal rule rather than a scientific principle.

In contrast, the witness who follows FBI counsel's advice will lose all credibility. The truth is that the lesser included testimony, while less powerful, is no less truthful than "This is the gun!" testimony. It is somewhat like choosing a wider confidence interval to increase the coverage probability; the statement becomes less precise, but it is more likely to be true. Talk of perjury and being asked to lie suggests either that (1) the witness does not understand a statement such as "the recovered bullet could have come from/is consistent with coming from/is not excluded as coming from/is more likely to have come from the firearm in question or that (2) the witness has chosen to lobby for the prosecution rather than to educate the judge impartially.

NOTES

  1. Mr. Agar is a decorated, retired Colonel with "31 years of successful experience leading complex legal organizations as a general counsel, attorney, leader, mentor and trainer of FBI legal offices and senior-level Army staffs" and "hands-on experience in advising senior FBI and Army leaders in all legal matters." His work as Assistant General Counsel for the "FBI Forensic Laboratory" began in October 2016. On Linkedin, from which these quotations are taken, he summarizes his current position as
    Legal advisor to the largest and best forensic laboratory in the world with a staff of over 700 scientists and a budget of $110 million. Responsible for training and qualifying the FBI’s forensic examiners to testify in any and all courts nationwide and internationally, consisting of over 120 examiners in 37 different disciplines. Coordinate all discovery for the Laboratory. Provide ethics advice to Laboratory personnel.
  2. Discussion of this line of cases can be found in David H. Kaye et al., Wigmore on Evidence: Expert Evidence (3d ed. 2021).
  3. The track record of the courts in translating this literature and the growing research on firearms-toolmark comparisons into appropriate constraints on proposed expert testimony is not perfect. Indeed, most of the judicial palliatives for perceived expert overclaiming (such as the supposed limitation of "a reasonable degree of ballistic certainty" and the alternatives listed in Dealing) are far from optimal. Id. (and other postings in this blog). But these failures hardly mean that, as "laymen," judges are disqualified from trying to improve the presentation of expert knowledge by excluding certain forms of testimony.
  4. Bronston v. United States, 409 U.S. 352 (1973).
  5. See Ira P. Robbins, Perjury by Omission, 97 Wash. U. L. Rev. 265 (2019).
  6. See, e.g., David H. Kaye, The Double Helix and the Law of Evidence (2010).

Friday, September 3, 2021

Does Qualitative Measurement Uncertainty Exist?

I have heard it said that forensic-science standards for interpreting the results of chemical or other tests need not discuss uncertainty in measurements of qualitative properties. For instance, ASTM International appropriately requires standards for test methods to include a section reporting on precision and bias as manifested in interlaboratory tests. Yet, it applies this requirement exclusively to quantitative measurements. Its 2021 style manual is unequivocal:

When a test method specifies that a test result is a nonnumerical report of success or failure or other categorization or classification based on criteria specified in the procedure, use a statement on precision and bias such as the following: “Precision and Bias—No information is presented about either the precision or bias of Test Method X0000 for measuring (insert here the name of the property) since the test result is nonquantitative" (ASTM 2020, § A21.5.4, pp. A3-A14).

Qualitative measurements are observation-statements such as the ink is blue, the friction ridge skin pattern includes loops, the bloodstain displays a cessation pattern, the blood group is type A, the glass fragments fit together perfectly, or the material contains cocaine. Likewise, the statements could be comparative: the recording of an unknown bell ringing sounds like it has a higher pitch than the ringing of a known bell; the hairs are microscopically indistinguishable; or the striations on the recovered bullet and the test bullet line up when viewed in the comparison microscope.

“Precision” is defined as “the closeness of agreement between test results obtained under prescribed conditions” (ibid. § A21.2.1, at A12). “A statement on precision allows potential users of the test method to assess in general terms its usefulness in proposed applications” and is mandatory (ibid. § A21.2, at A12). So how can it be that statements of precision and bias are not allowed for qualitative as opposed to quantitative findings? In both situations, the system that generates the findings could be noisy or skewed in its outcomes.

The only answer I have heard is that measurements cannot be qualitative because the word "measurement" is reserved for determining the magnitude of quantities such as length or mass. The values of these quantitative variables are basically isomorphic to the nonnegative real numbers. Counts, such as the number of alpha particles emitted in a given interval of time by radium atoms, also qualify as measurements because there is a quantitative, additive structure to them. The values of the variable are basically isomorphic to the natural numbers. Properties that only have names are described by nominal variables. Although numbers can assigned (1 for a match and 0 for a nonmatch, for example) these numbers are no more a measurement than a social security number is. In short, the argument is that because “measurements” do no not include qualitative judgments, classifications, decisions, identifications, or whatever one might call them, no statement of measurement uncertainty or error is possible, let alone required.

This argument is incredibly weak. To begin with, the definition of “measurement” is a highly contested concept. As one guide from NIST explains, a “much wider” conception of measurement than the one “contemplated in the current version of the International vocabulary of metrology (VIM)” has been developed in the metrology literature, and the measurand “may be ... qualitative (for example, the provenance of a glass fragment determined in a forensic investigation" (Possolo 2015). Broader conceptions of measurement have been the subject of many decades of writing in psychology and psychometrics (see, e.g., Humphry 2017; Mitchell 1990). Philosophers have been struggling to describe the scope and meaning of "measurement" at least since Aristotle (see, e.g., Tal 2015).

Second, even if one agrees with the definition in one NIST publication that “[m]easurement is [confined to] an experimental process that produces a value that can reasonably be attributed to a quantitative property of a phenomenon, body, or substance” (NIST 2019), some qualitative observations fit this definition. The color of a strip of litmus paper, for instance, can be understood as a value “that can reasonably be attributed to a quantitative property,” It is simply a crude measurement of pH.

Finally, the argument that there can be no measurement error for qualitative properties because those properties are not really “measured” is a semantic ploy that misses the point. The observations or estimates of nonquantitative properties as well as the individual measurements of quantitative properties are all subject to possible random and systematic error, and statements expressing the range of probable error for all measurements, observations, estimates, and classifications are essential. The need for these statements cannot be avoided for qualitative properties or judgments by the fiat of the VIM or some other dictionary. Even if “measurement” must be read in one particular, narrow, technical sense, “evaluation uncertainty” or “examination uncertainty” still must be reckoned with (Mari et al. 2020).

In sum, there is no excuse for ASTM and other organizations promulgating standards for forensic-science test methods to exempt any reported findings from required statements of uncertainty. Many statistics can be used to indicate how reliable (repeatable and reproducible) and valid (accurate) the test results may be (ibid.; Ellison & Gregory 1998; Pendrill & Petersson 2016). The qualitative-quantitative distinction affects the choice of the statistical method or expression but not the need to have one.

REFERENCES

  • ASTM Int’l, Form and Style for ASTM Standards (2020), https://www.astm.org/FormStyle_for_ASTM_STDS.html.
  • Stephen L. R. Ellison & Soumi Gregory, Perspective: Quantifying Uncertainty in Qualitative Analysis, Analyst 123, 1155-1161 (1998), https://doi.org/10.1039/A707970B
  • Stephen M. Humphry, Psychological Measurement: Theory, Paradoxes, and Prototypes, 27(3) Theory & Psychology 407–418 (2017)
  • L. Mari, C. Narduzzi, S. Trapmann, Foundations of Uncertainty in Evaluation of Nominal properties, 152 Measurement 107397 (2020), DOI:10.1016/j.measurement.2019.107397
  • Joel Mitchell, An Introduction to the Logic of Psychological Measurement (1990)
  • NIST, Statistical Engineering Division, Measurement Uncertainty, updated Nov. 15, 2019, https://www.nist.gov/itl/sed/topic-areas/measurement-uncertainty
  • Leslie Pendrill & Niclas Petersson, Metrology of human-based and other qualitative measurements, 27(9) Measurement Sci. Technol. 27 094003 (2016)
  • A. Possolo, Simple Guide for Evaluating and Expressing the Uncertainty of NIST
    Measurement Results (NIST Technical Note 1900), 2015, doi: 10.6028/NIST.TN.1900
  • Eran Tal, Measurement in Science, in Stanford Encyclopedia of Philosophy (Edward N. Zalta ed. 2015), https://plato.stanford.edu/archives/fall2017/entries/measurement-science/

APPENDIX: ADDITIONAL PUBLICATIONS ON "QUALITATIVE MEASUREMENT"

  1. Mary J. Allen & Wendy M. Yen, Introduction to Measurement Theory 2 (1979) ("In measurement, numbers are assigned systematically and can be of various forms. For example, labeling people with red hair "1" and people with brown hair "2" is a measurement. Since numbers are assigned to individuals in a systematic way and differences between scores represent differences in the property being measured (hair color).")
  2. Peter-Th. Wilrich, The determination of precision of qualitative measurement methods by interlaboratory experiments, Accreditation and quality assurance, 15: 439-444 (2010)
  3. Boris L. Milman, Identification of chemical compounds, Trends in Analytical Chemistry, 24:6, 2005 ("identification itself is considered as measurement on a qualitative scale")
  4. NIST Expert Working Group on Human Factors in Latent Print Analysis, Latent Print Examination and Human Factors: Improving the Practice Through a Systems Approach, Gaithersburg: National Institute of Standards and Technology, David H. Kaye ed., 2012 (defining "measurement" broadly, to encompass categorical variables, including the examiner's judgment about the source of a print).
  5. Lim, Yong Kwan, Kweon, Oh Joo, Lee, Mi-Kyung and Kim, Hye Ryoun. Assessing the measurement uncertainty of qualitative analysis in the clinical laboratory. Journal of Laboratory Medicine, vol. 44, no. 1, 2020, pp. 3-10. https://doi.org/10.1515/labmed-2019-0155 ("Measurement uncertainty is a parameter that is associated with the dispersion of measurements. Assessment of the measurement uncertainty is recommended in qualitative analyses in clinical laboratories; however, the measurement uncertainty of qualitative tests has been neglected despite the introduction of many adequate methods.")
  6. Donald Richards, Simultaneous Quantitative and Qualitative Measurements in Drug-Metabolism Investigations, Pharmaceutical Technology 2013
  7. Kadri Orro, Olga Smirnova, Jelena Arshavskaja, Kristiina Salk, Anne Meikas, Susan Pihelgas, Reet Rumvolt, Külli Kingo, Aram Kazarjan, Toomas Neuman & Pieter Spee, Development of TAP, a non-invasive test for -qualitative and quantitative measurements of biomarkers from the skin surface, Biomarker Research 2: 20 (2014)
  8. J M Conly & K Stein, Quantitative and qualitative measurements of K vitamins in human intestinal contents, Am J Gastroenterol. 1992 Mar;87(3):311-316
  9. Wenjia Meng, Qian Zheng, Gang Pan, Qualitative Measurements of Policy Discrepancy for Return-Based Deep Q-Network, IEEE Transactions on Neural Networks and Learning Systems 2020
  10. Rudolf M. Verdaasdonk, Jovanie Razafindrakoto, Philip Green, Real time large scale air flow imaging for qualitative measurements in view of infection control in the OR (Conference Presentation) Proceedings Volume 10870, Design and Quality for Biomedical Technologies XII; 1087002 (2019) https://doi.org/10.1117/12.2511185
  11. Rashis, Bernard, Witte, William G. & Hopko, Russell N., Qualitative Measurements of the Effective Heats of Ablation of Several Materials in Supersonic Air Jets at Stagnation Temperatures Up to 11,000 Degrees F, National Advisory Committee for Aeronautics, July 7, 1958
  12. Lawrence F Cunningham and Clifford E Young, Quantitative and Qualitative Approaches, Journal of Public Transportation 1(4) (1997) ("The study also contrasts the results of quantitative and qualitative measurements and methodologies for assessing transportation service quality")
  13. JM Conly, K Stein, Quantitative and qualitative measurements of K vitamins in human intestinal contents, American Journal of Gastroenterology, 1992
  14. P Sinha, Workshop on Biologically Motivated Computer Vision, 2002 - Springer ("Our emphasis on the use of qualitative measurements renders the representations stable in the presence of sensor noise and significant changes in object appearance. We develop our ideas in the context of the task of face-detection under varying illumination")
  15. D Michalski, S Liebig, E Thomae & A Hinz, Pain in Patients with Multiple Sclerosis: a Complex Assessment Including Quantitative and Qualitative Measurements, 40 J. Pain 219–225 (2011), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3160835/
  16. Cécilia Merlen, Marie Verriele, Sabine Crunaire,Vincent Ricard, Pascal Kaluzny, Nadine Locoge, Quantitative or Only Qualitative Measurements of Sulfur Compounds in Ambient Air at Ppb Level? Uncertainties Assessment for Active Sampling with Tenax TA®, 132 Microchemical J. 143-153 (2017)
  17. Tomomichi Suzuki, Jun Ichi Takeshita, Mayu Ogawa, Xiao-Nan Lu, Yoshikazu Ojima, Analysis of Measurement Precision Experiment with Categorical Variables, 13th International Workshop on Intelligent Statistical Quality Control 2019, Hong Kong ("Evaluating performance of a measurement method is essential in metrology. Concepts of repeatability and reproducibility are introduced in ISO5725-1 (1994) including how to run and analyse experiments (usually collaborative studies) to obtain these precision measures. ISO5725-2 (1994) describe precision evaluation in quantitative measurements but not in qualitative measurements. Some methods have been proposed for qualitative measurements cases such as Wilrich (2010), de Mast & van Wieringen (2010), Bashkansky, Gadrich & Kuselman (2012). Item response theory (Muraki, 1992) is another methodology that can be used to analyse qualitative data.").

Monday, June 14, 2021

Tibbs. Shipp, and Harris on "Meaningul" Peer Review of Studies on Firearms-toolmark Matching

The Supreme Court's celebrated (but ambiguous) opinion in Daubert v. Merell Dow Pharmaceuticals, \1/ was a direct response to a seemingly simple rule--results that are not published in the peer-reviewed scientific literature are inadmissible to prove that a scientific theory or method is generally accepted in the scientific community. The Court unanimously rejected this strict rule--and more broadly, the very requirement of general acceptance--in favor of a multifaceted examination guided by four or five criteria that have come to be known as "the Daubert factors."

But "peer review and publication" lives on--not as a formal requirement, but as one of these factors. Thus, courts routinely ask whether the peer-reviewed scientific literature supports the reasoning or data that an expert is prepared to present at trial. All too often, however, the examination of the literature is cursory or superficial. The temptation, especially for overburdened judges not skilled in sorting through biomedical and other journals, is to check that there are articles on point, and if  the theory has been discussed (critically or otherwise) in the literature, to write that the "peer review and publication" factor supports admission of the testimony.

One area in which this dynamic is apparent is traditional testimony of firearms examiners matching marks from guns to bullets or shell casings. \2/ Defendants have strenuously objected that traditional associations of particular guns to ammunition components is an inscrutable judgment call that does not pass muster under Daubert. Perhaps the most meticulous analysis of this issue comes from an unpublished opinion of Judge Todd Edelman in United States v. Tibbs. \3/ Judge Edelman's discussion of peer review and publication is unusually thorough and may have been penned as an antidote to the strategy in which the government gives the court a laundry list of articles that have discussed the procedure and the court checks off the "peer review and publication" box.

Being an opinion for a trial court (the District of Columbia Superior Court), Tibbs is not binding precedent for that court or any other, but it has not gone unnoticed. Two federal district courts recently reached mutually opposing conclusions about Judge Edelman's analysis of one large segment of the literature cited in support of admitting match determinations--namely, the extensive research reported in the AFTE Journal. ("AFTE" stands for the Association of Firearms and Toolmark Examiners. The organization was formed in 1969 in "recognition of the need for the interchange of information, methods, development of standards, and the furtherance of research, [by] a group of skilled and ethical firearm and/or toolmark examiners" who "stand prepared to give voice to this otherwise mute evidence." \4/)

Tibbs' Analysis of the AFTE Journal

Because of the AFTE Journal's orientation and editorial process, Tibbs did not give "the sheer number of studies conducted and published" there much weight. \5/ Judge Edelman made essentially four points about the journal:

  • Contrary to the testimony of the government’s experts, post-publication comments or later articles are not normally considered to be “peer review”;
  • The AFTE pre-publication peer review process is “open,” meaning that “both the author and reviewer know the other's identity and may contact each other during the review process”;
  • The reviewers who form the editorial board are all “members of AFTE” who may well “be trained and experienced in the field of firearms and toolmark examination, but do not necessarily have any ... training in research design and methodology” and who “have a vested, career-based interest in publishing studies that validate their own field and methodologies”; and
  • “AFTE does not make this publication generally available to the public or to ... reviewers and commentators outside of the organization's membership [and] unlike other scientific journals, the AFTE Journal ... cannot even be obtained in university libraries.” \6/

The court contrasted these aspects of the journal’s peer review to a "double-blind" process and observed that the AFTE “open” process was “highly unusual for the publication of empirical scientific research.” \7/ The full opinion, which develops these ideas more completely can be found online.

Shipp

Senior Judge Nicholas Garaufis of the Eastern District of New York was impressed with "this thorough opinion." \8/ His opinion in United States v. Shipp referred to the "several pages analyzing the AFTE Journal's peer review process [that] highlight[] several reasons for assigning less weight to articles published in the AFTE Journal than in other publications" and added that

The court shares these concerns about the AFTE Journal's peer review process. In particular, the court is concerned that the reviewers, who are all members of the AFTE, have a vested, career-based interest in publishing studies that validate their own field and methodologies. Also concerning is the possibility that the reviewers may be trained and experienced in the field of firearms and toolmark identification, but [may] not necessarily have any specialized or even relevant training in research design and methodology. \9/

Harris

In contrast, Judge Rudolph Contreras of the U.S. District Court for the District of Columbia, writing in United States v. Harris, \10/ had nothing complimentary to say about Tibbs. This court defended the AFTE Journal research articles said to demonstrate the validity of firearms-toolmark identification with two rejoinders to Tibbs. First, Judge Contreras maintained that “there is far from consensus in the scientific community that double-blind peer review is the only meaningful kind of peer review.” \11/ This is true enough, but the issue raised by the criticism of “open” review is not whether double-blind review is better than single-blind review (in which the author does not know the identity of the referees) or some other system. It is whether “open” review conducted exclusively by AFTE members is the kind of peer review envisioned as a strong indicator of scientific soundness in Daubert. The factors enumerated in Tibbs make that a serious question.

Second, Judge Contreras observed that the Journal of Forensic Sciences, which uses double-blind review, republished one AFTE study. This solitary event, the Harris opinion suggests, is a “compelling” rebuttal of “the allegation by Judge Edelman in Tibbs that the AFTE Journal does not provide 'meaningful' review." \12/ But Judge Edelman never proposed that every article in the AFTE journal was without scientific merit. Rather, his point was far less extreme. It was merely that courts should not “accept at face value the assertions regarding the adequacy of the journal's peer review process.” \13/ That one article—or even dozens—published in the AFTE Journal could have been published in other journals reveals very little about the level and quality of AFTE review. After all, even a completely fraudulent review process that accepted articles for publication by flipping a coin would result in the publication of some excellent articles—but not because the review process was meaningful or trustworthy. In addition, one might ask whether the very fact that an article had to be republished in a more widely read journal fortifies the fourth point in Tibbs, that the journal’s circulation is too restricted to make its publications part of the mainstream scientific literature. The discussion of peer review and publication in Harris ignores this concern.

Beyond the AFTE Journal

The significant concerns exposed in Tibbs do not prove that the peer-reviewed scientific literature, taken as a whole, undermines firearms identification as commonly practiced. They simply mean that the list of publications over the years in the AFTE Journal may not be entitled to great weight in evaluating whether the scientific literature supports the claim of firearms and toolmark examiners to be able to supply generally accurate and reliable "opinions relative to evidence which otherwise stands mute before the bar of justice." \14/

Fortunately, newer peer-reviewed studies exist, and not all the older research appears in the AFTE Journal. \15/ Thus, the Harris court asserted that

[E]ven if the Court were to discount the numerous peer-reviewed studies published in the AFTE Journal, Mr. Weller's affidavit also cites to forty-seven other scientific studies in the field of firearm and toolmark identification that have been published in eleven other peer-reviewed scientific journals. This alone would fulfill the required publication and peer review requirement. \16/

The last sentence could be misunderstood. As a statement that the 47 studies could be the basis of an scientifically informed judgment about the validity of firearms-toolmark matching, the conclusion is correct. As a statement that checking the "peer review and publication" box on the basis of a large number of studies published in the right places "alone" is a reason to admit the challenged testimony, it would be more problematic. The "required ... requirement" (to the extent Daubert imposes one) is for a substantial body of peer-reviewed papers that form a solid foundation for a scientific assessment of a method. Unless this research literature is actually supportive of the method, however, satisfying the "the required publication and peer review requirement" is not a reason to admit the evidence. 

Do the 47 studies (old and new) in widely accessible, quality journals all show that examiners' opinions derived from comparing toolmarks are consistently correct and stable for the kinds of comparisons made in practice? If so, then it is high time to stop the arguments over scientific validity. If not, if the 47 studies are of varying quality, scope, and relevance to ascertaining how repeatable, reproducible, and accurate the opinions rendered by firearms-toolmark examiners are, then there is room for further analysis of whether and how these experts can provide valuable information for the legal factfinders.

NOTES

  1. 509 U.S. 579 (1993),
  2. No. 2016 CF1 19431, 2019 D.C. Super. LEXIS 9, 2019 WL 4359486 (D.C. Super. Ct., Sept. 5, 2019).
  3. "[T]he process that most firearms examiners use when analyzing evidence" is desctibed in graphic detail in "[t]he Firearms Process Map, which captures the ‘as-is’ state of firearms examination, provides details about the procedures, methods and decision points most frequently encountered in firearms examination." NIST, OSAC's Firearms & Toolmarks Subcommittee Develops Firearms Process Map Jan. 19, 2021, https://www.nist.gov/news-events/news/2021/01/osacs-firearms-toolmarks-subcommittee-develops-firearms-process-map.
  4. AFTE Bylaws, Preamble, https://afte.org/about-us/bylaws
  5. 2019 D.C. Super. LEXIS 9, at *35. For a decade or so, both legal academics and forensic scientists had pointed to the AFTE Journal as an example of a practitioner-oriented outlet for publications that did not follow the peer review and publication practices of other scientific journals. See, e.g., David H. Kaye, Firearm-Mark Evidence: Looking Back and Looking Ahead, 68 Case W. Res. L. Rev. 723 (2018); Jennifer L. Mnook--n et al., The Need for a Research Culture in the Forensic Sciences, 58 UCLA L. Rev. 725 (2011).
  6. 2019 D.C. Super. LEXIS 9, at *32-*33.
  7. Id. at *33.
  8. United States v. Shipp, 422 F.Supp.3d 762, 776 (E.D.N.Y. 2019).
  9. Id. (citations and internal quotation marks omitted). Nevertheless, the court found "sufficient peer review." It wrote that "even assigning limited weight to the substantial fraction of the literature that is published in the AFTE Journal, this factor still weighs in favor of admissibility. Daubert found the existence of peer-reviewed literature important because “submission to the scrutiny of the scientific community ... increases the likelihood that substantive flaws in the methodology will be detected.” Daubert, 509 U.S. at 593. Despite AFTE Journal’s open peer-review process, the AFTE Theory has still been subjected to significant scrutiny. ... Therefore, the court finds that the AFTE Theory has been sufficiently subjected to 'peer review and publication' [outside of the AFTE Journal].” Daubert, 509 U.S. at 594."
  10. 502 F.Supp.3d 28 (D.D.C. 2020).
  11. Id. at 40.
  12. Id.
  13. Tibbs, 2019 D.C. Super. LEXIS 9, at *29.
  14. AFTE Bylaws, Preamble, https://afte.org/about-us/bylaws.
  15. AFTE has sought to remedy at least one complained-of feature of its peer review process. In 2020, it instituted the double-blind peer review that the Harris court found unnecessary. AFTE Peer Review Process – January 2020, https://afte.org/afte-journal/afte-journal-peer-review-process. Whether the qualifications and backgrounds of the journal's referrees have been changed is not apparent from the AFTE website.
  16. Harris, 502 F.Supp.3d at 40.

Monday, April 19, 2021

What is Accuracy?

The Organization of Scientific Area Committees for Forensic Science (OSAC) has an online "lexicon" that collects definitions of terms as they appear in published standards. 1/ These may or may not be the same as definitions in textbooks or other authoritative sources. 2/ They may or may not be accurate. (Yet, the drafters of OSAC standards sometimes point to the existence of a definition in the compendium as if it were a conclusive reason to perpetuate it. 3/)

Speaking of "accurate," the word "accuracy" has five overlapping definitions in OSAC's lexicon:

  • Closeness of agreement between a measured quantitiy [sic] value and a true quantity vlaue [sic] of a measurement.
  • The degree of agreement between a test result or measurement and the accepted reference value.
  • Closeness of agreement between a test result or measurement result and the true value. 1) In practice, the accepted reference value is substituted for the true value. 2) The term “accuracy,” when applied to a set of test or measurement results, involves a combination of random components and a common systematic error or bias component. 3) Accuracy refers to a combination of trueness and precision. [ISO 3534-2:2006].
  • The closeness of agreement between a test result and the accepted reference value. 1) In practice, the accepted reference value is substituted for the true value. 2) The term "accuracy," when applied to a set of test or measurement results, involves a combination of random components and a common systematic error or bias component. 3) Accuracy refers to a combination of trueness and precision.
  • Degree of conformity of a measure to a standard or true value.

Some of the definitions in the "lexicon" are designated "preferred terms." 4/ None of the definitions in the lexicon is marked preferred.

The main difficulty with the forensic scientists' set of definitions is that "accuracy" can refer to single measurements or estimates or to a process for making measurements or estimates. The longer definitions are confusing because they do not make it plain that "a combination of trueness and precision" applies to the accuracy of the process (or a large set of measurements from the process) and not so much to the accuracy of particular measurements.

"Precision" refers to the dispersion of repeated measurements under the same conditions. A precise estimate comes from a process that generates measurements that are typically tightly clustered around some value -- without regard to whether that value is the true one. A set of precise measurements -- ones that come from a process that tends to generate similar measurements when repeated -- may be far from the true value. Such measurements.(and the system that generates them) is statistically biased; these measurements have a systematic error component.

Conversely, an imprecise estimate -- one coming from a system that tends to produce widely divergent measurements -- may be essentially identical to the true value. Most other estimates from the same system would tend to stray farther from, the true value, but to say that an estimate that is spot on is not accurate sounds odd. The estimate may be unreliable (in the statistical sense of coming from a process that is highly variable), but it is practically 100% accurate (in this case). Even a generally inaccurate system may produce some accurate results.

The epistemological problem is that we should not rely on an unreliable system to ascertain the true value. For extremely imprecise point estimates, accuracy (in the sense of the absence of error and correspondence to the truth) becomes a matter of luck. It is unwise to act as if a particular measurement (or a small number of them) from an unreliable system adds much to our knowledge.

But the fact that the individual estimates provide little information is not well expressed by describing a result that is (luckily) correct as lacking accuracy.The investment analyst who said that a bitcoin will increase in value by 50% tomorrow is accurate if bitcoin's price did spike by approximately 50%. Nevertheless, this accurate prediction probably was unwarranted. Unless the analyst had a remarkable history of consistently predicting the ups and downs of bitcoin and an articulable and plausible basis for making the predictions, giving much credence to the prediction before the fact would have been unjustified.

Let's apply these elementary ideas to some forensic measurements. Suppose that analysts in a laboratory use an appropriate instrument to measure the refractive index of glass fragments. Most analysts are extremely proficient. Their measurements are both reliable (repeatability is high) and generally close to the true values. A smaller number of analysts are less proficient. Indeed, they are downright sloppy. They are not biased -- they err in both directions -- but the values they come up with are highly variable. An analyst from the proficient group obtains the value x for a particular fragment, and so does an analyst in the sloppy group.

Should we say that x is an accurate value when it comes from one of the former analysts and inaccurate when it comes from one of the latter? Some of the definitions from the standards suggest (or could be read as giving) one answer, whereas others suggest the opposite. It is far more straightforward to say that x is accurate (if it is close to the truth) in both cases.

To be sure, precision is a component of accuracy in the long run -- the imprecise analysts will tend to have lower accuracy (and higher error) rates. Their reports do not provide a sound basis for action. They are neither trustworthy nor statistically reliable. But it invites confusion to characterize every such report -- even ones that provide perfectly or approximately true values -- as inaccurate. When speaking of particular measurements, we simply need to distinguish between those that are wrong because they are far from the truth -- inaccurate -- and those that are accurate -- close to the truth either by good fortune or because of true knowledge. Systems that use luck to get the right answers are systematically inaccurate; properly functioning systems grounded on true knowledge are systematically accurate.

NOTES

  1. "The OSAC Forensic Lexicon should be the primary resource for terminology and used when drafting and editing forensic science standards and other OSAC work products. It is continually updated with the latest work from OSAC units, as well as terms from newly published documentary standards and standards elevated to the OSAC Registry." OSAC Registry, https://lexicon.forensicosac.org/ (undated).
  2. Cf. id. ("The terms and definitions in the OSAC Lexicon come from the published literature, including documentary standards, specialized dictionaries, Scientific Working Group (SWG) documents, books, journal articles, and technical reports. When a suitable definition can’t be located in any of these sources, an OSAC unit generates new or modifies existing definitions. Gradually terms are evaluated and harmonized by the OSAC to a single term. This process results in an OSAC Preferred Term."). 
  3. E.g.,  Comment Adjudication, OSAC 2021-N-0001, Wildlife Forensics Method-Collection of Known DNA Samples from Domestic Mammals, Feb. 11, 2021, at cells L25 & L27 (OSAC Proposed Standard added to the Registry Apr. 6, 2021) (link to Excel spreadsheet at https://www.nist.gov/osac/public-documents).
  4. Id. They should be called "preferred definitions" for terms, and terms that are not supposed to be used in standards anymore should be called "deprected terms," but I digress.
Last modified: 9/27/21 14:30 ET

Thursday, March 18, 2021

Europe's Fear of the AstraZeneca Vaccine: Post Hoc Ergo Propter Hoc?

When is the fact that Event B follows Event A proof that A causes B? The temporal sequence is a prerequisite for causation, but, in and of itself, the time sequence proves very little. Vaccines are a case in point. If many people receive a vaccine, some percentage of them will experience the slings and arrows of outrageous fortune. Some of them will suffer trauma from, say, automobile accidents. But non-vaccinated people also are caught up in auto accidents—perhaps even more often than their more cautious counterparts who chose to be vaccinated. Thus, the mere number of reports of the sequence of events A and B is not proof of any association between A and B, let alone a causal connection.

Furthermore, even if the incidence of accidents turned out to be greater among the vaccinated part of the population, the question of what might explain this association would need to be answered. Suppose, for instance, that the vaccine confers protection against Lyme disease, which is caused by a bacterium transmitted by certain tick bites. The vaccinated group might include disproportionately many hunters. Possibly, many of them may drive late at night, when they are tired, on dark, narrow, and winding roads, through woods teeming with deer and elk—factors that expose the hunters to a greater risk of an accident. The driving conditions are, in this hypothetical situation, a confounding variable.

The Loss of LYMErix

The link between Lyme-disease vaccine and automobile accidents is totally fictitious, but the history of the LYMErix vaccine provides an oft-told cautionary tale risk perception and evaluation. In the early 1990s, SmithKlineBeecham developed LYMErix to stop infections by the spirochetal bacterium Borrelia burgdorferi that causes Lyme disease. The vaccine targeted a protein (called outer-surface protein A, or OspA) of the bacterium. In a clinical trial, individuals given three doses of LYMErix showed a 76% reduction in Lyme disease in the following year, with no significant side-effects. The U.S. Food and Drug Administration approved LYMErix on December 21, 1998.

But within a year, there were reports of adverse reactions after vaccination. The media carried stories of "vaccine victims," and the Philadelphia law firm of Sheller, Ludwig & Bailey filed a class action against SmithKlineBeecham. Other lawyers did the same.

By 2001, with over 1.4 million doses distributed in the United States, there were 59 reports of arthritis possibly associated with vaccination. The incidence was essentially the same as that in unvaccinated individuals. In addition, the data did not show a temporal spike in arthritis diagnoses after the second and third vaccine dose (an outcome that would be expected for the immune-mediated phenomenon that plaintiffs and others maintained was a side effect of the vaccine). The FDA found no suggestion of harm from the Lyme vaccine. Nevertheless, it soon reconvened its advisory panel for a raucous meeting with the LYMErix manufacturer, independent experts, practicing physicians, the "vaccine victims," and their lawyers.

Although the panel made no changes to the product's labelling or indications, vaccine sales plummeted. On February 26, 2002, GlaxoSmithKline (the successor to SKB) withdrew LYMErix from the market. On July 9, 2003, GSK settled the class actions by agreeing to pay over one million dollars to plaintiffs' lawyers' for their fees and expenses—and nothing to the "vaccine victims."

Flash forward to 2021.

The British-Swedish pharmaceutical company AstraZeneca produced a vaccine developed by researchers at Oxford University that uses a harmless virus, modified to contain the DNA code for a protein on the surface of the virus (SARS-CoV-2) that causes COVID-19. When this altered (recombinant) virus (called ChAdOx1) infects human cells, they cell read the inserted sequence and make copies of the SARS-CoV-2 protein. 1/ The copies induce an immune response to SARS-CoV-2. Clinical trials in various countries suggested an overall efficacy of about 70%, 90%, and 62% (with substantial margins of error) depending on the doses.

As countries rushed to vaccinate their populations, reports of blood clots following vaccinations received intense publicity. Although the company and international regulators say there is no evidence that the shot is to blame, much of Europe (Denmark, Germany, France, Italy, Spain, Ireland, the Netherlands, Norway, Sweden, Latvia, and Iceland) as well as Congo and Bulgaria suspended use of the vaccine.

The World Health Organization and the European Medical Agency (EMA), however, continued to recommend the vaccine's use. In a statement issued today, the EMA emphasized that "the vaccine is not associated with an increase in the overall risk of blood clots (thromboembolic events) in those who receive it." A blogger for Science Translational Medicine remarked that

AstraZeneca has said that they’re aware of 15 events of deep vein thrombosis and 22 events pulmonary embolisms, but that’s in 17 million people who have had at least one shot—and they say that is indeed “much lower than would be expected to occur naturally in a general population of this size“. It also appears to be similar to what’s been seen with the other coronavirus vaccines, which rather than meaning “they’re all bad” looks like they’re all showing the same baseline signal of such events across a broad population, without adding to it.

Is a lower rate of clots for vaccinated individual real? An explanation (other than statistical error in comparing low probability events) comes from the chair of the EMA safety committee. Dr. Sabine Straus told the Wall Street Journal that "since blood clots are associated with Covid-19, by inoculating people against the disease, the vaccine 'likely reduces the risk of thrombotic incidents overall.'”

In an interview for Stat, University of Pennsylvania biostatistician Susan Ellenberg explained that

Vaccines protect against one thing: the infection or the infection plus disease. They don’t protect you against everything else that might possibly happen to you. There’s no reason to think that somehow there’s a magical period of time like, you know, four days or a week or two weeks after you get vaccinated, when none of those other horrible things are going to happen to you.

In response to the EMA's reassurances, France, Italy, Spain and Portugal reinstated vaccinations. Officials in other countries frame their actions to halt vaccinations as "precautionary". 2/ But the vaccine itself is a precaution, prompting University of Cambridge statistician David Spiegelhalter to counter that “the cautionary approach would be to carry on vaccinating. Casting doubt—lasting doubt—on the safety of the vaccines is not a precautionary position.”

The Next Day

Researchers studying the vaccine recipients diagnosed with cerebral venous sinus thrombosis (CVST) are convinced that it results from an autoimmune response. "Pål André Holme, a professor of hematology ... who headed an investigation into the Norwegian cases, said his team had identified an antibody created by the vaccine that was triggering the adverse reaction." His conclusion that "nothing but the vaccine can explain why these individuals had this immune response" is shared by German researchers independently looking at "13 cases of CVST detected among around 1.6 million people who received the AstraZeneca vaccine." British regulators continued to characterize the allegedly causal link to the vaccine as unproven. So did the EMA, which noted that "a causal link ... is possible and deserves further analysis.”

Two Weeks Later

From Gretchen Vogel & Kai Kupferschmidt, Side Effect Worry Grows for AstraZeneca Vaccine, Science, 372:14-15, Apr. 2, 2021, https://science.sciencemag.org/content/372/6537/14, DOI: 10.1126/science.372.6537.14:

... This week, Canada and Germany joined Iceland, Sweden, Finland, and France in recommending against the vaccine's use in younger people, who seem to be at higher risk for the clotting problem and are less likely to develop severe COVID-19. .... The highly unusual combination of symptoms—widespread blood clots and a low platelet count, sometimes associated with bleeding—has so far been reported from at least seven countries. ... Estimates of the incidence range from one in 25,000 people given the AstraZeneca vaccine in Norway to at least one in 87,000 in Germany. .... The United Kingdom remains a puzzle. Despite administering more than 11 million AstraZeneca doses, it has so far reported only a handful of suspicious clotting cases. But the U.K. did not limit the vaccine to younger groups, so the average age of recipients there may be older. ...
Researchers in Germany have proposed that some component of the vaccine triggers a rare immune reaction like one occasionally seen with the blood thinner heparin, in which antibodies trigger platelets to form dangerous clots throughout the body. This week the team posted case descriptions of what they call vaccine-induced prothrombotic immune thrombocytopenia (VIPIT) on the preprint server Research Square. ...
Even if VIPIT isn't the whole story, multiple other researchers told Science they are now convinced the vaccine somehow causes the rare set of symptoms. If true, that could be a serious blow to a vaccine that is central to the World Health Organization's push to immunize the world. ...

NOTES

  1. LYMErix also was a recombinant vaccine.
  2. Germany's healthcare ministry argued that for Germany, at least, the expected number of cases of cerebral vein thrombosis was only 1.4, making the 7 cases that occurred valid grounds to pause mass vaccinations. However, if there were many subcategories of clotting-related events considered and if this is the only one to display an apparently elevated level, the pattern would not be surprising. Moreover, the EMA looked at adverse events in more than a single country when making comparisons.

REFERENCES

  • European Medicines Agency, COVID-19 Vaccine AstraZeneca: Benefits Still Outweigh the Risks Despite Possible Link to Rare Blood Clots with Low Blood Platelets, Mar. 18, 2021, https://www.ema.europa.eu/en/news/covid-19-vaccine-astrazeneca-benefits-still-outweigh-risks-despite-possible-link-rare-blood-clots
  • Matthew Herpe, The Curious Case of AstraZeneca’s Covid-19 Vaccine, Stat, Mar. 15, 2021, https://www.statnews.com/2021/03/15/the-curious-case-of-astrazenecas-covid-19-vaccine/
  • Ivan F. N. Hung & Gregory A. Poland, Single-dose Oxford–AstraZeneca COVID-19 Vvaccine Followed by a 12-week Booster, Lancet, Mar. 6, 2021, 397(10277): P854-855, https://doi.org/10.1016/S0140-6736(21)00528-6
  • Derek Lowe, What is Going on with the AstraZeneca/Oxford Vaccine?, Science Translational Medicine, Mar. 16, 2021, https://blogs.sciencemag.org/pipeline/archives/2021/03/16/what-is-going-on-with-the-astrazeneca-oxford-vaccin
  • Smriti Mallapaty & Ewen Callaway. What Scientists Do and Don’t Know about the Oxford–AstraZeneca COVID Vaccine. Nature News Explainer, Mar. 24, 2021, https://www.nature.com/articles/d41586-021-00785-7
  • Daniel Michaels, AstraZeneca’s Covid-19 Vaccine Cleared by EU After Blood-Clot Concerns, Wall St. J., Mar. 18. 2021, https://www.wsj.com/articles/astrazenecas-covid-19-vaccine-is-cleared-by-europe-after-blood-clot-concerns-11616083845?mod=hp_lead_pos1
  • L. E. Nigrovic & K. M. Thompson, The Lyme Vaccine: A Cautionary Tale, Epidemiology & Infection, 135(1):1–8 (2007), doi: 10.1017/S0950268806007096, PMCID: PMC2870557
  • Bojan Pancevski, Scientists Say They Found Cause of Rare Blood Clotting Linked to AstraZeneca Vaccine, Wall St. J., Mar. 19, 2021 4:02 pm ET, https://www.wsj.com/articles/scientists-say-they-found-cause-of-blood-clotting-linked-to-astrazeneca-vaccine-11616169108

Saturday, March 6, 2021

"The Judgment of an Experienced Examiner"

The quotation from a textbook on forensic science in the left-hand panel invites the question of what the author was thinking. That an examiner's judgment is more important in comparisons of "class" features than "individual" ones? That the features that are present are less important than the examiner's judgment of them? Neither interpretation makes much sense. The examiner's judgment does not dictate anything about the features. It is the other way around. In applying a valid method, a proficient examiner generally will make correct judgments of significance as dictated by the features that are present.

As with most class evidence, the significance of a fiber comparison is dictated by the circumstances of the case, by the location, number, and nature of the fibers examined, and, most important, by the judgment of an experienced examiner.
Richard Saferstein, Criminalistics: An Introduction to Forensic Science 272 (12rth ed. 2018, Pearson Education Inc.) (emphasis added)
Over the years, scientific and legal scholars have called for the implementation of algorithms (e.g., statistical methods) in forensic science to provide an empirical foundation to experts’ subjective conclusions. ... Reactions have ranged from passive skepticism to outright opposition, often in favor of traditional experience and expertise as a sufficient basis for conclusions. In this paper, we explore why practitioners are generally in opposition to algorithmic interventions and how their concerns might be overcome. We accomplish this by considering issues concerning human-algorithm interactions in both real world domains and laboratory studies as well as issues concerning the litigation of algorithms in the American legal system. [W]e propose a strategy for approaching the implementation of algorithms ... .
Henry Swofford & Christophe Champod, Implementation of Algorithms in Pattern & Impression Evidence: A Responsible and Practical Roadmap, 3 Forensic Sci. Int'l: Synergy 100142 (2021) (abstract)

Another example of sloppy phrasing about expert judgment is the boilerplate disclaimers or admonitions in ASTM standards for forensic-science methods. For example, the 2019 Standard Guide for Forensic Analysis of Fibers by Infrared Spectroscopy (E2224−19) insists (in italics no less) that

This standard cannot replace knowledge, skills, or abilities acquired through education, training, and experience and is to be used in conjunction with professional judgment by individuals with such discipline-specific knowledge, skills, and abilities.

Does the first independent clause mean that fiber analysts are free to depart from the standard on the basis of their general "knowledge, skills, or abilities acquired through education, training, and experience"? Ever since Congress funded the Organization of Scientific Area Committees for Forensic Science (OSAC) to write new and better standards, lawyers in the organization have objected to the ASTM wording (without doubting that expert methods should be applied by responsible experts).

Now rumor has it that ASTM will be changing its stock sentence to the less ambiguous observation that

This standard is intended for use by competent forensic science practitioners with the requisite formal education, discipline-specific training (see Practice E2917), and demonstrated proficiency to perform forensic casework.

That's innocuous. Indeed, you might think it goes without saying.

Wednesday, January 20, 2021

P-values versus Statistical Significance in the Selective Enforcement Case of United States v. Lopez

For several reasons, United States v. Lopez, 415 F.Supp.3d 422 (S.D.N.Y. 2019), is another stimulating opinion from U.S. District Court Judge Jed S. Rakoff. It sets forth a new rule (or a new refinement of a rule) for handling discovery requests when a criminal defendant argues that he or she is the victim of unconstitutional selective investigation. In addition, the opinion reproduces the declaration of the defendants' statistical expert that the judge found "compelling." Too often, the work of statistical consultants is not readily available for public inspection.

In Lopez, Judge Rakoff granted limited discovery into a claim of selective prosecutions stemming from one type of DEA investigation -- "reverse stings." In these cases, law enforcement agents use informants to identify individuals who might want to steal drugs:

An undercover agent or informant then poses as a drug courier and offers the target an opportunity to steal drugs that do not actually exist. Targets in turn help plan and recruit other individuals to participate in a robbery of the fictitious drugs. Just before the targets are about to carry out their plan, they are arrested for conspiracy to commit the robbery and associated crimes.

Id. at 424. Defendant Johansi Lopez and other targets "who are all men of color, allege that ... the DEA limits such operations in the Southern District of New York to persons of color ... in violation of the Fifth Amendment's Equal Protection Clause." Id. at 425.

Seeking to sharpen the usual "broad discretion" approach to discovery, the court adopted the following standard for ordering discovery from the investigating agency:

where a defendant who is a member of a protected group can show that that group has been singled out for reverse sting operations to a statistically significant extent in comparison with other groups, this is sufficient to warrant further inquiry and discovery.

Id. at 427. The court was persuaded that the "of color" group was singled out on the basis of a "combination of raw data and statistical analysis." Id. This conclusion is entirely reasonable, but was the "singled out ... to a statistically significant extent" rule meant to make classical hypothesis testing at a fixed significance level the way to decide whether to grant discovery, or would an inquiry into p-values as measures of the strength of the statistical evidence against the hypothesis of "not singled out" be more appropriate? The discussion that follows suggests that little is gained by invoking the language and apparatus of hypothesis testing.

I. The "Singled Out" Standard

The "singled out" standard was offered as a departure from the one adopted by the Supreme Court in United States v. Armstrong, 517 U.S. 456 (1996). In Armstrong, Chief Justice Rehnquist wrote for the Court that to obtain discovery for a selective prosecution defense, "the required threshold [is] a credible showing of different treatment of similarly situated persons." Id. at 470 (emphasis added). The Court deemed insufficient a "study [that] failed to identify individuals who were not black and could have been prosecuted for the offenses for which respondents were charged, but were not so prosecuted." Id. (emphasis added).

In contrast, Judge Rakoff focused on "the racially disparate impact of the DEA's reverse sting operations," id. (emphasis added), and looked to other groups in the general population in ascertaining the magnitude and significance of the disparity. He maintained that "as now recognized by at least three federal circuits, selective enforcement claims should be open to discovery on a lesser showing than the very strict one required by Armstrong." Lopez, 415 F.Supp.3d at 425. In finding that this "lesser showing" had been made, he also considered three more restricted populations (of arrested individuals) that might better approximate groups in which all people are similarly plausible targets for sting operations.

Applying the "singled out ... to a statistically significant extent " standard requires not just the selection of an appropriate population in which to make comparisons among protected and unprotected classes but also a judicial determination of statistical significance. The opinion was not clear about the significance level required, but the court was impressed with an expert declaration that, unsurprisingly, applied the 0.05 level commonly used in many academic fields and most litigation involving statistical evidence.

Unfortunately, transforming a statistical convention into a rule of law does not necessarily achieve the ease of administration and degree of guidance that one would hope for. A p < 0.05 rule still leaves open considerable room for discretion in deciding whether p < 0.05, invites the ubiquitous search for significance, and requires an understanding of classical hypothesis testing that most judges and many experts lack if it is to be applied sensitively. Lopez itself shows some of the difficulties.

II. "Raw Data" Does Not Speak for Itself

The "raw data" (as presented in the body of the opinion) was that "not a single one of the 179 individuals targeted in DEA reverse sting operations in SDNY in the past ten years was white, and that all but two were African-American or Hispanic," whereas

  • "New York and Bronx Counties ... are 20.5% African-American, 39.7% Hispanic, and 29.5% White"; and
  • The breakdown of "NYPD arrests" is
    • felony drug: "42.7% African-American, 40.8% Hispanic, and 12.7% White";
    • firearms: "65.1% African-American, 24.3% Hispanic, 9.7% White"; and
    • robbery: "60.6% African-American, 31.1% Hispanic, 5.1% White."

Id. In other words, the "raw data" are summarized by the statistics p = 177/179 = 98.88% African-American and Hispanic (AH) targets among all the DEA-targets and π = 60.2%, 83.5%, 89.4%, or 91.7%, where π is the proportion of AH targets in the four surrogate populations.

Presumably, Judge Rakoff uses "statistically significant" to denote any difference p − π that would arise less than one time out of 20 when repeatedly picking 179 targets without regard to race (or a variable correlated with race) in the (unknown) population from which the DEA picked its "reverse sting" targets in the Southern District. (The 0.05 significance level is conventionally used in academic writing in many fields, and, as discussed below, it was the used by the defendant's expert in the case; however, in recent years, it has been said to be too weak to produce scientific findings that are likely to be reproducible.)

Notice that all the variability in the difference between p and π arises from p. The population proportion π is fixed (for each surrogate population and, of course, in the unknown population from which targets are drawn). It also is worth asking why the data and population are limited to the Southern District. Does the DEA have a different method for picking sting targets in, say, the Eastern District? If not, and if the issue is discriminatory intent, might the "significant" difference in the Southern District be an outlier from a process that does not look to the race of the individuals who are targeted?

In itself, "raw data" tell us nothing about statistical significance. "Significance" is just a word that characterizes a set of values for a statistic. It takes some sort of analysis to determine the zone in which "significance" exists. The evaluation may be intuitive and impressionistic, or it may be formal and quantitative. Intuitively, it seems that p = 177/179 is far enough from what would be expected from a race-independent system that it should be called "significant." But how do we know this intuition is correct? This is where probability theory comes into play.

III. One Simple Statistical Analysis

Even if the probability of targeting an African-American or a Hispanic in each investigation were the largest of the surrogate population proportions (0.917), the expected number of AH targets is still only 164, and the probability of picking 177 or more (an excess of 13 or more) is 0.0000271. The probability of picking as few or fewer than 151 AH targets (a deficit of 13 or more is 0.000881. Hence, the probability of the observed departure from the expected value (or an even greater departure) is 0.000918, or about 1/1,100. \1/

This two-sided p-value of 1/1,100 is less than 1/20, so we can reject the idea that the DEA selected targets without regard to AH status or some variable that is correlated with that status. That the p-value is smaller than the 1/20 cutoff makes the excess count statistically significant evidence against the hypothesis of a fixed selection probability of 0.917 in each case. The data suggest that the selection probability is at least somewhat larger than that.

Or do they? The binomial probability model that gives rise to these numbers could itself be wrong. What if there are fewer stings than targets? If an informant proposes multiple targets for one sting (think several members of the same gang, for example), won't knowing that the first one is AH increase the probability that the second one is AH?

IV. An "Exact" Analysis

The court did not discuss the formal analysis of statistical significance in the case. However, it relied on such an analysis. After describing the "raw data," Judge Rakoff added:

Furthermore, defendants have provided compelling expert analysis demonstrating that these numbers are statistically significant. According to a rigorous analysis conducted by Dr. Crystal S. Yang, a Harvard law and economics professor, it is highly unlikely, to the point of statistical significance, that the racially disparate impact of the DEA's reverse sting operations is simply random.

Id. at 427. The phrase "highly unlikely to the point of statistical significance" sounds simple, but the significance level of α = 0.05 is not all that stringent, and neither 0.05 nor p-values for the observed proportion p represent the probability that targets of the sting operations are "simply random" as far as racial status goes. Instead, α is the probability of an observed disparity that triggers the conclusion "not simply random" assuming that the model with the parameter π is correct. If we use the court's rule for rejecting random variation as an explanation for the disparity between p and π, then the chance of ordering discovery when the disparity is nothing more than statistical noise is just one in 20. But the probability that the disparity is merely noise when that disparity is large enough to reject that explanation need not be one in 20. If I flip a fair coin five times and observe five heads, I can reject the hypothesis that the coin is fair (i.e., the string of heads is merely noise) at the α = 0.05 level. The probability of the extreme statistic p = 5/5 is (1/2)5 = 1/32 < 1/20 if the coin is fair. But if I tell you that I blindly picked the coin from a box of 1,000 coins in which 999 were fair and one was biased, would you think that 1/20 is the probability that the coin I picked was biased? The probability for that hypothesis would be much closer to 1/1,000. (How much closer is left as an exercise to the reader who wishes to consult Bayes' theorem.)

Many opinions gloss over this distinction, transposing the arguments in the conditional probability that is the significance level to report things such as "[s]ocial scientists consider a finding of two standard deviations significant, meaning there is about 1 chance in 20 that the explanation for a deviation could be random." Waisome v. Port Authority, 948 F.2d 1370, 1376 (2d Cir. 1991). The significance level of 1/20 relates to the probability of data given the hypothesis, not the probability of the hypothesis given the data.

The expert report in this case did not equate the improbability of a range of data to the improbability of a hypothesis about the process that generated the data. The report is reproduced below. Dr. Yang used not just the four populations listed in the body of the opinion, but a total of eight surrogate populations as "benchmarks" to conclude that "it is extremely unlikely that random sampling from any of the hypothetical populations could yield a sample of 179 targeted individuals where 177 or more individuals are Latino or Black." Id. at 431. That is a fair statement. If we postulate simple random sampling in infinite populations with the specified values for the proportion π who are AH, the sample data for which p = 177/179 lie in a region outside of the central 95% of all possible samples. Standing alone, that critical region is not "extremely unlikely," but the largest p-value reported for all the populations in the declaration is 1/10,000.

The declaration is refreshingly free from exaggeration, but its explanation of the statistical analysis is slightly garbled. Dr. Yang swore that

Using the exact hypothesis test, I test whether the observed proportion of Latinos or Blacks observed in the sample (x = 177, n = 179) is equal to the hypothesized population probability/proportion p [π in the notation I have used]. Under this test, the null hypothesis is that the observed proportion is not statistically different from the hypothesized population proportion. The alternative hypothesis is that the observed proportion is statistically different from the hypothesized population proportion, a two-sided test. Each exact hypothesis test produces a corresponding p-value, which is the probability of observing a proportion as extreme or more extreme than the observed proportion assuming that the null hypothesis is true. A small p-value implies that the observed proportion is highly unlikely under the null hypothesis, favoring the rejection of the null hypothesis.

Id. at 430. The bottom line is fine, but the exposition confuses the notion of a p-value as a measure of the strength of evidence with the idea of a significance level as protection against false rejection of null hypotheses. A statistical distribution (or a non-parametric procedure) produces a p-value that can be compared to a pre-established value such as α = 0.5 to reach a test result of thumbs up (significant) or down (not significant). The hypothesis test does not "produce a ... p-value." At most, it uses a p-value to reach a conclusion. Once α is fixed, the p-value produces the test result.

Moreover, the description of the hypotheses that are being tested makes little sense. Hypothesis are statements about unknown parameter values. They are not statements about the data or about the data combined with the parameter. What does it mean to say that "the null hypothesis is that the observed proportion is not statistically different from the hypothesized population proportion" or that "[t]he alternative hypothesis is that the observed proportion is statistically different from the hypothesized population proportion, a two-sided test"? These are not hypotheses about the true value of an unknown parameter of the binomial distribution used in the exact test. The observed proportion (p in my notation) is known to be different from the hypothetical population proportion (π in my notation). There is no uncertainty about that. The hypotheses to be tested have to be statements about what produced the observed "statistical difference" p − π rather than statements of whether there is a "statistical difference."

Thus, the hypothesis that was tested was not "whether the observed proportion of Latinos or Blacks observed in the sample (x = 177, n = 179) is equal to the hypothesized population probability/proportion." We already know that these two known quantities are not equal. The hypothesis that Dr. Yang tested was whether (within the context of the model) the (unknown) selection probability is the known value π for a given surrogate population. If we call this unknown binomial probability Θ, she was testing claims about the true value of Θ. The null hypothesis is that the unknown Θ equals a known population proportion π; the alternative hypothesis is that Θ is not precisely equal to π. The tables in the expert declaration present strange expressions for these hypotheses, such as π = 0.917 rather than Θ = 0.917.

Nonetheless, these complaints about wording and notation might be dismissed as pedantic. The p-values for properly stated null hypotheses do signal a clear thumbs down for the null hypothesis Θ = 0.917 at the prescribed α = 0.05 level. \3/

Of more interest than these somewhat arcane points is the response to the question of dependent outcomes in selecting multiple targets for a particular investigation. Dr. Yang reasoned as follows:

[S]uppose that we took the most conservative approach and assumed that there is perfect homophily (i.e. a perfect correlation of 1) such that if one individual targeted in the operation is Latino or Black, all other individuals in that same operation are also Latino or Black. Under this conservative assumption, we can then treat the observed sample as if there were only 46 independent draws (rather than 179 draws). To observe a racial composition of 98.9% Latino or Black would thus require that at least 45 out of 46 draws resulted in a Latino or Black individual being targeted.

Id. at 431 (notes omitted). Rerunning the exact computation for p = 45/46 changes the p-value to 0.18, or roughly 1 in 5. Id. at 438 (Table 2, row h). This much larger p-value fails to meet the α = 0.05 rule that the court set.

The expert declaration reports this result along with those from seven other  hypothesis tests (one for each surrogate population). Five of the tests produce "significance," and three do not. Does this mixed picture establish that p < 0.05? It seems to me that the usual procedures for assessing multiple-comparisons do not apply here because the data are identical for all eight tests. The value of π in the surrogate population is being varied, and π is not a random variable. The use of multiple populations is a way to examine the robustness of the results of the statistical test. For p = 177/179, which is the only figure the court mentioned, the finding of significance is stable. For p = 45/46, however, it becomes necessary to ask which surrogate populations best approximate the unknown population of potential targets for the reverse stings. Numerical analysis cannot answer that question. \2/ As Macbeth soliloquized, "we still have judgment here."

NOTES

  1. An exact binomial computation performed in R (using binom.test(177,179,0.917)) gives a two-sided p-value of 0.0000578, which is about 1/17,300. Table 1, row h of the expert declaration reports a value 0.0001, or 1/10,000.
  2. Dr. Yang intimates that discovery might shed light on how much of a correction is needed. Id. at 431 ("It is impossible to know the true degree of homophily or correlation among individuals targeted in reverse-sting operations, particularly when the DEA's selection criteria is unknown."); id. at 432 ("Again, because I have no knowledge of the DEA's selection criteria of potential targets, it is impossible to know which of the hypothesized populations captures the relevant pool of similarly situated individuals. A more definitive statistical analysis may be possible if the government provides the requested selection criteria.").
  3. As the declaration points out, the p-values are small enough to yield "significance" at even smaller levels, but it would not be acceptable to set up a test at the α = 0.05 level and then interpret it as demonstrating significance at a different level. Changing α after analyzing the data vitiates the interpretation of α as the probability of a false alarm, which can be written as Pr("significant difference" | Θ = π).

APPENDIX: DECLARATION OF PROFESSOR CRYSTAL S. YANG

Pursuant to 28 U.S.C. § 1746, I, CRYSTAL S. YANG, J.D., Ph.D., declare under penalty of perjury that the following is true and correct:

1. I am a Professor of Law at Harvard Law School and Faculty Research Fellow at the National Bureau of Economic Research. My primary research areas are in criminal law and criminal procedure, with a focus on testing for and quantifying racial disparities in the criminal justice system. Before joining the Harvard Law School faculty, I was an Olin Fellow and Instructor in Law at The University of Chicago Law School. I am admitted to the New York State Bar and previously worked as a Special Assistant United States Attorney in the U.S. Attorney's Office for the District of Massachusetts.

2. I received a B.A. in economics summa cum laude from Harvard University in 2008, an A.M. in statistics from Harvard University in 2008, a J.D. magna cum laude from Harvard Law School in 2013, and a Ph.D. in economics from Harvard University in 2013. My undergraduate and graduate training involved substantial coursework in quantitative methods.

3. I have published in peer-reviewed journals such as the American Economic Review, The Quarterly Journal of Economics, the American Economic Journal: Economic Policy, and have work represented in many other peer-reviewed journals and outlets.

4. I make this Declaration in support of a motion being submitted by the defendants to compel the government to provide discovery in this case.

Statistical Analyses Pertinent to Motion

5. I was retained by the Federal Defenders of New York to provide various statistical analyses relevant to the defendants' motion in this case. Specifically, I was asked to evaluate whether the observed racial composition of targets in reverse-sting operations in the Southern District of New York could be due to random chance.

6. To undertake this statistical analysis, I first had to obtain the racial composition of targeted individuals in DEA reverse-sting stash house cases brought in the Southern District of New York for the ten-year period beginning on August 5, 2009 and ending on August 5, 2019. Based on the materials in Lamar and Garcia-Pena, as well as additional searches conducted by the Federal Defenders of New York, I understand that there have been 46 fake stash house reverse-sting operations conducted by the DEA during this time period. These 46 operations targeted 179 individuals of whom zero are White, two are Asian, and 177 are Latino or Black – the “sample.” \1/ Given these counts, this means that of the targeted individuals, 98.9% are Latino or Black (and 100% are non-White). Thus, the relevant question at hand is whether the observed racial composition of the sample could be due to random chance alone if the DEA sampled from a population of similarly situated individuals.

7. Second, I had to define what the underlying population of similarly situated individuals is. In other words, what is the possible pool of all similarly situated individuals who could have been targeted by the DEA in a reverse-sting operation? Because the DEA's criteria for being a target in these reverse-sting cases is unknown, my statistical analysis will assume a variety of hypothetical benchmark populations. If the government provides its selection criteria for being a target in these reverse-sting cases, a more definitive statistical analysis may be possible. Based on materials from Garcia-Pena and Lamar, I have identified eight hypothetical benchmark populations. Below, I present the hypothesized populations and the racial composition (% Latino or Black) in each population in order of least conservative (i.e. smallest share of Latino or Black) to most conservative (i.e. highest share of Latino or Black):

a. 2016 American Community Survey 5-year estimates on counties in the SDNY (from Garcia-Pena): 48.1% Latino or Black
b. 2016 American Community Survey 5-year estimates on Bronx and New York Counties (from Garcia-Pena): 60.2% Latino or Black
c. New York Police Department (NYPD) data from January 1 – December 31, 2017, on felony drug arrests in New York City (from Garcia-Pena): 83.5% Latino or Black
d. Estimates by Prof. Kohler-Hausmann on men aged 16-49 living in New York City who have prior New York State (NYS) violent felony convictions (from Lamar): 87.1% Latino or Black
e. Estimates by Prof. Kohler-Hausmann on men aged 16-49 living in New York City who have prior NYS felony convictions (from Lamar): 87.5% Latino or Black
f. NYPD data from January 1 – December 31, 2017 on firearms seizures arrests in New York City (from Garcia-Pena): 89.4% Latino or Black
g. Reverse-sting operation defendants in the Northern District of Illinois (from Garcia-Pena): 87.7-90.7% Latino or Black \2/
h. NYPD data from January 1 – December 31, 2017 on robbery arrests in New York City (from Garcia-Pena): 91.7% Latino or Black

8. For each of these eight hypothesized populations, I then conduct an exact hypothesis test for binomial random variables. This is the standard statistical test used for calculating the exact probability of observing x “successes” out of n “draws” when the underlying probability of success is p and the underlying probability of failure is 1-p. Here, each defendant represents an independent draw and a success occurs when the defendant is Latino or Black. Using the exact hypothesis test, I test whether the observed proportion of Latinos or Blacks observed in the sample (x = 177, n = 179) is equal to the hypothesized population probability/proportion p. Under this test, the null hypothesis is that the observed proportion is not statistically different from the hypothesized population proportion. The alternative hypothesis is that the observed proportion is statistically different from the hypothesized population proportion, a two-sided test. \3/ Each exact hypothesis test produces a corresponding p-value, which is the probability of observing a proportion as extreme or more extreme than the observed proportion assuming that the null hypothesis is true. A small p-value implies that the observed proportion is highly unlikely under the null hypothesis, favoring the rejection of the null hypothesis.

9. The following Table 1 presents each of the eight hypothesized population proportions, the null hypothesis under each population, the alternative hypothesis under each population, and the corresponding p-value using the observed proportion of Latinos or Blacks in the sample assuming 179 independent draws:

Table 1 (x = 177, n = 179)
Hypothesized Population
Proportion
Null Hypothesis
H0
Alternative Hypothesis
Ha
p-value
a. 48.1% Latino or Black H0: p = 0.481 Ha: p ≠ 0.481 0.0000
b. 60.2% Latino or Black H0: p = 0.602 Ha: p ≠ 0.602 0.0000
c. 83.5% Latino or Black H0: p = 0.835 Ha: p ≠ 0.835 0.0000
d. 87.1% Latino or Black H0: p = 0.871 Ha: p ≠ 0.871 0.0000
e. 87.5% Latino or Black H0: p = 0.875 Ha: p ≠ 0.875 0.0000
f. 89.4% Latino or Black H0: p = 0.894 Ha: p ≠ 0.894 0.0000
g. 90.7% Latino or Black H0: p = 0.907 Ha: p ≠ 0.907 0.0000
h. 91.7% Latino or Black H0: p = 0.917 Ha: p ≠ 0.917 0.0001

10. The above statistical calculations in Table 1 show that regardless of which of the eight hypothesized population proportions is chosen, one could reject the null hypothesis at conventional levels of statistical significance. For example, one could reject the null hypothesis at the standard 5% significance level which requires that the p-value be less than 0.05. All eight p-values are substantially smaller than 0.05 and would lead to a rejection of the null hypothesis even using more conservative 1%, 0.5%, or 0.1% significance levels. In other words, it is extremely unlikely that random sampling from any of the hypothetical populations could yield a sample of 179 targeted individuals where 177 or more individuals are Latino or Black.

11. Alternatively, one may be interested in the reverse question of what the underlying population proportion would have to be such that the observed proportion could be due to random chance alone assuming there are 179 independent draws. Using the standard 5% significance level, I have calculated that the hypothesized population would have to be composed of at least 96.0% Latinos or Blacks in order for one to not be able to reject the null hypothesis. In other words, unless the pool of similarly situated individuals is comprised of at least 96.0% Latinos or Blacks, it is highly unlikely that one could get a sample of 179 targeted individuals where 177 or more individuals are Latino or Black.

12. One potential question with the statistical analyses in Table 1 is whether the assumption that each of the 179 targeted individuals is an independent draw is reasonable. For example, what if the race/ethnicity of individuals in each reverse-sting operation is correlated, such that if one individual targeted in an operation is Latino or Black, the other individuals are also more likely to be Latino or Black? This correlation within operations could result if there is homophily, or “the principle that a contact between similar people occurs at a higher rate than among dissimilar people.” Miller McPherson et al., Birds of a Feather: Homophily in Social Networks, 27 Ann. Rev. Soc. 315, at 416 (2001). It is impossible to know the true degree of homophily or correlation among individuals targeted in reverse-sting operations, particularly when the DEA's selection criteria is unknown. But suppose that we took the most conservative approach and assumed that there is perfect homophily (i.e. a perfect correlation of 1) such that if one individual targeted in the operation is Latino or Black, all other individuals in that same operation are also Latino or Black. Under this conservative assumption, we can then treat the observed sample as if there were only 46 independent draws (rather than 179 draws). \4/ To observe a racial composition of 98.9% Latino or Black would thus require that at least 45 out of 46 draws resulted in a Latino or Black individual being targeted. \5/

13. For each of the eight hypothesized benchmark populations, I then test whether the observed proportion of Latinos or Blacks observed in this alternative sample (x = 45, n = 46) is equal to the hypothesized proportion from an underlying population assuming that there are only 46 independent draws. The following Table 2 presents each of the eight hypothesized population proportions, the null hypothesis under each population, the alternative hypothesis under each population, and the corresponding p-value using the observed proportion of Latinos or Blacks in the sample assuming 46 independent draws:

Table 2 (x = 45, n = 46)
Hypothesized Population
Proportion
Null Hypothesis
H0
Alternative Hypothesis
Ha
p-value
a. 48.1% Latino or Black H0: p = 0.481 Ha: p ≠ 0.481 0.0000
b. 60.2% Latino or Black H0: p = 0.602 Ha: p ≠ 0.602 0.0000
c. 83.5% Latino or Black H0: p = 0.835 Ha: p ≠ 0.835 0.0045
d. 87.1% Latino or Black H0: p = 0.871 Ha: p ≠ 0.871 0.0255
e. 87.5% Latino or Black H0: p = 0.875 Ha: p ≠ 0.875 0.0257
f. 89.4% Latino or Black H0: p = 0.894 Ha: p ≠ 0.894 0.0871
g. 90.7% Latino or Black H0: p = 0.907 Ha: p ≠ 0.907 0.1240
h. 91.7% Latino or Black H0: p = 0.917 Ha: p ≠ 0.917 0.1795

14. Under this conservative assumption of perfect homophily, the above statistical calculations in Table 2 show that under the first five hypothesized population proportions (a-e), one could reject the null hypothesis at the standard 5% significance level. In other words, even if the hypothesized population proportion of Latinos or Blacks is as high as 87.5%, it is highly unlikely that random sampling could yield a sample of 46 individuals where 45 or more individuals are Latino or Black. One, however, cannot reject the null hypothesis for the next three hypothesized population proportions (f-h). Again, because I have no knowledge of the DEA's selection criteria of potential targets, it is impossible to know which of the hypothesized populations captures the relevant pool of similarly situated individuals. A more definitive statistical analysis may be possible if the government provides the requested selection criteria.

15. As before, I also ask the reverse question of what the underlying population proportion would have to be such that the observed proportion could be due to random chance alone assuming that there are only 46 independent draws. Using the standard 5% significance level, I have calculated that the hypothesized population would have to be composed of at least 88.5% Latinos or Blacks in order for one to not be able to reject the null hypothesis. In other words, unless the pool of similarly situated individuals is comprised of at least 88.5% Latinos or Blacks, it is highly unlikely that one could get a sample of 46 targeted individuals where 45 or more individuals are Latino or Black.

Dated: Cambridge, Massachusetts
September 13, 2019
/s/ Crystal S. Yang
Crystal S. Yang

Footnotes

   1. In consultation with the Federal Defenders of New York, this sample is obtained by taking the 33 cases and 144 defendants identified in Garcia-Pena or Lamar, excluding two cases and five defendants that are either not DEA cases or reverse-sting cases, and including an additional 15 cases and 40 defendants that were not covered by the time frames included in the Lamar or Garcia-Pena analysis.
   2. I choose 90.7% (the upper end of the range) as the relevant proportion given that it yields the most conservative estimates.
   3. This two-sided test takes the most conservative approach (in contrast to a one-sided test) because it allows for the possibility of both an over-representation and under-representation of Latinos or Blacks relative to the hypothesized population proportion.
   4. I make the simplifying assumption that each of the 46 operations targeted the average number of codefendants, 3.89= 179/46.
   5. Technically, 45.494 draws would need to be of Latino or Black individuals but I conservatively round down to the nearest integer.

 UPDATED: 1/24/2021 3:13 ET