Tuesday, November 24, 2020

Wikimedia v. NSA: It's Classified!

The National Security Agency (NSA) engages in systematic, warrantless "upstream" surveillance of Internet communications that travel in and out of the United States along a "backbone" of fiber optic cables. The ACLU and other organizations maintain that Upstream surveillance is manifestly unconstitutional. Yet, the government has stymied one legal challenge after another on the ground that plaintiffs lacked standing because they cannot prove that the surveillance entails intercepting, copying, and reviewing any of their communications. Of course, the reason plaintiffs have no direct evidence is that the government has denominated the surveillance program a state secret, classified its details, and resisted even in camera hearings in ordinary courts.

In Wikimedia Foundation v. National Security Agency, 857 F.3d 193 (4th Cir. 2017), however, the Court of Appeals held that the Wikimedia Foundation, which operates Wikipedia, made "allegations sufficient to survive a facial challenge to standing." Id. at 193. The court concluded that Wikimedia's allegations were plausible enough to defeat a motion to dismiss the complaint because

Wikimedia alleges three key facts that are entitled to the presumption of truth. First, “[g]iven the relatively small number of international chokepoints,” the volume of Wikimedia's communications, and the geographical diversity of the people with whom it communicates, Wikimedia's “communications almost certainly traverse every international backbone link connecting the United States with the rest of the world.”

Second, “in order for the NSA to reliably obtain communications to, from, or about its targets in the way it has described, the government,” for technical reasons that Wikimedia goes into at length, “must be copying and reviewing all the international text-based communications that travel across a given link” upon which it has installed surveillance equipment. Because details about the collection process remain classified, Wikimedia can't precisely describe the technical means that the NSA employs. Instead, it spells out the technical rules of how the Internet works and concludes that, given that the NSA is conducting Upstream surveillance on a backbone link, the rules require that the NSA do so in a certain way. ...

Third, per the PCLOB [Privacy and Civil Liberties Oversight Board] Report and a purported NSA slide, “the NSA has confirmed that it conducts Upstream surveillance at more than one point along the [I]nternet backbone.” Together, these allegations are sufficient to make plausible the conclusion that the NSA is intercepting, copying, and reviewing at least some of Wikimedia's communications. To put it simply, Wikimedia has plausibly alleged that its communications travel all of the roads that a communication can take, and that the NSA seizes all of the communications along at least one of those roads. 

Id. at 210-11 (citations omitted).

The Fourth Circuit therefore vacated an order dismissing Wikimedia's complaint issued by Senior District Judge Thomas Selby Ellis III, the self-desctibed "impatient" jurist who achieved notoriety and collected ethics complaints (that were rejected last year) for his management of the trial of former Trump campaign manager Paul Manafort.

On remand, the government moved for summary judgment. Wikimedia Foundation v. National Security Agency/Central Security Service, 427 F.Supp.3d 582 (D. Md. 2019). Once more, the government argued that Wikimedia lacked standing to complain that the Upstream surveillance violated its Fourth Amendment rights. It suggested that the "plausible" inference that the NSA must be "intercepting, copying, and reviewing at least some of Wikimedia's communications” was not so plausible after all. To support this conclusion, it submitted a declaration of Henning Schulzrinne, a Professor of Computer Science and Electrical Engineering at Columbia University. Dr. Schulzrinne described how companies carrying Internet traffic might filter transmissions before copying them by “mirroring” with “routers” or “switches” that could perform “blacklisting” or “whitelisting” if the NSA chose to give the companies information on its targets with which to create “access control lists.”

But Dr. Schulzrinne supplied no information and formed no opinion on whether it was at all likely that the NSA used the mirroring methods that he envisioned, and Wikimedia produced a series of expert reports from Scott Bradner, who had served as Harvard University’s Technology Security Officer and taught at that university. Bradner contended that the NSA could hardly be expected to give away the classified information on its targets and concluded that it is all but certain that the agency intercepted and opened at least one of Wikimedia's trillions of Internet communications.

The district court refused to conduct an evidentiary hearing on the factual issue. Instead, he disregarded the expert's opinion as inadmissible scientific evidence under Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993), because no one without access to classified information could "know what the NSA prioritizes in the Upstream surveillance program ... and therefore Mr. Bradner has no knowledge or information about it." Wikimedia, 427 F. Supp. 3d at 604–05 (footnotes omitted).

This reasoning resembles that from Judge Ellis's first opinion in this long-running case. In Wikimedia Found. v. NSA, 143 F. Supp. 3d 344, 356 (D. Md. 2015), the judge characterized Wikimedia’s allegations as mere “suppositions and speculation, with no basis in fact, about how the NSA” operates and maintained that it was impossible for Wikimedia to prove its allegations “because the scope and scale of Upstream surveillance remain classified . . . .” Id. Rather than allow full consideration of the strength of the evidence that makes Wikimedia’s claim plausible, as the Fourth Circuit characterized it, the district court restated its position that “Mr. Bradner has no [direct] knowledge or information” because that information is classified. Wikimedia, 427 F. Supp. 3d at 604–605.

In a pending appeal to the Fourth Circuit, Edward Imwinkelried, Michael Risinger, Rebecca Wexler, and I prepared a brief as amici curiae in support of Wikimedia. The brief expresses surprise at “the district court’s highly abbreviated analysis of Rule 702 and Daubert, as well as the court’s consequent decision to rule inadmissible opinions of the type that Wikimedia’s expert offered in this case.” It describes the applicable standard for excluding expert testimony. It then argues that the expert’s method of reasoning was sound and that its factual bases regarding the nature of Internet communications and surveillance technology, together with public information on the goals and needs of the NSA program, were sufficient to justify the receipt of the proposed testimony.

Monday, September 28, 2020

Terminology Department: Significance

Inns of Court College of Advocacy, Guidance on the Preparation, Admission and Examination of Expert Evidence § 5.2 (3d ed. 2020)
Statisticians, for example, use what appear to be everyday words in specific technical senses. 'Significance' is an example. In everyday language it carries associations of importance, something with considerable meaning. In statistics it is a measure of the likelihood that a relationship between two or more variables is caused by something other than random chance.
Welcome to the ICCA

The Inns of Court College of Advocacy ... is the educational arm of the Council of the Inns of Court. The ICCA strives for ‘Academic and Professional Excellence for the Bar’. Led by the Dean, the ICCA has a team of highly experienced legal academics, educators and instructional designers. It also draws on the expertise of the profession across the Inns, Circuits, Specialist Bar Associations and the Judiciary to design and deliver bespoke training for student barristers and practitioners at all levels of seniority, both nationally, pan-profession and on an international scale.

How good is the barristers' definition of statistical significance? In statistics, an apparent association between variables is said to be significant when it is lies outside the range that one would expect to see in some large fraction of repeated, identically conducted studies in which the variables are in fact uncorrelated. Sir Ronald Fisher articulated the idea as follows:

[I]t is convenient to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials.’ This level ... we may call the 5 per cent. point .... If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach that level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. [1]

For Fisher, a "significant" result would occur by sheer coincidence no "more than once in twenty trials" (on average). 

Is such statistical significance the same as the barristers' "likelihood" that an observed "relationship ... is caused by something other than random chance"? One might object to the appearance of the term "likelihood" in the definition because it too is a technical term with a specialized meaning in statistics, but that is not the main problem. The venacular likelihood that X is the cause of extreme data (where X is anything other than random chance) is not a "level of significance" such as 5%, 2%, or 1%. These levels are conditional error probabilities: If the variables are uncorrelated and we use a given level to call the observed results significant, then, in the (very) long run, we will label coincidental results as significant no more than that level specifies. For example, if we always use a 0.01 level, we will call coincidences "significant" no more than 1% of the time (in the limit).

The probability (the venacular likelihood) "that a relationship between two or more variables is caused by something other than random chance" is quite different. Everything else being equal, significant results are more likely to signal a true relationship than are nonsignificant results, but the significance level itself refers to the probability of data that are uncommon when there is no true relationship, and not to the probability that the apparent relationship is real. In symbols, Pr(relationship | extreme data) is not Pr(extreme data | relationship). Naively swapping the terms in the expressions for the conditional probabilities is known as the transposition fallacy. In criminal cases involving statistical evidence, it often is called the "prosecutor's fallacy." Perhaps "barristers' fallacy" can be added to the list.

REFERENCE

  1. Ronald Fisher, The Arrangement of Field Experiments, 33 J. Ministry Agric. Gr. Brit 503-515, 504 (1926).

ACKNOWLEDGMENT: Thanks to Geoff Morrison for alerting me to the ICCA definition.

Wednesday, August 26, 2020

Terminology Department: Defining Bias for Nonstatisticians

The Organization of Scientific Area Committees for Forensic Science (OSAC) is trying to develop definitions of common technical terms that can be used across most forensic-science subject areas. "Bias" is one of these ubiquitous terms, but its statistical meaning does not conform to the usual dictionary definitions, such as  "an inclination of temperament or outlook, especially: a personal and sometimes unreasoned judgment" \1/ or "the action of supporting or opposing a particular person or thing in an unfair way, because of allowing personal opinions to influence your judgment." \2/ 

I thought the following definition might be useful for forensic-science practitioners:

A systematic tendency for estimates or measurements to be above or below their true values. A study is said to be biased if its design is such that it systematically favors certain outcomes. An estimator of a population parameter is biased when the average value of the estimates (from an infinite number of samples) would not equal the value of the parameter. Bias arises from systematic as opposed to random error in the collection of units to be measured, the measurement of the units, or the process for estimating quantities based on the measurements.

It ties together some of the simplest definitions I have seen in textbooks and reference works on statistics -- namely:

Yadolah Dodge, The Concise Encyclopedia of Statistics 41 (2008): From a statistical point of view, the bias is defined as the difference between the expected value of a statistic and the true value of the corresponding parameter. Therefore, the bias is a measure of the systematic error of an estimator. If we calculate the mean of a large number of unbiased estimations, we will find the correct value. The bias indicates the distance of the estimator from the true value of the parameter. Comment: This is the definition for mathematical statistics.
B. S. Everitt & A. Skrondal, The Cambridge Dictionary of Statistics 45 (4th ed. 2010) (citing Altman, D.G. (1991) Practical Statistics for Medical Research, Chapman and Hall, London): In general terms, deviation of results or inferences from the truth, or processes leading to such deviation. More specifically, the extent to which the statistical method used in a study does not estimate the quantity thought to be estimated, or does not test the hypothesis to be tested. In estimation usually measured by the difference between the expected value of an estimator and the true value of the parameter. An estimator for which E(θ-hat) = θ is said to be unbiased. See also ascertainment bias, recall bias, selection bias and biased estimator. Comment: The general definition (first sentence) fails to differentiate between random and systematic deviations. The “more specific” definition in the next sentence is limited to the definition in mathematical statistics.
David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manual on Scientific Evidence 283 (Federal Judicial Center & Nat’l Research Council eds., 3d ed. 2011): Also called systematic error. A systematic tendency for an estimate to be too high or too low. An estimate is unbiased if the bias is zero. (Bias does not mean prejudice, partiality, or discriminatory intent.) See nonsampling error. Comment: This one is intended to convey the essential idea to judges.
David H. Kaye, Frequentist Methods for Statistical Inference, in Handbook of Forensic Statistics (D. Banks, K. Kafadar, D. Kaye & M. Tackett eds. 2020): [A]n unbiased estimator t of [a parameter] θ will give estimates whose errors eventually should average out to zero. Error is simply the difference between the estimate and the true value. For an unbiased estimator, the expected value of the errors is E(tθ) = 0. Comment: Yet another version of the definition of an unbiased estimator of a population or model parameter.
JCGM, International Vocabulary of Metrology – Basic and General Concepts and Associated Terms (VIM) (3d ed. 2012): measurement bias, bias -- estimate of a systematic measurement error Comment: The VIM misdefines bias as an estimate of bias.
David S. Moore & George P. McCabe, Introduction to the Practice of Statistics 232 (2d ed. 1993): Bias. The design of a study is biased if it systematically favors certain outcomes. In a causal study, bias can result from confounding. Or can it?

NOTES

  1. Merriam-Webster Dictionary (online).
  2. Cambridge Dictionary (online)

Saturday, August 22, 2020

Phrases for the Value or Weight of Evidence

A few statisticians asked me (independently) about the usage of the terms evidential value, evidentiary value, and probative value. For years, I thought the phrases all meant the same thing, but that is not true in some fields.

Evidential Value

Black’s Law Dictionary (which tends to have aged definitions) has this definition of evidential value: “Value of records given as or in support of evidence, based on the certainty of the records origins. The value here is not in the record content. This certainty is essential for authentic and adequate evidence of an entity’s actions, functioning, policies, and/or structure.”

Under this definition, "evidential value" pertains to a document's value merely as the container of information. The definition distinguishes between the provenance and authenticity of a document -- where did it come from and has it been altered? -- and the content of the document -- what statements or information does it contain? Likewise, archivists distinguish between "evidential value" and "informational value." The former, according to the Society of American Archivists "relates to the process of creation rather than the content (informational value) of the records."

Evidentiary Value

Lawyers use the phrases "evidentiary value" and "probative value" (or "probative force") as synonyms. For example, a 1932 note in the University of Pennsylvania Law Review on "Evidentiary Value of Finger-Prints" predicted that "the time is not far distant when courts must scrutinize and properly evaluate the probative force to be given to evidence that finger-prints found on the scene correspond with those of the accused." \1/

Forensic scientists use "evidentiary value" to denote the utility of examining objects for information on whether the objects have a common origin. A 2009 report of a committee of the National Academies complained that there was no standard threshold for deciding when bitemarks have "reached a threshold of evidentiary value." \2/ More generally, the phrase can denote the value of any expert analysis of effects as proof of the possible cause of those effects. \3/

Evidential Value

Unlike archivists, forensic scientists use the phrase “evidential value” interchangeably with "evidentiary value." It appears routinely in titles or articles and books such as "Evidential Value of Multivariate Physicochemical Data," \4/ "Enhancing the Evidential Value of Fingermarks Through Successful DNA Typing," \5/ and "Establishing the Evidential Value of a Small Quantity of Material Found at a Crime Scene." \6/

Probative Value

Lawyers use "probative value" to denote the degree to which an item of evidence proves the proposition it is offered to prove. Credible evidence that a defendant threatened to kill the deceased, whose death was caused by a poison, is probative of whether the death was accidental and whether defendant was the killer. With circumstantial evidence like this, various probability-based formulations have been proposed to express probative value quantitatively. \7/ One of the simplest is the likelihood ratio or Bayes factor (BF) favored by most forensic statisticians. \8/ Its logarithm has qualities that argue for using log(BF) to express the "weight" of an item of evidence. \9/

The rules of evidence require judges to exclude evidence when unfair prejudice, distraction, and undue consumption of time in presenting the evidence substantially outweigh the probative value of the evidence. \10/ In theory, judges do not exclude evidence just because they do not believe that the witness is telling the truth. The jury will take credibility into account in deciding the case. However, in ensuring that there is sufficient probative value to bother with the evidence, judges can hardly avoid being influenced by the trustworthiness of the source of the information. Moreover, the importance of the fact that the proposed testimony addresses and the availability of alternative, less prejudicial proof also can influence the decision to exclude evidence that is probative of a material fact. \11/

NOTES

  1. Note, Evidentiary Value of Finger-Prints, 80 U. Penn. L. Rev. 887 (1932).
  2. Comm. on Identifying the Needs of the Forensic Sci. Cmty. Nat'l Research Council, Strengthening Forensic Science in the United States: A Path Forward 176 (2009).
  3. Nicholas Dempsey & Soren Blau, Evaluating the Evidentiary Value of the Analysis of Skeletal Trauma in Forensic Research: A Review of Research and Practice, 307 Forensic Sci. Int'l (2020), https://doi.org/10.1016/j.forsciint.2020.110140. Still another usage of the term occurs in epistemology. See P. Gärdenfors, B. Hansson, N-E. Sahlin, Evidentiary Value: Philosophical, Judicial and Psychological Aspects of a Theory (1983); Dennis V. Lindley, Review, 35(3) Brit. J. Phil. Sci. 293-296 (1984) (criticizing this theory).
  4. Grzegorz Zadora, Agnieszka Martyna, Daniel Ramos & Colin Aitken, Statistical Analysis in Forensic Science: Evidential Value of Multivariate Physicochemical Data (2014).
  5. Zuhaib Subhani, Barbara Daniel & Nunzianda Frascione, DNA Profiles from Fingerprint Lifts—Enhancing the Evidential Value of Fingermarks Through Successful DNA Typing, 64(1) J. Forensic Sci. 201–06 (2019), https://doi.org/10.1111/1556-4029.13830.
  6. I.W. Evett, Establishing the Evidential Value of a Small Quantity of Material Found at a Crime Scene, 33(2) J. Forensic Sci. Soc’y 83-86 (1993).
  7. 1 McCormick on Evidence § 185 (R. Mosteller ed., 8th ed. 2020); David H. Kaye, Review-essay, Digging into the Foundations of Evidence Law, 116 Mich. L. Rev. 915-34 (2017), http://ssrn.com/abstract=2903618.
  8. See Anuradha Akmeemana, Peter Weis, Ruthmara Corzo, Daniel Ramos, Peter Zoon, Tatiana Trejos, Troy Ernst, Chip Pollock, Ela Bakowska, Cedric Neumann & Jose Almirall, Interpretation of Chemical Data from Glass Analysis for Forensic Purposes, J. Chemometrics (2020), DOI:10.1002/cem.3267.
  9. Fed. R. Evid. 403; Unif. R. Evid. 403; 1 McCormick, supra note 6, § 185.
  10. I. J. Good, Weight of Evidence: A Brief Survey, in 2 Bayesian Statistics 249–270 (Bernardo, J. M., M. H. DeGroot, D. V. Lindley & A. F. M. Smith eds., 1985); Irving John Good, Weight of Evidence and the Bayesian Likelihood Ratio, in The Use of Statistics in Forensic Science 85–106 (C. G. G. Aitken & David A. Stoney eds., 1991).
  11. 1 McCormick, supra note 6, § 185.

Tuesday, August 18, 2020

"Quite High" Accuracy for Firearms-mark Comparisons

Court challenges to the validity of forensic identification of the gun that fired a bullet based on toolmark comparisons have increased since the President's Council of Advisors on Science and Technology (PCAST) issued a report in late 2016 stressing the limitations in the scientific research on the subject. A study from the Netherlands preprinted in 2019 adds to the research literature. The abstract reads (in part):

Forensic firearm examiners compare the features in cartridge cases to provide a judgment addressing the question about their source: do they originate from one and the same or from two different firearms? In this article, the validity and reliability of these judgments is studied and compared to the outcomes of a computer-based method. The ... true positive rates (sensitivity) and the true negative rates (specificity) of firearm examiners are quite high. ... The examiners are overconfident, giving judgments of evidential strength that are too high. The judgments of the examiners and the outcomes of the computer-based method are only moderately correlated. We suggest to implement performance feedback to reduce overconfidence, to improve the calibration of degree of support judgments, and to study the possibility of combining the judgments of examiners and the outcomes of computer-based methods to increase the overall validity.

Erwin J.A.T. Mattijssen, Cilia L.M. Witteman, Charles E.H. Berger, Nicolaas W. Brand & Reinoud D. Stoel, Validity and Reliability of Forensic Firearm Examiners. Forensic Sci. Int’l 2020, 307:110112. 

 

Despite the characterization of examiner sensitivity and specificity as "quite high," the observed specificity was only 0.89, which corresponds to a false-positive rate of 11%—much higher than the <2% estimate quoted in recent judicial opinions. But the false-positive proportions from different experiments are not as discordant as they might appear to be when naively juxtaposed. To appreciate the sensitivity and specificity reported in this experiment, we need to understand the way that the validity test was constructed.

Design of the Study

The researchers fired two bullets from two hundred 9 mm Luger Glock pistols seized in the Netherlands. These 400 test firings gave rise to true (same-source) and false (different-source) pairings of two-dimensional comparison images of the striation patterns on cartridge cases. Specifically, there were 400 cartridge cases from which the researchers made "measurements of the striations of the firing pin aperture shear marks" and prepared "digital images [of magnifications of] the striation patterns using oblique lighting, optimized to show as many of the striations as possible while avoiding overexposure." (They also produced three-dimensional data, but I won't discuss those here.)

They invited forensic firearm examiners from Europe, North America, South America, Asia and Oceania by e-mail to examine the images. Of the recipients, 112 participated, but only 77 completed the online questionnaire with 60 comparison images of striation patterns aligned side-by-side. (The 400 images gave rise to (400×399)/2 distinct pairs of images, of which 200 were same-source pairs. They could hardly ask the volunteers to study all these 79,800 pairs, so they used a computer program for matching such patterns to obtain 60 pairs that seemed to cover "the full range of comparison difficulty" but that overrepresented "difficult" pairs — an important choice that we'll talk about soon. Of the 60, 38 were same-source pairs, and 22 were different-source pairs.)

The examiners first evaluated the degree of similarity on a five-point scale. Then they were shown the 60 pairs again and asked (1) whether the comparison provides support that the striations in the cartridge cases are the result of firing the cartridge cases with one (same-source) or with two (different-source) Glock pistols; (2) for their sense of the degree of support for this conclusion; and (3) whether they would have provided an inconclusive conclusion in casework.

The degree of support was reported or placed on a six-point scale of "weak support" (likelihood ratio L = 2 to 10), "moderate support" (L = 10 to 100), "moderately strong support" (L = 100 to 1,000), "strong support" (L = 1,000 to 10,000), "very strong support" (L = 10,000 to 1,000,000), and "extremely strong support" (L > 1,000,000). The computerized system mentioned above also generated numerical likelihood ratios. (The proximity of the ratio to 1 was taken as a measure of difficulty.)

A Few of the Findings

For the 60 two-dimensional comparisons, the computer program and the human examiners performed as follows:

Table 1. Results for (computer | examiner) excluding pairs deemed inconclusive by examiners.

SS pair DS pair
SS outcome (36 | 2365) (10 | 95)
DS outcome (2 | 74) (12 | 784)
validity
sens = (.95 | .97)
FNP = (.05 | .03)
spec = (.55 | .89)
FPP = (.45 | .11)
Abbreviations: SS = same source; DS = difference source
sens = sensitivity; spec = specificity
FNP = false negative proportion; FPP = false positive proportion

Table 1 combines two of the tables in the article. The entry "36 | 2365," for example, denotes that the computer program correctly classified as same-source 36 of the 38 same-source pairs (95%), while the examinations of the 77 examiners correctly classified 2,365 pairs out of the 2,439 same-source comparisons (97%) that they did not consider inconclusive. The computer program did not have the option to avoid a conclusion (or rather a likelihood ratio) in borderline cases. When examiners' conclusions on the cases they would have called inconclusive in practice were added in, the sensitivity and specificity dropped to 0.93 and 0.81, respectively.

Making Sense of These Results

The reason there are more comparisons for the examiners is that there were 77 of them and only one computer program. The 77 examiners made 77 × 60 comparisons, while the computer program made only 1 × 60 comparisons on the the 60 pairs. Those pairs, as I noted earlier, were not a representative sample. On all 79,800 possible pairings of the test fired bullets, the tireless computer program's sensitivity and specificity were both 0.99. If we can assume that the human examiners would have done at least as well as the computer program that it outperformed (on the 60 more or less "difficult" cases), their performance for all possible pairs would have been excellent.

An "Error Rate" for Court?

The experiment gives a number for the false-positive "error rate" (misclassifications across all the 22 different-source pairs) of 11%. If we conceive of the examiners' judgments as a random sample from some hypothetical population of identically conducted experiments, then the true false-positive error probability could be somewhat higher (as emphasized in the PCAST report) or lower. How should such numbers be used in admissibility rulings under Daubert v. Merrell Dow Pharmaceuticals, Inc.? At trial, to give the jury a sense of the chance of a false-positive error (as PCAST also urged)?

For admissibility, Daubert referred (indirectly) to false-positive proportions in particular studies of "voice prints," although the more apt statistic would be a likelihood ratio for a positive classification. For the Netherlands study, that would be L+ = Pr(+|SS) / Pr(+|DS) ≈ 0.97/0.11 = 8.8. In words, it is almost nine times more probable that an examiner (like the ones in the study) will report a match when confronted with a randomly selected same-source pair of images than a randomly selected different-source pair (from the set of 60 constructed for the experiment). That validates the examiners' general ability to distinguish between same- and different-source pairs at a modest level of accuracy in that sample.

But to what extent can or should this figure (or just the false-positive probability estimate of 0.11) be presented to a factfinder as a measure of the probative value of a declared match? In this regard, arguments arise over presenting an average figure in a specific case (although that seems like a common enough practice in statistics) and the realism of the experiment. The researchers warn that

For several reasons it is not possible to directly relate the true positive and true negative rates, and the false positive and false negative rates of the examiners in this study to actual casework. One of these reasons is that the 60 comparisons we used were selected to over-represent ‘difficult’ comparisons. In addition, the use of the online questionnaire did not enable the examiners to manually compare the features of the cartridge cases as they would normally do in casework. They could not include in their considerations the features of other firearm components, and their results and conclusions were not peer reviewed. Enabling examiners to follow their standard operating procedures could result in better performance.

There Is More

Other facets of the paper also make it recommended reading. Data on the reliability of conclusions (both within and across examiners) are presented, and an analysis of the extent to which examiners' judgments of how strongly the images supported their source attributions led the authors to remark that

When we look at the actual proportion of misleading choices, the examiners judged lower relative frequencies of occurrence (and thus more extreme LRs) than expected if their judgments would have been well-calibrated. This can be seen as overconfidence, where examiners provide unwarranted support for either same-source or different-source propositions, resulting in LRs that are too high or too low, respectively. ... Simply warning examiners about overconfidence or asking them to explain their judgments does not necessarily decrease overconfidence of judgments.