Saturday, March 6, 2021

"The Judgment of an Experienced Examiner"

The quotation from a textbook on forensic science in the left-hand panel invites the question of what the author was thinking. That an examiner's judgment is important in comparisons of "class" features in trace evidence but not so much in comparisons of "individual" features? That can't be right. That the features that are present are less important than the examiner's judgment of them? That makes no sense either. The examiner's judgment does not dictate anything about the features. It is the other way around. In applying a valid method, a proficient examiner will make accurate judgments of significance as dictated by the features that are present.

As with most class evidence, the significance of a fiber comparison is dictated by the circumstances of the case, by the location, number, and nature of the fibers examined, and, most important, by the judgment of an experienced examiner.
Richard Saferstein, Criminalistics: An Introduction to Forensic Science 272 (12rth ed. 2018, Pearson Education Inc.) (emphasis added)
Over the years, scientific and legal scholars have called for the implementation of algorithms (e.g., statistical methods) in forensic science to provide an empirical foundation to experts’ subjective conclusions. ... Reactions have ranged from passive skepticism to outright opposition, often in favor of traditional experience and expertise as a sufficient basis for conclusions. In this paper, we explore why practitioners are generally in opposition to algorithmic interventions and how their concerns might be overcome. We accomplish this by considering issues concerning human-algorithm interactions in both real world domains and laboratory studies as well as issues concerning the litigation of algorithms in the American legal system. [W]e propose a strategy for approaching the implementation of algorithms ... .
Henry Swofford & Christophe Champod, Implementation of Algorithms in Pattern & Impression Evidence: A Responsible and Practical Roadmap, 3 Forensic Sci. Int'l: Synergy 100142 (2021) (abstract)

Another example of sloppy phrasing about expert judgment is the boilerplate disclaimers or admonitions in ASTM standards for forensic-science methods. For example, the 2019 Standard Guide for Forensic Analysis of Fibers by Infrared Spectroscopy (E2224−19) insists (in italics no less) that

This standard cannot replace knowledge, skills, or abilities acquired through education, training, and experience and is to be used in conjunction with professional judgment by individuals with such discipline-specific knowledge, skills, and abilities.

Does the first independent clause mean that fiber analysts are free to depart from the standard on the basis of their general "knowledge, skills, or abilities acquired through education, training, and experience"? Ever since Congress funded the Organization of Scientific Area Committees for Forensic Science (OSAC) to write new and better standards, lawyers in the organization seeking to foreclose any possibility of misunderstanding have objected to the ASTM wording (without doubting that expert methods should be applied by responsible experts).

Now rumor has it that ASTM will be changing its stock sentence to the less ambiguous observation that

This standard is intended for use by competent forensic science practitioners with the requisite formal education, discipline-specific training (see Practice E2917), and demonstrated proficiency to perform forensic casework.

That's innocuous. Indeed, you might think it goes without saying.

Wednesday, January 20, 2021

P-values versus Statistical Significance in the Selective Enforcement Case of United States v. Lopez

For several reasons, United States v. Lopez, 415 F.Supp.3d 422 (S.D.N.Y. 2019), is another stimulating opinion from U.S. District Court Judge Jed S. Rakoff. It sets forth a new rule (or a new refinement of a rule) for handling discovery requests when a criminal defendant argues that he or she is the victim of unconstitutional selective investigation. In addition, the opinion reproduces the declaration of the defendants' statistical expert that the judge found "compelling." Too often, the work of statistical consultants is not readily available for public inspection.

In Lopez, Judge Rakoff granted limited discovery into a claim of selective prosecutions stemming from one type of DEA investigation -- "reverse stings." In these cases, law enforcement agents use informants to identify individuals who might want to steal drugs:

An undercover agent or informant then poses as a drug courier and offers the target an opportunity to steal drugs that do not actually exist. Targets in turn help plan and recruit other individuals to participate in a robbery of the fictitious drugs. Just before the targets are about to carry out their plan, they are arrested for conspiracy to commit the robbery and associated crimes.

Id. at 424. Defendant Johansi Lopez and other targets "who are all men of color, allege that ... the DEA limits such operations in the Southern District of New York to persons of color ... in violation of the Fifth Amendment's Equal Protection Clause." Id. at 425.

Seeking to sharpen the usual "broad discretion" approach to discovery, the court adopted the following standard for ordering discovery from the investigating agency:

where a defendant who is a member of a protected group can show that that group has been singled out for reverse sting operations to a statistically significant extent in comparison with other groups, this is sufficient to warrant further inquiry and discovery.

Id. at 427. The court was persuaded that the "of color" group was singled out on the basis of a "combination of raw data and statistical analysis." Id. This conclusion is entirely reasonable, but was the "singled out ... to a statistically significant extent" rule meant to make classical hypothesis testing at a fixed significance level the way to decide whether to grant discovery, or would an inquiry into p-values as measures of the strength of the statistical evidence against the hypothesis of "not singled out" be more appropriate? The discussion that follows suggests that little is gained by invoking the language and apparatus of hypothesis testing.

I. The "Singled Out" Standard

The "singled out" standard was offered as a departure from the one adopted by the Supreme Court in United States v. Armstrong, 517 U.S. 456 (1996). In Armstrong, Chief Justice Rehnquist wrote for the Court that to obtain discovery for a selective prosecution defense, "the required threshold [is] a credible showing of different treatment of similarly situated persons." Id. at 470 (emphasis added). The Court deemed insufficient a "study [that] failed to identify individuals who were not black and could have been prosecuted for the offenses for which respondents were charged, but were not so prosecuted." Id. (emphasis added).

In contrast, Judge Rakoff focused on "the racially disparate impact of the DEA's reverse sting operations," id. (emphasis added), and looked to other groups in the general population in ascertaining the magnitude and significance of the disparity. He maintained that "as now recognized by at least three federal circuits, selective enforcement claims should be open to discovery on a lesser showing than the very strict one required by Armstrong." Lopez, 415 F.Supp.3d at 425. In finding that this "lesser showing" had been made, he also considered three more restricted populations (of arrested individuals) that might better approximate groups in which all people are similarly plausible targets for sting operations.

Applying the "singled out ... to a statistically significant extent " standard requires not just the selection of an appropriate population in which to make comparisons among protected and unprotected classes but also a judicial determination of statistical significance. The opinion was not clear about the significance level required, but the court was impressed with an expert declaration that, unsurprisingly, applied the 0.05 level commonly used in many academic fields and most litigation involving statistical evidence.

Unfortunately, transforming a statistical convention into a rule of law does not necessarily achieve the ease of administration and degree of guidance that one would hope for. A p < 0.05 rule still leaves open considerable room for discretion in deciding whether p < 0.05, invites the ubiquitous search for significance, and requires an understanding of classical hypothesis testing that most judges and many experts lack if it is to be applied sensitively. Lopez itself shows some of the difficulties.

II. "Raw Data" Does Not Speak for Itself

The "raw data" (as presented in the body of the opinion) was that "not a single one of the 179 individuals targeted in DEA reverse sting operations in SDNY in the past ten years was white, and that all but two were African-American or Hispanic," whereas

  • "New York and Bronx Counties ... are 20.5% African-American, 39.7% Hispanic, and 29.5% White"; and
  • The breakdown of "NYPD arrests" is
    • felony drug: "42.7% African-American, 40.8% Hispanic, and 12.7% White";
    • firearms: "65.1% African-American, 24.3% Hispanic, 9.7% White"; and
    • robbery: "60.6% African-American, 31.1% Hispanic, 5.1% White."

Id. In other words, the "raw data" are summarized by the statistics p = 177/179 = 98.88% African-American and Hispanic (AH) targets among all the DEA-targets and π = 60.2%, 83.5%, 89.4%, or 91.7%, where π is the proportion of AH targets in the four surrogate populations.

Presumably, Judge Rakoff uses "statistically significant" to denote any difference p − π that would arise less than one time out of 20 when repeatedly picking 179 targets without regard to race (or a variable correlated with race) in the (unknown) population from which the DEA picked its "reverse sting" targets in the Southern District. (The 0.05 significance level is conventionally used in academic writing in many fields, and, as discussed below, it was the used by the defendant's expert in the case; however, in recent years, it has been said to be too weak to produce scientific findings that are likely to be reproducible.)

Notice that all the variability in the difference between p and π arises from p. The population proportion π is fixed (for each surrogate population and, of course, in the unknown population from which targets are drawn). It also is worth asking why the data and population are limited to the Southern District. Does the DEA have a different method for picking sting targets in, say, the Eastern District? If not, and if the issue is discriminatory intent, might the "significant" difference in the Southern District be an outlier from a process that does not look to the race of the individuals who are targeted?

In itself, "raw data" tell us nothing about statistical significance. "Significance" is just a word that characterizes a set of values for a statistic. It takes some sort of analysis to determine the zone in which "significance" exists. The evaluation may be intuitive and impressionistic, or it may be formal and quantitative. Intuitively, it seems that p = 177/179 is far enough from what would be expected from a race-independent system that it should be called "significant." But how do we know this intuition is correct? This is where probability theory comes into play.

III. One Simple Statistical Analysis

Even if the probability of targeting an African-American or a Hispanic in each investigation were the largest of the surrogate population proportions (0.917), the expected number of AH targets is still only 164, and the probability of picking 177 or more (an excess of 13 or more) is 0.0000271. The probability of picking as few or fewer than 151 AH targets (a deficit of 13 or more is 0.000881. Hence, the probability of the observed departure from the expected value (or an even greater departure) is 0.000918, or about 1/1,100. \1/

This two-sided p-value of 1/1,100 is less than 1/20, so we can reject the idea that the DEA selected targets without regard to AH status or some variable that is correlated with that status. That the p-value is smaller than the 1/20 cutoff makes the excess count statistically significant evidence against the hypothesis of a fixed selection probability of 0.917 in each case. The data suggest that the selection probability is at least somewhat larger than that.

Or do they? The binomial probability model that gives rise to these numbers could itself be wrong. What if there are fewer stings than targets? If an informant proposes multiple targets for one sting (think several members of the same gang, for example), won't knowing that the first one is AH increase the probability that the second one is AH?

IV. An "Exact" Analysis

The court did not discuss the formal analysis of statistical significance in the case. However, it relied on such an analysis. After describing the "raw data," Judge Rakoff added:

Furthermore, defendants have provided compelling expert analysis demonstrating that these numbers are statistically significant. According to a rigorous analysis conducted by Dr. Crystal S. Yang, a Harvard law and economics professor, it is highly unlikely, to the point of statistical significance, that the racially disparate impact of the DEA's reverse sting operations is simply random.

Id. at 427. The phrase "highly unlikely to the point of statistical significance" sounds simple, but the significance level of α = 0.05 is not all that stringent, and neither 0.05 nor p-values for the observed proportion p represent the probability that targets of the sting operations are "simply random" as far as racial status goes. Instead, α is the probability of an observed disparity that triggers the conclusion "not simply random" assuming that the model with the parameter π is correct. If we use the court's rule for rejecting random variation as an explanation for the disparity between p and π, then the chance of ordering discovery when the disparity is nothing more than statistical noise is just one in 20. But the probability that the disparity is merely noise when that disparity is large enough to reject that explanation need not be one in 20. If I flip a fair coin five times and observe five heads, I can reject the hypothesis that the coin is fair (i.e., the string of heads is merely noise) at the α = 0.05 level. The probability of the extreme statistic p = 5/5 is (1/2)5 = 1/32 < 1/20 if the coin is fair. But if I tell you that I blindly picked the coin from a box of 1,000 coins in which 999 were fair and one was biased, would you think that 1/20 is the probability that the coin I picked was biased? The probability for that hypothesis would be much closer to 1/1,000. (How much closer is left as an exercise to the reader who wishes to consult Bayes' theorem.)

Many opinions gloss over this distinction, transposing the arguments in the conditional probability that is the significance level to report things such as "[s]ocial scientists consider a finding of two standard deviations significant, meaning there is about 1 chance in 20 that the explanation for a deviation could be random." Waisome v. Port Authority, 948 F.2d 1370, 1376 (2d Cir. 1991). The significance level of 1/20 relates to the probability of data given the hypothesis, not the probability of the hypothesis given the data.

The expert report in this case did not equate the improbability of a range of data to the improbability of a hypothesis about the process that generated the data. The report is reproduced below. Dr. Yang used not just the four populations listed in the body of the opinion, but a total of eight surrogate populations as "benchmarks" to conclude that "it is extremely unlikely that random sampling from any of the hypothetical populations could yield a sample of 179 targeted individuals where 177 or more individuals are Latino or Black." Id. at 431. That is a fair statement. If we postulate simple random sampling in infinite populations with the specified values for the proportion π who are AH, the sample data for which p = 177/179 lie in a region outside of the central 95% of all possible samples. Standing alone, that critical region is not "extremely unlikely," but the largest p-value reported for all the populations in the declaration is 1/10,000.

The declaration is refreshingly free from exaggeration, but its explanation of the statistical analysis is slightly garbled. Dr. Yang swore that

Using the exact hypothesis test, I test whether the observed proportion of Latinos or Blacks observed in the sample (x = 177, n = 179) is equal to the hypothesized population probability/proportion p [π in the notation I have used]. Under this test, the null hypothesis is that the observed proportion is not statistically different from the hypothesized population proportion. The alternative hypothesis is that the observed proportion is statistically different from the hypothesized population proportion, a two-sided test. Each exact hypothesis test produces a corresponding p-value, which is the probability of observing a proportion as extreme or more extreme than the observed proportion assuming that the null hypothesis is true. A small p-value implies that the observed proportion is highly unlikely under the null hypothesis, favoring the rejection of the null hypothesis.

Id. at 430. The bottom line is fine, but the exposition confuses the notion of a p-value as a measure of the strength of evidence with the idea of a significance level as protection against false rejection of null hypotheses. A statistical distribution (or a non-parametric procedure) produces a p-value that can be compared to a pre-established value such as α = 0.5 to reach a test result of thumbs up (significant) or down (not significant). The hypothesis test does not "produce a ... p-value." At most, it uses a p-value to reach a conclusion. Once α is fixed, the p-value produces the test result.

Moreover, the description of the hypotheses that are being tested makes little sense. Hypothesis are statements about unknown parameter values. They are not statements about the data or about the data combined with the parameter. What does it mean to say that "the null hypothesis is that the observed proportion is not statistically different from the hypothesized population proportion" or that "[t]he alternative hypothesis is that the observed proportion is statistically different from the hypothesized population proportion, a two-sided test"? These are not hypotheses about the true value of an unknown parameter of the binomial distribution used in the exact test. The observed proportion (p in my notation) is known to be different from the hypothetical population proportion (π in my notation). There is no uncertainty about that. The hypotheses to be tested have to be statements about what produced the observed "statistical difference" p − π rather than statements of whether there is a "statistical difference."

Thus, the hypothesis that was tested was not "whether the observed proportion of Latinos or Blacks observed in the sample (x = 177, n = 179) is equal to the hypothesized population probability/proportion." We already know that these two known quantities are not equal. The hypothesis that Dr. Yang tested was whether (within the context of the model) the (unknown) selection probability is the known value π for a given surrogate population. If we call this unknown binomial probability Θ, she was testing claims about the true value of Θ. The null hypothesis is that the unknown Θ equals a known population proportion π; the alternative hypothesis is that Θ is not precisely equal to π. The tables in the expert declaration present strange expressions for these hypotheses, such as π = 0.917 rather than Θ = 0.917.

Nonetheless, these complaints about wording and notation might be dismissed as pedantic. The p-values for properly stated null hypotheses do signal a clear thumbs down for the null hypothesis Θ = 0.917 at the prescribed α = 0.05 level. \3/

Of more interest than these somewhat arcane points is the response to the question of dependent outcomes in selecting multiple targets for a particular investigation. Dr. Yang reasoned as follows:

[S]uppose that we took the most conservative approach and assumed that there is perfect homophily (i.e. a perfect correlation of 1) such that if one individual targeted in the operation is Latino or Black, all other individuals in that same operation are also Latino or Black. Under this conservative assumption, we can then treat the observed sample as if there were only 46 independent draws (rather than 179 draws). To observe a racial composition of 98.9% Latino or Black would thus require that at least 45 out of 46 draws resulted in a Latino or Black individual being targeted.

Id. at 431 (notes omitted). Rerunning the exact computation for p = 45/46 changes the p-value to 0.18, or roughly 1 in 5. Id. at 438 (Table 2, row h). This much larger p-value fails to meet the α = 0.05 rule that the court set.

The expert declaration reports this result along with those from seven other  hypothesis tests (one for each surrogate population). Five of the tests produce "significance," and three do not. Does this mixed picture establish that p < 0.05? It seems to me that the usual procedures for assessing multiple-comparisons do not apply here because the data are identical for all eight tests. The value of π in the surrogate population is being varied, and π is not a random variable. The use of multiple populations is a way to examine the robustness of the results of the statistical test. For p = 177/179, which is the only figure the court mentioned, the finding of significance is stable. For p = 45/46, however, it becomes necessary to ask which surrogate populations best approximate the unknown population of potential targets for the reverse stings. Numerical analysis cannot answer that question. \2/ As Macbeth soliloquized, "we still have judgment here."

NOTES

  1. An exact binomial computation performed in R (using binom.test(177,179,0.917)) gives a two-sided p-value of 0.0000578, which is about 1/17,300. Table 1, row h of the expert declaration reports a value 0.0001, or 1/10,000.
  2. Dr. Yang intimates that discovery might shed light on how much of a correction is needed. Id. at 431 ("It is impossible to know the true degree of homophily or correlation among individuals targeted in reverse-sting operations, particularly when the DEA's selection criteria is unknown."); id. at 432 ("Again, because I have no knowledge of the DEA's selection criteria of potential targets, it is impossible to know which of the hypothesized populations captures the relevant pool of similarly situated individuals. A more definitive statistical analysis may be possible if the government provides the requested selection criteria.").
  3. As the declaration points out, the p-values are small enough to yield "significance" at even smaller levels, but it would not be acceptable to set up a test at the α = 0.05 level and then interpret it as demonstrating significance at a different level. Changing α after analyzing the data vitiates the interpretation of α as the probability of a false alarm, which can be written as Pr("significant difference" | Θ = π).

APPENDIX: DECLARATION OF PROFESSOR CRYSTAL S. YANG

Pursuant to 28 U.S.C. § 1746, I, CRYSTAL S. YANG, J.D., Ph.D., declare under penalty of perjury that the following is true and correct:

1. I am a Professor of Law at Harvard Law School and Faculty Research Fellow at the National Bureau of Economic Research. My primary research areas are in criminal law and criminal procedure, with a focus on testing for and quantifying racial disparities in the criminal justice system. Before joining the Harvard Law School faculty, I was an Olin Fellow and Instructor in Law at The University of Chicago Law School. I am admitted to the New York State Bar and previously worked as a Special Assistant United States Attorney in the U.S. Attorney's Office for the District of Massachusetts.

2. I received a B.A. in economics summa cum laude from Harvard University in 2008, an A.M. in statistics from Harvard University in 2008, a J.D. magna cum laude from Harvard Law School in 2013, and a Ph.D. in economics from Harvard University in 2013. My undergraduate and graduate training involved substantial coursework in quantitative methods.

3. I have published in peer-reviewed journals such as the American Economic Review, The Quarterly Journal of Economics, the American Economic Journal: Economic Policy, and have work represented in many other peer-reviewed journals and outlets.

4. I make this Declaration in support of a motion being submitted by the defendants to compel the government to provide discovery in this case.

Statistical Analyses Pertinent to Motion

5. I was retained by the Federal Defenders of New York to provide various statistical analyses relevant to the defendants' motion in this case. Specifically, I was asked to evaluate whether the observed racial composition of targets in reverse-sting operations in the Southern District of New York could be due to random chance.

6. To undertake this statistical analysis, I first had to obtain the racial composition of targeted individuals in DEA reverse-sting stash house cases brought in the Southern District of New York for the ten-year period beginning on August 5, 2009 and ending on August 5, 2019. Based on the materials in Lamar and Garcia-Pena, as well as additional searches conducted by the Federal Defenders of New York, I understand that there have been 46 fake stash house reverse-sting operations conducted by the DEA during this time period. These 46 operations targeted 179 individuals of whom zero are White, two are Asian, and 177 are Latino or Black – the “sample.” \1/ Given these counts, this means that of the targeted individuals, 98.9% are Latino or Black (and 100% are non-White). Thus, the relevant question at hand is whether the observed racial composition of the sample could be due to random chance alone if the DEA sampled from a population of similarly situated individuals.

7. Second, I had to define what the underlying population of similarly situated individuals is. In other words, what is the possible pool of all similarly situated individuals who could have been targeted by the DEA in a reverse-sting operation? Because the DEA's criteria for being a target in these reverse-sting cases is unknown, my statistical analysis will assume a variety of hypothetical benchmark populations. If the government provides its selection criteria for being a target in these reverse-sting cases, a more definitive statistical analysis may be possible. Based on materials from Garcia-Pena and Lamar, I have identified eight hypothetical benchmark populations. Below, I present the hypothesized populations and the racial composition (% Latino or Black) in each population in order of least conservative (i.e. smallest share of Latino or Black) to most conservative (i.e. highest share of Latino or Black):

a. 2016 American Community Survey 5-year estimates on counties in the SDNY (from Garcia-Pena): 48.1% Latino or Black
b. 2016 American Community Survey 5-year estimates on Bronx and New York Counties (from Garcia-Pena): 60.2% Latino or Black
c. New York Police Department (NYPD) data from January 1 – December 31, 2017, on felony drug arrests in New York City (from Garcia-Pena): 83.5% Latino or Black
d. Estimates by Prof. Kohler-Hausmann on men aged 16-49 living in New York City who have prior New York State (NYS) violent felony convictions (from Lamar): 87.1% Latino or Black
e. Estimates by Prof. Kohler-Hausmann on men aged 16-49 living in New York City who have prior NYS felony convictions (from Lamar): 87.5% Latino or Black
f. NYPD data from January 1 – December 31, 2017 on firearms seizures arrests in New York City (from Garcia-Pena): 89.4% Latino or Black
g. Reverse-sting operation defendants in the Northern District of Illinois (from Garcia-Pena): 87.7-90.7% Latino or Black \2/
h. NYPD data from January 1 – December 31, 2017 on robbery arrests in New York City (from Garcia-Pena): 91.7% Latino or Black

8. For each of these eight hypothesized populations, I then conduct an exact hypothesis test for binomial random variables. This is the standard statistical test used for calculating the exact probability of observing x “successes” out of n “draws” when the underlying probability of success is p and the underlying probability of failure is 1-p. Here, each defendant represents an independent draw and a success occurs when the defendant is Latino or Black. Using the exact hypothesis test, I test whether the observed proportion of Latinos or Blacks observed in the sample (x = 177, n = 179) is equal to the hypothesized population probability/proportion p. Under this test, the null hypothesis is that the observed proportion is not statistically different from the hypothesized population proportion. The alternative hypothesis is that the observed proportion is statistically different from the hypothesized population proportion, a two-sided test. \3/ Each exact hypothesis test produces a corresponding p-value, which is the probability of observing a proportion as extreme or more extreme than the observed proportion assuming that the null hypothesis is true. A small p-value implies that the observed proportion is highly unlikely under the null hypothesis, favoring the rejection of the null hypothesis.

9. The following Table 1 presents each of the eight hypothesized population proportions, the null hypothesis under each population, the alternative hypothesis under each population, and the corresponding p-value using the observed proportion of Latinos or Blacks in the sample assuming 179 independent draws:

Table 1 (x = 177, n = 179)
Hypothesized Population
Proportion
Null Hypothesis
H0
Alternative Hypothesis
Ha
p-value
a. 48.1% Latino or Black H0: p = 0.481 Ha: p ≠ 0.481 0.0000
b. 60.2% Latino or Black H0: p = 0.602 Ha: p ≠ 0.602 0.0000
c. 83.5% Latino or Black H0: p = 0.835 Ha: p ≠ 0.835 0.0000
d. 87.1% Latino or Black H0: p = 0.871 Ha: p ≠ 0.871 0.0000
e. 87.5% Latino or Black H0: p = 0.875 Ha: p ≠ 0.875 0.0000
f. 89.4% Latino or Black H0: p = 0.894 Ha: p ≠ 0.894 0.0000
g. 90.7% Latino or Black H0: p = 0.907 Ha: p ≠ 0.907 0.0000
h. 91.7% Latino or Black H0: p = 0.917 Ha: p ≠ 0.917 0.0001

10. The above statistical calculations in Table 1 show that regardless of which of the eight hypothesized population proportions is chosen, one could reject the null hypothesis at conventional levels of statistical significance. For example, one could reject the null hypothesis at the standard 5% significance level which requires that the p-value be less than 0.05. All eight p-values are substantially smaller than 0.05 and would lead to a rejection of the null hypothesis even using more conservative 1%, 0.5%, or 0.1% significance levels. In other words, it is extremely unlikely that random sampling from any of the hypothetical populations could yield a sample of 179 targeted individuals where 177 or more individuals are Latino or Black.

11. Alternatively, one may be interested in the reverse question of what the underlying population proportion would have to be such that the observed proportion could be due to random chance alone assuming there are 179 independent draws. Using the standard 5% significance level, I have calculated that the hypothesized population would have to be composed of at least 96.0% Latinos or Blacks in order for one to not be able to reject the null hypothesis. In other words, unless the pool of similarly situated individuals is comprised of at least 96.0% Latinos or Blacks, it is highly unlikely that one could get a sample of 179 targeted individuals where 177 or more individuals are Latino or Black.

12. One potential question with the statistical analyses in Table 1 is whether the assumption that each of the 179 targeted individuals is an independent draw is reasonable. For example, what if the race/ethnicity of individuals in each reverse-sting operation is correlated, such that if one individual targeted in an operation is Latino or Black, the other individuals are also more likely to be Latino or Black? This correlation within operations could result if there is homophily, or “the principle that a contact between similar people occurs at a higher rate than among dissimilar people.” Miller McPherson et al., Birds of a Feather: Homophily in Social Networks, 27 Ann. Rev. Soc. 315, at 416 (2001). It is impossible to know the true degree of homophily or correlation among individuals targeted in reverse-sting operations, particularly when the DEA's selection criteria is unknown. But suppose that we took the most conservative approach and assumed that there is perfect homophily (i.e. a perfect correlation of 1) such that if one individual targeted in the operation is Latino or Black, all other individuals in that same operation are also Latino or Black. Under this conservative assumption, we can then treat the observed sample as if there were only 46 independent draws (rather than 179 draws). \4/ To observe a racial composition of 98.9% Latino or Black would thus require that at least 45 out of 46 draws resulted in a Latino or Black individual being targeted. \5/

13. For each of the eight hypothesized benchmark populations, I then test whether the observed proportion of Latinos or Blacks observed in this alternative sample (x = 45, n = 46) is equal to the hypothesized proportion from an underlying population assuming that there are only 46 independent draws. The following Table 2 presents each of the eight hypothesized population proportions, the null hypothesis under each population, the alternative hypothesis under each population, and the corresponding p-value using the observed proportion of Latinos or Blacks in the sample assuming 46 independent draws:

Table 2 (x = 45, n = 46)
Hypothesized Population
Proportion
Null Hypothesis
H0
Alternative Hypothesis
Ha
p-value
a. 48.1% Latino or Black H0: p = 0.481 Ha: p ≠ 0.481 0.0000
b. 60.2% Latino or Black H0: p = 0.602 Ha: p ≠ 0.602 0.0000
c. 83.5% Latino or Black H0: p = 0.835 Ha: p ≠ 0.835 0.0045
d. 87.1% Latino or Black H0: p = 0.871 Ha: p ≠ 0.871 0.0255
e. 87.5% Latino or Black H0: p = 0.875 Ha: p ≠ 0.875 0.0257
f. 89.4% Latino or Black H0: p = 0.894 Ha: p ≠ 0.894 0.0871
g. 90.7% Latino or Black H0: p = 0.907 Ha: p ≠ 0.907 0.1240
h. 91.7% Latino or Black H0: p = 0.917 Ha: p ≠ 0.917 0.1795

14. Under this conservative assumption of perfect homophily, the above statistical calculations in Table 2 show that under the first five hypothesized population proportions (a-e), one could reject the null hypothesis at the standard 5% significance level. In other words, even if the hypothesized population proportion of Latinos or Blacks is as high as 87.5%, it is highly unlikely that random sampling could yield a sample of 46 individuals where 45 or more individuals are Latino or Black. One, however, cannot reject the null hypothesis for the next three hypothesized population proportions (f-h). Again, because I have no knowledge of the DEA's selection criteria of potential targets, it is impossible to know which of the hypothesized populations captures the relevant pool of similarly situated individuals. A more definitive statistical analysis may be possible if the government provides the requested selection criteria.

15. As before, I also ask the reverse question of what the underlying population proportion would have to be such that the observed proportion could be due to random chance alone assuming that there are only 46 independent draws. Using the standard 5% significance level, I have calculated that the hypothesized population would have to be composed of at least 88.5% Latinos or Blacks in order for one to not be able to reject the null hypothesis. In other words, unless the pool of similarly situated individuals is comprised of at least 88.5% Latinos or Blacks, it is highly unlikely that one could get a sample of 46 targeted individuals where 45 or more individuals are Latino or Black.

Dated: Cambridge, Massachusetts
September 13, 2019
/s/ Crystal S. Yang
Crystal S. Yang

Footnotes

   1. In consultation with the Federal Defenders of New York, this sample is obtained by taking the 33 cases and 144 defendants identified in Garcia-Pena or Lamar, excluding two cases and five defendants that are either not DEA cases or reverse-sting cases, and including an additional 15 cases and 40 defendants that were not covered by the time frames included in the Lamar or Garcia-Pena analysis.
   2. I choose 90.7% (the upper end of the range) as the relevant proportion given that it yields the most conservative estimates.
   3. This two-sided test takes the most conservative approach (in contrast to a one-sided test) because it allows for the possibility of both an over-representation and under-representation of Latinos or Blacks relative to the hypothesized population proportion.
   4. I make the simplifying assumption that each of the 46 operations targeted the average number of codefendants, 3.89= 179/46.
   5. Technically, 45.494 draws would need to be of Latino or Black individuals but I conservatively round down to the nearest integer.

 UPDATED: 1/24/2021 3:13 ET

Friday, November 27, 2020

Mysteries of the Department of Justice's ULTR for Firearm-toolmark Pattern Examinations

The Department of Justice's Uniform Language for Testimony and Reports (ULTR) for the Forensic Firearms/toolmarks Discipline – Pattern Examination" offers a ready response to motions to limit overclaiming or, to use the pedantic term, ultracrepidarianism, in expert testimony. Citing the DoJ policy, several federal district courts have indicated that they expect the government's expert witnesses to follow this directive (or something like it). \1/

Parts of the current version (with changes from the original) are reproduced in Box 1. \2/ This posting poses three questions about this guidance. Although the ULTR is a step in the right direction, it has a ways to go in articulating a clear and optimal policy.

Box 1. The ULTR

DEPARTMENT OF JUSTICE
UNIFORM LANGUAGE FOR TESTIMONY AND REPORTS
FOR THE FORENSIC FIREARMS/TOOLMARKS DISCIPLINE –
PATTERN MATCH EXAMINATION
...
III. Conclusions Regarding Forensic Pattern Examination of Firearms/Toolmarks Evidence for a Pattern Match

The An examiner may offer provide any of the following conclusions:
1.Source identification (i.e., identified)
2.Source exclusion (i.e., excluded)
3.Inconclusive
Source identification
‘Source identification’ is an examiner’s conclusion that two toolmarks originated from the same source. This conclusion is an examiner’s decision opinion that all observed class characteristics are in agreement and the quality and quantity of corresponding individual characteristics is such that the examiner would not expect to find that same combination of individual characteristics repeated in another source and has found insufficient disagreement of individual characteristics to conclude they originated from different sources.

The basis for a ‘source identification’ conclusion is an examiner’s decision opinion that the observed class characteristics and corresponding individual characteristics provide extremely strong support for the proposition that the two toolmarks came originated from the same source and extremely weak support for the proposition that the two toolmarks came originated from different sources.

A ‘source identification’ is the statement of an examiner’s opinion (an inductive inference2) that the probability that the two toolmarks were made by different sources is so small that it is negligible. A ‘source identification’ is not based upon a statistically-derived or verified measurement or an actual comparison to all firearms or toolmarks in the world.

Source exclusion
‘Source exclusion’ is an examiner’s conclusion that two toolmarks did not originate from the same source.

The basis for a ‘source exclusion’ conclusion is an examiner’s decision opinion that the observed class and/or individual characteristics provide extremely strong support for the proposition that the two toolmarks came from different sources and extremely weak or no support for the proposition that the two toolmarks came from the same source two toolmarks can be differentiated by their class characteristics and/or individual characteristics.

Inconclusive
‘Inconclusive’ is an examiner’s conclusion that all observed class characteristics are in agreement but there is insufficient quality and/or quantity of corresponding individual characteristics such that the examiner is unable to identify or exclude the two toolmarks as having originated from the same source.

The basis for an ‘inconclusive’ conclusion is an examiner’s decision opinion that there is an insufficient quality and/or quantity of individual characteristics to identify or exclude. Reasons for an ‘inconclusive’ conclusion include the presence of microscopic similarity that is insufficient to form the conclusion of ‘source identification;’ a lack of any observed microscopic similarity; or microscopic dissimilarity that is insufficient to form the conclusion of ‘source exclusion.’

IV. Qualifications and Limitations of Forensic Firearms/Toolmarks Discipline Examinations
A conclusion provided during testimony or in a report is ultimately an examiner’s decision and is not based on a statistically-derived or verified measurement or comparison to all other firearms or toolmarks. Therefore, an An examiner shall not assert that two toolmarks originated from the same source to the exclusion of all other sources. This may wrongly imply that a ‘source identification’ conclusion is based upon a statistically-derived or verified measurement or an actual comparison to all other toolmarks in the world, rather than an examiner’s expert opinion.
○ assert that a ‘source identification’ or a ‘source exclusion’ conclusion is based on the ‘uniqueness’3 of an item of evidence.

○ use the terms ‘individualize’ or ‘individualization’ when describing a source conclusion.

○ assert that two toolmarks originated from the same source to the exclusion of all other sources.
• An examiner shall not assert that examinations conducted in the forensic firearms/toolmarks discipline are infallible or have a zero error rate.

• An examiner shall not provide a conclusion that includes a statistic or numerical degree of probability except when based on relevant and appropriate data.

• An examiner shall not cite the number of examinations conducted in the forensic firearms/toolmarks discipline performed in his or her career as a direct measure for the accuracy of a proffered conclusion provided. An examiner may cite the number of examinations conducted in the forensic firearms/toolmarks discipline performed in his or her career for the purpose of establishing, defending, or describing his or her qualifications or experience.

• An examiner shall not assert that two toolmarks originated from the same source with absolute or 100% certainty, or use the expressions ‘reasonable degree of scientific certainty,’ ‘reasonable scientific certainty,’ or similar assertions of reasonable certainty in either reports or testimony unless required to do so by a judge or applicable law.34


2 Inductive reasoning (inferential reasoning):
A mode or process of thinking that is part of the scientific method and complements deductive reasoning and logic. Inductive reasoning starts with a large body of evidence or data obtained by experiment or observation and extrapolates it to new situations. By the process of induction or inference, predictions about new situations are inferred or induced from the existing body of knowledge. In other words, an inference is a generalization, but one that is made in a logical and scientifically defensible manner. Oxford Dictionary of Forensic Science 130 (Oxford Univ. Press 2012).
3 As used in this document, the term ‘uniqueness’ means having the quality of being the only one of its kind.’ Oxford English Dictionary 804 (Oxford Univ. Press 2012).
34 See Memorandum from the Attorney General to Heads of Department Components (Sept. 9. 2016), https://www.justice.gov/opa/file/891366/download.

1

Are the two or three conclusions -- identification, exclusion, and inconclusive -- the only way the examiners are allowed to report their results?

In much of the world, examiners are discouraged from reaching only two conclusions--included vs. excluded (with the additional option of denominating the data as too limited to permit such a classification). They are urged to articulate how strongly the data support one classification over the other. Instead of pigeonholing, they might say, for example, that the data strongly support the same-source classification (because those data are far more probable for ammunition fired from the same gun than for ammunition discharged from different guns). 

The ULTR studiously avoids mentioning this mode of reporting. It states that examiners "may provide any of the following ... ." It does not state whether they also may choose not to -- and instead report only the degree of support for the same-source (or the different-source) hypothesis. Does the maxim of expressio unius est exclusio alterius apply? Department of Justice personnel are well aware of this widely favored alternative. They have attended meetings of statisticians at which straw polls overwhelmingly endorsed it over the Department's permitted conclusions. Yet, the ULTR does not list statements of support (essentially, likelihood ratios) as permissible. But neither are they found in the list of thou-shalt-nots in Part IV. \3/ Is the idea that if the examiners have a conclusion to offer, they must state it as one of the two or three categorical ones -- and that they may give a qualitative likelihood ratio if they want to?

2

Is the stated logic of a "source identification" internally coherent and intellectually defensible?

The ULTR explains that "[t]he basis for a 'source identification' is

an examiner’s opinion that the observed class characteristics and corresponding individual characteristics provide extremely strong support for the proposition that the two toolmarks originated from the same source and extremely weak support for the proposition that the two toolmarks originated from different sources.

Translated into likelihood language, the DoJ's "basis for a source identification" is the belief that the likelihood ratio is very large -- the numerator of L is close to one, and the denominator is close to zero (see Box 2).

On this understanding, a "source identification" is a statement about the strength of the evidence rather than a conclusion (in the sense of a decision about the source hypothesis). However, the next paragraph of the ULTR states that "[a] ‘source identification’ is the statement of an examiner’s opinion (an inductive inference2) that the probability that the two toolmarks were made by different sources is so small that it is negligible."

Box 2. A Technical Definition of Support
The questioned toolmarks and the known ones have some degree of observed similarity X with respect to relevant characteristics. Let Lik(S) be the examiner's judgment of the likelihood of the same-source hypothesis S. This likelihood is proportional to Prob(X | S), the probability of the observed degree of similarity X given the hypothesis (S). For simplicity, we may as well let the constant be 1. Let Lik(D) be the examiner's judgment of the likelihood of the different-source hypothesis (D). This likelihood is Prob(X | D). The support for S is the logarithm of the likelihood ratio L = Lik(S) / Lik(D) = Prob(X | S) / Prob(X | D). \4/

In this way, the ULTR jumps from a likelihood to a posterior probability. To assert that "the probability that the two toolmarks were made by different sources ... is negligible" is to say that Prob(D|X) is close to 0, and hence that Prob(S|X) is nearly 1. However, the likelihood ratio L = Lik(S) / Lik(D) is only one factor that affects Prob(D|X). Bayes' theorem establishes that

Odds(D|X) = Odds(D) / L.

Consequently, a very large L (great support for S) shrinks the odds in favor of S, but whether we end up with a "negligible" probability for D depends on the odds on D without considering the strength of the toolmark evidence. Because the expertise of toolmark analysts only extends to evaluating the toolmark evidence, it seems that the ULTR is asking them to step outside their legitimate  sphere of expertise by assessing, either explicitly or implicitly, the strength of the particular non-scientific evidence in the case.

There is a way to circumvent this objection. To defend a "source identification" as a judgment that Prob(D|X) is negligible, the examiner could contend that the likelihood ratio L is not just very large, as the ULTR's first definition required, but that it is so large that it swamps every probability that a judge or juror reasonably might entertain in any possible case before learning about the toolmarks. A nearly infinite L would permit an analyst to dismiss the posterior odds on D as negligible without attempting to estimate the odds on the basis of other evidence in the particular case (see Box 3).

Box 3. How large must L be to swamp all plausible prior odds?

Suppose that the smallest prior same-source probability in any conceivable case were p = 1/1,000,000. The prior odds on the different-source hypothesis would be approximately 1/p = 1,000,000. According to Bayes' rule, the posterior odds on D then would be about (1/p)/L = 1,000,000/L.

How large would the support L for S have to be to make D a "negligible" possibility? If "negligible" means a probability below, say 1/100,000, then the threshold value of L (call it L*) would be such that 1,000,000 / L* < 1/100,000 (approximately). Hence L* > 10^11. Are examiners are able to reliably tell whether the toolmarks are such that L > 100 billion?

One can use different numbers, of course, but whether the swamping defense of the ULTR really works to justify actual testimony as to "source identification" defined according to the ULTR is none too clear.

The ULTR seems slightly embarrassed with the characterization of a "source identification" as an "opinion" on the small size of a probability. Parenthetically it calls the opinion "an inductive inference," which sounds more impressive. But the footnote that is supposed to explain the the more elegant phrase only muddies the waters. It reads as follows:

Inductive reasoning (inferential reasoning): A mode or process of thinking that is part of the scientific method and complements deductive reasoning and logic. Inductive reasoning starts with a large body of evidence or data obtained by experiment or observation and extrapolates it to new situations. By the process of induction or inference, predictions about new situations are inferred or induced from the existing body of knowledge. In other words, an inference is a generalization, but one that is made in a logical and scientifically defensible manner. Oxford Dictionary of Forensic Science 130 (Oxford Univ. Press 2012) [sic]. \4/

The flaws in this definition are many. First, "inferential reasoning" is not equivalent to "inductive reasoning." Inference is reaching a conclusion from stated premises. The argument from the premises to the conclusion can be deductive or inductive. Deductive arguments are valid when the conclusion is true given that the premises are true. Inductive arguments are sound when the conclusion is sufficiently probable given that the premises are true. Second, inductive reasoning can be based on a small body of evidence as well as on a large body of evidence. In other words, deduction produces logical certainty, whereas induction can yield no more than probable truth. Third, an induction -- that is, the conclusion of an inductive argument -- need not be particularly scientific or "scientifically defensible." Fourth, an inductive conclusion is not necessarily "a generalization." An inductive argument, no less than a deductive one, can go from the general to the specific -- as is the case for an inference that two toolmarks were made by the same source. Presenting an experience-based opinion as the product of "the scientific method" by the fiat of a flawed definition of "inductive reasoning" is puffery.

3

If the examiner has correctly discerned matching "individual characteristics" (as the ULTR calls them), why cannot the examiner "assert that a ‘source identification’ ... is based on ... ‘uniqueness’" or that there has been an "individualization"?

The ULTR states that a "source identification" is based on an examination of "class characteristics" and "individual characteristics." Presumably, "individual characteristics" are ones that differ in every source and thus permit "individualization." The dictionary on which the ULTR relies defines "individualization" as "assigning a unique source for a given piece of physical evidence" (which it distinguishes from "identification"). But the ULTR enjoins an examiner from using "the terms ‘individualize’ or ‘individualization’ when describing a source conclusion," from asserting "that a ‘source identification’ or a ‘source exclusion’ conclusion is based on the ‘uniqueness’ of an item of evidence," and from stating "that two toolmarks originated from the same source to the exclusion of all other sources."

The stated reason to avoid these terms is that a source attribution "is not based on a statistically-derived or verified measurement or comparison to all other firearms or toolmarks." But who would think that an examiner who "assert[s] that two '[t]oolmarks originated from the same source to the exclusion of all other sources'" is announcing "an actual comparison to all other toolmarks in the world"? The examiner apparently is allowed to report a plethora of matching "individual characteristics" and to opine (or "inductively infer") that there is virtually no chance that the marks came from a different source. Allowing such testimony cuts the heart out of the rules against asserting "uniqueness" and claiming "individualization."

NOTES

  1. E.g., United States v. Hunt, 464 F.Supp.3d 1252 (W.D. Okla. 2020) (discussed on this blog Aug. 10, 2020).
  2. The original version was adopted on 7/24/2018. It was revised on 6/8/2020.
  3. Are numerical versions of subjective likelihood ratios prohibited by the injunction in Part IV that "[a]n examiner shall not provide a conclusion that includes a statistic or numerical degree of probability except when based on relevant and appropriate data"? Technically, a likelihood ratio is not a "degree of probability" or (arguably) a statistic, but it seems doubtful that the drafters of the ULTR chose their terminology with the niceties of statistical terminology in mind.
  4. A.W.F. Edwards, Likelihood 31 (rev. ed. 1992) (citing H. Jeffreys, Further Significance Tests, 32 Proc. Cambridge Phil. Soc'y 416 (1936)).
  5. The correct name of the dictionary is A Dictionary of Forensic Science, and its author is Suzanne Bell. The quotation in the ULTR omits the following part of the definition of "inductive inference": "A forensic example is fingerprints. Every person's fingerprints are unique, but this is an inference based on existing knowledge since the only way to prove it would be to take and study the fingerprints of every human being ever born."

Tuesday, November 24, 2020

Wikimedia v. NSA: It's Classified!

The National Security Agency (NSA) engages in systematic, warrantless "upstream" surveillance of Internet communications that travel in and out of the United States along a "backbone" of fiber optic cables. The ACLU and other organizations maintain that Upstream surveillance is manifestly unconstitutional. Whether or not that is correct, the government has stymied one Fourth Amendment challenge after another on the ground that plaintiffs lacked standing because they cannot prove that the surveillance entails intercepting, copying, and reviewing any of their communications. Of course, the reason plaintiffs have no direct evidence is that the government won't admit or deny it. Instead, the government has asserted that the surveillance program is a privileged state secret, classified its details, and resisted even in camera hearings in ordinary courts.

In Wikimedia Foundation v. National Security Agency, 857 F.3d 193 (4th Cir. 2017), however, the Court of Appeals for the Fourth circuit held that the Wikimedia Foundation, which operates Wikipedia, made "allegations sufficient to survive a facial challenge to standing." Id. at 193. The court concluded that Wikimedia's allegations were plausible enough to defeat a motion to dismiss the complaint because

Wikimedia alleges three key facts that are entitled to the presumption of truth. First, “[g]iven the relatively small number of international chokepoints,” the volume of Wikimedia's communications, and the geographical diversity of the people with whom it communicates, Wikimedia's “communications almost certainly traverse every international backbone link connecting the United States with the rest of the world.”

Second, “in order for the NSA to reliably obtain communications to, from, or about its targets in the way it has described, the government,” for technical reasons that Wikimedia goes into at length, “must be copying and reviewing all the international text-based communications that travel across a given link” upon which it has installed surveillance equipment. Because details about the collection process remain classified, Wikimedia can't precisely describe the technical means that the NSA employs. Instead, it spells out the technical rules of how the Internet works and concludes that, given that the NSA is conducting Upstream surveillance on a backbone link, the rules require that the NSA do so in a certain way. ...

Third, per the PCLOB [Privacy and Civil Liberties Oversight Board] Report and a purported NSA slide, “the NSA has confirmed that it conducts Upstream surveillance at more than one point along the [I]nternet backbone.” Together, these allegations are sufficient to make plausible the conclusion that the NSA is intercepting, copying, and reviewing at least some of Wikimedia's communications. To put it simply, Wikimedia has plausibly alleged that its communications travel all of the roads that a communication can take, and that the NSA seizes all of the communications along at least one of those roads. 

Id. at 210-11 (citations omitted).

The Fourth Circuit therefore vacated an order dismissing Wikimedia's complaint issued by Senior District Judge Thomas Selby Ellis III, the self-desctibed "impatient" jurist who achieved later notoriety and collected ethics complaints (that were rejected last year) for his management of the trial of former Trump campaign manager Paul Manafort.

On remand, the government moved for summary judgment. Wikimedia Foundation v. National Security Agency/Central Security Service, 427 F.Supp.3d 582 (D. Md. 2019). Once more, the government argued that Wikimedia lacked standing to complain that the Upstream surveillance violated its Fourth Amendment rights. It suggested that the "plausible" inference that the NSA must be "intercepting, copying, and reviewing at least some of Wikimedia's communications” recognized by the Fourth Circuit was not so plausible after all. To support this conclusion, it submitted a declaration of Henning Schulzrinne, a Professor of Computer Science and Electrical Engineering at Columbia University. Dr. Schulzrinne described how companies carrying Internet traffic might filter transmissions before copying them by “mirroring” with “routers” or “switches” that could perform “blacklisting” or “whitelisting” if the NSA chose to give the companies information on its targets with which to create “access control lists.”

But Dr. Schulzrinne supplied no information and formed no opinion on whether it was at all likely that the NSA used the mirroring methods that he envisioned, and Wikimedia produced a series of expert reports from Scott Bradner, who had served as Harvard University’s Technology Security Officer and taught at that university. Bradner contended that the NSA could hardly be expected to give away the information on its targets and concluded that it is all but certain that the agency intercepted and opened at least one of Wikimedia's trillions of Internet communications.

The district court refused to conduct an evidentiary hearing on the factual issue. Instead, it disregarded the expert's opinion as inadmissible scientific evidence under Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993), because no one without access to classified information could "know what the NSA prioritizes in the Upstream surveillance program ... and therefore Mr. Bradner has no knowledge or information about it." Wikimedia, 427 F. Supp. 3d at 604–05 (footnotes omitted).

This reasoning resembles that from Judge Ellis's first opinion in this long-running case. In Wikimedia Found. v. NSA, 143 F. Supp. 3d 344, 356 (D. Md. 2015), the judge characterized Wikimedia’s allegations as mere “suppositions and speculation, with no basis in fact, about how the NSA” operates and maintained that it was impossible for Wikimedia to prove its allegations “because the scope and scale of Upstream surveillance remain classified . . . .” Id. Rather than allow full consideration of the strength of the evidence that makes Wikimedia’s claim plausible, the district court restated its position that “Mr. Bradner has no [direct] knowledge or information” because that information is classified. Wikimedia, 427 F. Supp. 3d at 604–605.

In a pending appeal to the Fourth Circuit, Edward Imwinkelried, Michael Risinger, Rebecca Wexler, and I prepared a brief as amici curiae in support of Wikimedia. The brief expresses surprise at “the district court’s highly abbreviated analysis of Rule 702 and Daubert, as well as the court’s consequent decision to rule inadmissible opinions of the type that Wikimedia’s expert offered in this case.” It describes the applicable standard for excluding expert testimony. It then argues that the expert’s method of reasoning was sound and that its factual bases regarding the nature of Internet communications and surveillance technology, together with public information on the goals and needs of the NSA program, were sufficient to justify the receipt of the proposed testimony.

Monday, September 28, 2020

Terminology Department: Significance

Inns of Court College of Advocacy, Guidance on the Preparation, Admission and Examination of Expert Evidence § 5.2 (3d ed. 2020)
Statisticians, for example, use what appear to be everyday words in specific technical senses. 'Significance' is an example. In everyday language it carries associations of importance, something with considerable meaning. In statistics it is a measure of the likelihood that a relationship between two or more variables is caused by something other than random chance.
Welcome to the ICCA

The Inns of Court College of Advocacy ... is the educational arm of the Council of the Inns of Court. The ICCA strives for ‘Academic and Professional Excellence for the Bar’. Led by the Dean, the ICCA has a team of highly experienced legal academics, educators and instructional designers. It also draws on the expertise of the profession across the Inns, Circuits, Specialist Bar Associations and the Judiciary to design and deliver bespoke training for student barristers and practitioners at all levels of seniority, both nationally, pan-profession and on an international scale.

How good is the barristers' definition of statistical significance? In statistics, an apparent association between variables is said to be significant when it is lies outside the range that one would expect to see in some large fraction of repeated, identically conducted studies in which the variables are in fact uncorrelated. Sir Ronald Fisher articulated the idea as follows:

[I]t is convenient to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials.’ This level ... we may call the 5 per cent. point .... If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach that level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. [1]

For Fisher, a "significant" result would occur by sheer coincidence no "more than once in twenty trials" (on average). 

Is such statistical significance the same as the barristers' "likelihood" that an observed "relationship ... is caused by something other than random chance"? One might object to the appearance of the term "likelihood" in the definition because it too is a technical term with a specialized meaning in statistics, but that is not the main problem. The venacular likelihood that X is the cause of extreme data (where X is anything other than random chance) is not a "level of significance" such as 5%, 2%, or 1%. These levels are conditional error probabilities: If the variables are uncorrelated and we use a given level to call the observed results significant, then, in the (very) long run, we will label coincidental results as significant no more than that level specifies. For example, if we always use a 0.01 level, we will call coincidences "significant" no more than 1% of the time (in the limit).

The probability (the venacular likelihood) "that a relationship between two or more variables is caused by something other than random chance" is quite different. [2, p.53] Everything else being equal, significant results are more likely to signal a true relationship than are nonsignificant results, but the significance level itself refers to the probability of data that are uncommon when there is no true relationship, and not to the probability that the apparent relationship is real. In symbols, Pr(relationship | extreme data) is not Pr(extreme data | relationship). Naively swapping the terms in the expressions for the conditional probabilities is known as the transposition fallacy. In regard to criminal cases involving statistical evidence, it often is called the "prosecutor's fallacy." Perhaps "barristers' fallacy" can be added to the list.

REFERENCES

  1. Ronald Fisher, The Arrangement of Field Experiments, 33 J. Ministry Agric. Gr. Brit 503-515, 504 (1926).
  2. David H. Kaye, Frequentist Methods for Statistical Inference, in Handbook of Forensic Statistics 39-72 (David Banks et al. eds. 2021).

ACKNOWLEDGMENT: Thanks to Geoff Morrison for alerting me to the ICCA definition.