Wednesday, January 20, 2021

P-values versus Statistical Significance in the Selective Enforcement Case of United States v. Lopez

For several reasons, United States v. Lopez, 415 F.Supp.3d 422 (S.D.N.Y. 2019), is another stimulating opinion from U.S. District Court Judge Jed S. Rakoff. It sets forth a new rule (or a new refinement of a rule) for handling discovery requests when a criminal defendant argues that he or she is the victim of unconstitutional selective investigation. In addition, the opinion reproduces the declaration of the defendants' statistical expert that the judge found "compelling." Too often, the work of statistical consultants is not readily available for public inspection.

In Lopez, Judge Rakoff granted limited discovery into a claim of selective prosecutions stemming from one type of DEA investigation -- "reverse stings." In these cases, law enforcement agents use informants to identify individuals who might want to steal drugs:

An undercover agent or informant then poses as a drug courier and offers the target an opportunity to steal drugs that do not actually exist. Targets in turn help plan and recruit other individuals to participate in a robbery of the fictitious drugs. Just before the targets are about to carry out their plan, they are arrested for conspiracy to commit the robbery and associated crimes.

Id. at 424. Defendant Johansi Lopez and other targets "who are all men of color, allege that ... the DEA limits such operations in the Southern District of New York to persons of color ... in violation of the Fifth Amendment's Equal Protection Clause." Id. at 425.

Seeking to sharpen the usual "broad discretion" approach to discovery, the court adopted the following standard for ordering discovery from the investigating agency:

where a defendant who is a member of a protected group can show that that group has been singled out for reverse sting operations to a statistically significant extent in comparison with other groups, this is sufficient to warrant further inquiry and discovery.

Id. at 427. The court was persuaded that the "of color" group was singled out on the basis of a "combination of raw data and statistical analysis." Id. This conclusion is entirely reasonable, but was the "singled out ... to a statistically significant extent" rule meant to make classical hypothesis testing at a fixed significance level the way to decide whether to grant discovery, or would an inquiry into p-values as measures of the strength of the statistical evidence against the hypothesis of "not singled out" be more appropriate? The discussion that follows suggests that little is gained by invoking the language and apparatus of hypothesis testing.

I. The "Singled Out" Standard

The "singled out" standard was offered as a departure from the one adopted by the Supreme Court in United States v. Armstrong, 517 U.S. 456 (1996). In Armstrong, Chief Justice Rehnquist wrote for the Court that to obtain discovery for a selective prosecution defense, "the required threshold [is] a credible showing of different treatment of similarly situated persons." Id. at 470 (emphasis added). The Court deemed insufficient a "study [that] failed to identify individuals who were not black and could have been prosecuted for the offenses for which respondents were charged, but were not so prosecuted." Id. (emphasis added).

In contrast, Judge Rakoff focused on "the racially disparate impact of the DEA's reverse sting operations," id. (emphasis added), and looked to other groups in the general population in ascertaining the magnitude and significance of the disparity. He maintained that "as now recognized by at least three federal circuits, selective enforcement claims should be open to discovery on a lesser showing than the very strict one required by Armstrong." Lopez, 415 F.Supp.3d at 425. In finding that this "lesser showing" had been made, he also considered three more restricted populations (of arrested individuals) that might better approximate groups in which all people are similarly plausible targets for sting operations.

Applying the "singled out ... to a statistically significant extent " standard requires not just the selection of an appropriate population in which to make comparisons among protected and unprotected classes but also a judicial determination of statistical significance. The opinion was not clear about the significance level required, but the court was impressed with an expert declaration that, unsurprisingly, applied the 0.05 level commonly used in many academic fields and most litigation involving statistical evidence.

Unfortunately, transforming a statistical convention into a rule of law does not necessarily achieve the ease of administration and degree of guidance that one would hope for. A p < 0.05 rule still leaves open considerable room for discretion in deciding whether p < 0.05, invites the ubiquitous search for significance, and requires an understanding of classical hypothesis testing that most judges and many experts lack if it is to be applied sensitively. Lopez itself shows some of the difficulties.

II. "Raw Data" Does Not Speak for Itself

The "raw data" (as presented in the body of the opinion) was that "not a single one of the 179 individuals targeted in DEA reverse sting operations in SDNY in the past ten years was white, and that all but two were African-American or Hispanic," whereas

  • "New York and Bronx Counties ... are 20.5% African-American, 39.7% Hispanic, and 29.5% White"; and
  • The breakdown of "NYPD arrests" is
    • felony drug: "42.7% African-American, 40.8% Hispanic, and 12.7% White";
    • firearms: "65.1% African-American, 24.3% Hispanic, 9.7% White"; and
    • robbery: "60.6% African-American, 31.1% Hispanic, 5.1% White."

Id. In other words, the "raw data" are summarized by the statistics p = 177/179 = 98.88% African-American and Hispanic (AH) targets among all the DEA-targets and π = 60.2%, 83.5%, 89.4%, or 91.7%, where π is the proportion of AH targets in the four surrogate populations.

Presumably, Judge Rakoff uses "statistically significant" to denote any difference p − π that would arise less than one time out of 20 when repeatedly picking 179 targets without regard to race (or a variable correlated with race) in the (unknown) population from which the DEA picked its "reverse sting" targets in the Southern District. (The 0.05 significance level is conventionally used in academic writing in many fields, and, as discussed below, it was the used by the defendant's expert in the case; however, in recent years, it has been said to be too weak to produce scientific findings that are likely to be reproducible.)

Notice that all the variability in the difference between p and π arises from p. The population proportion π is fixed (for each surrogate population and, of course, in the unknown population from which targets are drawn). It also is worth asking why the data and population are limited to the Southern District. Does the DEA have a different method for picking sting targets in, say, the Eastern District? If not, and if the issue is discriminatory intent, might the "significant" difference in the Southern District be an outlier from a process that does not look to the race of the individuals who are targeted?

In itself, "raw data" tell us nothing about statistical significance. "Significance" is just a word that characterizes a set of values for a statistic. It takes some sort of analysis to determine the zone in which "significance" exists. The evaluation may be intuitive and impressionistic, or it may be formal and quantitative. Intuitively, it seems that p = 177/179 is far enough from what would be expected from a race-independent system that it should be called "significant." But how do we know this intuition is correct? This is where probability theory comes into play.

III. One Simple Statistical Analysis

Even if the probability of targeting an African-American or a Hispanic in each investigation were the largest of the surrogate population proportions (0.917), the expected number of AH targets is still only 164, and the probability of picking 177 or more (an excess of 13 or more) is 0.0000271. The probability of picking as few or fewer than 151 AH targets (a deficit of 13 or more is 0.000881. Hence, the probability of the observed departure from the expected value (or an even greater departure) is 0.000918, or about 1/1,100. \1/

This two-sided p-value of 1/1,100 is less than 1/20, so we can reject the idea that the DEA selected targets without regard to AH status or some variable that is correlated with that status. That the p-value is smaller than the 1/20 cutoff makes the excess count statistically significant evidence against the hypothesis of a fixed selection probability of 0.917 in each case. The data suggest that the selection probability is at least somewhat larger than that.

Or do they? The binomial probability model that gives rise to these numbers could itself be wrong. What if there are fewer stings than targets? If an informant proposes multiple targets for one sting (think several members of the same gang, for example), won't knowing that the first one is AH increase the probability that the second one is AH?

IV. An "Exact" Analysis

The court did not discuss the formal analysis of statistical significance in the case. However, it relied on such an analysis. After describing the "raw data," Judge Rakoff added:

Furthermore, defendants have provided compelling expert analysis demonstrating that these numbers are statistically significant. According to a rigorous analysis conducted by Dr. Crystal S. Yang, a Harvard law and economics professor, it is highly unlikely, to the point of statistical significance, that the racially disparate impact of the DEA's reverse sting operations is simply random.

Id. at 427. The phrase "highly unlikely to the point of statistical significance" sounds simple, but the significance level of α = 0.05 is not all that stringent, and neither 0.05 nor p-values for the observed proportion p represent the probability that targets of the sting operations are "simply random" as far as racial status goes. Instead, α is the probability of an observed disparity that triggers the conclusion "not simply random" assuming that the model with the parameter π is correct. If we use the court's rule for rejecting random variation as an explanation for the disparity between p and π, then the chance of ordering discovery when the disparity is nothing more than statistical noise is just one in 20. But the probability that the disparity is merely noise when that disparity is large enough to reject that explanation need not be one in 20. If I flip a fair coin five times and observe five heads, I can reject the hypothesis that the coin is fair (i.e., the string of heads is merely noise) at the α = 0.05 level. The probability of the extreme statistic p = 5/5 is (1/2)5 = 1/32 < 1/20 if the coin is fair. But if I tell you that I blindly picked the coin from a box of 1,000 coins in which 999 were fair and one was biased, would you think that 1/20 is the probability that the coin I picked was biased? The probability for that hypothesis would be much closer to 1/1,000. (How much closer is left as an exercise to the reader who wishes to consult Bayes' theorem.)

Many opinions gloss over this distinction, transposing the arguments in the conditional probability that is the significance level to report things such as "[s]ocial scientists consider a finding of two standard deviations significant, meaning there is about 1 chance in 20 that the explanation for a deviation could be random." Waisome v. Port Authority, 948 F.2d 1370, 1376 (2d Cir. 1991). The significance level of 1/20 relates to the probability of data given the hypothesis, not the probability of the hypothesis given the data.

The expert report in this case did not equate the improbability of a range of data to the improbability of a hypothesis about the process that generated the data. The report is reproduced below. Dr. Yang used not just the four populations listed in the body of the opinion, but a total of eight surrogate populations as "benchmarks" to conclude that "it is extremely unlikely that random sampling from any of the hypothetical populations could yield a sample of 179 targeted individuals where 177 or more individuals are Latino or Black." Id. at 431. That is a fair statement. If we postulate simple random sampling in infinite populations with the specified values for the proportion π who are AH, the sample data for which p = 177/179 lie in a region outside of the central 95% of all possible samples. Standing alone, that critical region is not "extremely unlikely," but the largest p-value reported for all the populations in the declaration is 1/10,000.

The declaration is refreshingly free from exaggeration, but its explanation of the statistical analysis is slightly garbled. Dr. Yang swore that

Using the exact hypothesis test, I test whether the observed proportion of Latinos or Blacks observed in the sample (x = 177, n = 179) is equal to the hypothesized population probability/proportion p [π in the notation I have used]. Under this test, the null hypothesis is that the observed proportion is not statistically different from the hypothesized population proportion. The alternative hypothesis is that the observed proportion is statistically different from the hypothesized population proportion, a two-sided test. Each exact hypothesis test produces a corresponding p-value, which is the probability of observing a proportion as extreme or more extreme than the observed proportion assuming that the null hypothesis is true. A small p-value implies that the observed proportion is highly unlikely under the null hypothesis, favoring the rejection of the null hypothesis.

Id. at 430. The bottom line is fine, but the exposition confuses the notion of a p-value as a measure of the strength of evidence with the idea of a significance level as protection against false rejection of null hypotheses. A statistical distribution (or a non-parametric procedure) produces a p-value that can be compared to a pre-established value such as α = 0.5 to reach a test result of thumbs up (significant) or down (not significant). The hypothesis test does not "produce a ... p-value." At most, it uses a p-value to reach a conclusion. Once α is fixed, the p-value produces the test result.

Moreover, the description of the hypotheses that are being tested makes little sense. Hypothesis are statements about unknown parameter values. They are not statements about the data or about the data combined with the parameter. What does it mean to say that "the null hypothesis is that the observed proportion is not statistically different from the hypothesized population proportion" or that "[t]he alternative hypothesis is that the observed proportion is statistically different from the hypothesized population proportion, a two-sided test"? These are not hypotheses about the true value of an unknown parameter of the binomial distribution used in the exact test. The observed proportion (p in my notation) is known to be different from the hypothetical population proportion (π in my notation). There is no uncertainty about that. The hypotheses to be tested have to be statements about what produced the observed "statistical difference" p − π rather than statements of whether there is a "statistical difference."

Thus, the hypothesis that was tested was not "whether the observed proportion of Latinos or Blacks observed in the sample (x = 177, n = 179) is equal to the hypothesized population probability/proportion." We already know that these two known quantities are not equal. The hypothesis that Dr. Yang tested was whether (within the context of the model) the (unknown) selection probability is the known value π for a given surrogate population. If we call this unknown binomial probability Θ, she was testing claims about the true value of Θ. The null hypothesis is that the unknown Θ equals a known population proportion π; the alternative hypothesis is that Θ is not precisely equal to π. The tables in the expert declaration present strange expressions for these hypotheses, such as π = 0.917 rather than Θ = 0.917.

Nonetheless, these complaints about wording and notation might be dismissed as pedantic. The p-values for properly stated null hypotheses do signal a clear thumbs down for the null hypothesis Θ = 0.917 at the prescribed α = 0.05 level. \3/

Of more interest than these somewhat arcane points is the response to the question of dependent outcomes in selecting multiple targets for a particular investigation. Dr. Yang reasoned as follows:

[S]uppose that we took the most conservative approach and assumed that there is perfect homophily (i.e. a perfect correlation of 1) such that if one individual targeted in the operation is Latino or Black, all other individuals in that same operation are also Latino or Black. Under this conservative assumption, we can then treat the observed sample as if there were only 46 independent draws (rather than 179 draws). To observe a racial composition of 98.9% Latino or Black would thus require that at least 45 out of 46 draws resulted in a Latino or Black individual being targeted.

Id. at 431 (notes omitted). Rerunning the exact computation for p = 45/46 changes the p-value to 0.18, or roughly 1 in 5. Id. at 438 (Table 2, row h). This much larger p-value fails to meet the α = 0.05 rule that the court set.

The expert declaration reports this result along with those from seven other  hypothesis tests (one for each surrogate population). Five of the tests produce "significance," and three do not. Does this mixed picture establish that p < 0.05? It seems to me that the usual procedures for assessing multiple-comparisons do not apply here because the data are identical for all eight tests. The value of π in the surrogate population is being varied, and π is not a random variable. The use of multiple populations is a way to examine the robustness of the results of the statistical test. For p = 177/179, which is the only figure the court mentioned, the finding of significance is stable. For p = 45/46, however, it becomes necessary to ask which surrogate populations best approximate the unknown population of potential targets for the reverse stings. Numerical analysis cannot answer that question. \2/ As Macbeth soliloquized, "we still have judgment here."

NOTES

  1. An exact binomial computation performed in R (using binom.test(177,179,0.917)) gives a two-sided p-value of 0.0000578, which is about 1/17,300. Table 1, row h of the expert declaration reports a value 0.0001, or 1/10,000.
  2. Dr. Yang intimates that discovery might shed light on how much of a correction is needed. Id. at 431 ("It is impossible to know the true degree of homophily or correlation among individuals targeted in reverse-sting operations, particularly when the DEA's selection criteria is unknown."); id. at 432 ("Again, because I have no knowledge of the DEA's selection criteria of potential targets, it is impossible to know which of the hypothesized populations captures the relevant pool of similarly situated individuals. A more definitive statistical analysis may be possible if the government provides the requested selection criteria.").
  3. As the declaration points out, the p-values are small enough to yield "significance" at even smaller levels, but it would not be acceptable to set up a test at the α = 0.05 level and then interpret it as demonstrating significance at a different level. Changing α after analyzing the data vitiates the interpretation of α as the probability of a false alarm, which can be written as Pr("significant difference" | Θ = π).

APPENDIX: DECLARATION OF PROFESSOR CRYSTAL S. YANG

Pursuant to 28 U.S.C. § 1746, I, CRYSTAL S. YANG, J.D., Ph.D., declare under penalty of perjury that the following is true and correct:

1. I am a Professor of Law at Harvard Law School and Faculty Research Fellow at the National Bureau of Economic Research. My primary research areas are in criminal law and criminal procedure, with a focus on testing for and quantifying racial disparities in the criminal justice system. Before joining the Harvard Law School faculty, I was an Olin Fellow and Instructor in Law at The University of Chicago Law School. I am admitted to the New York State Bar and previously worked as a Special Assistant United States Attorney in the U.S. Attorney's Office for the District of Massachusetts.

2. I received a B.A. in economics summa cum laude from Harvard University in 2008, an A.M. in statistics from Harvard University in 2008, a J.D. magna cum laude from Harvard Law School in 2013, and a Ph.D. in economics from Harvard University in 2013. My undergraduate and graduate training involved substantial coursework in quantitative methods.

3. I have published in peer-reviewed journals such as the American Economic Review, The Quarterly Journal of Economics, the American Economic Journal: Economic Policy, and have work represented in many other peer-reviewed journals and outlets.

4. I make this Declaration in support of a motion being submitted by the defendants to compel the government to provide discovery in this case.

Statistical Analyses Pertinent to Motion

5. I was retained by the Federal Defenders of New York to provide various statistical analyses relevant to the defendants' motion in this case. Specifically, I was asked to evaluate whether the observed racial composition of targets in reverse-sting operations in the Southern District of New York could be due to random chance.

6. To undertake this statistical analysis, I first had to obtain the racial composition of targeted individuals in DEA reverse-sting stash house cases brought in the Southern District of New York for the ten-year period beginning on August 5, 2009 and ending on August 5, 2019. Based on the materials in Lamar and Garcia-Pena, as well as additional searches conducted by the Federal Defenders of New York, I understand that there have been 46 fake stash house reverse-sting operations conducted by the DEA during this time period. These 46 operations targeted 179 individuals of whom zero are White, two are Asian, and 177 are Latino or Black – the “sample.” \1/ Given these counts, this means that of the targeted individuals, 98.9% are Latino or Black (and 100% are non-White). Thus, the relevant question at hand is whether the observed racial composition of the sample could be due to random chance alone if the DEA sampled from a population of similarly situated individuals.

7. Second, I had to define what the underlying population of similarly situated individuals is. In other words, what is the possible pool of all similarly situated individuals who could have been targeted by the DEA in a reverse-sting operation? Because the DEA's criteria for being a target in these reverse-sting cases is unknown, my statistical analysis will assume a variety of hypothetical benchmark populations. If the government provides its selection criteria for being a target in these reverse-sting cases, a more definitive statistical analysis may be possible. Based on materials from Garcia-Pena and Lamar, I have identified eight hypothetical benchmark populations. Below, I present the hypothesized populations and the racial composition (% Latino or Black) in each population in order of least conservative (i.e. smallest share of Latino or Black) to most conservative (i.e. highest share of Latino or Black):

a. 2016 American Community Survey 5-year estimates on counties in the SDNY (from Garcia-Pena): 48.1% Latino or Black
b. 2016 American Community Survey 5-year estimates on Bronx and New York Counties (from Garcia-Pena): 60.2% Latino or Black
c. New York Police Department (NYPD) data from January 1 – December 31, 2017, on felony drug arrests in New York City (from Garcia-Pena): 83.5% Latino or Black
d. Estimates by Prof. Kohler-Hausmann on men aged 16-49 living in New York City who have prior New York State (NYS) violent felony convictions (from Lamar): 87.1% Latino or Black
e. Estimates by Prof. Kohler-Hausmann on men aged 16-49 living in New York City who have prior NYS felony convictions (from Lamar): 87.5% Latino or Black
f. NYPD data from January 1 – December 31, 2017 on firearms seizures arrests in New York City (from Garcia-Pena): 89.4% Latino or Black
g. Reverse-sting operation defendants in the Northern District of Illinois (from Garcia-Pena): 87.7-90.7% Latino or Black \2/
h. NYPD data from January 1 – December 31, 2017 on robbery arrests in New York City (from Garcia-Pena): 91.7% Latino or Black

8. For each of these eight hypothesized populations, I then conduct an exact hypothesis test for binomial random variables. This is the standard statistical test used for calculating the exact probability of observing x “successes” out of n “draws” when the underlying probability of success is p and the underlying probability of failure is 1-p. Here, each defendant represents an independent draw and a success occurs when the defendant is Latino or Black. Using the exact hypothesis test, I test whether the observed proportion of Latinos or Blacks observed in the sample (x = 177, n = 179) is equal to the hypothesized population probability/proportion p. Under this test, the null hypothesis is that the observed proportion is not statistically different from the hypothesized population proportion. The alternative hypothesis is that the observed proportion is statistically different from the hypothesized population proportion, a two-sided test. \3/ Each exact hypothesis test produces a corresponding p-value, which is the probability of observing a proportion as extreme or more extreme than the observed proportion assuming that the null hypothesis is true. A small p-value implies that the observed proportion is highly unlikely under the null hypothesis, favoring the rejection of the null hypothesis.

9. The following Table 1 presents each of the eight hypothesized population proportions, the null hypothesis under each population, the alternative hypothesis under each population, and the corresponding p-value using the observed proportion of Latinos or Blacks in the sample assuming 179 independent draws:

Table 1 (x = 177, n = 179)
Hypothesized Population
Proportion
Null Hypothesis
H0
Alternative Hypothesis
Ha
p-value
a. 48.1% Latino or Black H0: p = 0.481 Ha: p ≠ 0.481 0.0000
b. 60.2% Latino or Black H0: p = 0.602 Ha: p ≠ 0.602 0.0000
c. 83.5% Latino or Black H0: p = 0.835 Ha: p ≠ 0.835 0.0000
d. 87.1% Latino or Black H0: p = 0.871 Ha: p ≠ 0.871 0.0000
e. 87.5% Latino or Black H0: p = 0.875 Ha: p ≠ 0.875 0.0000
f. 89.4% Latino or Black H0: p = 0.894 Ha: p ≠ 0.894 0.0000
g. 90.7% Latino or Black H0: p = 0.907 Ha: p ≠ 0.907 0.0000
h. 91.7% Latino or Black H0: p = 0.917 Ha: p ≠ 0.917 0.0001

10. The above statistical calculations in Table 1 show that regardless of which of the eight hypothesized population proportions is chosen, one could reject the null hypothesis at conventional levels of statistical significance. For example, one could reject the null hypothesis at the standard 5% significance level which requires that the p-value be less than 0.05. All eight p-values are substantially smaller than 0.05 and would lead to a rejection of the null hypothesis even using more conservative 1%, 0.5%, or 0.1% significance levels. In other words, it is extremely unlikely that random sampling from any of the hypothetical populations could yield a sample of 179 targeted individuals where 177 or more individuals are Latino or Black.

11. Alternatively, one may be interested in the reverse question of what the underlying population proportion would have to be such that the observed proportion could be due to random chance alone assuming there are 179 independent draws. Using the standard 5% significance level, I have calculated that the hypothesized population would have to be composed of at least 96.0% Latinos or Blacks in order for one to not be able to reject the null hypothesis. In other words, unless the pool of similarly situated individuals is comprised of at least 96.0% Latinos or Blacks, it is highly unlikely that one could get a sample of 179 targeted individuals where 177 or more individuals are Latino or Black.

12. One potential question with the statistical analyses in Table 1 is whether the assumption that each of the 179 targeted individuals is an independent draw is reasonable. For example, what if the race/ethnicity of individuals in each reverse-sting operation is correlated, such that if one individual targeted in an operation is Latino or Black, the other individuals are also more likely to be Latino or Black? This correlation within operations could result if there is homophily, or “the principle that a contact between similar people occurs at a higher rate than among dissimilar people.” Miller McPherson et al., Birds of a Feather: Homophily in Social Networks, 27 Ann. Rev. Soc. 315, at 416 (2001). It is impossible to know the true degree of homophily or correlation among individuals targeted in reverse-sting operations, particularly when the DEA's selection criteria is unknown. But suppose that we took the most conservative approach and assumed that there is perfect homophily (i.e. a perfect correlation of 1) such that if one individual targeted in the operation is Latino or Black, all other individuals in that same operation are also Latino or Black. Under this conservative assumption, we can then treat the observed sample as if there were only 46 independent draws (rather than 179 draws). \4/ To observe a racial composition of 98.9% Latino or Black would thus require that at least 45 out of 46 draws resulted in a Latino or Black individual being targeted. \5/

13. For each of the eight hypothesized benchmark populations, I then test whether the observed proportion of Latinos or Blacks observed in this alternative sample (x = 45, n = 46) is equal to the hypothesized proportion from an underlying population assuming that there are only 46 independent draws. The following Table 2 presents each of the eight hypothesized population proportions, the null hypothesis under each population, the alternative hypothesis under each population, and the corresponding p-value using the observed proportion of Latinos or Blacks in the sample assuming 46 independent draws:

Table 2 (x = 45, n = 46)
Hypothesized Population
Proportion
Null Hypothesis
H0
Alternative Hypothesis
Ha
p-value
a. 48.1% Latino or Black H0: p = 0.481 Ha: p ≠ 0.481 0.0000
b. 60.2% Latino or Black H0: p = 0.602 Ha: p ≠ 0.602 0.0000
c. 83.5% Latino or Black H0: p = 0.835 Ha: p ≠ 0.835 0.0045
d. 87.1% Latino or Black H0: p = 0.871 Ha: p ≠ 0.871 0.0255
e. 87.5% Latino or Black H0: p = 0.875 Ha: p ≠ 0.875 0.0257
f. 89.4% Latino or Black H0: p = 0.894 Ha: p ≠ 0.894 0.0871
g. 90.7% Latino or Black H0: p = 0.907 Ha: p ≠ 0.907 0.1240
h. 91.7% Latino or Black H0: p = 0.917 Ha: p ≠ 0.917 0.1795

14. Under this conservative assumption of perfect homophily, the above statistical calculations in Table 2 show that under the first five hypothesized population proportions (a-e), one could reject the null hypothesis at the standard 5% significance level. In other words, even if the hypothesized population proportion of Latinos or Blacks is as high as 87.5%, it is highly unlikely that random sampling could yield a sample of 46 individuals where 45 or more individuals are Latino or Black. One, however, cannot reject the null hypothesis for the next three hypothesized population proportions (f-h). Again, because I have no knowledge of the DEA's selection criteria of potential targets, it is impossible to know which of the hypothesized populations captures the relevant pool of similarly situated individuals. A more definitive statistical analysis may be possible if the government provides the requested selection criteria.

15. As before, I also ask the reverse question of what the underlying population proportion would have to be such that the observed proportion could be due to random chance alone assuming that there are only 46 independent draws. Using the standard 5% significance level, I have calculated that the hypothesized population would have to be composed of at least 88.5% Latinos or Blacks in order for one to not be able to reject the null hypothesis. In other words, unless the pool of similarly situated individuals is comprised of at least 88.5% Latinos or Blacks, it is highly unlikely that one could get a sample of 46 targeted individuals where 45 or more individuals are Latino or Black.

Dated: Cambridge, Massachusetts
September 13, 2019
/s/ Crystal S. Yang
Crystal S. Yang

Footnotes

   1. In consultation with the Federal Defenders of New York, this sample is obtained by taking the 33 cases and 144 defendants identified in Garcia-Pena or Lamar, excluding two cases and five defendants that are either not DEA cases or reverse-sting cases, and including an additional 15 cases and 40 defendants that were not covered by the time frames included in the Lamar or Garcia-Pena analysis.
   2. I choose 90.7% (the upper end of the range) as the relevant proportion given that it yields the most conservative estimates.
   3. This two-sided test takes the most conservative approach (in contrast to a one-sided test) because it allows for the possibility of both an over-representation and under-representation of Latinos or Blacks relative to the hypothesized population proportion.
   4. I make the simplifying assumption that each of the 46 operations targeted the average number of codefendants, 3.89= 179/46.
   5. Technically, 45.494 draws would need to be of Latino or Black individuals but I conservatively round down to the nearest integer.

 UPDATED: 1/24/2021 3:13 ET