Saturday, January 18, 2014

The Signal, the Noise, and the Errors

Published in September 2012, The Signal and the Noise by Nate Silver soon reached The New York Times best-seller list for nonfiction. Amazon.com named it the best nonfiction book of 2012, and it won the 2013 Phi Beta Kappa Award in science. Not bad for a book that presents Bayes' rule as prescription for better thinking in life and as a model for consensus formation in science. The subtitle, for those who have not read it, is "Why So Many Predictions Fail--But Some Don't," and the explanations include poor data, cognitive biases, and statistical models not grounded in an understanding of the phenomena being modeled.

The book is both thoughtful and entertaining, covering many fields. I learned something about meteorology (your local TV weather forecaster probably is biased toward forecasting bad weather--stick to the National Weather Service forecasts), earthquake predictions, climatology, poker, human and computer chess-playing, equity markets, sports betting, political polling and poor prognostication by pundits, and more. Silver does not pretend to be an expert in all these fields, but he is perceptive and interviewed a lot of interesting people.

Indeed, although Wikipedia describes Silver as "an American statistician and writer who analyzes baseball (see Sabermetrics) and elections (see Psephology)," he does not present himself as as expert in statistics, and statisticians seem conflicted on whether to include him their ranks (see AmStat News). He seems to be pretty much self-educated in the subject, and he advocates "getting your hands dirty with the data set" rather than "spending too much time doing reading and so forth." Frick (2013).

Perhaps that emphasis, combined with the objective of writing an entertaining book for the general public, has something to do with the rather sweeping--and sometimes sloppy--arguments for Bayesian over frequentist methods. Although Silver gives a few engaging and precise examples of Bayes' rule in operation (playing poker or deciding whether your domestic partner is cheating on you, for instance), he is quick to characterize a variety of informal, intuitive modes of combining many different kinds of data as tantamount to following Bayes' rule. Marcus & Davis (2013) identify one telling example--a very successful sports bettor who recognizes the importance of data that the bookies overlook, misjudge, or do not acquire . (Pp. 232-61). What makes this gambler a Bayesian? Silver thinks it is the fact that "[s]uccessful gamblers ... think of the future as speckles of probability, flickering upward and downward like a stock market ticker to every new jolt of information." (P. 237). That's fine, but why presume that the flickers follow Bayes' rule as opposed to some other procedure for updating beliefs? And why castigate frequentist statisticians, as Silver seems to, as "think[ing] of the future in terms of no-lose bets, unimpeachable theories, and infinitely precise measurements"? Ibid. Surely, that is not the world in which statisticians live.

Changing probability judgments does not make someone a Bayesian

In proselytizing for Bayes' theorem and in urging readers to "think probabilistically" (p. 448), Silver also writes that
When you first start to make these probability estimates, they may be quite poor. But there are two pieces of favorable news. First, these estimates are just a starting point: Bayes's theorem will have you revise and improve them as you encounter new information. Second, there is evidence that this is something we can learn to improve. The military, for instance, has sometimes trained soldiers in these techniques,5 with reasonably good results.6 There is also evidence that doctors think about medical diagnoses in a Bayesian manner.7 [¶] It is probably better to follow the lead of our doctors and our soldiers than our television pundits.
It is hard to argue with the concluding sentence, but where is the evidence that many soldiers and doctors are intuitive (or trained) Bayesians? The report cited (n.5) for the proposition that "[t]he military ... has sometimes trained soldiers in the [Bayesian] techniques" says nothing of the kind.* Similarly, the article that is supposed to show that the alleged training in Bayes' rule produces "reasonably good results" is quite wide of the mark. It is a 35-year-old report for the Army on "Training for Calibration" about research that made no effort to train soldiers to use Bayes' rule.**

How about doctors? The source here is an article in the British Medical Journal that asserts that "[c]linicians apply bayesian reasoning in framing and revising differential diagnoses." Gill et al. (2005). But these authors--I won't call them researchers because they did no real research--rely only on their impressions and post hoc explanations for diagnoses that are not expressed probabilistically. As one distinguished physician tartly observed, "[c]linicians certainly do change their minds about the probability of a diagnosis being true as new evidence emerges to improve the odds of being correct, but the similarity to the formal Bayesian procedure is more apparent than real and it is not very likely, in fact, that most clinicians would consider themselves bayesians." Waldron (2008, pp. 2-3 n.2).

The transposition fallacy

Consistent with this tendency to conflate expressing judgments probabilistically with using Bayes' rule to arrive at the assessments, Silver presents probabilities that have nothing to do with Bayes' rule as if they are properly computed posterior probabilities. In particular, he naively transposes conditional probabilities to misrepresent p-values as degrees of belief.

At page 185, he writes that
A once-famous “leading indicator” of economic performance, for instance, was the winner of the Super Bowl. From Super Bowl I in 1967 through Super Bowl XXXI in 1997, the stock market gained an average of 14 percent for the rest of the year when a team from the National Football League (NFL) won the game. But it fell by almost 10% when a team from the original American Football Leage (AFL) won instead. [¶] Through 1997, this indicator had correctly “predicted” the direction of the stock market in twenty-eight of thirty-one years. A standard test of statistical significance, if taken literally, would have implied that there was only about a 1-in-4,700,000 possibility that the relationship had emerged from chance alone.
This is a cute example of the mistake of interpreting a p-value, acquired after a search for significance, as if there had been no such search. As Silver submits, "[c]onsider how creative you might be when you have a stack of economic variables as thick as a phone book." Ibid.

But is the ridiculously small p-value (that he obtained by regressing the S&P 500 index on the conference affiliation of the Super Bowl winner) really the probability "that the relationship had emerged from chance alone"? No, it is the probability that such a remarkable association would be seen if the Super Bowl outcome and the S&P 500 index were entirely uncorrelated (and no one had searched for a data set that shared a seemingly shocking correlation to the S&P 500 index). Silver may be a Bayesian at heart, but he did not compute the probability of the null hypothesis given the data, and it is problematic to tell the reader that a "standard test of statistical significance" (or more precisely, a p-value) gives the (Bayesian) probability that the null hypothesis is true.

Of course, with such an extreme p-value, the misinterpretation might not make any practical difference, but the same misinterpretation is evident in Silver's characterization of a significance test at the 0.05 level. He writes that "[b]ecause 95 percent confidence in a statistical test is Fisher’s traditional dividing line between 'significant' and 'insignificant,' researchers are much more likely to report findings that statistical tests classify as 95.1 percent certain than those they classify as 94.9 percent certain—a practice that seems more superstitious than scientific." (P. 256 n. †). Sure, any rigid dividing line (a procedure that Fisher did not really use) is arbitrary, but rejecting a hypothesis in a classical statistical test at the 0.05 level does not imply a 95% certainty that this rejection is correct.

In transposing conditional probabilities in violation of both Bayesian and frequentist precepts, Silver is in good and plentiful company. As the pages on this blog reveal, theoretical physicists, epidemiologists, judges, lawyers, forensic scientists, journalists, and many other people make this mistake. E.g., The Probability that the Higgs Boson Has Been Discovered, July 6, 2012. Despite its general excellence in describing data-driven thinking, The Signal and the Noise would have benefited from a little more error-correcting code.

Notes

* Rather, Gunzelmann & Gluck (2004) discusses training in unspecified "mission-relevant skills." An "expert model is able to compare their actions against the optimal actions in the task situation" and "identify [trainee errors] and provide specific feedback about why the action was incorrect, what the correct action was, and what the students should do to correct their mistake." The expert model--and not the soldiers--"uses Bayes’ theorem to assess mastery learning based upon the history of success and failure on particular units of skill within the task." Ibid. According to the authors, this "Bayesian knowledge tracing approach" is inadequate because it "does not account for forgetting, and thus cannot provide predictions about skill retention." Ibid.

** Lichtenstein & Fischhoff's (1978) objective was "to help analysts to more accurately use numerical probabilities to indicate their degree of confidence in their decisions." They did not study military analysts, but instead recruited 12 individuals from their personal contacts. They had these trainees assess the probabilities of statements in the areas of geography, history, literature, science, and music. They measured how well calibrated their subjects were. (A well calibrated individual gives correct answers to x% of the questions for which he or she assesses the probability of the given answer to be x%.) The proportion of the subjects whose calibration improved after feedback was 72%. Ibid.

References

Walter Frick, Nate Silver on Finding a Mentor, Teaching Yourself Statistics, and Not Settling in Your Career, Harvard Business Review Blog Network, Sept. 24, 2013, http://blogs.hbr.org/2013/09/nate-silver-on-finding-a-mentor-teaching-yourself-statistics-and-not-settling-in-your-career/.

Christopher J. Gill, Lora Sabin & Christopher H. Schmid, Why Clinicians Are Natural Bayesians, 330 Brit. Med. J. 1080–83 (2005), available at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC557240/

Glenn F. Gunzelmann & Kevin A. Gluck, Knowledge Tracing for Complex Training Applications: Beyond Bayesian Mastery Estimates, in Proceedings of the Thirteenth Conference on Behavior Representation in Modeling and Simulation 383-84 (2004), available at http://act-r.psy.cmu.edu/wordpress/wp-content/uploads/2012/12/710gunzelmann_gluck-2004.pdf.

Sarah Lichtenstein & Baruch Fischhoff, Training for Calibration, Army Research Institute Technical Report TR-78-A32, Nov. 1978, available at http://www.dtic.mil/dtic/tr/fulltext/u2/a069703.pdf

Gary Marcus & Ernest Davis, What Nate Silver Gets Wrong, New Yorker, Jan. 25, 2013, http://www.newyorker.com/online/blogs/books/2013/01/what-nate-silver-gets-wrong.html

Tony Waldron, Palaeopathology (2008), excerpt available at http://assets.cambridge.org/97805216/78551/excerpt/9780521678551_excerpt.pdf

No comments:

Post a Comment