Thursday, October 6, 2016

Does log-LR separate Hillary Clinton from Donald Trump?

Many forensic statisticians regard likelihood ratios (or their logarithms) as a providing an ideal measure of the probative value of comparisons of patterns and other test results. Moving into another context, in medium.com "Data Scientist" Maixent Chenebaux describes differences between the acceptance speeches of Hillary Clinton and Donald Trump at their party's nominating convention:
we would like to find which are the candidates’ favorite words. To get the “Trumpian” and the “Clintonian” vocabularies, we have to find the words that occur the most in one candidate’s talk and, at the same time, the least in the opponent’s. For example, the word “really” is found 15 times in Trump’s speech but only once in Clinton’s. One way to determine this is to calculate the odds ratio for each word. The odds ratio (here named OR) was, for each word, computed using the following formula:

OR(wordi) = log { ptrump(wordi) / pclinton(wordi) }

The first term of the ratio is the probability of a word being in Trump’s vocabulary, and the other one is the probability of the same word being in Clinton’s. The log function allows us to efficiently sort each word in one category or the other: when the probabilities are equal, the log function is null. In any other cases, it is either negative (a word is Clintonian) or positive (a word is Trumpian).
There is a simpler way to say this. If a word is used proportionally more often in Trump's speech than in Clinton's, it is Trumpian; if it occurs proportionally more in Clinton's speech, it is Clintonian.

The "odds ratio" in the box is nit really an odds ratio. It is the logarithm of a Bayes factor (the ratio of posterior to prior odds) or the log of the likelihood ratio. But seems like extra baggage. We are not trying to deduce the author of the speech from the frequencies of the distinctive words. We are only trying to pick out the words that are distinctive. Their usefulness in classifying additional speech samples by author would require further research. If we had a transcript of a speech that we knew was either Clinton's or Trump's--and had to guess the speaker without knowing anything about politics and semantics--then the discriminators obtained from this study might be useful. We could claim scientific validity in using them to discriminate between the two possible authors if these terms had been shown to be sensitive and specific (and, hence, to have high likelihood ratios) when cross-validated on other speeches.

In any event, the words that emerged as discriminators for the nomination speeches and that were "almost exclusively" by that candidate (pother candidate(word) ≈ 0 ?) were

Clinton Trump
America
working
hard
together
Donald
millions
campaign
enough
wants
great
got
China
really
Mexico
nice
problem(s)
Iran
disaster

One proposed interpretation is that Trump uses shorter words, but that is not obvious from this list. Another is
Fine observers will note that “Trump” does not appear in the Clintonian wordset above, the reason being that Trump himself made numerous mentions of his last name in his speech (10 times), bringing the ratio way down. As a point of comparison, Clinton’s name is only used twice: once in Hillary’s speech (about her husband Bill Clinton), and once in Trump’s. Moreover, the Clintonian word “Wants” that shows up in the list is mostly used to criticize her opponent (“He wants to divide us […]”, “He wants us to fear the future and fear each other.”). It clearly shows that Clinton talked about Trump, and Trump talked about … himself!
A word whose LR was about 1 is "thanks." It does not work well as a classifier. But here semantics, not statistics, indicates a difference between the two candidates:
They both use “thank(s)” numerous times, but in a different manner: while Clinton specifically thanked a group of people or an individual [e.g. "I want to thank Bernie Sanders"] Trump’s “thanks” were mostly employed when the crowd was applauding him [e.g., "That's really nice, thanks"].
Returning to statistics alone,
Most of Trump’s sentences are short: more than 21% of Trump’s speech is made of sentences that contain 5 or 6 words. Clinton’s sentence lengths are more evenly distributed, 12-word sentences being the most frequent. ... Obama, during his first nomination speech, employed an average of 25.7 words per sentence, which is almost equal to Clinton and Trump combined. Obama also repeated himself 24% less than Clinton, and 42% less than Trump.

That's it for now! I would not want to repeat myself.

Reference

Maixent Chenebaux, Semantics — What Does Data Science Reveal about Clinton and Trump?, Oct. 5, 2016, https://medium.com/reputation-squad/semantics-what-does-data-science-reveal-about-clinton-nd-trump-afdf427e833b#.ctkkoyu2o

No comments:

Post a Comment