Monday, June 24, 2013

“Disturbing” and “Ridiculous” Expertise in State v. Zimmerman

Spectrographic speaker identification has had a troubled history. The pattern is familiar: early, enthusiastic claims; skepticism from a few eminent “outside” scientists; review in many court hearings; mixed rulings; a skeptical NRC report; more rulings.

The latest ruling, in the closely watched Florida murder case, State v. Zimmerman, extends beyond spectrographic analysis. The court's opinion suggests that the prosecution was not overly scrupulous about its scientific evidence on whose screams are present in a recorded 911 call made from inside a house just before neighborhood watcher George Zimmerman shot teenage Trayvon Martin in the chest.

Newspapers had hired two experts to examine this recording along with one of a non-emergency call that Zimmerman had made to the police. The expert selected by the Washington Post, Alan R. Reich concluded that Martin was yelling, "I'm begging you," and "stop." The expert retained by the Orlando Sentinel, Tom Owen, ruled out Zimmerman as the screamer, in part, after using voice-recognition software. Prosecutors wanted them to testify at the trial. The defense objected that their methodology was incapable of producing these results.

Judge Debra Nelson held a hearing on the motion to exclude the evidence. The prosecution had Reich and Owen testify. It produced no independent experts to vouch for their methods—not a good strategy. The defense produced four experts who found the identifications outside the range of what existing science can accomplish. Given this array of opinions, Judge Nelson excluded the proposed testimony, leaving the prosecution free to identify the screamer and to decipher his words with ordinary methods (such as testimony of Trayvon’s father that he recognizes the voice as his son’s)

The court reached this result under Florida’s version of the general-scientific-acceptance standard announced in the 1923 case of Frye v. United States, 293 F. 1013 (D.C. Cir. 1923). One renowned law professor recently insisted that “Under Frye's general acceptance test [a]ll that is necessary is the ability to count. ... Judges do not need to know very much about science; they simply need to be able to count. So judges ask, ‘You epidemiologists, is this method generally accepted? Raise your hand. One, two, three, four. Those who think it's not generally accepted? One, two, three.’” (Faigman 2013). But the general acceptance standard is not this simple. Although some judges may try to apply Frye this mechanically, and although advocates occasionally have produced opinion polls as if they were proof of general acceptance, the test is much richer than this. As the Florida Supreme Court explained in Ramirez v. State, 810 So.2d 836, 844 (Fla. 2001) (footnotes omitted):
When applying the Frye test, a court is not required to accept a "nose count" of experts in the field. Rather, the court may peruse disparate sources—e.g., expert testimony, scientific and legal publications, and judicial opinions—and decide for itself whether the theory in issue has been "sufficiently tested and accepted by the relevant scientific community." In gauging acceptance, the court must look to properties that traditionally inhere in scientific acceptance for the type of methodology or procedure under review—i.e., "indicia" or "hallmarks" of acceptability.
Under this standard, numbers are not decisive. Courts must verify that there has been sufficient testing in the scientific community. Thus, in Zimmerman, the state had to show that scientists generally believe there is a method that permits the kinds of results the newspaper-prosecution experts achieved. It did not come close to this showing.

I. Scientific Methods of Speaker Recognition

After preliminary quotations of generalities from Florida Supreme Court opinions, the trial court observed that
There are currently three employed methods used in forensic speaker identification. The first and most widely used and accepted was referred to during the hearing as critical listening, aural perception, or auditory phonetic analysis. This is the process whereby a trained expert carefully listens to a sample in an effort to detect unique qualities, such as vocal pitch, speech rhythms, and accents that can be heard by the unassisted human ear. The second is spectral or acoustic-phonetic analysis which uses computer software to measure fundamental frequency and energy in the spoken words, the results of which are represented in a graphic format. Finally, there is biometric or Gaussian Mixture Model analysis. This method also uses computer software to compare thousands of variables in spoken words to determine whether they were uttered by the same person.
At this point, the careful reader should be nervous. To begin with, “critical listening” by a “trained expert” to hear “unique qualities”? What studies establish that the qualities are unique and that the trained listeners can detect them and compare them as between two samples of spoken words? The point here is not that “aural perception” is necessarily subjective. People can perform many tasks that require subjective judgment quite well, and training can augment their skills. But if this is what the prosecution's experts are doing, what does science show about the performance of “critical listeners” under similar circumstances?

Second, “spectral analysis”—obtaining waveforms in the frequency domain—can be done reliably and validly, but in general its use to ascertain the identity of particular speakers remains controversial. Here, a substantial body of research exists, but interspeaker versus intraspeaker variability remains an issue. The court’s order does not address this general concern, but (understandably) focuses on the problems with the 911 recording.

Finally, what “variables” in spoken words establish the identity of a speaker? There is no shortage of papers on Gaussian Mixture Models in speaker recognition by machine-learning systems, and the basic principles are well accepted. But how well does the classifying software work with snippets of background noise like that in the 911 tape?

II. Two Experts’ Theories

One expert was Thomas Owen. He is the founder of Owen Forensic Services. (The president is Jennifer Owens, his daughter, and his CV implies that he formerly ran the business as Owl Investigative Services.) The firm “maintains a state of the art facility for the purposes of digital audio enhancement, digital video enhancement, digital audio and video authenticity analysis, voice identification and media/data recovery.” Mr. Owen has no (academic) scientific training and no publications in any journal of science or engineering (unless one counts a 1983 article on “Reproduction of Acoustically Recorded Cylinders and Disks” in the Journal of the Audio Engineering Society, the society being “an international organization that unites audio engineers, creative artists, scientists and students worldwide by promoting advances in audio and disseminating new knowledge and research.”).

The opinion states that Mr. Owen performed a “software-reliant analysis” with “Easy Voice, a software program he markets and in which he has a small financial interest.” Whatever Easy Voice does, it could not do it with the “seven seconds of screams from the 911 call. The seven second sample was rejected [as too short] by the Easy Voice software program. To correct this problem, he ran the seven second sample twice ... .” The Easy Voice website indicates that the program uses some spectrographic data (namely, formants), pitch, and a Gaussian Mixture Model. Using the program with the loopy sample and changing the pitch of a comparison sample, Owen concluded that the screams were not Zimmerman’s and therefore must have been Martin’s.

Furthermore, Owen, "cleaned up" the audio of the Zimmerman's non-emergency call and “[u]sing audio editing software, [determined] that the unintelligible word ... was ‘punks.’”

The Washington Post described its expert, Alan Reich, as “a former University of Washington professor with a doctorate in speech science” and a long history of consulting work. He employed “commonly-used digital enhancement and transcription software” along with “the aural perception and acoustic-phonetic analysis methods.” Those are the court’s words. The Post put it more plainly: “To familiarize himself with Zimmerman’s voice, Reich also listened many times to a recorded call that Zimmerman placed to police minutes earlier ... .” and, “[w]here many people have heard only vague yells on the [911] recording, Reich said that he has found language.”

Not only that, but Reich figured out that the screams were Martin’s. His reasoning, as summarized by the court, was that "the screams ended upon the gunshot being fired, leading to an inference that the person screaming had been shot; and the frequency of the screams indicated that the speaker's vocal tract had not completely developed, leading to a conclusion that the person had not reached adulthood."

II. Disturbing and Ridiculous

The defense assembled four internationally recognized experts. The FBI’s Dr. Hirotaka Nakasone “testified that the processes of aural perception and spectral analysis are commonly used in the field of speaker identification and generally accepted within the field” but that there were “less than three seconds of useable audio” and that “screams are not suitable for comparison with one's normal speaking voice.” He found it “disturbing” that a scientist [would] claim “to make a conclusion about the identity of the person(s) screaming in the 911 call given the current state of scientific technology ... .”

Dr. Peter French, a linguist with a significant publication record and the leader of “J P French Associates ... the United Kingdom's longest established independent forensic laboratory specialising in the analysis of speech, audio and language,” was “the most compelling” witness. He dismissed “the recorded screams in the 911 call [as] unsuitable for any type of forensic analysis.” “[I]f he had received these recordings from law enforcement at the outset of the case, he would have rejected the assignment as it would have been fruitless to undertake the task.”

He agreed with Dr. Nakasone that “there is no basis to compare spoken words to screaming [because] under the type of stress present in this case [screaming] changes the voice in an unpredictable manner and cannot be replicated in laboratory conditions. Moreover, ... [a] forensic expert cannot hear the variables used with aural comparison in screams, including the pronunciation of certain phonemes, accents, speech rate, and pitch variations.”

He rejected the theory that it is possible to “tell the age of a speaker based upon the sound of his voice.” As the court remarked, “Dr. French's opinions cannot be reconciled with those of Dr. Reich.”

Like Dr. French, George Doddington thought the state’s voice identifications were “ridiculous.” Dr. Doddington “works with NIST, the National Institute of Standards and Technology,” where he “formulated a test designed to determine software performance in voice comparison studies. The results ... indicate that software error rates climb substantially as the recorded sample size is reduced from ninety seconds to ten.”

The final blow came from Jim Wayman, an electrical engineer and the director of the National Biometric Test Center at San Jose State University. He “testified that there was less than one second worth of data (50 or 60 milliseconds) available in each of the screams in the 911 call” and that no “software accepted in the scientific community that would produce reliable comparison results ... .” He found Dr. Reich's methodology “confusing” and “baffling.”

III. The Outcome and a Note on Prosecutorial Ethics

In addition to excluding testimony from the state’s experts because it was, quite clearly, not the product of generally accepted methods as applied to the case at hand, the court excluded Dr. Reich’s extraordinary perceptions of specific words in the enhanced recording as lacking in probative value (the court spoke of “listener bias”) and likely to “mislead the jury.” (The court also thought it would “confuse issues,” but it is hard to see what issues his uncanny hearing confuses).

I have told this story in some length because I think it raises a broader question. Is it ethical for prosecutors to offer testimony like this if they are not prepared to defend it? Florida’s star homicide prosecutors must have known that they were out of bounds on this play. In presenting its expert’s conclusions, the Post’s reporters added that James J. Ryan, “the retired head of the FBI forensic audio, video and image analysis unit” had told them that “[t]he science doesn’t help with a recording like this.” That should have clued them into the possibility that the newspaper experts were overreaching.

By itself, a disagreement among equally well credentialed experts would not have been a reason to forego reasonable testimony. But the two sets of experts were hardly evenly matched in their scientific backgrounds and accomplishments. Why did the state fail to recruit a single disinterested expert to confirm the work for the newspapers? Did the prosecutors not bother to check whether any other reputable scientists could defend its proposed evidence? If so, the competence of the state’s ministers of justice is open to question. On the other hand, if these prosecutors did make the effort and could not produce any credible expert witness, how could they think that they were serving justice? The only purpose for presenting evidence that is clearly scientifically unfounded as if it were good science is to trick the jury into reaching a desired result (whether it be right or wrong). That is not a step to be taken lightly or without reprimand.


No comments:

Post a Comment