Do you love the sound of your computer’s voice?

by Subramanian Ramamoorthy on 2014/03/25

Speech synthesis has a long and chequered history, going back at least to the Enlightenment when Wolfgang von Kempelen produced an artificial contraption that could make simple vowel sounds, based on a combination of bellows, a vibrating reed and a leather tube – all mimicking the human sound producing apparatus. The field has come some ways since then, with some visible examples of artificial speech in Stephen Hawking’s popular mechanical voice and more recently the voice of Apple’s Siri and our many satellite navigation units. Closer to home, a group of researchers in Edinburgh have pioneered novel methods for expressive speech synthesis, such as in the products of the University of Edinburgh spin-out company CereProc. This work was covered in an article in The Telegraph, under the provocative title, “How do you teach a computer to speak like Scarlett Johansson?” .

The Science behind Speech Synthesis

As the article nicely explains, the simplest way to synthesise speech in a way that mimics the peculiarities of any person’s speech is to collect many examples of that person speaking, making sure they cover the full range of expressions one would later expect the computer to use, finely slice those hours of recordings down to the smallest possible constituent fragments (in technical jargon, diphones), file them away under suitable categorical labels and then find a way to piece them back together as required to expressively read out a new sentence. Much of the science is in the statistical method that achieves this piecing together process. For instance, one could naively put the fragments together based on how they have been used in the past, and this would work well if one has access to so much data that most aspects of style are already captured somewhere in that database. On the other hand, extrapolating from limited data often requires some kind of a parametric model, whose specifications are calibrated to the limited data available, and whose structure is carefully selected to reflect our overall understanding of the nature of speech production.

The Business of Speech Synthesis

Why do we have products that produce speech? One might justify this based on the fundamental nature of speech, it is one of those uniquely human things that calls out to the scientist to be understood and replicated. However, that is not the only or even main reason why many research groups and companies invest in this enterprise. One socially relevant use of this technology is to restore this capacity in people who have lost their speech due to medical reasons. A very famous example is that of the film critic, Roger Ebert, who lost his voice after an operation and found it again, thanks to an accidental discovery on the internet, through CereProc. Less famous examples include the many computer programs that read out text in surprisingly human like intonations and accents, innocuously packaged as a “William” or “Heather”, but more than capable of eliciting an emotional response from the listener – often resulting in requests for the phone number of the speaker. It isn’t all about mimicry though. One potentially major application is the possibility that a computer could take the copious amounts of content that is generated throughout the internet, ranging from twitter feeds to blog posts like this one, and piece them into easy to listen snippets in the middle of a radio programme or an audio book, freeing the user from the need for a more involved keyboard and mouse interface.

Social and Ethical Issues

Anything that mimics and restores a human capacity seems to come with associated questions regarding social and ethical dilemmas. At the simplest level, one question that arises from this technology is the extent to which speech is an ‘essential’ capacity in life. Pragmatically put, if I were to lose my ability to speak, should the NHS pay for me to have it restored through a technology such as this? How much would the NHS be willing to pay? Currently, it appears that the quality of life guidelines make such questions particularly relevant for patients who may lose their speech due to a variety of medical complications.

More generally, what should we make of the possibility that the machine could mimic that last bastion of human expression – emotion and inflection in the spoken word. If our computers were given the capacity to pull at our heart strings through a carefully chosen turn of phrase and tone, to evoke a memory with loved ones or happy times, what could one do with this power? What should one be allowed to do?

How did the article portray the field?

This article isn’t so much a piece of news as a status update on a technology, and a spotlight on one of the pioneering companies working in this area. In that spirit, the news article is quite well researched and captures the essence of the field, while also giving a fair glimpse into the many facets of such a complex issue – technical, commercial, social and ethical.

The article isn’t without the occasional dramatization – attributing a chaotic genius look to the CereProc founder and situating the company “On the sixth floor of a run-down university building in Edinburgh”, but it is all for a good cause one hopes!