What it takes for artificially synthesized speech to be suitable for gaming?

Started by
10 comments, last by swiftcoder 6 years, 1 month ago

[To the moderator: I selected this forum as TTS technology falls within AI; I found no better alternative. I hope it is acceptable.]

In this post I would like to share opinions about Text To Speech (TTS) technology in the context of gaming. I am not a gaming expert – rather, my expertise is in signal and speech processing. I am working at the IBM Research division.

I hope this post will trigger a discussion and opinion sharing. My purpose here is to discuss “quality”. I plan to discuss technology trends in follow on posts .

The potential benefit of “good quality” TTS technology for game developers is clear. But it is still considered as delivering “insufficient quality”. What is “quality” anyway, in our context?

1.      The basic quality of modern TTS is good, in general. State of the art machine learning algorithms enable good prediction of the prosody (“intonation”, duration, loudness, emphasis and more) from the text, and the synthesized speech achieves good scores in subjective quality tests. This is no more the “robotic” sound it used to be. It sounds natural and “clean”. For applications such as announcements or commercial question answering, modern TTS provides a good alternative.

2.      Natural speech, however, needs correct emphasis of different words across the sentences. Due to the ambiguity of the natural language, the algorithms (and humans as well) cannot always determine “correctly” the emphasis from the text of isolated sentences or utterances, without full knowledge of the entire context. This can limit the quality achievable by modern TTS technology.

3.      When we consider gaming applications, additional needs arise. For example, using a formal style for generating the speech for a scene of action would sound weird – where are the emotions? The emotional content in human speech is essential for conveying messages. This is certainly an important aspect of “quality”.

4.      Yet another aspect that relates to “quality”, at least in the broader sense, is the variety of voices. Most modern TTS technologies are based on pre-recorded human voices (recording of voice talents uttering a large collection of sentences). As recording, and processing the recorded speech, are expensive and time consuming, the variety of voices in typical TTS products is limited, often a few and less commonly several tens of different voices per “major” language. Moreover, gaming often requires non-human voices, such as “cartoonish” ones, to best support the different characters. To summarize, using repeatedly the same voices across games and characters, amounts to less than optimal experience, or in other words – to lower “quality”.

I hope this provides some initial insights – from the perspective of a speech technology researcher. I hope to get feedback from the gaming experts. Am I right? What have I missed? What would allow game developers to start benefiting from the TTS technology?

All the best, Aharon.

Aharon Satt

Advertisement

I'm interested in item 4. We would have to build a parametric model of the speaker so we can then tweak the parameters to generate many new speakers.

One way to make it happen would be to train something like WaveNet on multiple speakers, conditioning on a small vector that encodes information about the speaker. Part of that vector could describe gender and age, part of the vector could be a learned embedding of the accent (so we can label the speaker as having "British English accent", or "Southern American accent", etc.) and the rest of the vector would be a learned embedding of the speaker. We could then generate new speakers by specifying gender, age and accent, and then generating some random numbers for the specific speaker.

While we are at it, the same conditioning mechanism could be used to indicate some rough emotion of the utterance, addressing item 3 to some extent.

Does this sound feasible?

 

 

Hi Alvaro,  thanks for fast response!

Your description IMO describes a good approach and sounds feasible to me. It has the potential of generating high quality voices with fine grained attributes like gender, age, accent etc.

In comparison, we have worked on an approach that makes use of parametric model of the speaker as well, but using "simple" heuristic controls that enable generation of a large variety of voices, including "non-human" ones, with lower dependency on pre-recorded voices to establish the baseline (the embedding). Here is a link to the scientific publication: http://www.isca-speech.org/archive/Interspeech_2017/pdfs/1202.PDF  

Two different approaches, each with its own merits.

Aharon Satt

2 hours ago, Aharon Satt said:

What would allow game developers to start benefiting from the TTS technology?

* A system that translates to other languages

* A virtual philosopher that has something to say

:) 

But realistically - personally i have high interest. Looking up for something like a library once a year but never found any. (I guess the industry has only little interest: If we need a writer to write any text, we can hire a voice actor as well)

Some ideas:

Some language that discribes intonations, feelings and stuff, e.g. "<frightened 5> Help! The monster! <whispering 2> Hide under the desc!"

Eventually a musician could also add this data in real time by playing along with a keyboard, controlling both pitch, mood etc. (I remember guitarist Steve Vai imitating spoken speech on guitar, and it really works well - he can express those things very accurate, although you can not understand the words themselves.)

We need laaughter and screams, not just speech.

Generating data usable for automatic lipsync and facial animation would be nice bonus.

We might be happy with something that acts as a placeholder during production, and will be replaced by voice actors when everything is done. For now at least...

 

 

 

 

 

Some games have used TTS for quite some time.  

It isn't fancy, but consider how Animal Crossing has done that for more than a decade. The text doesn't sound like professional voice acting, but when you've played for a while you can understand every utterance.

There are many games where TTS is thematically appropriate. Games set in computerized worlds would be a great natural fit.   Also less natural, but many text-based games have used it for years, particularly with vision-impaired players.

So I think there is an item 0 on your list, the context of the synthesized speech.  

Thank you Frob!  In addition, please consider that TTS technology has gone through significant improvement recently, based on new machine learning / deep learning technology. So yes, it isn't fancy, but it does much better job than it used to until recently. If you like, you can listen to a research prototype as an example, and there are additional examples: https://ivva-tts.sl.haifa.il.ibm.com.  I am curious to find out what is your opinion of this.  thank you, Aharon

Aharon Satt

Hello JoeJ, 

thank you for your feedback. Specifically,

Some language that describes intonations, feelings and stuff, e.g. "<frightened 5> Help! The monster! <whispering 2> Hide under the desc!"  - we are in fact working on a technology that does exactly this, I will find the proper context to describe it in another post.

We need laughter and screams, not just speech - yes, absolutely; the (smaller) challenge here is to adapt some basic laughter sample to the specific voice you are using to synthesize the speech.

many thanks, Aharon

Aharon Satt

On 1/18/2018 at 9:12 AM, Aharon Satt said:

I am curious to find out what is your opinion of this.

My personal opinion is complete indifference about TTS.  As I wrote above it depends entirely on the game.

Some games with enormous budgets and high-end designs require voice acting that truly does need professional voice actors, speech coaches, and high-end recording studios.

Other games can use -- and already use -- rudimentary audio tables for TTS that fits their game design wonderfully.

 

Naturally the course of software is that it offers more options, higher audio fidelity, and more choices for encoding and use, but that is not an issue of the tech being suitable for gaming in general.

To me the actual issue is the cost of a tech solution versus the cost and the (potentially enormous) benefits of the customized voice acting.  If you've got the budget to create a game worth tens of thousands of lines of voice acting, or even hundreds of thousands of lines of voice acting, then you've got a budget to use a wide range of voice synthesis systems. At that point you can study what is out there, the costs involved, the benefits of the actual actors, and decide based on the details of the game itself.

I already gave the example of Animal Crossing that has been using very simple TTS for nearly two decades with great appeal.

For another example, when I worked on Tiger Woods golf there was an enormous corpus of audio from the announcers, carefully edited so it could be pieced together based on any situation. There could be comments about the distances, the lie of the ball, the quality of the shot, the wind and other environmental factors, and much more. The benefit of having specific well-known individuals far outweighed the costs. Even if we could have used a high end TTS system that made realistic tones and inflections it would have been a terrible mix.

On The Sims series we had a wide range of audio.  We had Simlish clips put together by real humans speaking nonsense syllables, but these could have been strung together by anybody and may have been a good fit for some TTS systems today. However, since the middle of the original Sims 1 series they have brought in big-name musicians to record their own music in Simlish.  Bringing in Katy Perry, Lily Allen, Depeche Mode, Annaca, and other singers is not a choice about the quality of audio technology, but about the singers themselves. Bringing in the big-name singer is far more valuable to the game than any TTS system would be.

In the Littlest PetShop series series we used a system similar to what Animal Crossing used. Players commented on how they loved it, it gave a great style to the game that a Simlish-style speech engine could not have used.

These depend entirely on the game, not the quality of the TTS system.

Some games have more than enough with yesterday's technology. Some games can use today's technology.  And some games would not accept the TTS technology no matter how advanced it became, because the voice actors themselves are necessary.  It is all about the games, not the tech.

To my mind, where TTS really has the potential to shine is in large, procedurally generated game worlds. Conversational AI is at the point where is at least somewhat feasible to arbitrary conversations with NPCs. For a game the scale/ambition of something like Skyrim or No Man's Sky, but with free-form conversation, voice acting is liable to be prohibitively expensive (both in monetary and storage terms).

This isn't a new idea - Douglas Adams shipped a game with free-form conversations in the 90s, but ended up relying on massive amounts of pre-recorded audio for the responses. We're ever closer to lifting that limitation.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Hello,

We would like to offer our beta-level service for free for about half a year from now, until Aug 10, 2018. We hope it is attractive enough for game developers to experience the tool and use it, this way. Our purpose is to gain feedback to improve the tool. We encourage you to read the help to get information about the new capabilities and explore them.

 

Any audio content generated through this tool (https://ivva-tts.sl.haifa.il.ibm.com/) until Aug 10, 2018, can be used for free, forever, including for commercial purpose.

 

I will post more details about the technology behind the scenes over the next weeks.

 

Aharon

Aharon Satt

This topic is closed to new replies.

Advertisement