Sign in to follow this  
the456

Generating "Simlish"/gibberish from sound samples

Recommended Posts

First off, apologies if this is in the wrong forum. The Sound/Music forum FAQ seemed to point me here. I am working on a space sim as a hobby project. I can't do decent voice acting, however I would like something more than just plain text. What I was thinking was that I would like to display the text of what is being said while at the same time having "simlish"/gibberish being played in the background. For instance, let's say you were doing this for Luke Skywalker. You could sample his voice in various emotional states. For instance, you could sample his voice when he is talking excitedly in the various SW movies. You would then process it (somehow) in such a way that the individual words are "smeared away" but the voice still has the "essence" of Luke Skywalker (Mark Hamil). I have no idea whether this is possible or even what to search for. Does anyone have any pointers? Thanks, Michael

Share this post


Link to post
Share on other sites
I suppose you could identify common phonemes and use them as transition points, similar to video texture or motion graphs. But honestly, this would be so damned distracting compared to just not having any voices at all.

Share this post


Link to post
Share on other sites
One of the characteristics that makes voices unique (the "essence" that you describe) is the frequency spectrum for a particular person - base frequencies and overtones. Perhaps you could analyse a person's voice sufficiently to determine a characteristic spectrum and generate "random" phonemes or just amplitude modulated tones using that spectrum.

Share this post


Link to post
Share on other sites
Interesting problem, I was always interested in speech synthesis but it wasn't my area of research so this should be taken with a large pinch of salt but it might give you a starting point to start searching from.

If I was having to do this I would take my corpus of speech (that's the that audio that we are trying to emulate) and first convert it to a formant based representation. This includes the the fundamental frequency, voice / unvoiced data, amplitude and the vocal tract filter (f1, f2, f3...)

Next I'd search across the corpus for periods where the voice parameters are stationary for a period of time which would indicate pauses and vowel sounds (and fricatives if unvoiced). Run these though a classifier to break them into a limited symbol set (including a number of different pauses based on the length). Call these V1 - Vn

Overlaying these onto the original corpus would give a give vowel symbols separated by non-vowel sounds. Then separate out the non-vowel sounds based on the vowel symbols that came before and after them (i.e. collect together all the non-vowel sounds that occurs between V1 and V2) and classify them the same way as the vowel sounds. Call these Vi-C1-Vj to Vi-Cn-Vj.

The vowel symbols can be analysed to give a probability of one vowel symbol given the previous one (i.e. if we have 'V1' then it is followed by 'V4' 45% of the time, 'V3' 25% of the time, 'V9' 15% of the time and 'V7' 15% of the time). Start with a particular vowel symbol (for instance a long pause) then can pick a random next vowel symbol based on the probability of that vowel following the current one. Repeat as needed (perhaps based on length and final vowel symbol set) to generate a list of vowel symbols. This is called a Markov chain. You can improve in by using more that one previous previous symbol so you work out the probability of each vowel based on the two or more previous vowels but it depends on how big you corpus is.

Then pick random non-vowel symbols to fit between then vowel symbols based on their probability of occurrence (i.e. if V1-C1-V2 occurs 50% of the time that V2 followed V1 then the probability of choosing it should be 50%). This gives you a full sentence of utterances which you need to turn back into an audio stream by concatenating sound fragments contains half a prefix vowel sound, an non-vowel sound and half a suffice vowel sound (i.e. half of Vi, Vi - Cj - Vk, half of Vk with the start and end at a zero crossing of the waveform)

This can process can be done in one pass by picking the next vowel symbol then immediately picking the non-vowel symbol before picking the next vowel. It could also be mediated by the text by, for example, forcing pauses at intervals based on the length of the words.

I'd probably create these fragments by computing the average vocal parameters for each vowel sound , finding each instance of Vi, Vi - Cj - Vk, Vk in the corpus, chopping of the first half of Vi and the last half of Vk and fixing the ends to the average vowel parameters. The instances would be averaged to make a representative fragment (or possibly two) based one the vocal parameters and then feed it back though a speech synthesis model to generate the audio. It might need the vocal parameters low pas filtering to muddy the sound up a bit otherwise it might be a bit too good but without listening to the results I can't tell.

Depending on how intelligible the output audio is you might need to generate a list of prohibited sound sequences to stop you random burbles from randomly generating things that you don't want it to say such as obscenities.

It's possible that this is a bit excessive for the application you want but hopefully, this will give you some ideas.

Share this post


Link to post
Share on other sites
Thanks everyone for the replies. I like the idea of looking at the frequency spectrum. Perhaps doing a Fourier composition...(takes me back to my signals classes in college!).

Thanks again,

Michael

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this