Generating "Simlish"/gibberish from sound samples

Started by
4 comments, last by Zahlman 14 years, 1 month ago
First off, apologies if this is in the wrong forum. The Sound/Music forum FAQ seemed to point me here. I am working on a space sim as a hobby project. I can't do decent voice acting, however I would like something more than just plain text. What I was thinking was that I would like to display the text of what is being said while at the same time having "simlish"/gibberish being played in the background. For instance, let's say you were doing this for Luke Skywalker. You could sample his voice in various emotional states. For instance, you could sample his voice when he is talking excitedly in the various SW movies. You would then process it (somehow) in such a way that the individual words are "smeared away" but the voice still has the "essence" of Luke Skywalker (Mark Hamil). I have no idea whether this is possible or even what to search for. Does anyone have any pointers? Thanks, Michael
Advertisement
I suppose you could identify common phonemes and use them as transition points, similar to video texture or motion graphs. But honestly, this would be so damned distracting compared to just not having any voices at all.
One of the characteristics that makes voices unique (the "essence" that you describe) is the frequency spectrum for a particular person - base frequencies and overtones. Perhaps you could analyse a person's voice sufficiently to determine a characteristic spectrum and generate "random" phonemes or just amplitude modulated tones using that spectrum.

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

Interesting problem, I was always interested in speech synthesis but it wasn't my area of research so this should be taken with a large pinch of salt but it might give you a starting point to start searching from.

If I was having to do this I would take my corpus of speech (that's the that audio that we are trying to emulate) and first convert it to a formant based representation. This includes the the fundamental frequency, voice / unvoiced data, amplitude and the vocal tract filter (f1, f2, f3...)

Next I'd search across the corpus for periods where the voice parameters are stationary for a period of time which would indicate pauses and vowel sounds (and fricatives if unvoiced). Run these though a classifier to break them into a limited symbol set (including a number of different pauses based on the length). Call these V1 - Vn

Overlaying these onto the original corpus would give a give vowel symbols separated by non-vowel sounds. Then separate out the non-vowel sounds based on the vowel symbols that came before and after them (i.e. collect together all the non-vowel sounds that occurs between V1 and V2) and classify them the same way as the vowel sounds. Call these Vi-C1-Vj to Vi-Cn-Vj.

The vowel symbols can be analysed to give a probability of one vowel symbol given the previous one (i.e. if we have 'V1' then it is followed by 'V4' 45% of the time, 'V3' 25% of the time, 'V9' 15% of the time and 'V7' 15% of the time). Start with a particular vowel symbol (for instance a long pause) then can pick a random next vowel symbol based on the probability of that vowel following the current one. Repeat as needed (perhaps based on length and final vowel symbol set) to generate a list of vowel symbols. This is called a Markov chain. You can improve in by using more that one previous previous symbol so you work out the probability of each vowel based on the two or more previous vowels but it depends on how big you corpus is.

Then pick random non-vowel symbols to fit between then vowel symbols based on their probability of occurrence (i.e. if V1-C1-V2 occurs 50% of the time that V2 followed V1 then the probability of choosing it should be 50%). This gives you a full sentence of utterances which you need to turn back into an audio stream by concatenating sound fragments contains half a prefix vowel sound, an non-vowel sound and half a suffice vowel sound (i.e. half of Vi, Vi - Cj - Vk, half of Vk with the start and end at a zero crossing of the waveform)

This can process can be done in one pass by picking the next vowel symbol then immediately picking the non-vowel symbol before picking the next vowel. It could also be mediated by the text by, for example, forcing pauses at intervals based on the length of the words.

I'd probably create these fragments by computing the average vocal parameters for each vowel sound , finding each instance of Vi, Vi - Cj - Vk, Vk in the corpus, chopping of the first half of Vi and the last half of Vk and fixing the ends to the average vowel parameters. The instances would be averaged to make a representative fragment (or possibly two) based one the vocal parameters and then feed it back though a speech synthesis model to generate the audio. It might need the vocal parameters low pas filtering to muddy the sound up a bit otherwise it might be a bit too good but without listening to the results I can't tell.

Depending on how intelligible the output audio is you might need to generate a list of prohibited sound sequences to stop you random burbles from randomly generating things that you don't want it to say such as obscenities.

It's possible that this is a bit excessive for the application you want but hopefully, this will give you some ideas.
Thanks everyone for the replies. I like the idea of looking at the frequency spectrum. Perhaps doing a Fourier composition...(takes me back to my signals classes in college!).

Thanks again,

Michael
Useful Googling terms may include "formant frequency" and "granular synthesis".

This topic is closed to new replies.

Advertisement