Archived

This topic is now archived and is closed to further replies.

Thread

FFT DSP Signal processing?

Recommended Posts

Hello, Wo can help me with C/C++ source from FFT or DSP or other filter function''s for analyzing analog audio waveforms? Thanks

Share this post


Link to post
Share on other sites
quote:
Original post by Thread
Hello, Wo can help me with C/C++ source from FFT or DSP or other filter function''s for analyzing analog audio waveforms?


Can you be more specific about what you need to do? There are too many signal processing and time-series techniques to list here.

-Predictor
http://will.dwinnell.com



Share this post


Link to post
Share on other sites
Hi Predictor,
More specific about what I want to do.
I am working now on an application that analyzing the analog wave shape, (speech no music) example the waveform off "Hello".
This is (byte) 128 silence, 0 (max) lower wave, and 255 (max) high wave, zero is in sound wave max -low amplitude.
De sum of the energy is added low and high together. Low 128-45 and high (235-128) You cannot add 45 and 235 this is bigger than max 8 bit sample of 255.De energy is 235-128=107+45=152. This value must be open the mouth(lips). Now I am working on the frequent of the wave, How many waves’s in a second.
I want the energy off the analog waveform.
This energy must drive a mouth off 2D/3D character on the screen and in the future a servo from a real 3D character.
How many samples from the analog sound do I need to come (freq)translate the wave to open the mouth?
How filtering this sound[n]data? FFT, DSP or something else?
Now I take 50hz samples, 22050 Hz/50Hz = 441 samples.
example off my energy source:
Maybe Fast Fourier Transform off this data[441] is better?

//--------------------------------------------------------------
// Name: ShowEnergy()
// Desc: Display the energy level.
//-------------------------------------------------------------
void ShowEnergy(HWND hDlg,DWORD nIndex)
{
if(Eng < Freq) // Freq = 22050 sample /50 hz = 441
{
nWav = TempBuffer[nIndex]; // soundbuffer[n]

if(nWav >= 128) // high ampltude wave
{
nWav = nWav-128; // yes example 245-128=117
}
else
{
nWav = 128-nWav; // low amplitude wave inverse
}
// check threshold > behind 0 to 128 (8 bits sound format)
if(nWav > (int)dwThresHold) tWav+=nWav;

Eng++;
}
else
{
Eng=1;
if(tWav > Freq)
{
tWav = tWav/Freq; // devided the max energy from 441 samples
SendMessage( GetDlgItem(hDlg,IDC_PROGRESS1),PBM_SETPOS,tWav , 0 );
//Sleep(100);
SetDlgItemInt(hDlg,IDC_ENERGY,tWav,FALSE);
}
else
{
SendMessage( GetDlgItem(hDlg,IDC_PROGRESS1),PBM_SETPOS,0 , 0 );
SetDlgItemInt(hDlg,IDC_ENERGY,0,FALSE);
}
}

}// end ShowEnergy

regards Thread,

Share this post


Link to post
Share on other sites
I''m moving this thread to the Maths & Physics forum... while there is an aspect of AI in the problem, the specific question is far more directed at M&P and thus, more help is likely to come from that forums.

Timkin

Share this post


Link to post
Share on other sites
First: I´ve never dealt with this kin of problem neither even thought about it before but since you haven´t got any reply yet, I´ll try to help you by giving out a few remarks:

a) Generally the energy of a wave is the integral over the squared amplitude. The squared amplitude thus is the energy density.

b) I don´t see any connection between the energy of a sound wave and the mouth form. In fact you can emmit quite a range of sounds without even opening your mouth. The energy density should mainly depend on how much you exhale.

c) same goes for frequency. I´d even say the main frequency (now talking about the FT) is generated in the vocal chords.

d) to make things worse: I´d even say the vocal chords don´t even emmit a single sinoid wave but also subfrequencies. But if you´re lucky you can assume that those are highly surpressed (amplitude wise).

e) Same as for the vocal chords goes for the tounge. It also plays an important part in generating sound.

f) to make things even worse: I wouldn´t even count on the mouth form to be only dependant from the sound you actually emmit but it might also depend on the sound you emmited before. You have two "parameters" that both move with a finite "speed": Mouthform and ground-tone. Two different parameterizations might still end up in a similar sound. Remember that speaking is a process that even takes the human brain quite some time to learn even through it´s optimized for that. A linguist might be able to answer that question.


If you really want to analytically derive a mouth form for a given sound, I´d try (just a guess of mine; will probably not work) the following:
1) Get the FT of the sound.
2) Determine the main frequency.
3) Extract the relative positions of most relevant subfrequency(ies).
4) Hope you find a rule how to form the mouth dependant of the relative position(s) (the higher the subfrequencies the more the mouth is opened, I´d guess).

My assumption on how other programs deal with this:
They have a set of different mouth forms with (list of) sound related to it, which was derived empirically. Then, they compare the actual sound (perhaps by FFT ?) with the list and chose the mouth form which´s associated sound fit the best.
That´s more straightforward and probably gets better results.

Resume:
Your post first sounded very weird to me and the using of unsigned bytes really made things look unnessecary unreadable. But I think I started understanding what you where trying to ask, so my answer would be:

Try FTs and try to find the most fitting mouthform from a list which you generate by doing several vocals, humms, hisses and watching your mouth in a mirror.

Share this post


Link to post
Share on other sites
Atheist Thanks,
Maybe your example, compare the actual sound with sound in memory, mapping this together with some threshold(n).
And then figuring out the mouth opening from this.
thread

Share this post


Link to post
Share on other sites
quote:
Original post by Thread
Now I take 50hz samples, 22050 Hz/50Hz = 441 samples.



A 50Hz sampling rate is no where near enough. To accurately represent a voice, you need at least an 8Khz sampling rate ( well, you *need* a bandwidth of 3.4Khz, so technically you could get away with using 6.8Khz, but you add a little extra to make sure ). It won''t be perfect, but it is enough to destinguish voice ( 8Khz is what the UK telecommunications network uses ( the landline one - the GSM networks use something less than this, hence why it isn''t quite the same quality ) ). Look up Nyquists sampling theorem, and aliasing.

If you want to do a sort of voice recognition thing, then I suggest you do an FFT and compare it to a base FFT ( look for similarities in the distance between harmonics, and so forth ). Firstly, I''d try to get it to recognise a simple single frequency signal, and then try to move to something like voice. If you want to model a mouth, by streaming an arbitary waveform through a system of somesort to do so, then it gets FAR more complicated. Moving the mouth up and down is a pretty simple process ( I.e. Half-life ), you can just use the power of the signal at any particular point ( you can average a bit if you want ), and set the mouth position accordingly ( so, more power -> more open mouth ). To do this you''ll need to look up power density spectra of transient signals, and other related things. I can recommend a few books on the matter if you want. However, if you want to actually model the mouth realistically ( i.e. have it morph and change shape, rather than just open and close ), I really don''t know how you''d do this. You''ll need some sort of predefined table of mouth actions for certain types of sound, and do some sort of preprocessing to see how much of each sound is in the signal, and perhaps merge these weights against each of the animations to provide a net output. Dunno really what I''m talking about there, but it seems an application for some fuzzy variables.

You have to remember that you''re unique, just like everybody else.

Share this post


Link to post
Share on other sites
hi python_regious,
What you say about the power,
you can just use the power of the signal at any particular point ( you can average a bit if you want ), and set the mouth position accordingly ( so, more power -> more open mouth ).
>
Yes this is just what I want.
Voice recognition is not nesesary.
I an building a character (puppet) like Kermit(c)J Henson.
the green frog from Sesamstraat in Dutch.
Only not driving by hands but with Servo actuators and
interact with capturing webcam input...
regards Thread. Cogito Ergo Sum, I think

Share this post


Link to post
Share on other sites
Ah cool, that idea would tie in nicely with the input of a digital control system then. Well, it wouldn''t be much harder to implement it with an analogue design either.

You have to remember that you''re unique, just like everybody else.

Share this post


Link to post
Share on other sites
quote:
Original post by python_regious
Moving the mouth up and down is a pretty simple process ( I.e. Half-life ), you can just use the power of the signal at any particular point ( you can average a bit if you want ), and set the mouth position accordingly ( so, more power -> more open mouth ).


I still don´t think this is accurate but if you say HL used it, I guess it will be at least an ok solution (well, you didn´t actually say HL is looking at the power of the signal, but I interpret your comment this way).
Anyways: You can´t just take the power of the signal at a single point (you HAVE TO integrate) cause for say a sinoid signal of 1kHz, then either (your program runs as fast as the sampling rate) the mouth will be set to open/close 2000 times a second or (program runs slower) be set to rather random values between 0 and max.
In fact, for determining the average power of a signal a low sampling rate might be sufficient. Assume that with a sampling rate of 1kHz the results you get (I still assume the values are the actual pressures of the wave, if it´s the averaged power over the time intervals you are checking then just take say 20Hz and pick the current point) in a range of 100 points are rather randomly distributed with a probability for each value that corresponds to the waveform. If you add up the power on those 100 points you´ll get a relative standard error of 10% which should be ok (practical implementation will show) and a time delay <0.1s (dunno if that´s ok).

EDIT: one thing that made me wondering:
quote:

To do this you'll need to look up power density spectra of transient signals, and other related things.


Isn´t the square of the signal the power? And what´s a power density, then? If not, all I wrote about power and energy density (=power) might be completely wrong.

One more idea that´s come to my mind (I hope what I´m writing actually helps you and isn´t absolutely useless/confusing) is to take the FT of the signal, identify the dominant frequency (highest peak in FT) and set the mouth according to this. Idea is that for higher tones you open your mouth more than for lower tones (I remember the "A, E, I, O, U, and your mouth is closed" from my first year of school, although it doesn´t rhyme in english plus I´m not completely certain if those vocals really have a decreasing lead-frequency).

btw.: If you´ve got the chance to ask a linguist (like you´re an university student) I´d take the chance, knock on his door and ask him whether he knows a connection between mouth form and tone spoken. Maybe he won´t be able to tell you "the dependancy of the physical signal is like that", but you might get some new ideas and from my experiences many people at university gladly like to tell people about their studies.

[edited by - Atheist on March 19, 2004 4:30:53 PM]

Share this post


Link to post
Share on other sites
To everybody, thanks for the many reply’s.
This weekend I am going working out the answers.
Also the FFT link from Magmai Kai Holmlor.
The weekend is maybe to short?
Regards and thanks thread,

Share this post


Link to post
Share on other sites
quote:
Original post by Atheist
I still don´t think this is accurate but if you say HL used it, I guess it will be at least an ok solution (well, you didn´t actually say HL is looking at the power of the signal, but I interpret your comment this way).


I have no idea how half-life did it. It was simply a suggestion of how they did it. It''s what I''d do anyway.

quote:

Isn´t the square of the signal the power? And what´s a power density, then? If not, all I wrote about power and energy density (=power) might be completely wrong.



Ok.. Some math, and stuff.

The average normalised power of a periodic signal is 1 / T times the integral of |f(t)|2 between the limits t and t + T, where f(t) is the function of the signal in the time domain, and T is the time period you''re averaging over. Now, this is great for periodic signals, but they never exist, nor can they ever practically exist. Plus, they''re pretty pointless because they carry no noise, and hence no information. Also note that periodic signals have descrete/line amplitude spectra. Because of this periodic signals have a descrete normalised power spectra ( take the trigonmetric coefficients of the fourier series of the signal and divide by sqrt( 2 ), and then square the resulting amplitude spectra ).

Now, transient signals are those which are localised in time. I.e. they don''t occur all the time - so time-limited signals are included here. Because of this, the average power of the signal is 0 ( taken over all time ), hence why I said take the instantaneous power ( or a little averaged ). The total energy of the signal however is finite. Hence, they are sometimes called Energy signals. Though, you may be able to use the energy of the signal, rather than the power... Now, since you can''t perform a fourier series analysis on a transient signal, you have to do a fourier transform. If you remember correctly, the fourier transform produces a continuous frequency spectra, which is then sampled at certain points ( this is where aliasing and spectral leakage come in ). Now, if you square the continuous amplitude spectra, you actually get the energy density spectrum. From this you can pull out the energy of the signal at various frequencies. From that I suppose you can make the mouth do different things...

Hmmm... Interesting.
If you use the energy of the signal, that might work. Though... I don''t know how much you''d want to take into account when you''re calculating the EDS, I also don''t know how useful the energy would be when considering the motion of the mouth. I mean, you could do it for the entire signal, but that''d be pointless. If you have a sample rate of something like 8Khz for your voice signal, then you could probably get away with something like a 512 sample set for every FFT you perform... You''d then get a "localised" EDS I suppose. However, make sure you choose a decent window with your FFT, otherwise you''ll get shite loads of spectral leakage.

Hmmm... It''s been a while since I''ve discussed this mind, ( I had to go back on some of my notes to jog the memory ), so don''t take everything I say as totally true.

quote:

if it´s the averaged power over the time intervals you are checking then just take say 20Hz and pick the current point).



Yes, I agree. You don''t want to use the instantaneous power at such a high sample rate. I did mean to average over a sufficient time period ( your 20Hz seems reasonable ). I suppose you could take the RMS aswell if you felt the need.

quote:

btw.: If you´ve got the chance to ask a linguist (like you´re an university student) I´d take the chance, knock on his door and ask him whether he knows a connection between mouth form and tone spoken. Maybe he won´t be able to tell you "the dependancy of the physical signal is like that", but you might get some new ideas and from my experiences many people at university gladly like to tell people about their studies.



Seems a pretty crucial point that. If the frequency of the tone has nothing to do with the mouth shape, then all this is pretty pointless.

You have to remember that you''re unique, just like everybody else.

Share this post


Link to post
Share on other sites
hi, i am working on an application which suppose to recognise vowels from wave file. i recorded the vowel "e".
after i processed it with hamming window and fft i have to find the main frequencies of this wave(the formants of "e").
how can i find this formants from the fft result on the wave???

it will be very helpfull

thanks

Share this post


Link to post
Share on other sites
My idea would be to find the main frequency of the signal. You´d call this the frequency of the tone. Of course it doesn´t tell you much about the actual vowel spoken since you can speak a vowel higher or lower (frequency-wise). Then, find out the frequencies of the next highest peaks. The relative position of them to the main peak might be typical for certain vowels (just like an instrument has typical sub-frequencies).

I cannot give you an assumption of what relative positions belong to which vowel. You best record them all, Fourier-transform them and look at the pictures of the FT. Maybe you´ll be able to identify a pattern.

There were many links about "voice recognition fourier" on google. Didn´t bother to really check them but one that looked quite interesting and nicely written was:
http://www.gvsu.edu/math/wavelets/student_work/Hoekstra/voice_recognition.htm

About the problem with short time intervals and FTs: It might be possible that you can use relatively long time-intervals for recognizing vowels, when either
a) vowels are usually pronounced longer or
b) in common language they don´t have any relevant low frequencies.

For checking a) you could recrod a word and then cut out parts of the signal until only a vowel is left (and you can still comprehend it). Check the lenght of the signal. For b) you´ll have to look at the FTs of your vowels signals.

Well, the thing about "look for the relative position of the sub-frequencies" is only a guess of mine but since you can speak the same vowel louder or lower (height of the signal) and with different tone-heights (position of the main peak) there´s not much data left to built your vowel recocnition on.´

btw.: It would be interesting to hear what Thread achieved because most things I and pythos said here were pure assumptions and I´d be very interested in experimental results.

[edited by - Atheist on April 15, 2004 6:20:58 PM]

Share this post


Link to post
Share on other sites
Hello dear, I have to say I don’t need voice recognition
What I am building is a 2D on screen character and a 3D character in real world environment.
The character has to move his mouth synchrony with the wav (speech)sound.
At the moment I can say it begins to work.
The computer has also driven the animatronics around the character, so I can not use al the CPU power for speech. When this works I go on with the webcam vision input, (more power and more power) maybe Intel follow me, and this is not a problem in the future>>>??(;>) or euro’s
(Dollars) are the trouble. Voice recognition needs a lot off CPU power when it’s running I a real-time interactive application.
My project: sorry for now only in Dutch.
http://home.hetnet.nl/~creasoft/index.html
and the project
http://home.hetnet.nl/~creasoft/Synchrone.html
Regards Thread.(;>) cogito ergo sum
The code: MoveMouth
//-----------------------------------------------------------------------------
// Name: MoveMouth()
// Desc: Open the mouth.
//-----------------------------------------------------------------------------
bool MoveMouth()
{
int iPos =0;
tWav = 0;



if(IsSoundPlay(0)) // is sound playing
{
// move Mouth
dwPlayCursor = GetPlayCursor(0); // cursor in sound buffer

if(dwPlayCursor < (dwDSBufferSize+Sample))
{
for(Eng = 0; Eng {
fvalue = ((float)GetSoundData(dwPlayCursor+Eng));

if(fvalue >= fSilence) // static float fSilence = 127.5f;
{
fvalue = fvalue-fSilence;
}
else
{
fvalue = fSilence - fvalue;
}
// check threshold >
if(fvalue > (float)dwThresHold)
{
tWav+=fvalue;
}

} // end for
} // end if

else iPos = 0;

if((int)tWav > Sample)
{
tWav = (int)tWav/Sample;
iPos = (tWav/9);
}
else
{
iPos = 0;
}

if((iPos < 15)&& (iPos > -1))
{
//MouthAnimate(iMouthPos);
// the graphics or the servo output
g_MouthSprite.SetSourceRect(rHead[iPos]);

if(bAnalyse) // save data for analyse
{
AnimData[dwMouthPos].dwCursor GetPlayCursor(0);
AnimData[dwMouthPos].bPos = (BYTE)iPos;
dwMouthPos++;
}
}

else
{
g_MouthSprite.SetSourceRect(rHead[0]);

}

} // end if sound is playing
iMouthPosOld = iMouthPos;

return TRUE;
}// end MoveMouth

Share this post


Link to post
Share on other sites
hi, well i read your seggestion .it is nice , but i think i didnt make my self clear. i dont want to recognize a vowel from a word . i want to recognize just a vowel spoken not as apart of a word. so i have to find the main frequencies (formants ) which represent the vowel. i thunk its easier than what you thought. do you have an idea?

thanks again

Share this post


Link to post
Share on other sites
''Lo all.

Firstly, I''ve just come back off a diving holiday, so my brain is no where near in gear for a technical discussion, but I''ll try...

If you''re just wanting to find the spectral distribution of a signal ( in your case, the vowels ), then just perform an FFT. You''ll then have the spectrum of the signal ( it''s frequency distribution ). To recognise it, I suppose you could do some sort of comparision with a base frequency distribution, that you know to be true. You couldn''t do a direct comparison of course, perhaps some sort of fuzzy algorithm could be used for that.

You have to remember that you''re unique, just like everybody else.

Share this post


Link to post
Share on other sites
hi ,
well , we checked it towards base frequencies of the vowel.
you were right. we didnt find the main frequencies as in the book. can you give me some idea what is a fuzzy algorithm .
(do you mean something in fuzzy logic???).
do you know about specific fuzzy algorithm which is used in speech recognition??
thanks for your help .we start to get an idea what complicated issue we are dealing with.

Share this post


Link to post
Share on other sites
Ok, you produce an FFT, from this you grab the fundamental frequency, and a sufficient number of harmonics. With this, you can compare it to another FFT result ( one that was previously calculated to be true. So, you record yourself saying "a", and take the FFT of that ). Since everytime you say the vowel will be different, a direct equals operation will not work, as there will be descrepencies in what is found. So, you want to have a threshold, and use fuzzy logic and decide what is the most likely thing it could be. What I'd look at is the spacing and number of harmonics found. The fundamental frequency I can't see being too important ( say, for instance, you got a girl to say something, it'd be a higher pitch, but the spacing of the harmonics, and the number of them will probably still be the same ). Of course, something like this would break instantly with a different accent. But hey, I don't write speech recognition stuff, I just did/am doing DSP in my degree.

You have to remember that you're unique, just like everybody else.

[edited by - python_regious on April 18, 2004 6:39:01 PM]

Share this post


Link to post
Share on other sites
hi,
well its not so simple to grab the fundamental frequencies of the wave with all the noise. i dont know which is noise and which is the fundamental frequency....
help!!!
thanks

Share this post


Link to post
Share on other sites
hi,
well its not so simple to grab the fundamental frequencies of the wave with all the noise. how can i recognise which is the noise and which is the important freq''??
help!!
thanks

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Or use Hidden Markov Models to solve the problem

Share this post


Link to post
Share on other sites