Sign in to follow this  
crshinjin

Deep and high voices

Recommended Posts

Hi! I'd like to write a program that can record and analyze (to be) player's voice for how deep/high is it. I suppose one has to use Fourier transformation to get information on high and low frequency components of a sample. So far my java program records the sound, transforms the sample, then displays the magnitude values in an oscilloscope (?) manner. What I would expect, the high freq part would change for high voices and the low freq part would change for deep voices. Instead this is what I get: In silence (more exactly with usual room noise): Image Hosted by ImageShack.us Speaking in high voice: Image Hosted by ImageShack.us Speaking in deep voice: Image Hosted by ImageShack.us So, is this really, what I am supposed to get, or did I botch the algorithm somewhere? Or simply Fourier transformation is not what I need for this? Thanks shinjin As for the technical details: the sound is being recorded in 8bits/22050Hz mono, the Fourier transformation happens ten times a sec for a sample with 2205 elements. Not all but only 300 components are calculated and plot.

Share this post


Link to post
Share on other sites
Hi.
Your graphs seem to plot signed numbers, which doesn't make sense since you say that you're plotting magnitudes, which are strictly positive.
I suppose you plot "absolute value of fi / (square root of number of samples)", so that the same algo can be used for the inverse transform.

If the plot range is only positive though, it seems to me that you're plotting all the coefficients (a total of "number of samples / 2") instead of the first 300 ones, as you claim.

Human voice of both males and females is usually no more than 2KHz. (at least this is what I've read)
Interestingly, your graphs only exhibit major difference within no more than 20% of their length, which is the same proportion of 2KHz w.r.t. 11KHz (the nyquist frequency == the maximum frequency you can pick up and reproduce == your sampling frequency/2)



I guess there's no better way than cross-checking the results we get with our FFTs.
Post a sample list of 2, 4, 8, 16, 32... complex values (a power of 2 anyway), and I'll let you know what I get with my implementation

Share this post


Link to post
Share on other sites
You can find a lot of literature by googling pitch detection. Often pitch detection algorithms exploit the fact that the frequencies of the harmonics are k*f0, where k is some integer and f0 is the fundamental frequency or the pitch. The fundamental frequency itself might be missing from the signal and humans still correctly perceive the pitch as the fundamental frequency. For specific algorithms, see for example cnx.org article on pitch detection algorithms.

I'm not exactly sure how one would characterize the deepness of the voice. Perhaps one measure could be the strength of the lower-end harmonics when compared to higher end harmonics. That is, once you have found f0 you could measure the strengt of k*f0 for k=1..n and compare it to k*f0 for k=n+1..m. Where n and m are some suitable integers.

Share this post


Link to post
Share on other sites
I forgot to mention, this is my first signal processing attempt, so I make stupid mistakes in the process.

To someusername:
To compute the magnitude I use the following equation from this tutorial


magn[binF] = 20. * Math.log10(2*Math.sqrt(sinPart[binF]*sinPart[binF] +
cosPart[binF]*cosPart[binF])/sTrfLength);




One important thing you pointed out, that I should do the transformation for the half length, not for the whole. This explains the strange symmetry of the plot.

BTW the 300 points was selected uniformly from the whole transform length.

By fixing this error and plotting only the first half of the the magnitudes values, the plot is much more meaningful.

For deep voice:
Image Hosted by ImageShack.us

For high voice:
Image Hosted by ImageShack.us

Edit: the plots are upside-down, so higher magnitudes are closer to the bottom of the screen.

To Winograd:
I have never heard about pitch detection, thanks for mentioning it. I'll definitely check it out.

[Edited by - crshinjin on August 20, 2006 8:33:20 AM]

Share this post


Link to post
Share on other sites
The formula you quoted for the calculation of amplitude seems to calculate pressure and not the actual amplitudes indicated by the fourier coefficients... (you can look up "SPL decibell" or "sound pressure level", as opposed to "sound intensity level")

To me it looks like that it should be "((sinPart^2+cosPart^2)/(number of samples))/(2x10^-5)" in the logarithm, (which is the amplitude of the coefficient squared (the pressure) divided by the standard reference pressure level)
I could be missing something though, so it's probably me who's mistaken.
Afterall you saw it in a tutorial, it's highly unlikely that it's wrong.


As for SuperNerd's suggestion, it's a really good idea actually if you want something not too complicated...
A zero crossing is a position in the sample stream, where the sample values change sign; (from +ive they become -ive, or vice versa) The number of zero crossings in a given time interval, is a measure of how fast the waveform oscillates, thus a measure of how high the frequency of the sound is...
This is a very simplistic model that will not work for complicated waveforms, but it may prove sufficient enough when dealing with voice only... I think it can focus on the fundamental frequency of simple sounds quite easily...
Good idea, indeed...

Share this post


Link to post
Share on other sites
Quote:
Original post by crshinjin
To SuperNerd:
Would you explain this a bit more in details. As I've said, I'm quite new to signal processing.


One cycle for a sound is one hert. If you imagine a graph amplitude over time. Every cycle crosses the zero point twice. So loop through the samples and add up the number of times the graph crosses zero and divide by two. That should be the average frequency of the sound. The only problem with this is other sounds or noise can easily throw off the calculation.

Share this post


Link to post
Share on other sites
Thanks! Using the zero crossings really did the trick for me.

Though the curious thing, what was the same with DTF, when I tested the program I had to try really hard to scream in a noticeable higher frequency, when I made simply sounds that a felt higher there was almost no difference compared to my "low sounds".

Also, are there any ways to improve the accuracy of the zero crossings calculations, like against noise?

Share this post


Link to post
Share on other sites
Could be that the input soundwave oscillate faster than the sampling rate

cors cros cros cros
sample sample

^ you just lost two crossings since it only samles at 1/3rd the speed of the oscillate

Share this post


Link to post
Share on other sites
What exactly are you trying to do, distinguish between high and low pitch voices or detect the exact pitch at which one speaks?

I doubt that zero-crossings can provide you anywhere near accurate estimate of the actual pitch. Then again if you are just differentiating between low and high pitch voices then there might be other simple but more robust methods for achieving this.

Share this post


Link to post
Share on other sites
The purpose would be differentiating the sounds made by the same person.
I planed to start each time with a calibration to be able to adopt to the person's vocal capabilities. Being able to tell how "high" or "low" was the sound (probably scream) s/he made would be nice, but it's not a must.

The program should be kept as light as possible to be able to run in real-time and allowing time for other things like visualization.

So far zero crossing calc looks nice with average equipment and background noise.

Share this post


Link to post
Share on other sites
Quote:
Original post by cr_shinjin
Also, are there any ways to improve the accuracy of the zero crossings calculations, like against noise?


The only ways that I can think that would improve accuaracy is making sure that the input sound into the microphone is loud and clear and if the microphone has auto gain, turn it off. The time when noise has the biggest effect on the calculation is when there is silence and becuase noise samples are usually very close to zero you could have a small area in the amplitude for the sound that is considered zero. I hope what i'm trying to say is clear.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this