Deep and high voices

Started by
11 comments, last by SuperNerd 17 years, 8 months ago
Hi! I'd like to write a program that can record and analyze (to be) player's voice for how deep/high is it. I suppose one has to use Fourier transformation to get information on high and low frequency components of a sample. So far my java program records the sound, transforms the sample, then displays the magnitude values in an oscilloscope (?) manner. What I would expect, the high freq part would change for high voices and the low freq part would change for deep voices. Instead this is what I get: In silence (more exactly with usual room noise): Image Hosted by ImageShack.us Speaking in high voice: Image Hosted by ImageShack.us Speaking in deep voice: Image Hosted by ImageShack.us So, is this really, what I am supposed to get, or did I botch the algorithm somewhere? Or simply Fourier transformation is not what I need for this? Thanks shinjin As for the technical details: the sound is being recorded in 8bits/22050Hz mono, the Fourier transformation happens ten times a sec for a sample with 2205 elements. Not all but only 300 components are calculated and plot.
Advertisement
Hi.
Your graphs seem to plot signed numbers, which doesn't make sense since you say that you're plotting magnitudes, which are strictly positive.
I suppose you plot "absolute value of fi / (square root of number of samples)", so that the same algo can be used for the inverse transform.

If the plot range is only positive though, it seems to me that you're plotting all the coefficients (a total of "number of samples / 2") instead of the first 300 ones, as you claim.

Human voice of both males and females is usually no more than 2KHz. (at least this is what I've read)
Interestingly, your graphs only exhibit major difference within no more than 20% of their length, which is the same proportion of 2KHz w.r.t. 11KHz (the nyquist frequency == the maximum frequency you can pick up and reproduce == your sampling frequency/2)



I guess there's no better way than cross-checking the results we get with our FFTs.
Post a sample list of 2, 4, 8, 16, 32... complex values (a power of 2 anyway), and I'll let you know what I get with my implementation
You can find a lot of literature by googling pitch detection. Often pitch detection algorithms exploit the fact that the frequencies of the harmonics are k*f0, where k is some integer and f0 is the fundamental frequency or the pitch. The fundamental frequency itself might be missing from the signal and humans still correctly perceive the pitch as the fundamental frequency. For specific algorithms, see for example cnx.org article on pitch detection algorithms.

I'm not exactly sure how one would characterize the deepness of the voice. Perhaps one measure could be the strength of the lower-end harmonics when compared to higher end harmonics. That is, once you have found f0 you could measure the strengt of k*f0 for k=1..n and compare it to k*f0 for k=n+1..m. Where n and m are some suitable integers.
I forgot to mention, this is my first signal processing attempt, so I make stupid mistakes in the process.

To someusername:
To compute the magnitude I use the following equation from this tutorial

magn[binF] = 20. * Math.log10(2*Math.sqrt(sinPart[binF]*sinPart[binF] + cosPart[binF]*cosPart[binF])/sTrfLength);


One important thing you pointed out, that I should do the transformation for the half length, not for the whole. This explains the strange symmetry of the plot.

BTW the 300 points was selected uniformly from the whole transform length.

By fixing this error and plotting only the first half of the the magnitudes values, the plot is much more meaningful.

For deep voice:
Image Hosted by ImageShack.us

For high voice:
Image Hosted by ImageShack.us

Edit: the plots are upside-down, so higher magnitudes are closer to the bottom of the screen.

To Winograd:
I have never heard about pitch detection, thanks for mentioning it. I'll definitely check it out.

[Edited by - crshinjin on August 20, 2006 8:33:20 AM]
Well the simple way of doing it would be to count the zero crossings, but the FFT filter would probably be more reliable.
To SuperNerd:
Would you explain this a bit more in details. As I've said, I'm quite new to signal processing.
The formula you quoted for the calculation of amplitude seems to calculate pressure and not the actual amplitudes indicated by the fourier coefficients... (you can look up "SPL decibell" or "sound pressure level", as opposed to "sound intensity level")

To me it looks like that it should be "((sinPart^2+cosPart^2)/(number of samples))/(2x10^-5)" in the logarithm, (which is the amplitude of the coefficient squared (the pressure) divided by the standard reference pressure level)
I could be missing something though, so it's probably me who's mistaken.
Afterall you saw it in a tutorial, it's highly unlikely that it's wrong.


As for SuperNerd's suggestion, it's a really good idea actually if you want something not too complicated...
A zero crossing is a position in the sample stream, where the sample values change sign; (from +ive they become -ive, or vice versa) The number of zero crossings in a given time interval, is a measure of how fast the waveform oscillates, thus a measure of how high the frequency of the sound is...
This is a very simplistic model that will not work for complicated waveforms, but it may prove sufficient enough when dealing with voice only... I think it can focus on the fundamental frequency of simple sounds quite easily...
Good idea, indeed...
Quote:Original post by crshinjin
To SuperNerd:
Would you explain this a bit more in details. As I've said, I'm quite new to signal processing.


One cycle for a sound is one hert. If you imagine a graph amplitude over time. Every cycle crosses the zero point twice. So loop through the samples and add up the number of times the graph crosses zero and divide by two. That should be the average frequency of the sound. The only problem with this is other sounds or noise can easily throw off the calculation.
Thanks! Using the zero crossings really did the trick for me.

Though the curious thing, what was the same with DTF, when I tested the program I had to try really hard to scream in a noticeable higher frequency, when I made simply sounds that a felt higher there was almost no difference compared to my "low sounds".

Also, are there any ways to improve the accuracy of the zero crossings calculations, like against noise?
Could be that the input soundwave oscillate faster than the sampling rate

cors cros cros cros
sample sample

^ you just lost two crossings since it only samles at 1/3rd the speed of the oscillate

This topic is closed to new replies.

Advertisement