Jump to content
  • Advertisement
Sign in to follow this  
ProfL

voice/word recognition

This topic is 1151 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

there is no dedicated audio tech forum, I hope this is the right place to post my question instead.

for fun (and curiosity) I'd like to implement a simple single word recognition of spoken recordings. as this is for fun, i can setup an artificial environment e.g. 5 words only, no noise, clear voice, etc.

so far I've gathered that I could run an fourie transform and compare the frequencies via histograms.

well, it's not easy to find tutorials or sample source to learn from it. usually everybody recommend libraries, but those are either black boxes or require deep knowledge just to set those up.

any guidance, links, high level descriptions, samples, keywords for Google are highly appreciated!

Share this post


Link to post
Share on other sites
Advertisement

Are you looking for info on voice recognition overall or how to implement the fourier transform/histogram?

 

I think what you want is called discrete fourier transform. Wiki has the formula (you just have to know what the sigma symbol means, and not be afraid of complex numbers - I think c++ even has built in complex number class if you dont want to translate the math to work with real numbers only).

(https://en.wikipedia.org/wiki/Discrete_Fourier_transform)

 

Its of the form x(k) = *iterate over data and sum some stuff* where x(k) finds the amplitude/phase at frequency k. Then you do that for all the frequencies to form your histogram and somehow compare it or whatever it is you wanted to do.

 

The computationally efficient way to transform to the frequency domain is use fast fourier transform (https://en.wikipedia.org/wiki/Fast_Fourier_transform). This is especially important for processing audio data I assume.

 

Maybe you can find more results searching for "fast fourier transform".

Share this post


Link to post
Share on other sites

I know how FT, DFT and FFT work, but I'm not quite sure how to apply it. should I break the recorded audio stream into some chunks to get a frequency over time histogram/graph? or should I transform the whole wave sequence?

 

once that is done, should I compare it to an average of previously recorded words? just the MSE distance? or counting peaks? how would that work when someone speaks slightly faster or slower?

 

I know it's not a trivial topic, but I'm trying to find a place to start.

Share this post


Link to post
Share on other sites

The good news is that you don't need to implement very much.

 

There are many good free libraries that can handle it, and if you're doing it on Windows the OS provides a speech recognition engine already built in.

 

If you're in the .net framework, you can build a simple recognizer with about 10 lines of code. Their 10 lines of code in a form app will recognize three different words (red, green, blue) and show a dialog box when it recognizes a word.

 

Small grammars like this are extremely simple, basically listing a bunch of variants for yes, ('yes','yup','sure','ok',...) variants for no ('no','nope','cancel') and whatever else you need.  Since you only want five words, it isn't like you're trying to recognize flowing speech with several hundred thousand potential words. The systems are fun and easy at limited vocabularies.

Share this post


Link to post
Share on other sites

The good news is that you don't need to implement very much.

but I have fun to learn new things, my goals are:
1. I want to understand how it works
2. I want to implement it myself


@Álvaro reply was top notch, thanks!

more tips/hints/replies are welcome.

Share this post


Link to post
Share on other sites

In that case, consider using the links to the open source project list as a reference.  Most projects document their algorithms and the papers they reference, and they also include source code so you can dig in to see what they actually did.

 

It is a subject with a lot of PhD work, published papers, and technical research behind it.  It will be a lot of work, but if that's the type of thing you enjoy, have fun with that. :-)

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!