Back to General and Gameplay Programming

voice/word recognition

General and Gameplay Programming Programming

Started by ProfL August 26, 2015 02:55 PM

5 comments, last by frob 8 years, 7 months ago

ProfL

717

Author

August 26, 2015 02:55 PM

there is no dedicated audio tech forum, I hope this is the right place to post my question instead.

for fun (and curiosity) I'd like to implement a simple single word recognition of spoken recordings. as this is for fun, i can setup an artificial environment e.g. 5 words only, no noise, clear voice, etc.

so far I've gathered that I could run an fourie transform and compare the frequencies via histograms.

well, it's not easy to find tutorials or sample source to learn from it. usually everybody recommend libraries, but those are either black boxes or require deep knowledge just to set those up.

any guidance, links, high level descriptions, samples, keywords for Google are highly appreciated!

Waterlimon

4,401

August 26, 2015 03:14 PM

Are you looking for info on voice recognition overall or how to implement the fourier transform/histogram?

I think what you want is called discrete fourier transform. Wiki has the formula (you just have to know what the sigma symbol means, and not be afraid of complex numbers - I think c++ even has built in complex number class if you dont want to translate the math to work with real numbers only).

(https://en.wikipedia.org/wiki/Discrete_Fourier_transform)

Its of the form x(k) = *iterate over data and sum some stuff* where x(k) finds the amplitude/phase at frequency k. Then you do that for all the frequencies to form your histogram and somehow compare it or whatever it is you wanted to do.

The computationally efficient way to transform to the frequency domain is use fast fourier transform (https://en.wikipedia.org/wiki/Fast_Fourier_transform). This is especially important for processing audio data I assume.

Maybe you can find more results searching for "fast fourier transform".

o3o

ProfL

717

Author

August 26, 2015 03:27 PM

I know how FT, DFT and FFT work, but I'm not quite sure how to apply it. should I break the recorded audio stream into some chunks to get a frequency over time histogram/graph? or should I transform the whole wave sequence?

once that is done, should I compare it to an average of previously recorded words? just the MSE distance? or counting peaks? how would that work when someone speaks slightly faster or slower?

I know it's not a trivial topic, but I'm trying to find a place to start.

alvaro

21,604

August 26, 2015 04:36 PM

The "traditional" (state of the art 5-10 years ago) method consists of dividing the signal into little windows of time (20ms, say), taking the windows say 5 at a time and computing some features of the signal during that time. These features are related to FFT, but not directly that. I think the most common features are Mel-frequency cepstral coefficients.

Once you have converted the signal into a sequence of feature vectors, you feed these to a hidden Markov model, which can then identify what phoneme is being said at each time window. Actually, the output is a probability distribution over the phonemes for each time window. You can then use a language model and something like the Viterbi algorithm to decide what the most likely interpretation of the sound is.

The parameters of the HMM can be tuned automatically if you have enough data, using some version of the EM algorithm.

Some modern approaches use neural networks to define the features that are fed into the HMM, and they can be tuned to improve the performance of the whole system. That might be the current state of the art. There are also people working on getting a recurrent neural network to do the whole thing, but I don't think that performs as well as the HMMs (yet?).

Restricting yourself to just a few words will make the last stages easier. But it's still a ton of work.

You are probably better off looking into existing speech recognition libraries. Unfortunately I don't have experience with any of them, so I can't give you any recommendations.

frob

46,221

August 26, 2015 08:07 PM

The good news is that you don't need to implement very much.

There are many good free libraries that can handle it, and if you're doing it on Windows the OS provides a speech recognition engine already built in.

If you're in the .net framework, you can build a simple recognizer with about 10 lines of code. Their 10 lines of code in a form app will recognize three different words (red, green, blue) and show a dialog box when it recognizes a word.

Small grammars like this are extremely simple, basically listing a bunch of variants for yes, ('yes','yup','sure','ok',...) variants for no ('no','nope','cancel') and whatever else you need. Since you only want five words, it isn't like you're trying to recognize flowing speech with several hundred thousand potential words. The systems are fun and easy at limited vocabularies.

ProfL

717

Author

August 26, 2015 08:41 PM

The good news is that you don't need to implement very much.

but I have fun to learn new things, my goals are:
1. I want to understand how it works
2. I want to implement it myself

@Álvaro reply was top notch, thanks!

more tips/hints/replies are welcome.

frob

46,221

August 26, 2015 09:23 PM

In that case, consider using the links to the open source project list as a reference. Most projects document their algorithms and the papers they reference, and they also include source code so you can dig in to see what they actually did.

It is a subject with a lot of PhD work, published papers, and technical research behind it. It will be a lot of work, but if that's the type of thing you enjoy, have fun with that. :-)

voice/word recognition

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

voice/word recognition

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines