voice/word recognition

Started by
5 comments, last by frob 8 years, 7 months ago
there is no dedicated audio tech forum, I hope this is the right place to post my question instead.

for fun (and curiosity) I'd like to implement a simple single word recognition of spoken recordings. as this is for fun, i can setup an artificial environment e.g. 5 words only, no noise, clear voice, etc.

so far I've gathered that I could run an fourie transform and compare the frequencies via histograms.

well, it's not easy to find tutorials or sample source to learn from it. usually everybody recommend libraries, but those are either black boxes or require deep knowledge just to set those up.

any guidance, links, high level descriptions, samples, keywords for Google are highly appreciated!
Advertisement

Are you looking for info on voice recognition overall or how to implement the fourier transform/histogram?

I think what you want is called discrete fourier transform. Wiki has the formula (you just have to know what the sigma symbol means, and not be afraid of complex numbers - I think c++ even has built in complex number class if you dont want to translate the math to work with real numbers only).

(https://en.wikipedia.org/wiki/Discrete_Fourier_transform)

Its of the form x(k) = *iterate over data and sum some stuff* where x(k) finds the amplitude/phase at frequency k. Then you do that for all the frequencies to form your histogram and somehow compare it or whatever it is you wanted to do.

The computationally efficient way to transform to the frequency domain is use fast fourier transform (https://en.wikipedia.org/wiki/Fast_Fourier_transform). This is especially important for processing audio data I assume.

Maybe you can find more results searching for "fast fourier transform".

o3o

I know how FT, DFT and FFT work, but I'm not quite sure how to apply it. should I break the recorded audio stream into some chunks to get a frequency over time histogram/graph? or should I transform the whole wave sequence?

once that is done, should I compare it to an average of previously recorded words? just the MSE distance? or counting peaks? how would that work when someone speaks slightly faster or slower?

I know it's not a trivial topic, but I'm trying to find a place to start.

The "traditional" (state of the art 5-10 years ago) method consists of dividing the signal into little windows of time (20ms, say), taking the windows say 5 at a time and computing some features of the signal during that time. These features are related to FFT, but not directly that. I think the most common features are Mel-frequency cepstral coefficients.

Once you have converted the signal into a sequence of feature vectors, you feed these to a hidden Markov model, which can then identify what phoneme is being said at each time window. Actually, the output is a probability distribution over the phonemes for each time window. You can then use a language model and something like the Viterbi algorithm to decide what the most likely interpretation of the sound is.

The parameters of the HMM can be tuned automatically if you have enough data, using some version of the EM algorithm.

Some modern approaches use neural networks to define the features that are fed into the HMM, and they can be tuned to improve the performance of the whole system. That might be the current state of the art. There are also people working on getting a recurrent neural network to do the whole thing, but I don't think that performs as well as the HMMs (yet?).

Restricting yourself to just a few words will make the last stages easier. But it's still a ton of work.

You are probably better off looking into existing speech recognition libraries. Unfortunately I don't have experience with any of them, so I can't give you any recommendations.

The good news is that you don't need to implement very much.

There are many good free libraries that can handle it, and if you're doing it on Windows the OS provides a speech recognition engine already built in.

If you're in the .net framework, you can build a simple recognizer with about 10 lines of code. Their 10 lines of code in a form app will recognize three different words (red, green, blue) and show a dialog box when it recognizes a word.

Small grammars like this are extremely simple, basically listing a bunch of variants for yes, ('yes','yup','sure','ok',...) variants for no ('no','nope','cancel') and whatever else you need. Since you only want five words, it isn't like you're trying to recognize flowing speech with several hundred thousand potential words. The systems are fun and easy at limited vocabularies.

The good news is that you don't need to implement very much.

but I have fun to learn new things, my goals are:
1. I want to understand how it works
2. I want to implement it myself


@Álvaro reply was top notch, thanks!

more tips/hints/replies are welcome.

In that case, consider using the links to the open source project list as a reference. Most projects document their algorithms and the papers they reference, and they also include source code so you can dig in to see what they actually did.

It is a subject with a lot of PhD work, published papers, and technical research behind it. It will be a lot of work, but if that's the type of thing you enjoy, have fun with that. :-)

This topic is closed to new replies.

Advertisement