Audio Comparison based on model

Started by
4 comments, last by AndyArmstrong 13 years ago
[color=#1C2837][size=2]Hey guys.

I want to be able to solve the following problem in Java - as it is the language I am most experianced in and my preferred choice.

I want to be able to build a model of a sound - such as a dog barking based upon say 100 sound samples of different dogs barking... Once I have this sample I want to be able to record a clip from a microphone and process it against the model to determine the probability that the recorded sample matches closely enough to the model, to determine if the recorded sound was a dog.

I had the following in mind:

Get the Fourier Transforms of 100 dogs.

Get the average FT of the 100 - this is now the model.

Record the sound clip - generate Fourier Transform.

Deduct sound clip FT from model FT to see how they compare?

I am not hugely experienced with audio - so if anyone can tell me if this is the correct approach - what FFT library to use - and what the process is to build an average FT from 100 samples is - that would be great!

Thanks
Advertisement
Sound recognition like this is hard. Very hard.

All the fourier transform gives you is the frequency spectrum of the sound - but that alone is not enough to recognize or classify sounds. It's not even close to enough. This is clear just by noticing the fact that different barks can have different pitches, and yet still be clearly recognizable as barks - because of this, no approach based just on comparing frequency spectra can possibly work.

This is because many kinds of sounds, including speech, animal sounds, and many musical instrument, is defined at least as much by their formant structure as by their frequency spectrum. (Another major component is envelope - the way various aspects of the sound change over time.) The formant structure is determined by the structure of the resonating body, and acts somewhat like a filter on the sound. In the case of dog barks, the resonating body is the dog's upper respiratory tract and oral cavity. Dogs vary wildly in size, so it is indeed possible that even the formant structure may vary depending upon the size of dog.

What this means is that after extracting formant data from the frequency spectrum, you then need to convert it into a frequency-invariant form in order to compare different sounds meaningfully.

Of course, this is assuming that dog barks even have a fixed formant structure, which is still an open question - the current research is inconclusive. If they don't, and it is in fact quite likely that they don't, the problem becomes even harder, if not currently insoluble.

In other words, with what we know right now, no one can even tell you if this is currently possible, much less how to actually go about it; but is certainly won't be accomplished without a solid grounding in acoustics and signal processing.
Hi Anthony - this doesnt leave me with much hope. How about if I ignored pitch - is there not a way to apply the same format of sound independant of pitch - that way the fourier average could be used to generate a shape of sound - irrelevant of specific pitches....? On top of this - it doesnt need to work with things like dogs - it could be , a door slamming, or, glass breaking, wind, tractor, motorway traffic, car zooming past all of those kinds of things. Shazam inspires me to believe this is all possible - especially the kind of conditions that works under!
You can pretty much ignore the Fourier transform for any kind of recognition, apart from extremely trivial sounds. The problem with recognition, or even the simpler matching, is mostly about reducing the size of the source data and to extract the most fundamental parts of the signal. The Fourier transform is a one-to-one mapping, meaning the data is not reduced at all, and furthermore the fundamental structure is not sufficiently extracted (although the frequency domain is typically more useful, it is nowhere near enough).

Formant analysis is, as Anthony said, about analyzing resonating bodies, and as such reveals the tonal components of a signal. This is very good for speech, but I seriously doubt barks contains enough tonal components to be useful. My advice is to forget about formant analysis as well.

You need to find a way to extract a fundamental features of a signal. The sequence of tonal components for speech for example, envelope shape for different frequency ranges could perhaps work for impulse type of sound.

Shazam inspires me to believe this is all possible - especially the kind of conditions that works under!

You could take a read of: How Shazam Works.
Somebody has suggested :

http://www.cp.jku.at/people/schedl/Research/Development/CoMIRVA/webpage/CoMIRVA.html

The Audio processing section looks like it comapres 2 audio sources to determine how similar they are

This topic is closed to new replies.

Advertisement