Detecting gunshots in audio with AI?

Started by
19 comments, last by Timkin 18 years, 3 months ago
Quote:Original post by sirSolarius
Anyhow, I'm just completely confused by how to set this thing up and run it. I'm also confused about something: if a gunshot is played over let's say 50ms, and I have 5 sets of sample data taken at 10ms intervals, how do I feed it to the network such that it realizes it's all part of one sound? Do I just make a giant vector with all 50ms of data?

This is really confusing me =(.


I've used FANN and recommend it. It follows tutorial terminology very well and includes sample programs for training simple networks. If you have MATLAB, you may want to look into some MATLAB neural network tools or scripts. Start by trying to get the network to train on simple sets such as AND/OR then XOR (which requires at least one hidden layer).

Yes, you can feed all of the sound to it at once, but I doubt it will learn much. You generally want to reduce the number of inputs so that they contain as little data as possible but still hold the key for discriminating it, so techniques such as Principal Components Analysis are popular nowadays for this. Another option is for you to only pass in as inputs important features of the data, for instance statistical data, FFT results, and maybe duration data.
h20, member of WFG 0 A.D.
Advertisement
Ok, I'm looking at HMM's and they definitely seem easier to understand than NN's.

So what I'm thinking is that I can take like 10 sets of FFT data (I'll add more variables later like sound levels, etc.) and make states out of them. Then I'll just grab 10 sets of FFT data at a time from the audio clip, run it through the HMM and see if I end in the final state. Is that correct so far?

I haven't really found any references about generating an HMM from data, however. If I show my HMM a shot with a certain 512 values, and then show it another shot with 512 similar but different values, how does it decide to move onto the next state?
Theres a book called "AI Techniques for Game Programming" that goes over Neural Networks from the beginning, and makes them surprisingly easy to learn.

I agree that while a NN is a difficult algorithm to understand, once you have it written it will be the easiest way to get a very accurate "gunshot recognition" algorithm. And the above-mentioned book can get you up and running with a basic neural network in a matter of hours (if you're comfortable with C++ as a whole, anyway).

Check out my new game Smash and Dash at:

http://www.smashanddashgame.com/

Quote:Original post by JBourrie
Theres a book called "AI Techniques for Game Programming" that goes over Neural Networks from the beginning, and makes them surprisingly easy to learn.

I agree that while a NN is a difficult algorithm to understand, once you have it written it will be the easiest way to get a very accurate "gunshot recognition" algorithm. And the above-mentioned book can get you up and running with a basic neural network in a matter of hours (if you're comfortable with C++ as a whole, anyway).


Ok, I guess I'm just still trying to figure out how to take like 30ms of sound data and give it to a NN. Do I just make a giant vector of all the data and pass it in, and then read 30ms (or whatever) at a time and feed it into the network?
Hi again, you could take all that data if you wanted to and try that. But that would be a gigantic neural network. What you want to do is take features from that 30ms window of data and then feed it to the neural net. The neural net should then output (for example) a 0 or a 1. Let me try and tell you how you would do that. This is gonna be sloppy cause I am in a rush to do other things.

Just pretend that this data vector is your sound data:

x = [0.1, 0.5, 0.2, 0.5]

now 30ms might be a lot more data points but we are pretending.

We are extracting features from x:

Suppose our first feature is the estimated mean of x: then f1 = (0.1+0.5+0.2+0.5)/4 = some number

Suppose a second feature is the magnitude at frequency k, of the FFT of x: then
the fft might yield something like (just pretending):
fft of x = [0.4+0.2i,0.4+0.1i,0.3+0.5i,0.1+0.9i] then you would take the magnitude of the frequency from this fft.
magnitude at frequency f of fft of x = f2 (after calculations)

You can get more features, but the ones that characterize the gunshot would be best. Suppose we wanted to try just these 2 features. Then our feature vector would be:

fv = [f1, f2]

So that is how we extract features from a window of data. Now suppose we had a lot of example windows of data. Some representing gunshots and some representing non-gunshots. We should extract features from each window.

Window1 --> [f11, f12] is a gunshot
Window2 --> [f21, f22] is not a gunshot
Window3 --> [f31, f32] is not a gunshot
...
WindowN --> [fN1, fN2] is a gunshot

This is our training set for a NN. We want the NN to learn the mapping from feature vectors to is it a gunshot or not.

suppose we get these values for example and we will represent a 1 value at the output of our NN to be a gunshot and 0 if it is not.

Window1 --> [0.1, 0.5] outputs 1
Window2 --> [0.3, 0.1] outputs 0
Window3 --> [0.9, -0.3] outputs 0
...
WindowN --> [0.1, -343.2] output 1

Now that we have this data, we train our neural network using a learning algorithm such as backpropagation or genetic algorithm, etc. What the algorithm does is present the current window feature vector at the input of the NN and then sees what it outputs. If it doesn't give the correct output, the error is fed back to adjust the weights of the neural network. For example:

We input window 1 features and the NN gives us 1. Ok that is good. error = expected value - the output of the neural network = 1 - 1 = 0.

Suppose we input window 2 features next and it gives us 1. This will give error = 0 - 1 = -1. The algorithm will feed back this error and adjust the weights a little (assuming using backprop) to make the error smaller next time.

You do this for all windows and see what the overall error is for all windows. If it still is too much then you do the training over all the same windows again. and again.. and again... until the error is small enough. After that you have trained the neural net.

Now you use the neural network by:
1. getting a new window of audio data
2. extract new features
3. feed these features to the NN and see what the output is 0 or 1.

Ta da. I know this is sloppy, but I hope you get the idea. I have to go treadmilling.
Quote:Original post by NickGeorgia
As for Markov models, I would say that the main advantages of using a NN,

1. usually NN's require less data (pdf's need lots and lots of data)

That would only be true if you took a very naive approach to HMMs or modelling pdfs. Sparse data sets can be handled in many contexts (ANNs, HMMs, Decision Trees, etc) and in all of these (ANNs included), the accuracy of the classification is dependent on the information contained within the data. If you have less data, all of these techniques suffer. I've never found it to be true that ANNs require 'less' data than other methods.

Quote:2. NN can easily be expanded (adding nodes, etc.)

If you add a node to a network, you change the input-output mapping, just as if you add a node to a probabilistic network. In both cases, you need to re-condition the network/model on the available data.

Quote:3. I'm sure there's more...

The usual benefit of ANNs over probabilistic techniques is that you don't need to specify a domain model (typically in the form of prior and conditional distributions/densities)... however this lack of specification comes at a cost: you need to ensure that your training set is representative of your operational set. This is why ANNs have had so much widespread appeal (easy to implement when you don't know much about the domain) and have failed to live up to their promise as a general classification/learning tool (because its nearly impossible to obtain representative data... and when you do, certain architectures are nearly impossible to train).

Any of the varied classification techniques should work (to varying degrees of success) on this problem. Any technique though will require pre-processing and post-classification analysis (for vetting of false-positives).

Personally, as a first attempt, I'd consider classification based on the Minimum Embedding Dimension of the 1-D audio signal. This will avoid the need for spectral analysis and speed up processing for real-time work. There is plenty of literature online regarding methods for obtaining a Minimum Embedding Dimension. If this wasn't sufficient I'd then look at other features: power/variance and its time derivatives, time-frequency analysis (short term fourier transforms/wavelets), etc. If you're going to use a classification method (ANN or any other) then you should definitely perform a transformation into feature space and classify in that space, rather than just on the raw data. Why? Because it will make training of your classifier far easier and give you more reliable answers.

Cheers,

Timkin

[Edited by - Timkin on January 16, 2006 5:33:46 PM]
Quote:If you add a node to a network, you change the input-output mapping, just as if you add a node to a probabilistic network. In both cases, you need to re-condition the network/model on the available data.


I was speaking in terms of using the backpropagation algorithm here and the ease of its expansion with similiar nodes.

Quote:That would only be true if you took a very naive approach to HMMs or modelling pdfs. Sparse data sets can be handled in many contexts (ANNs, HMMs, Decision Trees, etc) and in all of these (ANNs included), the accuracy of the classification is dependent on the information contained within the data. If you have less data, all of these techniques suffer. I've never found it to be true that ANNs require 'less' data than other methods.


From my experience, to be fairly confident in a pdf/HMM estimate, you need quite a bit of data. I am assuming this would be the case for gun shots. On the other hand neural networks have an ability to generalize (interpolate). HMMs/Pdfs do not have this capability--at least from my experience. If you have some links on how to generate accurate pdfs from sparse data, I'd be interested. Modeling with sparse data is something I'm looking into very closely lately.

Quote:This is why ANNs have had so much widespread appeal (easy to implement when you don't know much about the domain) and have failed to live up to their promise as a general classification/learning tool (because its nearly impossible to obtain representative data... and when you do, certain architectures are nearly impossible to train).


This I might agree with actually to some extent. A neural network is a general classifing tool in the sense it can learn an input/output mapping. Representative data can be difficult to find and thus is a problem for every classifier not just NN's. It's all about the data. However, in the case of gun shots NN might work since the data is available. I haven't come across a NN that was impossible to train though, unless you have very bad data. To get representative data, there are techniques such as Taguchi Matricies. The key is to develop a good plan to gather representative data, and usually, that is the last thing on everyones mind. It should be the first.

I like the idea of using minimum embedding dimension as a feature. I used it when I was working with chaotic systems. It's been a while, but is it fast to compute?

One more thing I want to reiterate. A neural network is a perfectly good classifer. However, just like any modeling tool, it will not perform miracles on bad data. And just like any modeling tool, if you try to go outside the bounds of the models purpose, it will fail.

I forgot to say that you seem to be a perfect moderator for this forum, Timkin. Glad you are around. ++'s. I'm trying to help, but sometimes I'm a little sloppy. I'm counting on you to clean up my messes if you don't mind.

[Edited by - NickGeorgia on January 17, 2006 5:54:05 AM]
Quote:Original post by NickGeorgia
If you have some links on how to generate accurate pdfs from sparse data, I'd be interested. Modeling with sparse data is something I'm looking into very closely lately.


Mmm.. 'accurate' and 'sparse data' in the one sentence! ;) Relying on sparse data in probability models has the same problems when relying on sparse data in any other model... the generalisation of the model is poor, even between data points. Particle filtering techniques have been shown to be reasonable on sparse data, but obviously, as with other 'basis function' techniques in classification, the assumptions you make in choosing a basis reflect in the accuracy of the model and its generalisation ability.

Quote:I like the idea of using minimum embedding dimension as a feature. I used it when I was working with chaotic systems. It's been a while, but is it fast to compute?

That depends on the technique you use. In my EEG analysis research I used a False Nearest Neighbours algorithm. The slowest part was the construction of a k-d tree for each candidate embedding dimension. I didn't bother with acceleration techniques as I was doing offline analysis, but it was reasonably fast. You might be able to speed it up enough to run it in real time for a gun shot analysis.... but thats mostly a hardware issue than one of software IMO. ;)

Quote:I forgot to say that you seem to be a perfect moderator for this forum, Timkin.


Hehe..I've actually been the surrogate moderator for this forum for many years... Ferretman (Steve) hasn't been around much, particularly since Geta's (Eric Dybsand) untimely death. I suspect work and life have kept him too busy for quite some time now. I too had a quiet time for about 18 months (mostly work induced) and didnt contribute much...but I've been spending more time online again these past 6 months. Maybe I should finally say something to Dave about taking over formally. ;)
Thanks for answering my questions. I didn't know you worked on EEGs before. I knew some people who did and they were very busy developing different features to extract from the data. I always wondered if the EEG did contain the information for detecting seizures since the sampling rate seemed rather slow (100 Hz). But like I said, I was just an outsider looking in. I liked the research though.

Quote:Hehe..I've actually been the surrogate moderator for this forum for many years... Ferretman (Steve) hasn't been around much, particularly since Geta's (Eric Dybsand) untimely death. I suspect work and life have kept him too busy for quite some time now. I too had a quiet time for about 18 months (mostly work induced) and didnt contribute much...but I've been spending more time online again these past 6 months. Maybe I should finally say something to Dave about taking over formally. ;)


I think you should. You seem to have a very comprehensive knowledge of techniques and experience and your answers are very thoughtful.

I was sad to hear about Eric Dybsand. He sounded like a very nice and respected individual active in the AI community (from reading the comments on the memorial website). I would have liked to known him.
Hi,

Like Yvanhoe, I strongly, strongly suggest using Support Vector Machines (SVM) for classification/detection problems.

This topic is closed to new replies.

Advertisement