Sign in to follow this  

Detecting gunshots in audio with AI?

This topic is 4351 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

This isn't related to game programming, but I don't know of any other real AI forums to post in (if you have any suggestions, that would be awesome). Anyhow, I want to build a program that listens to audio and detects if any guns are firing. I want to use a training set to "teach" the program what is and isn't a shot using data from the frequencies and levels of the audio. What data structures would be appropriate for doing this? What should I start looking into? Are there any easy introductions into this type of artificial intelligence?

Share this post


Link to post
Share on other sites
Here is one way. Basically you sample a window of audio data. You extract features of the signal that help distinguish a gun shot from other noises. Features can be as simple as taking the mean or variance of the data in a window or could be more complicated like taking an FFT and finding peak frequencies. Then, train a neural network with some example features from example windows of audio data (some gunshots some without). Then after the neural network is sufficiently trained, put the neural network online with the feature extractor and window grabber. Anyway, there are other ways...

Another thing you should do is normalize the audio data once you grab a window. You might try, Z-order normalization.

Share this post


Link to post
Share on other sites
Ok, thanks so much for the reply. Now, I'm really new at AI but not at CS and math, so I need to start reading about these topics. A few questions:

Quote:
Original post by NickGeorgia
Here is one way. Basically you sample a window of audio data. You extract features of the signal that help distinguish a gun shot from other noises. Features can be as simple as taking the mean or variance of the data in a window or could be more complicated like taking an FFT and finding peak frequencies.


Ok, that I can definitely do. I have an audio library on hand that will give FFT data, levels, etc.

Quote:

Then, train a neural network with some example features from example windows of audio data (some gunshots some without). Then after the neural network is sufficiently trained, put the neural network online with the feature extractor and window grabber. Anyway, there are other ways...


Ok, I know nothing about neural networks. What are some good resources to start learning about them? What are advantages of this vs. hidden Markov models?

Then, how do I create and train this network?

Quote:

Another thing you should do is normalize the audio data once you grab a window. You might try, Z-order normalization.


Sorry, what does this mean? =)

I'm really new at this but I really want to learn. Thanks again for everything!

Share this post


Link to post
Share on other sites
Quote:

Quote:

Then, train a neural network with some example features from example windows of audio data (some gunshots some without). Then after the neural network is sufficiently trained, put the neural network online with the feature extractor and window grabber. Anyway, there are other ways...


Ok, I know nothing about neural networks. What are some good resources to start learning about them? What are advantages of this vs. hidden Markov models?

Then, how do I create and train this network?


You can go to my journal. There are a few tidbits here and there about neural networks and a small tutorial way in the back. I think there are some links to other websites. If not, ask me again or try a google search. You don't have to use neural networks. You could use a fuzzy logic expert, decision trees, etc. even simple thresholding if that works sufficiently. As for Markov models, I would say that the main advantages of using a NN,

1. usually NN's require less data (pdf's need lots and lots of data)
2. NN can easily be expanded (adding nodes, etc.)
3. I'm sure there's more...

Quote:

Quote:

Another thing you should do is normalize the audio data once you grab a window. You might try, Z-order normalization.


Sorry, what does this mean? =)

I'm really new at this but I really want to learn. Thanks again for everything!


You want to normalize the data usually since volumes (etc.) might not all be uniform. This is a pre-processing step. You may have to filtering also if the noise levels are especially high.

Edit: Sorry if I'm not detailed right now. I'm a little shot after working all day.

Share this post


Link to post
Share on other sites
I don't think that neural networks would be the best solution in a case like this ... Frankly, you just need to find some features of a "gunshot" sound which are different than the features of any other sound. Since a gunshot is pretty well-defined, find some sound samples of gunshots and open them in your favorite sound wave editor. Look at their patterns and qualitatively write down some patterns you notice. You may have to apply filters within your sound program to see apparent patterns, or you may even have to run some diagnostics like FFTs which your program can provide. Here, it's really just guess and test.

Now, once you've figured out the general defining characteristic of a "gunshot" (or perhaps there are several), you need to first write a program that will undergo all the filters you used to see the difference visually. Next, use your verbal description to write code which normalizes a segment of the unknown wave and tests it to see if it matches your verbal description.

In my opinion, for a beginner to AI, especially in somewhat vague situations like this neural networks may be a bit confusing to implement and may not achieve great results. Any method you use may make some incorrect classifications, but using the above method you should be able to achieve an excellent hit rate. Also, I should note that since a gunshot will likely be drastically different from "normal" sounds which enter your microphone, the above technique will statistically work out well. If the sounds are harder to pick out by a human, you'd probably have to use some more advanced algorithms.

Share this post


Link to post
Share on other sites
I agree, you can do this numerous ways. Finding "the filters" you mention can be difficult though especially if you cannot find good features. Using a classifier such as a neural network can ease the burden a bit (if your familiar with it) since it can try to do the classification for you. Granted, it may not do a good job but it's better than racking your brain over a large amount of feature data IMHO. There are of course other ways and if you can find some good distinguishing characteristics use them and you may of course not choose to use a neural network.

Another technique one could use is a clustering method, for example, fuzzy c-means clustering.

I just mentioned the neural network since he/she wanted an AI technique, and I thought neural networks might be useful especially in the case where good features are elusive. It's all about the features. Anyway, have fun.

Share this post


Link to post
Share on other sites
Quote:
Original post by mnansgar
I don't think that neural networks would be the best solution in a case like this ... Frankly, you just need to find some features of a "gunshot" sound which are different than the features of any other sound. Since a gunshot is pretty well-defined, find some sound samples of gunshots and open them in your favorite sound wave editor. Look at their patterns and qualitatively write down some patterns you notice. You may have to apply filters within your sound program to see apparent patterns, or you may even have to run some diagnostics like FFTs which your program can provide. Here, it's really just guess and test.

Now, once you've figured out the general defining characteristic of a "gunshot" (or perhaps there are several), you need to first write a program that will undergo all the filters you used to see the difference visually. Next, use your verbal description to write code which normalizes a segment of the unknown wave and tests it to see if it matches your verbal description.

In my opinion, for a beginner to AI, especially in somewhat vague situations like this neural networks may be a bit confusing to implement and may not achieve great results. Any method you use may make some incorrect classifications, but using the above method you should be able to achieve an excellent hit rate. Also, I should note that since a gunshot will likely be drastically different from "normal" sounds which enter your microphone, the above technique will statistically work out well. If the sounds are harder to pick out by a human, you'd probably have to use some more advanced algorithms.


I tried this, and I got a fairly accurate algorithm out of it. The problem is that I need a really accurate algorithm. There were just too many false positives and negatives, even when I looked at different frequencies and characteristics of the shot. For example, that low shuffling noise when someone runs with a microphone would trigger it (in like a 3 second clip it might trigger once, for example), and it's nearly impossible to get all the characteristics down IMHO.

I would like to experiment with the neural nets, and I will start reading about them. If there is anything I'm overlooking with respect to hardcoding the spectrum and level values that will work well, I can check that out too.

I figure if a neural network can separate and process speech, it can do a fairly distinct type of sound pretty well too.

Share this post


Link to post
Share on other sites
Trying to reduce false positives and negatives can be a really difficult problem. One way is to use a technique to optimize the features you use (feature selection, etc.) Another is to handle uncertainty in some fashion such as using probabilties, possibilities (fuzzy), or evidence (Dempster-Shafer). That way you can say, it's a gun shot with a certain probability (possibility, degree of certainty, etc.) Try fuzzy-cmeans clustering for this method, it's not that difficult. Another is to attempt to surpress the interferring signals through pre-processing (filtering, etc.) I'm just throwing some ideas your way.... lots more and that's what makes it fun.

And once again, remember if you have crummy features, you can only do so much. So you should try to handle the uncertainty in some fashion if this is the case. Think what the people who try to detect seizures from EEG's before they happen have to go through. Egads! Put my nickname down in the patent will ya? hehe J/K

Edit: also look into Particle Filters (it's all the rage)

[Edited by - NickGeorgia on January 11, 2006 11:35:08 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by NickGeorgia
Trying to reduce false positives and negatives can be a really difficult problem. One way is to use a technique to optimize the features you use (feature selection, etc.) Another is to handle uncertainty in some fashion such as using probabilties, possibilities (fuzzy), or evidence (Dempster-Shafer). That way you can say, it's a gun shot with a certain probability (possibility, degree of certainty, etc.) Try fuzzy-cmeans clustering for this method, it's not that difficult. Another is to attempt to surpress the interferring signals through pre-processing (filtering, etc.) I'm just throwing some ideas your way.... lots more and that's what makes it fun.

And once again, remember if you have crummy features, you can only do so much. So you should try to handle the uncertainty in some fashion if this is the case. Think what the people who try to detect seizures from EEG's before they happen have to go through. Egads! Put my nickname down in the patent will ya? hehe J/K

Edit: also look into Particle Filters (it's all the rage)


Ok, I'm wondering if I can just pick up a neural network library, throw training data at it and let it do its magic?

I found a library for .NET at http://www.cdrnet.net/projects/neuro/. Is there a better one to use that you know of?

Anyhow, I'm just completely confused by how to set this thing up and run it. I'm also confused about something: if a gunshot is played over let's say 50ms, and I have 5 sets of sample data taken at 10ms intervals, how do I feed it to the network such that it realizes it's all part of one sound? Do I just make a giant vector with all 50ms of data?

This is really confusing me =(.

Share this post


Link to post
Share on other sites
It looks like that the AI research community now uses support vector machines (try SVM in google or wikipedia) in place of neural networks. I am not a specialist in NN but from what I know, SVMs have a more mathematic and pragmatic approach of the problem. It is an algorithm to approximate a mutidimensional continuous function thanks to examples (that may be partly erronous). It looks a bit harder to use, but if you are not afraid of maths, you should be able to control better the learning process.

One word on the Markov Models : they have the goal to recognize a sequence of inputs, where NN or SVM take all their inputs without regard to their succession. Markov models are heavily used in speech recognition.

Is it better to consider a gunshot as a sequence of samples or as a single event ? it is shorter than a word but maybe can it be learnt as something like
(saturation - fast decay - one or more echoes) which would justify a Markov approach. Or maybe is it more like :
(a window of 1 seconds, with a peak in the 180-220 Hz range, a mean level very high at 0.05 seconds of t0)

You decide, you are the specialist :-)

Share this post


Link to post
Share on other sites
Quote:
Original post by sirSolarius
Anyhow, I'm just completely confused by how to set this thing up and run it. I'm also confused about something: if a gunshot is played over let's say 50ms, and I have 5 sets of sample data taken at 10ms intervals, how do I feed it to the network such that it realizes it's all part of one sound? Do I just make a giant vector with all 50ms of data?

This is really confusing me =(.


I've used FANN and recommend it. It follows tutorial terminology very well and includes sample programs for training simple networks. If you have MATLAB, you may want to look into some MATLAB neural network tools or scripts. Start by trying to get the network to train on simple sets such as AND/OR then XOR (which requires at least one hidden layer).

Yes, you can feed all of the sound to it at once, but I doubt it will learn much. You generally want to reduce the number of inputs so that they contain as little data as possible but still hold the key for discriminating it, so techniques such as Principal Components Analysis are popular nowadays for this. Another option is for you to only pass in as inputs important features of the data, for instance statistical data, FFT results, and maybe duration data.

Share this post


Link to post
Share on other sites
Ok, I'm looking at HMM's and they definitely seem easier to understand than NN's.

So what I'm thinking is that I can take like 10 sets of FFT data (I'll add more variables later like sound levels, etc.) and make states out of them. Then I'll just grab 10 sets of FFT data at a time from the audio clip, run it through the HMM and see if I end in the final state. Is that correct so far?

I haven't really found any references about generating an HMM from data, however. If I show my HMM a shot with a certain 512 values, and then show it another shot with 512 similar but different values, how does it decide to move onto the next state?

Share this post


Link to post
Share on other sites
Theres a book called "AI Techniques for Game Programming" that goes over Neural Networks from the beginning, and makes them surprisingly easy to learn.

I agree that while a NN is a difficult algorithm to understand, once you have it written it will be the easiest way to get a very accurate "gunshot recognition" algorithm. And the above-mentioned book can get you up and running with a basic neural network in a matter of hours (if you're comfortable with C++ as a whole, anyway).

Share this post


Link to post
Share on other sites
Quote:
Original post by JBourrie
Theres a book called "AI Techniques for Game Programming" that goes over Neural Networks from the beginning, and makes them surprisingly easy to learn.

I agree that while a NN is a difficult algorithm to understand, once you have it written it will be the easiest way to get a very accurate "gunshot recognition" algorithm. And the above-mentioned book can get you up and running with a basic neural network in a matter of hours (if you're comfortable with C++ as a whole, anyway).


Ok, I guess I'm just still trying to figure out how to take like 30ms of sound data and give it to a NN. Do I just make a giant vector of all the data and pass it in, and then read 30ms (or whatever) at a time and feed it into the network?

Share this post


Link to post
Share on other sites
Hi again, you could take all that data if you wanted to and try that. But that would be a gigantic neural network. What you want to do is take features from that 30ms window of data and then feed it to the neural net. The neural net should then output (for example) a 0 or a 1. Let me try and tell you how you would do that. This is gonna be sloppy cause I am in a rush to do other things.

Just pretend that this data vector is your sound data:

x = [0.1, 0.5, 0.2, 0.5]

now 30ms might be a lot more data points but we are pretending.

We are extracting features from x:

Suppose our first feature is the estimated mean of x: then f1 = (0.1+0.5+0.2+0.5)/4 = some number

Suppose a second feature is the magnitude at frequency k, of the FFT of x: then
the fft might yield something like (just pretending):
fft of x = [0.4+0.2i,0.4+0.1i,0.3+0.5i,0.1+0.9i] then you would take the magnitude of the frequency from this fft.
magnitude at frequency f of fft of x = f2 (after calculations)

You can get more features, but the ones that characterize the gunshot would be best. Suppose we wanted to try just these 2 features. Then our feature vector would be:

fv = [f1, f2]

So that is how we extract features from a window of data. Now suppose we had a lot of example windows of data. Some representing gunshots and some representing non-gunshots. We should extract features from each window.

Window1 --> [f11, f12] is a gunshot
Window2 --> [f21, f22] is not a gunshot
Window3 --> [f31, f32] is not a gunshot
...
WindowN --> [fN1, fN2] is a gunshot

This is our training set for a NN. We want the NN to learn the mapping from feature vectors to is it a gunshot or not.

suppose we get these values for example and we will represent a 1 value at the output of our NN to be a gunshot and 0 if it is not.

Window1 --> [0.1, 0.5] outputs 1
Window2 --> [0.3, 0.1] outputs 0
Window3 --> [0.9, -0.3] outputs 0
...
WindowN --> [0.1, -343.2] output 1

Now that we have this data, we train our neural network using a learning algorithm such as backpropagation or genetic algorithm, etc. What the algorithm does is present the current window feature vector at the input of the NN and then sees what it outputs. If it doesn't give the correct output, the error is fed back to adjust the weights of the neural network. For example:

We input window 1 features and the NN gives us 1. Ok that is good. error = expected value - the output of the neural network = 1 - 1 = 0.

Suppose we input window 2 features next and it gives us 1. This will give error = 0 - 1 = -1. The algorithm will feed back this error and adjust the weights a little (assuming using backprop) to make the error smaller next time.

You do this for all windows and see what the overall error is for all windows. If it still is too much then you do the training over all the same windows again. and again.. and again... until the error is small enough. After that you have trained the neural net.

Now you use the neural network by:
1. getting a new window of audio data
2. extract new features
3. feed these features to the NN and see what the output is 0 or 1.

Ta da. I know this is sloppy, but I hope you get the idea. I have to go treadmilling.

Share this post


Link to post
Share on other sites
Quote:
Original post by NickGeorgia
As for Markov models, I would say that the main advantages of using a NN,

1. usually NN's require less data (pdf's need lots and lots of data)

That would only be true if you took a very naive approach to HMMs or modelling pdfs. Sparse data sets can be handled in many contexts (ANNs, HMMs, Decision Trees, etc) and in all of these (ANNs included), the accuracy of the classification is dependent on the information contained within the data. If you have less data, all of these techniques suffer. I've never found it to be true that ANNs require 'less' data than other methods.

Quote:
2. NN can easily be expanded (adding nodes, etc.)

If you add a node to a network, you change the input-output mapping, just as if you add a node to a probabilistic network. In both cases, you need to re-condition the network/model on the available data.

Quote:
3. I'm sure there's more...

The usual benefit of ANNs over probabilistic techniques is that you don't need to specify a domain model (typically in the form of prior and conditional distributions/densities)... however this lack of specification comes at a cost: you need to ensure that your training set is representative of your operational set. This is why ANNs have had so much widespread appeal (easy to implement when you don't know much about the domain) and have failed to live up to their promise as a general classification/learning tool (because its nearly impossible to obtain representative data... and when you do, certain architectures are nearly impossible to train).

Any of the varied classification techniques should work (to varying degrees of success) on this problem. Any technique though will require pre-processing and post-classification analysis (for vetting of false-positives).

Personally, as a first attempt, I'd consider classification based on the Minimum Embedding Dimension of the 1-D audio signal. This will avoid the need for spectral analysis and speed up processing for real-time work. There is plenty of literature online regarding methods for obtaining a Minimum Embedding Dimension. If this wasn't sufficient I'd then look at other features: power/variance and its time derivatives, time-frequency analysis (short term fourier transforms/wavelets), etc. If you're going to use a classification method (ANN or any other) then you should definitely perform a transformation into feature space and classify in that space, rather than just on the raw data. Why? Because it will make training of your classifier far easier and give you more reliable answers.

Cheers,

Timkin

[Edited by - Timkin on January 16, 2006 5:33:46 PM]

Share this post


Link to post
Share on other sites
Quote:
If you add a node to a network, you change the input-output mapping, just as if you add a node to a probabilistic network. In both cases, you need to re-condition the network/model on the available data.


I was speaking in terms of using the backpropagation algorithm here and the ease of its expansion with similiar nodes.

Quote:
That would only be true if you took a very naive approach to HMMs or modelling pdfs. Sparse data sets can be handled in many contexts (ANNs, HMMs, Decision Trees, etc) and in all of these (ANNs included), the accuracy of the classification is dependent on the information contained within the data. If you have less data, all of these techniques suffer. I've never found it to be true that ANNs require 'less' data than other methods.


From my experience, to be fairly confident in a pdf/HMM estimate, you need quite a bit of data. I am assuming this would be the case for gun shots. On the other hand neural networks have an ability to generalize (interpolate). HMMs/Pdfs do not have this capability--at least from my experience. If you have some links on how to generate accurate pdfs from sparse data, I'd be interested. Modeling with sparse data is something I'm looking into very closely lately.

Quote:
This is why ANNs have had so much widespread appeal (easy to implement when you don't know much about the domain) and have failed to live up to their promise as a general classification/learning tool (because its nearly impossible to obtain representative data... and when you do, certain architectures are nearly impossible to train).


This I might agree with actually to some extent. A neural network is a general classifing tool in the sense it can learn an input/output mapping. Representative data can be difficult to find and thus is a problem for every classifier not just NN's. It's all about the data. However, in the case of gun shots NN might work since the data is available. I haven't come across a NN that was impossible to train though, unless you have very bad data. To get representative data, there are techniques such as Taguchi Matricies. The key is to develop a good plan to gather representative data, and usually, that is the last thing on everyones mind. It should be the first.

I like the idea of using minimum embedding dimension as a feature. I used it when I was working with chaotic systems. It's been a while, but is it fast to compute?

One more thing I want to reiterate. A neural network is a perfectly good classifer. However, just like any modeling tool, it will not perform miracles on bad data. And just like any modeling tool, if you try to go outside the bounds of the models purpose, it will fail.

I forgot to say that you seem to be a perfect moderator for this forum, Timkin. Glad you are around. ++'s. I'm trying to help, but sometimes I'm a little sloppy. I'm counting on you to clean up my messes if you don't mind.

[Edited by - NickGeorgia on January 17, 2006 5:54:05 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by NickGeorgia
If you have some links on how to generate accurate pdfs from sparse data, I'd be interested. Modeling with sparse data is something I'm looking into very closely lately.


Mmm.. 'accurate' and 'sparse data' in the one sentence! ;) Relying on sparse data in probability models has the same problems when relying on sparse data in any other model... the generalisation of the model is poor, even between data points. Particle filtering techniques have been shown to be reasonable on sparse data, but obviously, as with other 'basis function' techniques in classification, the assumptions you make in choosing a basis reflect in the accuracy of the model and its generalisation ability.

Quote:
I like the idea of using minimum embedding dimension as a feature. I used it when I was working with chaotic systems. It's been a while, but is it fast to compute?

That depends on the technique you use. In my EEG analysis research I used a False Nearest Neighbours algorithm. The slowest part was the construction of a k-d tree for each candidate embedding dimension. I didn't bother with acceleration techniques as I was doing offline analysis, but it was reasonably fast. You might be able to speed it up enough to run it in real time for a gun shot analysis.... but thats mostly a hardware issue than one of software IMO. ;)

Quote:
I forgot to say that you seem to be a perfect moderator for this forum, Timkin.


Hehe..I've actually been the surrogate moderator for this forum for many years... Ferretman (Steve) hasn't been around much, particularly since Geta's (Eric Dybsand) untimely death. I suspect work and life have kept him too busy for quite some time now. I too had a quiet time for about 18 months (mostly work induced) and didnt contribute much...but I've been spending more time online again these past 6 months. Maybe I should finally say something to Dave about taking over formally. ;)

Share this post


Link to post
Share on other sites
Thanks for answering my questions. I didn't know you worked on EEGs before. I knew some people who did and they were very busy developing different features to extract from the data. I always wondered if the EEG did contain the information for detecting seizures since the sampling rate seemed rather slow (100 Hz). But like I said, I was just an outsider looking in. I liked the research though.

Quote:
Hehe..I've actually been the surrogate moderator for this forum for many years... Ferretman (Steve) hasn't been around much, particularly since Geta's (Eric Dybsand) untimely death. I suspect work and life have kept him too busy for quite some time now. I too had a quiet time for about 18 months (mostly work induced) and didnt contribute much...but I've been spending more time online again these past 6 months. Maybe I should finally say something to Dave about taking over formally. ;)


I think you should. You seem to have a very comprehensive knowledge of techniques and experience and your answers are very thoughtful.

I was sad to hear about Eric Dybsand. He sounded like a very nice and respected individual active in the AI community (from reading the comments on the memorial website). I would have liked to known him.

Share this post


Link to post
Share on other sites
OFF-TOPIC... (*Timkin smacks his own wrist*)

Quote:
Original post by NickGeorgia
Thanks for answering my questions. I didn't know you worked on EEGs before.

Yep, I worked on seizure prediction for a couple of years. I found that basically all of the techniques based on time series analysis were doomed to fail because while they could identify a pre-seizure state, they had no way of rejecting false positives. Pre-epiliptic events are too easily mis-classified, even when using non-linear analysis techniques. I believe that a true seizure predictor would have to be aware of the patients physiological state, not just their electroencephalographic state... and we require far more localised information than standard scalp electrodes can supply. I found my best results from embedded electrode arrays... but that doesn't help a patient who wants to wear a monitor out to dinner! ;)

Quote:
You seem to have a very comprehensive knowledge of techniques and experience and your answers are very thoughtful.

Thanks... I'm a research academic in AI with a diverse background (maths, physics, computing, philosophy). Hopefully that knowledge (and those many decades at school 8( ) can help others, even just a little.

Quote:
I was sad to hear about Eric Dybsand. He sounded like a very nice and respected individual active in the AI community (from reading the comments on the memorial website). I would have liked to known him.


My interactions with him were limited to these forae, but in the few years that I knew him I always found him to be thoughtful, intelligent and extremely giving of his time and energy. He is sorely missed from our community. It's a shame that Steve doesn't call around any more. He too was a respected and important member of the Indie community and he is missed. If you do happen to read this Steve, let us know whats happening with you!

Cheers,

Timkin

Share this post


Link to post
Share on other sites

This topic is 4351 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this