Sign in to follow this  
issch

kurtosis algorithm help needed

Recommended Posts

issch    479
Hi, Does anybody know where I might find some source code for the kurtosis algorithm? (preferably in C, Python or Java, but any language that I can decipher, pseudocode included, would be fine too). I wrote my own, but have no way of testing is it correct because I have no sample inputs for which I know the kurtosis. If anyone does, please give me a shout! I currently have this:
def mean(values):
	return sum(values) / len(values)

def variance(values, m):
	result = 0
	tmp = 0
	for v in values:
		tmp = v - m
		result += (tmp * tmp)
	return result / len(values)

def standardDeviation (var): # var=variance
	return sqrt(var)

def kurtosis (values, m, sd): # m = mean, sd = standard deviation
	result = 0
	for v in values:
		result += pow((v - m), 4)
	return (result / ((len(values) - 1) * pow(sd, 4))) - 3

# Test it:
a = [0, 0.01, 0.05, 0.25, 0.4, 0.25, 0.05, 0.01, 0]
m = mean(a)
v = variance(a, m)
sd = standardDeviation(v)
k = kurtosis(a, m, sd)
print "mean =", m
print "variance =", v
print "standard deviation =", sd
print "kurtosis =", k
# Outputs:
#mean = 0.11333333333333334
#variance = 0.019399999999999997
#standard deviation = 0.13928388277184117
#kurtosis = -0.32932785985049584
#
# I have no idea if this is correct...


But I'm not really sure if I got that correct. If it is correct, please let me know :-) I used the formula found on this page to write that. Any comments or help are greatly appreciated. Thanks!

Share this post


Link to post
Share on other sites
Vorpy    869
It looks right to me.

You can also use
result = sum(pow((v - m), 4) for v in values)
instead of using a for loop, but the for loop might be easier to read in this case..hmm..

You can also use the ** operator instead of pow.
result = sum((v - m)**4) for v in values)

Share this post


Link to post
Share on other sites
issch    479
Thanks, Vorpy, for the quick reply :-)

One more semi-related question: Are there any other formulas (besides kurtosis and skewness) which could potentially help me pick "out of place" peaks from a distribution (eg, if it is generally peaked, then a peak is of little interest to me, however, peaks in a flat distribution (or subset of a distribution) are of interest to me)?
I'm trying to use as many varying methods of detecting these points of interest as I can.

For anyone who is interested, I am looking at various ways of detecting audio events whilst ignoring noise. I am also interested in detecting if events appear to be recurring periodically or if they are once off.

Again, any comments are greatly appreciated.

Share this post


Link to post
Share on other sites
Winograd    380
Quote:
Original post by issch
I wrote my own, but have no way of testing is it correct because I have no sample inputs for which I know the kurtosis. If anyone does, please give me a shout!


I'm sure you can easily generate data sets for uniform (with rand()) and normal (with Box-Muller) distributions and many others.

Compare the results to these:
Kurtosis for the common distributions

Share this post


Link to post
Share on other sites
Zahlman    1682
Code cleanup:

- You have a mean() function. Use it!
- Think more about what work is done by which functions. Calculating a variance is logically part of the process of calculating a standard deviation. Therefore, instead of passing a variance value to the standard deviation function, pass the data set, and call variance() *from* standardDeviation(). Similarly in other places. One function builds upon another; so don't expect the user to invoke everything. When you send a letter, you don't call the postman to tell him to pick it up from the mailbox, right?
- As noted, the list comprehension stuff, and the exponentiation stuff.
- There's a common bit of work that's done in variance() and kurtosis(): taking a sum-of-powers (only the power varies) -of-difference-from-mean, over a set of values. We can extract a common function for this, to reduce redundancy in the code.


from math import sqrt

def mean(values):
return sum(values) / len(values)

def exponentiatedDifferences(values, power):
# This needs a better name... :/
m = mean(values)
return [(v - m) ** power for v in values]

def variance(values):
return mean(exponentiatedDifferences(values, 2))

def standardDeviation(values):
return sqrt(variance(values))

def kurtosis(values):
return (sum(exponentiatedDifferences(values, 4)) /
((len(values) - 1) * (standardDeviation(values) ** 4))) - 3

# Test it:
a = [0, 0.01, 0.05, 0.25, 0.4, 0.25, 0.05, 0.01, 0]
# Notice, now, how all of the calculations are independent, and we don't need
# any temporary variables.
print "mean =", mean(a)
print "variance =", variance(a)
print "standard deviation =", standardDeviation(a)
print "kurtosis =", kurtosis(a)



Share this post


Link to post
Share on other sites
issch    479
Zahlman, that is indeed cleaner. Thank you.

My reasoning behind having separate functions and passing them as parameters was that in my main program I actually use the mean and variance for various other things, not just the standard deviation and kurtosis and I figured why recompute them if I can compute them once and pass the values as parameters.


Would I be correct in thinking this:

def skewness(values):
return (sum(exponentiatedDifferences(values, 3)) /
((len(values) - 1) * (standardDeviation(values) ** 3)))

Share this post


Link to post
Share on other sites
Zahlman    1682
Quote:
Original post by issch
My reasoning behind having separate functions and passing them as parameters was that in my main program I actually use the mean and variance for various other things, not just the standard deviation and kurtosis and I figured why recompute them if I can compute them once and pass the values as parameters.


That's a reason for having separate functions. You can still call the functions separately in order to display the results, but that doesn't mean you have to let the caller track all the intermediate values.

Another way would be to make a class that calculates all the intermediate results and stores them as it goes. That lets you avoid re-calculating the mean, for example, while still making it accessible.


from math import sqrt

class statistics:
def __init__(self, values):
self.values = values

# We do some Python reflection stuff in order to cache results.
# The datasets will have to get pretty big before this is a net win ;)
# But at least it shows an interesting technique.
def __getattr__(self, name):
self.__dict__[name] = eval('self._' + name)()
return self.__dict__[name]

def _size(self):
return len(self.values)

def _mean(self):
return self.calcmean(self.values)

def calcmean(self, v):
return sum(v) / len(v)

def exponentiatedDifferences(self, power):
# This needs a better name... :/
return [(v - self.mean) ** power for v in self.values]

def _variance(self):
# Here, we use the generic mean function.
return self.calcmean(self.exponentiatedDifferences(2))

def _standardDeviation(self):
return sqrt(self.variance)

def _kurtosis(self):
return (sum(self.exponentiatedDifferences(4)) /
((self.size - 1) * (self.standardDeviation ** 4))) - 3




Usage:
	
>>> x = statistics([0, 0.01, 0.05, 0.25, 0.4, 0.25, 0.05, 0.01, 0])
>>> x.kurtosis
-0.32932785985049584
>>> x.mean
0.11333333333333334
>>> x.variance
0.019399999999999997
>>> x.standardDeviation
0.13928388277184117
>>> x.size
9


And each calculation only happens once because of the caching mechanism. ;) We don't even need a temporary 'm' in exponentiatedDifferences() any more.

Quote:
Would I be correct in thinking this:
*** Source Snippet Removed ***


Looks right (I don't remember the formulas), but it seems that now we have another function to extract, since this looks awfully similar to the kurtosis calculation :)


# I really wish I had good names for these concepts ;)
# I think you might be able to figure some out, though, if you browse around
# MathWorld a bit...
def nthOrderWeightedExponentiatedDifferenceAverage(values, order):
return (sum(exponentiatedDifferences(values, order)) /
((len(values) - 1) * (standardDeviation(values) ** order)))

def skewness(values):
return nthOrderWeightedExponentiatedDifferenceAverage(values, 3)

def kurtosis(values):
return nthOrderWeightedExponentiatedDifferenceAverage(values, 4) - 3

Share this post


Link to post
Share on other sites
Vorpy    869
I think it's better to avoid using eval:


def __getattr__(self, name):
self.__dict__[name] = self.__dict__['_' + name]()
return self.__dict__[name]


There's a way to do this without introspection by using a descriptor. You can make a function decorator that returns a descriptor that runs the function when accessed and sets the corresponding attribute in the instance to the value, and then returns the value. I found this page that basically shows how it is done, except the code on it was apparently written before the decorator syntax was added.

Share this post


Link to post
Share on other sites
issch    479
Thanks a lot, I really appreciate it!
I wasn't expecting to learn more Python when I started this topic, thanks for that. :-P

Share this post


Link to post
Share on other sites
Zahlman    1682
Quote:
Original post by Vorpy
I think it's better to avoid using eval:


def __getattr__(self, name):
self.__dict__[name] = self.__dict__['_' + name]()
return self.__dict__[name]



I thought so too, but the __dict__ doesn't seem to contain methods (even though dir() includes their names in the resulting list). At least in Python 2.3.

Quote:
There's a way to do this without introspection by using a descriptor. You can make a function decorator that returns a descriptor that runs the function when accessed and sets the corresponding attribute in the instance to the value, and then returns the value. I found this page that basically shows how it is done, except the code on it was apparently written before the decorator syntax was added.


I really should upgrade :(

Share this post


Link to post
Share on other sites
Vorpy    869
Oh, oops! You're right, instances don't contain methods, their class contains the methods.

def __getattr__(self, name):
self.__dict__[name] = statistics.__dict__['_' + name](self)
return self.__dict__[name]

Share this post


Link to post
Share on other sites
issch    479
Ooh, thats very cool indeed.
Heh, I'm glad I asked the origonal questions not for the answers I got for them, but rather for the nice Python tricks I learnt in the meantime.
Thanks guys! I'd rate you both up, but apparently I've already hit the max ;-)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this