kurtosis algorithm help needed

Started by
11 comments, last by guywithknife 17 years, 1 month ago
Hi, Does anybody know where I might find some source code for the kurtosis algorithm? (preferably in C, Python or Java, but any language that I can decipher, pseudocode included, would be fine too). I wrote my own, but have no way of testing is it correct because I have no sample inputs for which I know the kurtosis. If anyone does, please give me a shout! I currently have this:

def mean(values):
	return sum(values) / len(values)

def variance(values, m):
	result = 0
	tmp = 0
	for v in values:
		tmp = v - m
		result += (tmp * tmp)
	return result / len(values)

def standardDeviation (var): # var=variance
	return sqrt(var)

def kurtosis (values, m, sd): # m = mean, sd = standard deviation
	result = 0
	for v in values:
		result += pow((v - m), 4)
	return (result / ((len(values) - 1) * pow(sd, 4))) - 3

# Test it:
a = [0, 0.01, 0.05, 0.25, 0.4, 0.25, 0.05, 0.01, 0]
m = mean(a)
v = variance(a, m)
sd = standardDeviation(v)
k = kurtosis(a, m, sd)
print "mean =", m
print "variance =", v
print "standard deviation =", sd
print "kurtosis =", k
# Outputs:
#mean = 0.11333333333333334
#variance = 0.019399999999999997
#standard deviation = 0.13928388277184117
#kurtosis = -0.32932785985049584
#
# I have no idea if this is correct...


But I'm not really sure if I got that correct. If it is correct, please let me know :-) I used the formula found on this page to write that. Any comments or help are greatly appreciated. Thanks!
Advertisement
It looks right to me.

You can also use
result = sum(pow((v - m), 4) for v in values)
instead of using a for loop, but the for loop might be easier to read in this case..hmm..

You can also use the ** operator instead of pow.
result = sum((v - m)**4) for v in values)
Thanks, Vorpy, for the quick reply :-)

One more semi-related question: Are there any other formulas (besides kurtosis and skewness) which could potentially help me pick "out of place" peaks from a distribution (eg, if it is generally peaked, then a peak is of little interest to me, however, peaks in a flat distribution (or subset of a distribution) are of interest to me)?
I'm trying to use as many varying methods of detecting these points of interest as I can.

For anyone who is interested, I am looking at various ways of detecting audio events whilst ignoring noise. I am also interested in detecting if events appear to be recurring periodically or if they are once off.

Again, any comments are greatly appreciated.
Quote:Original post by issch
I wrote my own, but have no way of testing is it correct because I have no sample inputs for which I know the kurtosis. If anyone does, please give me a shout!


I'm sure you can easily generate data sets for uniform (with rand()) and normal (with Box-Muller) distributions and many others.

Compare the results to these:
Kurtosis for the common distributions
Code cleanup:

- You have a mean() function. Use it!
- Think more about what work is done by which functions. Calculating a variance is logically part of the process of calculating a standard deviation. Therefore, instead of passing a variance value to the standard deviation function, pass the data set, and call variance() *from* standardDeviation(). Similarly in other places. One function builds upon another; so don't expect the user to invoke everything. When you send a letter, you don't call the postman to tell him to pick it up from the mailbox, right?
- As noted, the list comprehension stuff, and the exponentiation stuff.
- There's a common bit of work that's done in variance() and kurtosis(): taking a sum-of-powers (only the power varies) -of-difference-from-mean, over a set of values. We can extract a common function for this, to reduce redundancy in the code.

from math import sqrtdef mean(values):	return sum(values) / len(values)def exponentiatedDifferences(values, power):	# This needs a better name... :/	m = mean(values)	return [(v - m) ** power for v in values]def variance(values):	return mean(exponentiatedDifferences(values, 2))def standardDeviation(values):	return sqrt(variance(values))def kurtosis(values):	return (sum(exponentiatedDifferences(values, 4)) / 	        ((len(values) - 1) * (standardDeviation(values) ** 4))) - 3# Test it:a = [0, 0.01, 0.05, 0.25, 0.4, 0.25, 0.05, 0.01, 0]# Notice, now, how all of the calculations are independent, and we don't need# any temporary variables.print "mean =", mean(a)print "variance =", variance(a)print "standard deviation =", standardDeviation(a)print "kurtosis =", kurtosis(a)


Zahlman, that is indeed cleaner. Thank you.

My reasoning behind having separate functions and passing them as parameters was that in my main program I actually use the mean and variance for various other things, not just the standard deviation and kurtosis and I figured why recompute them if I can compute them once and pass the values as parameters.


Would I be correct in thinking this:
def skewness(values):	return (sum(exponentiatedDifferences(values, 3)) / 	        ((len(values) - 1) * (standardDeviation(values) ** 3)))
Quote:Original post by issch
My reasoning behind having separate functions and passing them as parameters was that in my main program I actually use the mean and variance for various other things, not just the standard deviation and kurtosis and I figured why recompute them if I can compute them once and pass the values as parameters.


That's a reason for having separate functions. You can still call the functions separately in order to display the results, but that doesn't mean you have to let the caller track all the intermediate values.

Another way would be to make a class that calculates all the intermediate results and stores them as it goes. That lets you avoid re-calculating the mean, for example, while still making it accessible.

from math import sqrtclass statistics:	def __init__(self, values):		self.values = values	# We do some Python reflection stuff in order to cache results.	# The datasets will have to get pretty big before this is a net win ;)	# But at least it shows an interesting technique.	def __getattr__(self, name):		self.__dict__[name] = eval('self._' + name)()		return self.__dict__[name]	def _size(self):		return len(self.values)	def _mean(self):		return self.calcmean(self.values)	def calcmean(self, v):		return sum(v) / len(v)	def exponentiatedDifferences(self, power):		# This needs a better name... :/		return [(v - self.mean) ** power for v in self.values]	def _variance(self):		# Here, we use the generic mean function.		return self.calcmean(self.exponentiatedDifferences(2))	def _standardDeviation(self):		return sqrt(self.variance)	def _kurtosis(self):		return (sum(self.exponentiatedDifferences(4)) / 		        ((self.size - 1) * (self.standardDeviation ** 4))) - 3


Usage:
	>>> x = statistics([0, 0.01, 0.05, 0.25, 0.4, 0.25, 0.05, 0.01, 0])>>> x.kurtosis-0.32932785985049584>>> x.mean0.11333333333333334>>> x.variance0.019399999999999997>>> x.standardDeviation0.13928388277184117>>> x.size9


And each calculation only happens once because of the caching mechanism. ;) We don't even need a temporary 'm' in exponentiatedDifferences() any more.

Quote:Would I be correct in thinking this:
*** Source Snippet Removed ***


Looks right (I don't remember the formulas), but it seems that now we have another function to extract, since this looks awfully similar to the kurtosis calculation :)

# I really wish I had good names for these concepts ;)# I think you might be able to figure some out, though, if you browse around# MathWorld a bit...def nthOrderWeightedExponentiatedDifferenceAverage(values, order):	return (sum(exponentiatedDifferences(values, order)) / 	        ((len(values) - 1) * (standardDeviation(values) ** order)))def skewness(values):	return nthOrderWeightedExponentiatedDifferenceAverage(values, 3)def kurtosis(values):	return nthOrderWeightedExponentiatedDifferenceAverage(values, 4) - 3
I think it's better to avoid using eval:

	def __getattr__(self, name):		self.__dict__[name] = self.__dict__['_' + name]()		return self.__dict__[name]


There's a way to do this without introspection by using a descriptor. You can make a function decorator that returns a descriptor that runs the function when accessed and sets the corresponding attribute in the instance to the value, and then returns the value. I found this page that basically shows how it is done, except the code on it was apparently written before the decorator syntax was added.
Thanks a lot, I really appreciate it!
I wasn't expecting to learn more Python when I started this topic, thanks for that. :-P
Quote:Original post by Vorpy
I think it's better to avoid using eval:

	def __getattr__(self, name):		self.__dict__[name] = self.__dict__['_' + name]()		return self.__dict__[name]



I thought so too, but the __dict__ doesn't seem to contain methods (even though dir() includes their names in the resulting list). At least in Python 2.3.

Quote:There's a way to do this without introspection by using a descriptor. You can make a function decorator that returns a descriptor that runs the function when accessed and sets the corresponding attribute in the instance to the value, and then returns the value. I found this page that basically shows how it is done, except the code on it was apparently written before the decorator syntax was added.


I really should upgrade :(

This topic is closed to new replies.

Advertisement