Jump to content
  • Advertisement
Sign in to follow this  
AnAss

Function to measure 'stratification' in data

This topic is 3489 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I am looking for a function that returns a value that is proportional to the overall evenness of distribution of a set of numbers. For example: 0 1 2 3 4 5 6 7 8 9 10 Would return a low value, because the values are evenly distributed This 0 0 1 1 1 2 6 7 7 7 8 Would return a high value, because the values are clumped together My basic knowledge of statistics like standard deviation, etc. does not have a tool for this type of analysis.

Share this post


Link to post
Share on other sites
Advertisement
What about transforming the sequence by taking the difference between nearby values (assuming they're ordered)?

The first sequence would become:
0 1 1 1 1 1 1 1 1 1 1

The second:
0 0 1 0 0 1 4 1 0 0 1

That big 4 there tells you there are 2 clusters.

There might be a fancier way of doing this, but differencing doesn't seem too bad.

[Edited by - jdindia on January 23, 2009 3:26:32 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by AnAss
I am looking for a function that returns a value that is proportional to the overall evenness of distribution of a set of numbers.

For example:
0 1 2 3 4 5 6 7 8 9 10
Would return a low value, because the values are evenly distributed

This
0 0 1 1 1 2 6 7 7 7 8
Would return a high value, because the values are clumped together

My basic knowledge of statistics like standard deviation, etc. does not have a tool for this type of analysis.


The problem is that you haven't really defined the problem. You want a number that says "how uniform" the data is. But uniformity implies compliance with some a priori model, and you haven't specified the model.

Using a linear model, you can simply take the squared error from a linear regression. That would work for the above data certainly.

But what if the data follows a sinusoid. Do you want that to give you a high or low value? What if it follows a quadratic curve? Or do you just want to look for overall noisiness or smoothness? Statistics provides SO MANY different ways to analyze this type of problem that you need to more clearly specify how you want the function to behave to get an answer.

Share this post


Link to post
Share on other sites
Hmmm, I think what you are looking for is to calculate the discrepancy of a sequence of numbers:
http://en.wikipedia.org/wiki/Low-discrepancy_sequence

So, let's say you have a sequence of integers in the range [0, 10]. Draw a line segment with ends, say, A and B and plot those numbers on this segment (one end is 0 and the other end is 10). You can calculate the discrepancy of a sub-segment of AB, let's name it CD, like this:

discrepancy_of_CD = (number_of_points_within_CD / total_number_of_points)
(CD/AB)
So, to have a uniform sequence of numbers, you want the discrepancy of small sub-intervals within the range of the numbers in your sequence to be near 1, that is, the fraction of points lying within CD should be the same as the fraction of the interval.

To calculate the discrepancy of the whole sequence, I guess that you need to calculate the discrepancy of many different intervals and get a somehow weighed average of the results.

I hope this helps, I am not sure if it is what you are looking for though..

[Edited by - D_Tr on January 23, 2009 4:55:30 PM]

Share this post


Link to post
Share on other sites
If you look at the frequency of each number, the first sequence is:
1 1 1 1 1 1 1 1 1 1 1

The second is
2 3 1 0 0 0 1 3 1 0 0

If the numbers were all picked uniformly, you would expect the frequencies to roughly follow a Poisson distribution. The Poisson distribution has the property that the variance and the average are the same value. You can compute a measure of clustering by dividing the sampled variance by the average. If the result is less than 1, the numbers are distributed more evenly than you would expect. If the result is larger than 1, the numbers are clustered more than you would expect.

1 1 1 1 1 1 1 1 1 1 1 (variance=0, mean=1, clustered_score=0)
2 3 1 0 0 0 1 3 1 0 0 (variance=1.4, mean=1, clustered_score=1.4)

I believe that number satisfies your requirements, and it's very easy to compute.

Share this post


Link to post
Share on other sites
jdindia/alvaro: I should have mentioned that my values are floating point, not integers. It seems to me that both of these methods require integers, or at least some fuzzy epsilon value to decide when two numbers are equal.

yahastu: I'm looking for when the data has many values are near the same value. I now realize my original description was over-determined. I'm looking for a function that gives a high value for the second case, where the data is close to a few single values, and a low value for everything else. Or vice versa, obviously.

Share this post


Link to post
Share on other sites
The most mathematically defensible method I can think of would be to approximate the histogram with a sum of gaussians, and find the minimum number of gaussians needed to represent the data with a given minimum PoE. Few gaussians = clumped data.

Share this post


Link to post
Share on other sites
Sneftel, I like that idea a lot from a math point of view.

But I should also mention: I have to do this at least 30 times per second, and it should not hog all my CPU. My data sets will have 2 million points in them. (2048 * 1024)

Share this post


Link to post
Share on other sites
All right, all right. Scan a window over the sorted numbers to find the largest window where all numbers inside the window are within a particular range. Disqualify that range and repeat. Once the maximum window size goes below a particular number, the number of windows you've found becomes your clumpiness.

EDIT: actually, on second thought this is really just a stupid way of approximating jdindia's solution. Just find the difference between each consecutive two numbers in the sorted set, and sum the squared differences. The result for your first set is 10, and the result for your second set is 20.

Share this post


Link to post
Share on other sites
Quote:
Original post by AnAss
yahastu: I'm looking for when the data has many values are near the same value. I now realize my original description was over-determined. I'm looking for a function that gives a high value for the second case, where the data is close to a few single values, and a low value for everything else. Or vice versa, obviously.


In that case this is a clustering problem, because you want to find smaller groups that represent the data, and then estimate how well it fits to those groups. I have recently done a literature survey on data clustering and I can tell you, there are literally thousands of different methods proposed because clustering is in general such a difficult and ill-defined problem.

In order to recommend a specific algorithm best for your needs, I would need to know a lot more about your specific problem-- ie, how the data was generated, what you want to use this for, what your efficiency requirements are, etc.

The methods specified by D_Tr and alvaro will not accomplish what you want. Sneftel's suggestion is more in line with what you need. What he is describing is called a "Gaussian mixture model", and the best way to construct it is by using the Expectation Maximization algorithm. I have coded this algorithm before and it does take a bit of math.

Usually GMM's are initialized by doing a monte carlo search with k-means clustering with each possible number of clusters, and picking the number based on something like the F-statistic. Then converging the results with EM. This can be computationally intensive process.

EDIT: I wouldn't recommend a GMM for a computationally intensive thing like this. You may want to consider using the modes of a kernel density estimate. This is mathematically the purest form of clustering because it is the only non-parametric way to find clusters (ie, it does not make assumptions about the distribution, such as it being distributed in hyperspherical clusters or Gaussian distributions, which are rarely true in practice).

The kernel density estimate is very robust, and automatically determines a number of clusters given a bandwidth parameter -- but there are ways to determine a good bandwidth automatically. It is also very efficient to compute.

The algorithm you should look at for this is called Mean Shift.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!