• Advertisement
Sign in to follow this  

Anyone know the probability density function for this?

This topic is 1656 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

*Not really gamedev related, but question was asked on retro games site (world of spectrum, for ZX Spectrum nostalgia)*

 

Let's say I have a piece of string of length 1. I make N cuts in the string (cut positions are made from a uniform(0, 1) distribution), what is the probability density function for the length of the N+1 pieces? (Note: I make all the cuts at once, I don't make a cut, discard a piece, make another cut in what is left, etc.)

 

It's pretty obvious the mean is going to be 1/N and from a simulation done (in ZX Basic, lol) looks to me like the limiting pdf is going to be the exponential distribution with lambda = 1/N.

 

The constraint is that if x0, x1, ..., xn are the lengths of the pieces then x0 + x1 + ... + xn = 1

 

Over to you guys! I'm sure there must be a name for this distribution but a quick look at the 144 continuous pdf's on wikipedia hasn't shown up anything.

Share this post


Link to post
Share on other sites
Advertisement

That looks pretty good, I'll look more into it later! I'll link to that post on the other forum thread.

 

Of course I meant to say the mean would be 1/(N+1) for N cuts ;) I was thinking N pieces when I said 1/N.

 

Here's the thread on WoS forums (warning, contains ZX Basic!) http://www.worldofspectrum.org/forums/showthread.php?t=44605

 

EDIT: Is that going to converge to an exponential distribution as I predicted?

 

EDIT2: Bonus question:

 

Iif the string isn't length 1 anymore, but length L (an integer), and the cuts are made at integral points only, what is the distribution then? What if we disallow 0 length pieces of string (i.e. all cuts at integers are distinct)?

 

(in other words, what is the discrete pdf?)

Edited by Paradigm Shifter

Share this post


Link to post
Share on other sites

Order statistics then huh? I don't think we did those in my prob & stats courses in uni (only did 2 years of that, I was pure maths only in my final year). And it was a long time ago! Nice one!

 

EDIT: And LOL @ "This is the cumulative distribution function of the length of a piece of string"

Edited by Paradigm Shifter

Share this post


Link to post
Share on other sites

What Bacterius said. You could have found it on Google by typing `minimum of uniform random variables'.

 

Thanks for the keywords Alvaro! I didn't identify the problem as a minimum of random variables, but it makes a lot more sense that way.

 

 

 


EDIT: Is that going to converge to an exponential distribution as I predicted?

 

Not sure what you mean by exponential distribution, the PDF is always going to be a polynomial in the length, and exponential in the number of cuts, but the distribution does get sweked towards zero rather quickly with every additional cut.

 

 

 


EDIT2: Bonus question:
 
Iif the string isn't length 1 anymore, but length L (an integer), and the cuts are made at integral points only, what is the distribution then?

 

You should still be able to use the same reasoning above, using discrete sums instead of integrals (though it might require a few modifications). I would hope the PDF would look the same, only as a discrete approximation of the continuous version. Think about it: if you add enough points and rescale the PDF down by the number of points, you are essentially converging towards the continuous PDF by the definition of the integral!

 

 

 


What if we disallow 0 length pieces of string (i.e. all cuts at integers are distinct)?

 

That is.. more complicated, I think, since then the cuts are no longer independent sad.png and when that happens in statistics things tend to become hairy very quickly, though in the limit (with sufficiently large number of integers) you will still converge towards the original PDF, as the probability of cutting in the same location twice tends to zero. After all, you could say that the continuous version of the problem in your original post is really just taking the limit of this process, where the probability of cutting in the same spot is zero (the probability that two uniformly selected reals in any given non-empty interval are equal is, indeed, zero).

Share this post


Link to post
Share on other sites

Yeah, although sums/discrete pdf's tend to be hairier than integrals in my experience. If 0 length pieces of string are disallowed, it's going to be worse, although then it is a case of picking subsets from {1, 2, ..., L-1} (points where you cut the string) and analysing the pdf for string lengths resulting from that, and straight away it's switched to a 2 variable pdf (length L and number of cuts, N).

 

EDIT: I see that there is no section in wikipedia for the order statistic for a discrete uniform distribution either.

Edited by Paradigm Shifter

Share this post


Link to post
Share on other sites

I am sure the probability distribution converges to an exponential for large numbers of cuts.

 

My informal reasoning is that, since the cuts happen as independent random events, as the number of cuts grows the situation resembles a Poisson process.

 

Let's try it another way. We can change our probability distribution so its average is 1, by multiplying it by n+1. The resulting density function is n/(n+1)*(1-x/(n+1))^(n-1), whose limit is exp(-x). Undoing the multiplication by n+1, we get that for large values of n the distribution is well approximated by (n+1)*exp(-(n+1)*x), which is an exponential distribution with lambda = n+1.

 

EDIT: Typo in a formula.

Edited by Álvaro

Share this post


Link to post
Share on other sites

Yep, my intuition was that it was a Poisson process as well, sort of like modelling the distribution of the number of arrivals in a time period of 1 with mean arrival time 1/(n+1).

Share this post


Link to post
Share on other sites

I'd like to thanks Paradigm Switcher, Bacterius and Alvaro for taking the time (and having the interest) to contribute to answering the question I posed on the WoSpectrum forum (that PS reposed here). I can certainly see that BASIC (particularly on a ZX emulator!) is going to be a very tedious way of making further progress!

 

In light of this, I have a question to Bacterius (or any of you): What software did you use to write/run the program/model you presented in post #2?

 

I'm looking for a simple way to get started in the statistics/programming/modelling field. I have access to R and MATLAB on Mac platform. While I have learnt to create basic plots and do some very simple data manipulations, I haven't progressed to writing programs within them - are they even appropriate? I am trained in Molecular biology but have only ancient Maths and BASIC programming skills that are 20 years old and largely forgotten.

Share this post


Link to post
Share on other sites

Hey there 48K ;) (EDIT: And welcome to gamedev, it's a great site for getting your maths/programming questions answered in a friendly way. You may want to change your email preferences though so you don't get spammed with "X posted a thread about..." emails).

 

I'm sure MATLAB will allow you to do statistical simulations via programming but haven't got any experience of using it.

 

On WoS I recommended Python, that's a LOT easier than line-number era BASIC and has some really nice features for doing maths/stats stuff (like built in vectors, lists, dictionaries, support for bignums [arbitrarily large numbers], etc.).

 

Java is another option but that is going to be more daunting to start off with if you're not used to compile/link style programs and is frankly a beast.

 

Python homepage:

 

http://www.python.org/

 

EDIT: You should probably go into more detail about what you are trying to achieve as well, something to do with simulating splitting DNA or RNA strands I believe?

Edited by Paradigm Shifter

Share this post


Link to post
Share on other sites

Thanks PS. I've actually just spent a bit of time Googling for tutorials/examples for R, and I've now modified them and written my first "program". A pair of nested recursive loops. 

 

 

x <- 1:10
z <- NULL
for(j in seq(along=x)) { 
for(i in seq(along=x)) { 
        z <- c(z, x * x[j])   
    } 
}
x
z
plot (z,z)
 
Quite pleased actually by how simple it was to tweak a previous program (containing a single loop).
So, I may actually be more on track than I first thought!
The syntax is very different than what I am used to with BASIC, but the concepts seem to be retained - I guess it is all programming after all.
I may take a look at Python too - thanks for the heads up! - I need to reply over on WoS too.
 
Re: the question: Yes - I'm interested in modelling distributions of breaks/cuts in a linear element (in my case, chromosomes) - varying parameters, like frequency, and how the presence of one break/cut might affect the position of additional cuts.
 
 

Share this post


Link to post
Share on other sites

When posting code, use the <> icon in the toolbar above the posting box, or [ code ] [ /code ] (without the spaces) to retain the formatting, like so:

 

EDIT: However, that seems to eat any text after the closing [ /code ] tag, oh deary me gamedev ;)

x <- 1:10
z <- NULL
for(j in seq(along=x)) { 
    for(i in seq(along=x)) { 
        z <- c(z, x[i] * x[j])   
    } 
}
x
z
plot (z,z)
Edited by Paradigm Shifter

Share this post


Link to post
Share on other sites

Thanks for that. I'm stuck already though. I can get data into R using the read.table function. But I cannot seem to do the necessary arithmetic on the data in the resulting vector. R seems to treat the data held in the imported vector differently than in a vector I assign directly.

 

For example, in the program listed above, if I read a txt file to populate x, then run. instead of multiplying each vale of x by every other value, iteratively, and putting the result in z, it instead, only multiplies each element of x by itself, resulting in a z vector the same length as as x, rather than x * x in size.

 

I'm not sure I am being clear...basically given a contents of x of 1:3 (but populated by a text import) the resulting z vector is just: 1, 4, 9

 

When it should be: 1,2,3, 2,4,6, 3,6,9

 

Are you familiar with R?

Share this post


Link to post
Share on other sites

Nope, I don't know my R's from my elbow I'm afraid. Maybe ask in the "General Programming" forum, or someone who is familiar with it may see this thread?

 

I see it is a statistical language though so there is a chance someone reading this thread may be familiar with it.

 

I googled R language and saw it is also known as GNU S, make your mind up ;)

Share this post


Link to post
Share on other sites

I don't know much about R, but it is used at work and I should probably learn it. If you can post the exact commands that give you the undesired 1, 4, 9 result, I'll take a look.

Share this post


Link to post
Share on other sites


In light of this, I have a question to Bacterius (or any of you): What software did you use to write/run the program/model you presented in post #2?

 

I just whipped up a C program providing me with the PDF data in list form, and plotted it with Mathematica. It would've probably been faster to do it in MATLAB or R, or even Mathematica.. if you know it in advance, I didn't tongue.png (though I should probably learn)

Share this post


Link to post
Share on other sites
I normally use Perl for this type of quick thing:
 
for (1..5000000) {
  $x = min(min(rand(),rand()),rand());
  $count[int($x*500)]++;
}

for $i (0..499) {
  print "$i ".(0+$count[$i])."\n"
}

sub min {
  $x=shift;
  $y=shift;
  return $x>$y?$y:$x;
}
Edited by Álvaro

Share this post


Link to post
Share on other sites

I normally use Perl for this type of quick thing:
 

for (1..5000000) {
  $x = min(min(rand(),rand()),rand());
  $count[int($x*500)]++;
}

for $i (0..499) {
  print "$i ".(0+$count[$i])."\n"
}

sub min {
  $x=shift;
  $y=shift;
  return $x>$y?$y:$x;
}

 

I actually modelled the whole "cut a piece of string" aspect instead of going directly to a minimum of three uniform random variables because I didn't know they were equivalent, so the code was a bit longer. But yeah, it was pretty much the same otherwise - binning each sample and then printing out how many samples fell into each bin.

Share this post


Link to post
Share on other sites


I actually modelled the whole "cut a piece of string" aspect instead of going directly to a minimum of three uniform random variables because I didn't know they were equivalent, so the code was a bit longer.

 

Here's a quick way to see the equivalence: Instead of making N cuts in the string, start with a circle and make N+1 cuts in it. Wherever you make the first cut in the circle doesn't make any difference, and after you have made it you can take coordinates where the cut happened at 0 and the length of the circle is 1. So the two procedures will result in the same probability distribution for lengths, but thinking of the circle makes it much easier to see that the lengths of all the pieces come from the same distribution. Well, at least that's what convinced me.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement