Right I've been reading about some of these tests in some PDF on one of the official cryptography standards (not sure it's not the only one, just know that it is one). I'm really starting to see the benefit of analyzing the 1's and 0's and not using weird birthday tests or pi tests or such things. I still kind of wanted to try my hand at it though lol. ;]
The math really is above me, I never learned what ʃ or Σ mean, and I'm really not used to reading it. The worst part is that when people tell me what formula to use they hardly ever explain how they derived that formula, they just expect me to take it on faith lol. And when they do it's all in technical mathy terms that I have no way of understanding in my current position. I'm getting it though, it's rather slow progress but I'm getting it.
I don't intend to give up until I've learned more about randomness. It's kind of a new found passion of mine.
As always, thanks guys. I really appreciate it.
The US National Institute of Standards and Technology (NIST) has a suite of tests that are also popular:http://csrc.nist.gov.../rng/index.html
Some brief descriptions of the tests are at:http://csrc.nist.gov...tats_tests.html
The one test that seems closer than most to the compression-based test that alvaro suggested is the "Approximate Entropy Test". The basic idea is to take your bitstream and break it into m-bit chunks, where m is 1, 2, etc. and see if there is an imbalance in how frequent each chunk is. For instance, take the bitstream 01010101. If you break it into m=1 bit chunks, you'll find that the chunk 0 and the chunk 1 both occur with the same frequency, so there is no imbalance. Awesome. Next, break it into m=2 bit chunks. You'll notice right away that the chunk 01 occurs four times, but not once did the chunk 00, 10, or 11 occur. Not awesome. That's a big imbalance, and obviously the bitstream is nowhere near random. Of course, your bitstream would contain more than just 8 bits, and your m-bit chunks would get larger than just m=2.
Essentially, a greater entropy means a greater balance in the frequency of symbols and a greater incompressibility. Get your toes wet with the Σ (summation) symbol by getting to understand how to calculate the entropy of a bunch of symbols: http://en.wikipedia....eory)#Rationale
Here's the equation for Shannon entropy in natural units. Don't think too hard about it until after you read the next few paragraphs:
is the natural logarithm (ie. log() in C++).
It's pretty straightforward. Say you have a group of 6 symbols altogether. They could be binary digits, decimal digits, characters, words, images, paragraphs, etc., but we'll use characters: 'a', 'b', 'c', 'd', 'e', 'a'.
In the end you only have n = 5 distinct symbols; the letter 'a' is repeated once. In terms of probability, the letter 'a' occurs
th of the time, 'b' occurs
th of the time, 'c' occurs
th, 'd' occurs
th, and 'e' occurs
th of the time. Just so you know, these probabilities for all of these 5 distinct symbols should add up to 1:
In expanded form, this is:
Now that you have a basic idea of what Σ (summation) means, let's calculate the entropy. In expanded form, it's:
You can divide this number 1.56071 by
to get the entropy in binary units: 2.25. This tells you that these 5 distinct symbols (from a set of 6 symbols total) would each require roughly 2.25 bits of information to encode, on average. It's worth noting that the total number of data (6) and the distinct number of data (5) are whole numbers, but the average information content per datum (1.56, or 2.25 in binary units) is not
. Yes indeed, data are not at all the same thing as information
, and the popular concept of "a piece/indivisible unit of information" is not generally
valid. I just find these confusions to be so very often perpetuated by math/comp sci/physics experts, and it annoys me a little bit.
Try the group of 8 symbols '0', '1', '2', '3', '4', '5', '6', '7'. Note that there are 8 symbols and all are distinct. You should get a natural entropy of
. Divide that by
to get the binary entropy of 3. This basically says that you need 3 bits of information to encode 8 distinct (and equally frequent) symbols. If you understand that an n-bit integer can represent 2^n symbols, then this should not be a shock to you (ie. a 16-bit integer can represent 2^16 = 65536 distinct values, which corresponds to a binary entropy of 16). Yes, the binary information content per symbol can be a whole number, but this is only a special case. Just remember that data are not the same thing as information.
Now try the group of 8 symbols '0', '0', '0', '0', '0', '0', '0', '0'. The natural entropy is simpy
. No need for distinctness, no need for information. Yes, the binary information content per symbol can be a whole number, but again, this is only a special case. Just remember that data are not the same thing as information.
Essentially entropy is information, and it is a measurement taken of data. It's definitely not the same thing as data. I mean, the length of your leg is obviously not the very same thing as your leg (one's a measurement, one's bone and flesh).
Also, can you see how the group '0', '0', '0', '0', '0', '0', '0', '0' would be greatly compressible compared to the group '0', '1', '2', '3', '4', '5', '6', '7'? If not, you should look up run-length encoding (a type of compression) and then dictionary-based compression. Remember: very high entropy data is largely incompressible. In the context of that m-bit chunk test, we'd also cut the set of 8 symbols into subsets that are larger than one digit each (ie. not just subsets of single digits like '0', '1', '2', '3', '4', '5', '6', '7', but also subsets like '01', '23', '45', '67', etc.), so there would be multiple measures of entropy. I hope that isn't too confusing. If it is too confusing, then just ignore this paragraph altogether.
Have you not done any calculus at all? This is where you will see ʃ (indefinite integral / antiderivative / opposite of local slope, definite integral / area / volume / hypervolume / anything else that can be added up using infinitesimally small chunks) pop up a lot. You owe it to yourself to get informed. Upgrading your knowledge from algebra to algebra+calculus is kind of like upgrading from arithmetic to arithmetic+algebra. It's a very powerful upgrade, and there are a bajillion ways to apply that knowledge.
Edited by taby, 17 May 2012 - 03:25 PM.