• entries
32
5
• views
8643

# Efficiently recording large finite sets of numbers

1286 views

This is another idea I had. Basically imagine if you need to keep a list of billions of IP address (32 bits each) visitors to your site or worse IPv6 (256 bits each?) And if you had each IPv4 address you would get a total of 4+ GB. What if I told you you can reduce that to 400 MB?

The amount of entries encodable by using combinations (of size 2b-digit numbers) rather than permutation (using b-digit numbers) is:

sum ( (2^(2b))! / (k!)((2^(2b)-k)!) ) k from 0 to 2^(2b) = 2^(2^(2b))

vs

2^(2^b)

And the amount of space used by them is (combo vs permu, worst case scenario all the IP addresses):

I calculated this to be around 400 MB using I think (this might be completely wrong as I don't remember exactly all of this and it does seem dumb to use combinations rather than permutations):

k * 2b

Where k is given by:

sum ( (2^(2b))! / (k!)((2^(2b)-k)!) ) k from 0 to 2^(2b) = 2^32.     And the k should thus be limited (ie number of numbers used in the combination from the absolute possible maximum of 2^32 or even 2^(32*2)).

Vs

32 * 2^32 or if using bit field then only 2^32

I graphed them and noticed that at early levels (ie less than 32 bits) there was more encodable data by eg using 2b-digit wide number combinations (and finding the correct way to encode and quickly retrieve the visited IP addresses). The method with permutations if allocating data up front would be to keep 2^32 bits in total with one bit for each possible IP address.

This is better than an upfront big permutation bits integers cost.

An x fixed amount of 2b-wide or 3b-wide numbers instead of b-wide numbers might be much cheaper and have same amount values or more perhaps cheaper (less memory) than up-front cost of permutations of a single (2^b)-wide number.

How this would exactly be stored is the main issue.

keeping a linked list is basically a hidden cost
but if you store it as one integer or an array of variable size that helps, or an up-front fixed-size integer with repeats meaning unused and all -1 meaning nothing added, which rather defeats the point because the big-permu-int is more efficient as it can store more for the same num of bits as "unused" repeats etc have meaning as unique
values, but there might be a way to make the up-front permu int variable size to save space that has an equal distribution of possibilities for any same number of visitors and only grows in bit size requirements as there are more visitors.
The max size for a big permu int is 2^32 bits or 0.5 GB for all 32-bit IP's visitors worst case. About a few million visitors is 1/1000th of 4+ billion possible IP addresses (bits in the big permu int) or 1 MB.

The best idea though is to keep this variable sized array for all the numbers for the combo, use a separate integer to count how much bits or digits there are, and find some way to use permutation to encode the combination numbers to get rid of ordering redundancy.

The basic idea is, just looking at the number of permutations vs combinations, of b-bit numbers, there are more combinations than permutations.

As an example using a 2-k 2-bit IP system with combinations we store eg:
0 = 0
1 = 1
2 = 2
3 = 3
0,1 = 4
0,2 = 5
0,3 = 6
1,2 = 7
1,3 = 8
2,3 = 9
etc

Ie we stored more than 2^2 = 8 entries because we also used the fact that not having a number is also a value or combination.

Also has anybody already noticed that the Fermat numbers without +1 are equal to the sum of number of combinations of 0 to 2^b numbers out of 2^b? Ie it is equal to 2^(2^b), which is equal to the more full combinatorics equation for that, if you graph those together.
And compare it to "(2^b)^b" or the number of permutations of b b-bit numbers (ie a single b*b-bit-wide int).

With IPv6 it's going to be even harder to keep track of lots of IP addresses, or even checking through them for the presence of one.

There are no comments to display.

## Create an account

Register a new account