Random data set generation

Started by
22 comments, last by Mantear 13 years, 9 months ago
I'm looking for an efficient method of generating a large set of unique 64-bit integers (on the order of 10,000 values). I use a PRNG (a Mersenne Twister implementation to acheive a long period) to create input to an AES core with a given key. The AES output is used to populate the 64-bit data set. The actual generation of the data is fast enough. The problem (slowness) comes with guaranteeing uniqueness among all 10,000 64-bit numbers. Any ideas?

EDIT: Oops, meant to put this in the General Programming area.
Advertisement
Generate a bunch of input numbers (more than you need).
sort the final list.
remove duplicates (they'll all be right next to one another)
and then if you need "randomness" preform a shuffle on the final data set.
Perhaps you could use an std::set and keep generating/inserting values until its size is what you need?

std::set<unsigned long long> values;const unsigned int N = 10000;for(unsigned int i = 0; i < 2*N; ++i) {   values.insert(getRandomValue());   if(values.size() == N) {      break;   }}

I've used a for-loop instead of a while-loop to prevent infinite loops, so you can decide what you want to if the loop ends and there aren't enough values.
@Zipster: That's the path I started down with
std::set<unsigned long long> values;while (values.size() < numValues){    values.insert(getRandom64Num());}

.. until I realized std::set is sorted and std::random_shuffle can't work on a std::set. I don't think copying the contents out of the set into another container in a random fashion is an efficient/good idea.

I'm going to try what KulSeran said. Generate a std::list<> that's 10% larger than what I need. Then call std::list::sort(), std::list::unique(), and std::random_shuffle(). Handle special-case if the resulting list after calling unique() is smaller than what I need.
Quote:Original post by Zipster
*** Source Snippet Removed ***
You can save a little work by generating the first 10000, then doing the loop to add the rest.

Because you are filling up such a small amount of the possible space (2^13 numbers out of 2^64 space, or roughly 0.00000000000005%) there is a very low probability of duplicates if your PRNG is configured properly.

Quote:Original post by frob
Because you are filling up such a small amount of the possible space (2^13 numbers out of 2^64 space, or roughly 0.00000000000005%) there is a very low probability of duplicates if your PRNG is configured properly.


Very small, yes, but even a single occurange is not acceptable.

My std::list sort(), unique(), actually ended up working rather poorly, even without a randomization happening yet. The best method I have so far is using a std::set<>, inserting into the set the randomly generated numbers until the set.size() equals the data set size I need, std::copy() from the std::set to a (pre-allocated) std::vector, then doing a std::random_shuffle on the vector.

EDIT: edits made
Hi,

You can use a maximal length linear feedback shift register (LFSR). It can generate 2^64-1 unique values before repeating itself, and it is very fast. You might also want to look up "gold sequence".
Quote:Original post by Mantear
The best method I have so far is using a std::set<>, inserting into the set the randomly generated numbers until the set.size() equals the data set size I need, std::copy() from the std::set to a (pre-allocated) std::vector, then doing a std::random_shuffle on the vector.


That seems like a reasonable plan. You probably don't need to do this very often, do you?
If memory isn't an issue, you can generate a list of numbers from n = i to j,
into an array and random_shuffle that.
Edge cases will show your design flaws in your code!
Visit my site
Visit my FaceBook
Visit my github
You could just generate 10,000 random 64-bit numbers and not check for collisions. The probability of a collision is less than .000000000003, which means you'll never see one.

This topic is closed to new replies.

Advertisement