Don't generate random numbers on the CPU. It's a complete waste of time, even if you update them in feedback mode. The idea is as follows: assign each run of your shader/kernel/whatever a unique integer (say, 32 bit or 64 bit) that you increment each frame or whatever, and then assign each pixel a unique integer as well (perhaps width * 65536 + height or something similar) directly inside the pixel shader in a register. At this point you have a unique integer C per pixel per run, and can run a pseudorandom function (a good choice is a slimmed down block cipher) on the ordered pairs (C, 0), (C, 1), ..., (C, n) to generate as many random numbers as you want. Advantages? Essentially zero memory bandwidth, since the information you need to compute the starting integer is already available in your shader, and hence fully compute bound, which is a good thing since with a GPU path tracer you are almost certainly memory bound. Disadvantages? None, of course. It's probably easier to implement than what you're doing now as well once you wrap your mind around it.
In pseudocode:
// CPU-side
uint counter = 0;
while (true)
{
// do work..
render(counter);
++counter;
}
// GPU-side
uint cpu_counter; // e.g. in cbuffer
struct prng_state
{
uint a;
uint b;
uint c;
};
uint PRF(prng_state state)
{
// implement your favorite pseudorandom function here
// (it could be as simple as a bunch of xors and stuff,
// but you can use e.g. a simplified block cipher)
}
uint rand(inout prng_state state)
{
uint retval = PRF(state);
++state.c; // counter
return retval;
}
void ps()
{
// derive unique counter for this pixel, e.g.:
uint uid = 65536 * screen_pos.x + screen_pos.y;
// init pseudorandom number generator for this pixel
// ==> each pixel gets its own unique state (IMPORTANT)
prng_state state = prng_state{cpu_counter, uid, 0};
// ^ ^ ^
// unique per frame | |------ multiple random numbers per pixel
// |
// unique per pixel
uint one_random_number = rand(state):
uint another_random_float = rand(state) / float(UINT_MAX);
/* etc.. can write wrappers for floats/integers/etc. */
}
(this implementation above for instance buys you 2^32 pseudorandom numbers per pixel per frame, for a resolution up to 65536x65536 and 2^32 draw calls - you can simply add more elements to the prng_state struct or switch to longs if you need more, or you can start packing bits if you want to micro-optimize)
In short, you can obtain an essentially infinite high quality stream of different pseudorandom numbers for every pixel using a dozen bytes of global memory and two to four GPU registers, depending on the size of the internal state of your chosen pseudorandom function. Also, if you write it properly, it can generalize not only to pixels but to any independent work units, by choosing the right mapping function to assign unique integers to work units. In fact once you understand that the whole framework boils down to supplying a unique ID to every pixel + a counter variable to have multiple random numbers per pixel, it becomes easy to modify it to suit your needs.
Now you might ask why you can't just use the free variable state.c as the actual PRNG state to implement e.g. a linear congruential generator or a multiply-with-carry. You could do that if you wanted to, but the advantage of the counter method is that under the assumption that you have a good quality pseudorandom function, you can exactly quantify how many pseudorandom numbers you can obtain per pixel shader invocation, in other words, it can easily be made reliable at no particular additional cost, so it is my preferred option.
In any case, the lesson to draw here is to not waste time and memory doing this on the CPU. Just pass unique state to your pixel shader and implement a PRNG there!
PS: I too am in the (slow due to real life) process of implementing a brand new GPU ray tracer. I may write a few articles or journal entries on it at a later indeterminate date, and if I do I will be certain to spend some time explaining in detail how I dealt with generating pseudorandom numbers.