HOLY COW! Why do people here always question why something is the way it is? Can't you just accept that THIS IS NOT POSSIBLE IN MY PARTICULAR CASE !???
I get your frustration, but at the same time, people are not questioning the specifics of your situation just to be difficult. It's possible that a better solution could be arrived at through an entirely different process, possibly one you didn't even consider to be plausible or know to exist. Maybe not, but why needlessly limit yourself and the quality and/or quantity of potential answers you could get?
That said, you indicated that your C++ -> HLSL port was running slow. Is it otherwise running correctly? If so, it might be a question of optimization rather than a new algorithm. Posting the HLSL you have might help you get more concrete answers to speed it up. If not, I don't directly have a good answer for you. I'm sorry! What you're asking for is somewhat difficult since you're not going to have a good time trying to store state between pixels or between frames without some kind of help from the CPU. The best I can do right now is to give you a few links that I think are tangential to your problem but might help you come up with a workable idea.
Nathan Reed talks about PRNG on the GPU, and how a hashing function can be helpful there.
Alan Wolfe talks about creating a random shuffle operator.
Like I said, neither of those links is going to give you exactly what you want. They both provide part of the solution but have limitations that might prove unworkable for you given the limitations you mention. Hopefully they can spark an idea for you, but at the least, I think that they're both interesting reads.
Good luck!