And do you need to use doubles? Computing your v expression with doubles is slower than with floats (the double division will be really slow) and also converting it to float takes some time.
You don't really need to worry about doubles or floats. As long as your memory bandwith does not run out doubles are almost always as fast as floats on a 64 bit processor as they are both usually calculated with the same alu using the same 80bit registers. Granted with 1 million 3d points you have 3 million doubles and thats about 24 megabytes and there goes your bandwith. If all the data could fit in L3 cache(or L4 if you have it) then you propably would not get much of a difference in performance without using some fancy stuff like SSE(wich can infact process double the amount of floats than doubles in one cycle). But in this case even with SSE you wouldn't get much of an improvement as your bandwith is already holding you back.
Maybe you could sort on GPU and keep the sorted array on GPU and use it as an indirect parameter to your rendering, perhaps as an index buffer. That way, you wouldn't have to stall at all.
This is propably the best bet on getting a million particles sorted inside a reasonable amount of time. Just as a comparison a game that I'm making with a couple of friends has a particle system that has two textures containing the particle data. One for reading and one for writing and swapping them around after rendering. Granted it's only in 2d and no need for sorting but since all the data is kept on vram all the time its blazing fast and we can have 5 million particles without any noticeably decrease in performance(less than 1ms difference in frame time between 100k and 5mil particles with a nvidia gtx 660). And as op said in the first post the shader did the job in less than 1ms too so the best solution would be to figure out a way to avoid transferring all the data between cpu and gpu.
We actually did ours by having three different shaders. One that "rendered" new particles in the particle data texture, one that updated the data between frames and one that rendered them to a framebuffer to be displayed on the screen. So the only data that needed to be sent anywhere was a list of particle emitters active during a frame sent to the first shader and the rest was just a few draw calls. I believe they call this approach ping-ponging. I'm not too familiar with how it works as my two friends do most of the rendering code in our project but I hope this gives you the motivation to try something similar yourself.
However if your particles interact with the rest of the simulation then it gets complicated and I have no idea how to make that happen.