When fillrate is not a concern, i get around 15 million particles per second. A vertex shader is used to align the vertices in eye-space (ie. billboarding). That's around 50000 particles @ 300 fps. Don't forget that it transforms 4 times more vertices (so that's 60 million vertices a second), since each particle is made of a quad.
There were two tricks to make it go that fast:
1. Eliminate the CPU sort - the vertex and index buffers are 100% static. Specifically, in theory there are exactly 45 combinations of index buffers to sort the particles perfectly from back to front, for a given camera position. So i'm basically using one VB and 45 IBs, and all the work that is needed at run time is to bind the VB and select one of the 45 IBs, the one that corresponds to the viewing angle.
2. Eliminate the CPU billboarding calculations - thanks to the vertex shader, it was not really hard.
The next (and hopefully last) step is to implement a priority queue, to get ride of these slowdowns when moving fast.