To answer the original question: just sort the indices, and use a radix sort. If your indices are 32-bit, you can use a three-pass radix sort with 11-bit buckets that will nicely fit into the CPU's L1 cache.
This looks like indeed a nice idea for applying radix sort. There's so few places where it really fits.
I couldn't tell whether it's worth the trouble (still always using std::sort, which for anything practical surprisingly turns out just good enough every time, I've never had enough of a justification for actually using radix sort in my entire life), but from playing with it a decade or two ago, I remember that radix sort can easily be 3-4 times faster than e.g. quicksort/introsort.
So if the sort is indeed bottlenecking you out (that is, you don't meet your frame time, and profiling points to the sort), that would probably be a viable option.
Slight nitpick, though: One doesn't of course sort the indices, which would be somewhat pointless. You most certainly didn't mean to say that, but it kind of sounded like it.
One sorts the keys (moving indices). Or well, one sorts indices by key, or whatever one would call it.
Which means that most likely, 3 passes won't be enough, sadly (or you need bigger buckets), since you will almost certainly have at least 48 or 64-bit numbers to deal with (except if you have very few passes and render states, and are very unkind to your depth info).
Not using a stable radix sort to save temp memory may be viable (can instead concat the indices like described above if needed, even if this means an extra pass... the storage size difference alone likely outweights the extra pass because of fitting L1).
Thanks for pointing out my mistake, yes I wanted to say that one should sort the keys . What you get back are the indices which are then used to access your data.
Fully generic rendering-related data might not fit into a 32-bit key, that's true. But there are certain occasions where 32 bits are more than enough (or even 16 might suffice), e.g. for sorting particles.
As a quick note on performance, sorting 50k particles back-to-front using an ordinary std::sort on 16-bit keys takes 3.2ms on my machine, whereas an optimized radix-sort needs 0.25ms. If all you need to sort are keys that somehow index other data, it's almost always better to just sort that, and correctly index the data on access because this causes much fewer memory accesses during the sort.