<off-topic>
(lots of aggressive stuff)
Look, if you weren't yelling "you idiots" every time someone says a word, and would instead read what people have to say instead of switching into aggressive mode, you might cause a lot less noise.
You claim that std::sort is excessively slow when it demonstrably isn't, based on an allegation (inability to inline comparisons) that isn't true at all. Several people tell you that you're wrong, and you turn around every time saying that it doesn't inline function pointers.
Someone (Ryan_001) mentioned you could even use Bubblesort for small data sets such as 64k items or less. While I don't fully agree with that (it's still a piss poor algorithm for a complete sort, so I whould still not use it when something better is readily available), I mentioned that Bubblesort has its merits as a 1-pass-per-frame sort if sorting depth is all you do.
This is a well-known and proven idiom. The technique has constant per-frame overhead, amortizing the work over several frames.
You didn't read or understand what I was saying at all because everybody except yourself is stupid (at least that's the message delivered by your answers). You stopped reading after "Bubblesort" and went into discussing completely unrelated details about a full Bubblesort, of which some, such as the cache misses, are outright wrong (and... irrelevant, because you're discussing a totally different thing).
Besides, when sorting indices (or pointers) with the intent of iterating the corresponding random-address objects in pseudo-random order subsequently, complaining about too many cache misses is a bit of a queer move anyway.
Similar story with this stable sort thing. It's not a problem in the first place. My example of what you need stable sorts for may or may not have been the 100% best fit (though I believe it is not wrong), but that doesn't make a difference. Sure, you like to play on sophistries, but to what avail. Whether or not std::sort is stable simply isn't an issue in this case.
Artefacts due to using a non-stable sort can only ever occur on transparent geometry. Further, it can only happen on different transparent objects which have exactly the same properties (as far as the renderqueue can tell) and occupy the same space, but still you treat them as "different" objects.
This is something that shouldn't happen in the first place, unless your model is "broken". Mind you, we're not talking about objects being somewhat close to each other or touching, or objects having similar properties.
But if it does happen (or can happen) for some reason there are at least two obvious solutions to it (well, three... stable sort would be the third). Either, you simply don't split something that to all appearances isn't two separate objects into two separable identical objects.
Or, you can include the object's index in the sort key as a preemptive measure, which automatically makes every sort stable, since objects that occupy the same space and have all the same properties are still binary different. No identical keys exist that could be ordered differently.
Either way, we're already 3 posts down discussing absolutely nothing except ego. Which really doesn't bring the thread forward.
(And it probably annoys everyone else as much as it annoys me, so I'm not going to continue that discussion.)
</off-topic>
To answer the original question: just sort the indices, and use a radix sort. If your indices are 32-bit, you can use a three-pass radix sort with 11-bit buckets that will nicely fit into the CPU's L1 cache.
This looks like indeed a nice idea for applying radix sort. There's so few places where it really fits.
I couldn't tell whether it's worth the trouble (still always using std::sort, which for anything practical surprisingly turns out just good enough every time, I've never had enough of a justification for actually using radix sort in my entire life), but from playing with it a decade or two ago, I remember that radix sort can easily be 3-4 times faster than e.g. quicksort/introsort.
So if the sort is indeed bottlenecking you out (that is, you don't meet your frame time, and profiling points to the sort), that would probably be a viable option.
Slight nitpick, though: One doesn't of course sort the indices, which would be somewhat pointless. You most certainly didn't mean to say that, but it kind of sounded like it.
One sorts the keys (moving indices). Or well, one sorts indices by key, or whatever one would call it.
Which means that most likely, 3 passes won't be enough, sadly (or you need bigger buckets), since you will almost certainly have at least 48 or 64-bit numbers to deal with (except if you have very few passes and render states, and are very unkind to your depth info).
Not using a stable radix sort to save temp memory may be viable (can instead concat the indices like described above if needed, even if this means an extra pass... the storage size difference alone likely outweights the extra pass because of fitting L1).