That's an interesting question. What do you have at your disposal? Perhaps compute shaders and atomics could do - they are so fast they can be used for OIT transparency. Building the acceleration structure seems to be perhaps the most important thing.
You could have something like a matrix transpose on Local Data Share... if the API using it supports it (I'm not sure DirectCompute does).
Assuming you use DirectCompute or some Profile 5 target, you could use a write barrier to gather information from the various work-items in a workgroup, reduce the various lists and then spill them out to a global buffer.
This global buffer would still not be unique but it would likely be much smaller.