Hmm... so I did a quick test of implementing a recursive mergesort knockoff in a compute shader, performing a separate dispatch call for each split.
Unfortunately this seems to be very inefficient (on average it seems my implementation sorts one million integers in slightly over half a second).
The following is a simple, dirty HLSL program for doing the sorting:
cbuffer SortCountData : register(b0) {
uint ArraySize;
uint ElemCount;
};
struct sElemData {
int id;
};
StructuredBuffer<sElemData> In : register(t0);
RWStructuredBuffer<sElemData> Out : register(u0);
[numthreads(64, 1, 1)]
void MergeSort(uint3 threadId : SV_DispatchThreadId) {
int leftOffset = threadId.x * ElemCount * 2;
int rightOffset = leftOffset + ElemCount;
int leftSize = ElemCount;
int rightSize = (rightOffset + ElemCount >= ArraySize) ? ArraySize - rightOffset : ElemCount;
int subSize = ElemCount * 2;
int leftId = 0;
int rightId = 0;
if((uint)leftOffset >= ArraySize)
return;
for(int n = 0; n < subSize; n++) {
if(leftId >= leftSize) {
// Add all remaining elements from the (sorted) right list
while(n < subSize)
Out[leftOffset + n++].id = In[rightOffset + rightId++].id;
return;
} else if(rightId >= rightSize) {
// Add all remaining elements from the (sorted) left list
while(n < subSize)
Out[leftOffset + n++].id = In[leftOffset + leftId++].id;
return;
}
if(In[leftOffset + leftId].id <= In[rightOffset + rightId].id)
Out[leftOffset + n].id = In[leftOffset + leftId++].id;
else
Out[leftOffset + n].id = In[rightOffset + rightId++].id;
}
}
I'm swapping the In and Out buffers between dispatches so that the shader always works with merging two individually sorted sub-lists.
The number of thread groups for each dispatch is determined as ceil((totalBufferElementCount / (subBufferElementCount * 2)) / 2.0f) and subBufferElementCount = pow(2, pass) where pass goes from zero to the rounded-up log2() of the total buffer element count.
I tried removing the first passes by doing an initial simple O(n^2) sort on the items into 64-element sub buffers so that the compute shader wouldn't have to start with single element buffers, but that didn't seem to increase the efficiency in any noticible way, which indicates that the majority of the slowdown would come from the last passes where finally a single thread will have to go through the entire buffer. However I can think of no other, more parallellized way of sorting an entire list; it cannot be entirely done in separate passes (threads)?
I suppose there might be other algorithms that lend themselves better to this type of use, however I haven't been able to find any adequate descriptions of things like bitonic and radix sorts which are mentioned in various papers but never really defined.
Since this must doubtlessly be a rather common problem to solve, I was wondering if anybody might point out something obvious I've overlooked, a better way to parallelize mergesort (or some other type of sorting) or perhaps provide a (informative, not "buy the whole paper with ambiguous content by clicking here" ) source on the afforementioned networked sorting algorithms?