- process at most 64 triangles in one Dispatch
- store data describing triangles (256 bytes per triangle) in a structured buffer called trianglesBuffer
- store 64 uints, per threads tile, that will contain indices to trianglesBuffer. Call it indicesToTrianglesBuffer
The screen is divided into tiles, each having 16x16 pixels. Each 64 uints pack in indicesToTrianglesBuffer contains indices to trianglesBuffer, so that each tile (DirectCompute threads block) knows which triangles collision with that tile so they can be rendered. A value of "0" in indicesToTrianglesBuffer means that no more triangles need to be processed (less than 64 triangles can collision with a particular tile). Part of my DirectCompute code that does the organization part looks this:
...
RWStructuredBuffer<TriangleToRasterize> trianglesBuffer: register(u2);
RWByteAddressBuffer indicesToTrianglesBuffer: register(u3);
[numthreads(16, 16, 1)]
void main(uint3 dtID: SV_DispatchThreadID, uint3 gID: SV_GroupID)
{
int tileIndex = screenWidth_tiles*gID.y + gID.x;
int2 pixelXY = int2(dtID.x, dtID.y);
[allow_uav_condition]
for (int i = 0; i < 64; i++)
{
int triangleIndex = indicesToTrianglesBuffer.Load(4*(64*tileIndex + i));
if (triangleIndex == 0)
break;
TriangleToRasterize t = trianglesBuffer[triangleIndex - 1];
// rasterize
So first of all we need to find out to which tile the threads block is associated to. Then we simply iterate over all 64 "prospective" triangles and get indices of triangles that affect the tile. If the value of 0 is read from indicesToTrianglesBuffer it means that no more triangles affect the tile so we can break the loop. Note that we access "triangleIndex - 1" in trianglesBuffer. This is because "0" in triangleIndex means "no triangle", so we have to subtract 1 to get proper addressing to trianglesBuffer.
The algorithm works nice and as expected. At this point I thought I could make use of shared memory to speed up access to trianglesBuffer. So I remade the code to look this:
RWStructuredBuffer<TriangleToRasterize> trianglesBuffer: register(u2);
RWByteAddressBuffer indicesToTrianglesBuffer: register(u3);
groupshared TriangleToRasterize trianglesBuffer_shared[64];
groupshared uint indicesToTrianglesBuffer_shared[64];
[numthreads(16, 16, 1)]
void main(uint gi: SV_GroupIndex, uint3 dtID: SV_DispatchThreadID, uint3 gID: SV_GroupID)
{
int tileIndex = screenWidth_tiles*gID.y + gID.x;
int2 pixelXY = int2(dtID.x, dtID.y);
if (gi < 64)
{
uint triangleIndex = indicesToTrianglesBuffer.Load(4*(64*tileIndex + gi));
if (triangleIndex > 0)
trianglesBuffer_shared[gi] = trianglesBuffer[triangleIndex - 1];
indicesToTrianglesBuffer_shared[gi] = triangleIndex;
}
GroupMemoryBarrierWithGroupSync();
[loop]
for (int i = 0; i < 64; i++)
{
if (indicesToTrianglesBuffer_shared[i] == 0)
break;
TriangleToRasterize t = trianglesBuffer_shared[i];
// rasterize
This is basically the same thing expect for that we copy XXX buffers to XXX_shared buffers. Now, since each thread reads only one triangle and store it in shared memory, and since *all* threads access the same address in the shared memory (no bank conflicts due to broadcasting) I expected a significant performance boost). What I in fact got is a 25-30% decrease.
Can anyone explain why is this happening?






