Jump to content

  • Log In with Google      Sign In   
  • Create Account

DirectCompute - why is shared memory slower than global?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
2 replies to this topic

#1 maxest   Members   -  Reputation: 294

Like
0Likes
Like

Posted 15 April 2012 - 09:59 AM

I'm writing a software rasterizer. I decided to speed up pixel processing using DirectCompute. My algorithm of organizing data is more less this:

- process at most 64 triangles in one Dispatch
- store data describing triangles (256 bytes per triangle) in a structured buffer called trianglesBuffer
- store 64 uints, per threads tile, that will contain indices to trianglesBuffer. Call it indicesToTrianglesBuffer

The screen is divided into tiles, each having 16x16 pixels. Each 64 uints pack in indicesToTrianglesBuffer contains indices to trianglesBuffer, so that each tile (DirectCompute threads block) knows which triangles collision with that tile so they can be rendered. A value of "0" in indicesToTrianglesBuffer means that no more triangles need to be processed (less than 64 triangles can collision with a particular tile). Part of my DirectCompute code that does the organization part looks this:
...

RWStructuredBuffer<TriangleToRasterize> trianglesBuffer: register(u2);
RWByteAddressBuffer indicesToTrianglesBuffer: register(u3);

[numthreads(16, 16, 1)]
void main(uint3 dtID: SV_DispatchThreadID, uint3 gID: SV_GroupID)
{
    int tileIndex = screenWidth_tiles*gID.y + gID.x;
    int2 pixelXY = int2(dtID.x, dtID.y);

    [allow_uav_condition]
    for (int i = 0; i < 64; i++)
    {
        int triangleIndex = indicesToTrianglesBuffer.Load(4*(64*tileIndex + i));

        if (triangleIndex == 0)
            break;

        TriangleToRasterize t = trianglesBuffer[triangleIndex - 1];

       // rasterize

So first of all we need to find out to which tile the threads block is associated to. Then we simply iterate over all 64 "prospective" triangles and get indices of triangles that affect the tile. If the value of 0 is read from indicesToTrianglesBuffer it means that no more triangles affect the tile so we can break the loop. Note that we access "triangleIndex - 1" in trianglesBuffer. This is because "0" in triangleIndex means "no triangle", so we have to subtract 1 to get proper addressing to trianglesBuffer.

The algorithm works nice and as expected. At this point I thought I could make use of shared memory to speed up access to trianglesBuffer. So I remade the code to look this:
RWStructuredBuffer<TriangleToRasterize> trianglesBuffer: register(u2);
RWByteAddressBuffer indicesToTrianglesBuffer: register(u3);

groupshared TriangleToRasterize trianglesBuffer_shared[64];
groupshared uint indicesToTrianglesBuffer_shared[64];

[numthreads(16, 16, 1)]
void main(uint gi: SV_GroupIndex, uint3 dtID: SV_DispatchThreadID, uint3 gID: SV_GroupID)
{
    int tileIndex = screenWidth_tiles*gID.y + gID.x;
    int2 pixelXY = int2(dtID.x, dtID.y);

    if (gi < 64)
    {
        uint triangleIndex = indicesToTrianglesBuffer.Load(4*(64*tileIndex + gi));

        if (triangleIndex > 0)
            trianglesBuffer_shared[gi] = trianglesBuffer[triangleIndex - 1];

        indicesToTrianglesBuffer_shared[gi] = triangleIndex;
    }

    GroupMemoryBarrierWithGroupSync();

    [loop]
    for (int i = 0; i < 64; i++)
    {
        if (indicesToTrianglesBuffer_shared[i] == 0)
            break;

        TriangleToRasterize t = trianglesBuffer_shared[i];

       // rasterize

This is basically the same thing expect for that we copy XXX buffers to XXX_shared buffers. Now, since each thread reads only one triangle and store it in shared memory, and since *all* threads access the same address in the shared memory (no bank conflicts due to broadcasting) I expected a significant performance boost). What I in fact got is a 25-30% decrease.

Can anyone explain why is this happening?

Sponsor:

#2 MJP   Moderators   -  Reputation: 11765

Like
0Likes
Like

Posted 15 April 2012 - 08:12 PM

It does seem weird to get such a slowdown. The one thing that jumps at me is that you're using a lot of shared memory (over 16K, based on your statement that each TriangleToRasterize element is 256 bytes), which can have a pretty negative effect on occupancy and performance. However it's hard to tell with out some profiler data.

#3 maxest   Members   -  Reputation: 294

Like
0Likes
Like

Posted 16 April 2012 - 05:25 AM

I've just stumbled upon this link http://stackoverflow.com/questions/9196134/cuda-is-coalesced-global-memory-access-faster-than-shared-memory-also-does-al and it seems that the more shared memory is used, the less blocks can be scheduled. And since I have 256 bytes per triangle, 64 triangles, it's (as you MJP mentioned) 16K of memory per block. And since the whole multiprocessor has 32K, no more than 1 or 2 blocks can be scheduled on one multiprocessor.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS