Jump to content
  • Advertisement
Sign in to follow this  
Qbz

Adding transpercy/sorting to particle system killed performance

This topic is 646 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello, I am trying to implement the particle system discussed in Amd's 2014 GDC talk (http://www.gdcvault.com/play/1020002/Advanced-Visual-Effects-with-DirectX). I've run into issues adding transparency/sorting, losing a ton of performance and introducing a perf weird bug I'll explain later. I'm pretty confident that my performance issues lie in my implementation of a bitonic sort. 
 
c++
 

  void ParticleSystem::Sort(ID3D11DeviceContext* context, Matrix view)
  {
    //Fills the dispatch indirect buffer for presort
    _PreparePresortCS->Use(context);
    _PreparePresortCS->UnbindResources(context);

    //I wasn't sure how to do a bitonic sort if the set is not a power 
    //of 2, this appends dummy particle indices with an abritrary huge 
    //view distance so they're guarunteed to be at the end at not drawn
    _PresortCS->SetConstantBuffer(context, view, 0u);
    _PresortCS->UseIndirect(context, _SortCountBuffer);
    _PresortCS->UnbindResources(context);

    //Calls Copy append count on the sort buffer
    u32 sortCount = ReadSortCount(context) * kPresortThreads;

    //Sort
    _SortCS->UpdateResources(context);
    context->CSSetShader(_SortCS->_pShader.pCompute, nullptr, 0u);
    for (u32 setSize = 2; setSize <= sortCount; setSize *= 2)
    {
      bitonicData data;
      data.setSize = setSize;
      for (u32 compareDist = setSize / 2; compareDist > 0; compareDist /= 2)
      {
        data.compareDist = compareDist;
        data.twoCompareDist = compareDist * 2;
        //I think this might be the problem, using nvidia nsight profiler, 
        //it seems like a lot of my time is spent in map. my set const buffer
        //function uses map memcpy unmap
        _SortCS->SetConstantBuffer(context, data, 0u);
        context->DispatchIndirect(_SortCountBuffer, 0u);
      }
    }
    _SortCS->UnbindResources(context);
  }

And here are the shaders. Although the presort stuff might not be great, I'm pretty confident that the problem is in the sort, my performance is fine if I presort but don't sort.
 
hlsl
 

#include "ParticleHeader.hlsl"

cbuffer bitonicData
{
  uint setSize;
  uint compareDist;
  uint twoCompareDist;
  float padding;
};

RWStructuredBuffer<SortData> SortList;

void Swap(uint index, uint compareIndex)
{
  SortData temp = SortList[index];
  SortList[index] = SortList[compareIndex];
  SortList[compareIndex] = temp;
}

[numthreads(1, kSortThreads, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
{
  uint threadIndex = dispatchThreadID.x * kSortThreads + dispatchThreadID.y;
  uint index = twoCompareDist * (threadIndex / compareDist) + threadIndex % compareDist;
  uint compareIndex = index + compareDist;

  uint descending = (index / setSize) % 2;
  if (descending)
  {
    //if this is less than other, not descending
    if (SortList[index].zDistance < SortList[compareIndex].zDistance)
    {
      Swap(index, compareIndex);
    }
  }
  else
  {
    //if this is greater than other, not ascending
    if (SortList[index].zDistance > SortList[compareIndex].zDistance)
    {
      Swap(index, compareIndex);
    }
  }
} 

In addition the a general slow down, adding this to the particle system adds a perf bug in a specific situation. If the particles are mostly on screen but slightly off screen, there's a significant slowdown. If the particles are in full view, mostly out of view, or entirely out of view performance is much much better. Here are some screenshots displaying this perf bug

 

Here, the problem is displayed, with the particles slightly out of view, performance drops to ~30 fps 

z5JdPuZ.png

 

Here, the system is entirely in view, and performance is fine at ~60 fps. 

1C2NC1a.png

 

Here, the system is mostly out of view and the performance is fine at ~60 fps 

Jiiwktq.png

 

Share this post


Link to post
Share on other sites
Advertisement

It's possible that your performance issues are because your first case is actually fillrate-bound and absolutely nothing to do with your sort.

Actually if I remember that paper/presentation correctly it uses tiling and shared memory in compute shaders to avoid such problems. 

Share this post


Link to post
Share on other sites

Thanks for the replies, I was originally hoping i wouldnt have to implement tiled rendering but I might have to. I do have a follow up question though, noticing a lot of time was spent mapping and unmapping the cbuffer, I tried to move the loop to shader so it is done in one dispatch rather than a ton of different ones and I don't need to update a cbuffer for each sort pass. However, this leads to incorrect sorting results. I'm a bit confused as to why, it seems equivalent to me. Here's the updated hlsl code... 

Buffer<uint> sortCount;
RWStructuredBuffer<SortData> SortList;

void Swap(uint index, uint compareIndex)
{
  SortData temp = SortList[index];
  SortList[index] = SortList[compareIndex];
  SortList[compareIndex] = temp;
}

[numthreads(1, kSortThreads, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
{
  //only need half as many threads to sort cause each sort pass acts on 2 elements,
  //this should give indices from 0 to 1/2 array size. 
  uint threadIndex = dispatchThreadID.x * kSortThreads + dispatchThreadID.y;

  //num thread groups in the presort * num presort threads is total size of array
  for (uint setSize = 2; setSize <= sortCount[0] * kPresortThreads; setSize *= 2)
  {
    for (uint compareDist = setSize / 2; compareDist > 0; compareDist /= 2)
    {
        uint index = 2 * compareDist * (threadIndex / compareDist) + threadIndex % compareDist;
        uint compareIndex = index + compareDist;

        uint descending = (index / setSize) % 2;
        if (descending)
        {
          //if this is less than other, not descending
          if (SortList[index].zDistance < SortList[compareIndex].zDistance)
          {
            Swap(index, compareIndex);
          }
        }
        else
        {
          //if this is greater than other, not ascending
          if (SortList[index].zDistance > SortList[compareIndex].zDistance)
          {
            Swap(index, compareIndex);
          }
        }

      //Is it possible that other thread groups in the dispatch call don't have to 
      //wait here potentially messing up the results? 
      DeviceMemoryBarrierWithGroupSync();
    }
  }
}

And the updated C++ code

_PreparePresortCS->Use(context);
_PreparePresortCS->UnbindResources(context);
_PresortCS->SetConstantBuffer(context, view, 0u);
_PresortCS->UseIndirect(context, _SortCountBuffer);
_PresortCS->UnbindResources(context);
    
_SortCS->UpdateResources(context);
_SortCS->UseIndirect(context, _SortCountBuffer);
_SortCS->UnbindResources(context);

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!