# Geometry shader-generated camera-aligned particles seemingly lacking Z writing

This topic is 1363 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I find myself having another relatively baffling issue when playing around with billboard GPGPU-based particle rendering.

This time I've got a proper render in all but excessive z-fighting that seems to occur due to the order in which each billboard (particle) is being drawn varies between frames. I tried to record a video reference of the issue but for whatever reason Fraps decided to only record a black screen tonight. I can try to get a properly recorded video up later if needed but I thought I would post this before bed tonight still.

If I disable my alpha testing and go purely with alpha blending it appears that indeed the completely transparent pixels of closer particles will sometimes overwrite opaque pixels of particles being drawn behind those, suggesting that completely transparent pixel writes seem to fill out the depth buffer. The billboard quads aren't particularly close to each others at all so this cannot be a normal z-fighting problem as far as I can tell.

May it be that I'm forgetting some render state I ought to set? Or is this a common "problem" that has to be solved by ensuring that my individual quads are created back-to-front from my geometry shader?

##### Share on other sites

Dare to show some shader code? Is your projection matrix properly set with good znear and zfar values? Is the problem affecting only particles, and other things render correctly?

Cheers!

##### Share on other sites
Sounds like plain old alpha blending issues - they have to be sorted/drawn back to front. You'd have to sort your particles (using a compute shader, etc) before this billboard pass.
Alternatively you can disable depth-writes, and instead of z-test artifacts, deal with blend order artifacts instead.

##### Share on other sites

Dare to show some shader code? Is your projection matrix properly set with good znear and zfar values? Is the problem affecting only particles, and other things render correctly?

The code is rather messy at the moment but basically it goes like

1. Using a pointlist topology, get each single vertex' id from the GS and use it to index into a StructuredBuffer previously built by two compute shader programs, Update and Emit.
2. Create a quad as an array of four vertices. The quad is made to align with the camera by calculating its right vector as the cross product of the up vector (static (0, 1, 0)) and (quadCenterPos - EyePos).
3. Project the vertex positions using a world-view-projection matrix. There is nothing wrong with this one, it does render other things just well and if you move the beholding position around you can see that there is indeed space in-between the individual particles as there should be.
4. Append the four vertices (bottom left, top left, bottom right, top right) to an output TriangleStream from the GS.

I reached Hodgman's conclusion that they will indeed have to be independently sorted; after all I do sort my individual transparent meshes by view depth already.

Then the next problem will be finding an efficient sorting algorithm that can be parallelized. I'm sure that's just a google search away though.

Disabling depth writing would probably cause similarly obvious artifacts since the order of the particles can currently change from one frame to another depending on how the update threads finish copying data over from the old to the new state buffers, or am I wrong to make that assumption?

Thanks,

Husbjörn

##### Share on other sites

Hmm... so I did a quick test of implementing a recursive mergesort knockoff in a compute shader, performing a separate dispatch call for each split.

Unfortunately this seems to be very inefficient (on average it seems my implementation sorts one million integers in slightly over half a second).

The following is a simple, dirty HLSL program for doing the sorting:

cbuffer SortCountData : register(b0) {
uint ArraySize;
uint ElemCount;
};

struct sElemData {
int id;
};

StructuredBuffer<sElemData>	In  : register(t0);
RWStructuredBuffer<sElemData>	Out : register(u0);

int leftOffset	= threadId.x * ElemCount * 2;
int rightOffset	= leftOffset + ElemCount;
int leftSize	= ElemCount;
int rightSize	= (rightOffset + ElemCount >= ArraySize) ? ArraySize - rightOffset : ElemCount;
int subSize	= ElemCount * 2;
int leftId	= 0;
int rightId	= 0;

if((uint)leftOffset >= ArraySize)
return;
for(int n = 0; n < subSize; n++) {
if(leftId >= leftSize) {
// Add all remaining elements from the (sorted) right list
while(n < subSize)
Out[leftOffset + n++].id = In[rightOffset + rightId++].id;
return;
} else if(rightId >= rightSize) {
// Add all remaining elements from the (sorted) left list
while(n < subSize)
Out[leftOffset + n++].id = In[leftOffset + leftId++].id;
return;
}
if(In[leftOffset + leftId].id <= In[rightOffset + rightId].id)
Out[leftOffset + n].id = In[leftOffset + leftId++].id;
else
Out[leftOffset + n].id = In[rightOffset + rightId++].id;
}

}


I'm swapping the In and Out buffers between dispatches so that the shader always works with merging two individually sorted sub-lists.

The number of thread groups for each dispatch is determined as ceil((totalBufferElementCount / (subBufferElementCount * 2)) / 2.0f) and subBufferElementCount = pow(2, pass) where pass goes from zero to the rounded-up log2() of the total buffer element count.

I tried removing the first passes by doing an initial simple O(n^2) sort on the items into 64-element sub buffers so that the compute shader wouldn't have to start with single element buffers, but that didn't seem to increase the efficiency in any noticible way, which indicates that the majority of the slowdown would come from the last passes where finally a single thread will have to go through the entire buffer. However I can think of no other, more parallellized way of sorting an entire list; it cannot be entirely done in separate passes (threads)?

I suppose there might be other algorithms that lend themselves better to this type of use, however I haven't been able to find any adequate descriptions of things like bitonic and radix sorts which are mentioned in various papers but never really defined.

Since this must doubtlessly be a rather common problem to solve, I was wondering if anybody might point out something obvious I've overlooked, a better way to parallelize mergesort (or some other type of sorting) or perhaps provide a (informative, not "buy the whole paper with ambiguous content by clicking here" source on the afforementioned networked sorting algorithms?

##### Share on other sites

Hi.

You should be able to fix the problem with blend states. Im using what

you are gpu partilces and mine all work good.

maybe a image you need to show us.

And I noticed some strange z stuff when I was messing with the blend state once.

##### Share on other sites

Are you sure about that; doesn't blending work just by blending with the current backbuffer value at each pixel, so if you don't draw things back-to-front one of your frontal particles may end up blending with the render target clear colour and thus draw that on top of other particles that should appear behind it?

Still I would be interested in hearing your blend state settings if you believe that might be good enough :)

I rewrote my sorting algorithm to this which performs quite better (although still not at a desirable rate, but it should be "good enough" for a reasonable particle count I guess):

cbuffer GlobalData : register(b0) {
uint BufferSize;
};

cbuffer PassData : register(b1) {
uint SubSize;
};

Buffer<int>	In  : register(t0);
RWBuffer<int>	Out : register(u0);

// This program sorts 2-element subarrays by comparing and swapping their elements; can be used as a first pass
uint offset = threadId.x * 2;
if(offset < BufferSize - 1) {
if(Out[offset] > Out[offset + 1]) {
int tmp = Out[offset + 1];
Out[offset + 1] = Out[offset];
Out[offset] = tmp;
}
}
}

// A mergesort implementation; works in steps of SubSize * 2 per thread
uint offset = threadId.x * SubSize * 2;
uint pLeft  = offset;
uint pRight = offset + SubSize;
uint lLeft  = pRight;
uint lRight = min(SubSize, BufferSize - pRight);

if(offset < BufferSize) {
// Elements left in both lists?
while(pLeft < lLeft && pRight < lRight) {
if(In[pLeft] <= In[pRight]) {
Out[offset++] = In[pLeft++];
} else {
Out[offset++] = In[pRight++];
}
}
// When we get here one list has been exhausted; add the remaining elements in the other one (which is already sorted) to the output
while(pLeft < lLeft) {
Out[offset++] = In[pLeft++];
}
while(pRight < lRight) {
Out[offset++] = In[pRight++];
}
}
}


However I just discovered that the only way I can dispatch the MergeSort shader for the appropriate number of passes (log2(BufferSize)) is to indeed read the append buffer's element count back to the CPU which I was hoping I shouldn't have to do. Is there any way around this?

##### Share on other sites

mabe depth buffer writes

I have this set in shader

DepthStencilState DepthWrites
{
DepthEnable = TRUE;
};

I Only have dust and flame thrower and a explosion type and they look fine on the terrain and they not render through the terrain when there is a hill.

the back particles may indeed blend wrong but can't see it in a explostion. not yet any way.

heres some fire

Can we see a image.

Edited by ankhd

##### Share on other sites

True, doing that gets rid of the clear colour forming a rectangle around the individual particles and everything looks fine on a per-frame basis.

Because of that showing an image doesn't help much; the individual frame capture images look just fine. However because of the way my particles are updates their draw order will vary from frame to frame and this is what causes issues; in one frame particle A is drawn before particle B and in the next frame particle B gets drawn before particle A. This causes quite noticible flickering when particles overlap. The problem wouldn't be very apparent if the particles used the same single colour, but as this is just a test to ensure I'll get proper results with multiple colours, all of my individual particles are blended with a random colour.

Fraps is still refusing to record anything besides a black screen with its FPS watermark on top so unfortunately I cannot produce a video of the issue either. I guess I could upload an executable if you like?

Edit: my blend states are

SrcBlend       = D3D11_BLEND_SRC_ALPHA

DstBlend       = D3D11_BLEND_INV_SRC_ALPHA

SrcAlphaBlend  = D3D11_BLEND_ONE

DstAlphaBlend  = D3D11_BLEND_ZERO

by the way, in case that would affect anything.

Edited by Husbjörn

##### Share on other sites

Hmm... so I did a quick test of implementing a recursive mergesort knockoff in a compute shader, performing a separate dispatch call for each split.

Unfortunately this seems to be very inefficient (on average it seems my implementation sorts one million integers in slightly over half a second).

The following is a simple, dirty HLSL program for doing the sorting:

cbuffer SortCountData : register(b0) {
uint ArraySize;
uint ElemCount;
};

struct sElemData {
int id;
};

StructuredBuffer<sElemData>	In  : register(t0);
RWStructuredBuffer<sElemData>	Out : register(u0);

int leftOffset	= threadId.x * ElemCount * 2;
int rightOffset	= leftOffset + ElemCount;
int leftSize	= ElemCount;
int rightSize	= (rightOffset + ElemCount >= ArraySize) ? ArraySize - rightOffset : ElemCount;
int subSize	= ElemCount * 2;
int leftId	= 0;
int rightId	= 0;

if((uint)leftOffset >= ArraySize)
return;
for(int n = 0; n < subSize; n++) {
if(leftId >= leftSize) {
// Add all remaining elements from the (sorted) right list
while(n < subSize)
Out[leftOffset + n++].id = In[rightOffset + rightId++].id;
return;
} else if(rightId >= rightSize) {
// Add all remaining elements from the (sorted) left list
while(n < subSize)
Out[leftOffset + n++].id = In[leftOffset + leftId++].id;
return;
}
if(In[leftOffset + leftId].id <= In[rightOffset + rightId].id)
Out[leftOffset + n].id = In[leftOffset + leftId++].id;
else
Out[leftOffset + n].id = In[rightOffset + rightId++].id;
}

}


I'm swapping the In and Out buffers between dispatches so that the shader always works with merging two individually sorted sub-lists.

The number of thread groups for each dispatch is determined as ceil((totalBufferElementCount / (subBufferElementCount * 2)) / 2.0f) and subBufferElementCount = pow(2, pass) where pass goes from zero to the rounded-up log2() of the total buffer element count.

I tried removing the first passes by doing an initial simple O(n^2) sort on the items into 64-element sub buffers so that the compute shader wouldn't have to start with single element buffers, but that didn't seem to increase the efficiency in any noticible way, which indicates that the majority of the slowdown would come from the last passes where finally a single thread will have to go through the entire buffer. However I can think of no other, more parallellized way of sorting an entire list; it cannot be entirely done in separate passes (threads)?

I suppose there might be other algorithms that lend themselves better to this type of use, however I haven't been able to find any adequate descriptions of things like bitonic and radix sorts which are mentioned in various papers but never really defined.

Since this must doubtlessly be a rather common problem to solve, I was wondering if anybody might point out something obvious I've overlooked, a better way to parallelize mergesort (or some other type of sorting) or perhaps provide a (informative, not "buy the whole paper with ambiguous content by clicking here" source on the afforementioned networked sorting algorithms?

Hi man! Im having the same problem that you had. How did you calculate the SortCountData data to pass each frame? Thanks in advance for your time

1. 1
2. 2
Rutin
18
3. 3
4. 4
5. 5

• 14
• 12
• 9
• 12
• 37
• ### Forum Statistics

• Total Topics
631428
• Total Posts
3000027
×