Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 27 Jul 2013
Offline Last Active Feb 15 2014 09:36 AM

Posts I've Made

In Topic: particles and depth testing

08 January 2014 - 04:23 AM

* Disabling depth testing means that the depth buffer doesn't have to be read/written, which slightly reduces your bandwidth per-pixel (save a 24bit read+write per pixel).
* Disabling depth writing means that the depth buffer doesn't have to be written, which (even more) slightly reduces your bandwidth per-pixel (save a 24bit write per pixel).
* Enabling depth testing will also enable "early Z" / "HiZ" on modern GPU's (assuming that alpha-testing or shader discard/clip statements aren't used -- these features kill early Z). This allows the depth test to occur before the pixel shader is executed (technically the GPU is supposed to waste time executing the pixel shader, then throwing away the results if the depth-test fails -- 'early Z' fixes this), and the pixel shader is skipped if that pixel doesn't have to be drawn. However, this is only an advantage when rendering opaque geometry from front-to-back -- it is of no use when rendering translucent geometry from back-to-front.


Oh nice! Thank you that is what I was waiting for.

In Topic: particles and depth testing

07 January 2014 - 05:22 PM

Thank you for your replies.


Where performance is concerned, you should be batching as many particles from a single emitter as you can into a single draw call. An updating bounding box around that entire set of particles improves culling and gives you reasonable sorting (back-to-front) capabilities. That means you are sorting by close-together groups of individual particles rather than by individual particles, which means you don’t sort on a per-particle basis (highly discouraged for real-time applications) but close enough.


I am always drawing all particles per emitter in a single draw call. I never heard about using bounding boxes for particle sorting, but it sounds interesting.


Anyways, what I meant with performance is that, if I have the particles sorted anyways for an order dependent blending like alpha blending,I might as well activate depth writing. This would not produce any artifacts since the particles are drawn in correct order. Yet I still wonder if it has any advantage? Like maybe some fragments getting rejected in step 1 and don't have to be blended in step 2?

In Topic: Compute Shader: ThreadGroupSize and Performance

21 December 2013 - 10:21 PM

Thank you for your answer!


3. I couldn't say for sure without more context...


This is where I got that quote: http://cvg.ethz.ch/teaching/2011spring/gpgpu/GPU-Optimization.pdf (Page 15). There isn't much more info though. sad.png


With my current implementation for my particle system, I have noticed that my performance drops, if I increase my thread group size from 64 to 128 on my nvidia card (had 1 million particles active -> 1 million threads). And I am not using shared memory. All I do is consume() a particle from one buffer, process the particle and append() it to the other buffer. These should be atomic operations. So there must be another reason why it might be bad to have bigger thread group sizes...


Also, I would like to write a few words about why it is critical to use the correct amounts of threads per thread group for my bachelor thesis. For that I need some reason why it might be bad to have too many threads per thread group. So any theoretical reason would help. (I could not find anything on the web so far)


While we are at it... there is this GTC presentation: http://www.nvidia.com/content/GTC/documents/1015_GTC09.pdf. On page 44 it says something about thread group size heristics.



# of thread groups > # of multiprocessors


I guess this is only true if you actually have enough work to do. So if you only need one thread group at a size of 512 you might want to lower it to 64 or even 32 and dispatch more groups. But it is not advised to start a few more thread groups if you only need 32 threads and your group size is already 32 just to have the other multiprocessor occupied. If i am correct? (Just asking because you have to be super precise when writing papers...)



Amount of allocated shared memory per thread group should be at most half the total shared memory per multiprocessor


Why is that? So that the multiprocessor can already load data for the next thread group to be processed? 



Occupancy is:

- a metric to determine how effectively the hardware is kept busy
- the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps


This means having more warps in the queue and ready to be executed while the processor is still doing work on other warps, so that in case of latency it can switch out the warps and work on the warps in the queue instead?

In Topic: Billboard WVP + rotation problem

27 October 2013 - 09:55 AM

A friend of mine found the solution.


Apparently I corrupted the w value of the vector.


And the second problem is actually no problem. 


Anyways thank you!

In Topic: Billboard WVP + rotation problem

27 October 2013 - 05:12 AM

To visualize this I made a simple test scene and a video:



This was done with the settings mentioned above. As you can see there are actually two problems:


1. The depth buffer problem

2. The billboards aren't rotated accordingly. The arrow should point to the right when I rotate the placement.