Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 02 Feb 2011
Offline Last Active Feb 27 2016 06:52 PM

Posts I've Made

In Topic: Uber-shader approach

06 February 2016 - 12:23 PM

Mitigate the cost of glUseProgram() by sorting objects with the same shaders such that they can be drawn without calling glUseProgram() (render queues).
Branches inside shaders are almost free as long as all pixels in a block take the same branch. Still there is a small cost for branching, and it never makes sense to keep a branch if you have to swap shaders anyway (such that the branch is always taken in one and never in the other). For example, opaque and translucent objects should always be permutated and without a branch indicating "alpha or not".
A branch is costly if in a block some pixels will take one path and others the other. In that case each branch waits for the other, meaning all pixels ran both branches.
When to permutate is decided based on the balance between these costs.
L. Spiro

BTW, if one block of the branch (or even branch chain) requires high register usage, it will reduce the number of warps in flight, which can have an overall negative impact on performance (your super simple "sub-shader" may be running with the same register allocation as your super complex one).

So as a rule of thumb, it is often better to group your complex shaders together in one uber shader and your simpler shaders in another - so slightly less uber ;)

In Topic: QuickSort algorithm

24 April 2013 - 01:30 AM

- Is it possible to have better performances ? How ?

In addition to Hodgman's implemtation changes:
1) Choose a random pivot element, rather than the first. This greatly reduces the probability of performance leaning in the O(n^2) direction for poorly distributed (semi-sorted to fully sorted) input data. You may want to also compute the median of the first, second and middle elements, or the median of a random subset (trade off between better medians, and more computation to get them).
2) Drop to a simpler sort when the data gets small enough. E.g., insertion sort (at approx. 8-16 elements) in order to reduce overhead. A less conservative switch (which IIRC std::sort does), is a switch to a heap sort after either a certain memory footprint size is reached or stack depth is reached. This is because this approach has a bounded stack depth, and HS becomes more efficient (due to it not being inhibited by it's cache unfriendly memory access patterns on larger sets) for smaller data sets.
3) If you are using primitive types, use a SIMD sorting network when the data set for a particular recursion is small enough.
4) Separate your sorting keys from your data for cache and swap efficiency.
5) Sort multi-threaded.
6) Sort distributed.

In Topic: Octrees: Precomputed visibility and rendering approach

20 April 2013 - 11:36 AM

My idea to compute the PVS was to go raycast from each cube corner to all the other cubes, main problem with this is that it will be really expensive based on the fact that each cube have 8 childs, i was thinking to just raycast to the "bigger" cubes (either by area or polygon count) this way i would be able to still skip rendering of hidden geometry and not take ages to compute the visibility.


I think that you will find that you need to ray-cast from a lot more than the corners to prevent popping/false-invisibility.  If you subdivide far enough, this won't be an issue, but then you land up with an extremely dense octree that takes ages to compute (most of the computation is redundant since you are mostly sampling the same lines again and again). For best results you will likely need to densely sample the full line space between your source and destination nodes.  This can be done by ray-casting from all over the surface of your source node (to all over the surface of your destination node, stopping if visibility is established).


This problem has been well researched.  You should check out:




for CPU based sampling approaches tha attempt to minimize the number of rays cast.


You can also use the GPU to do the visibility determination for you, or compute it exactly:


Chapter 4 talks about using the GPU for sampling visibility.  The hardware at the time was dated: I expect that today you could avoid the readback by using occlusion queries or compute shader processing of the z-buffer.  Chapters 5 and 6 talks about computing the exact visibility set, but the math is a bit hairy, and it is probably overkill for mose use cases.

In Topic: Octrees: Precomputed visibility and rendering approach

19 April 2013 - 01:40 AM

I suggest splitting your static data on node boundaries after you have your visibility graph, since this allows you to associate data with nodes of your tree partition. Then you just draw the static geometry associated with that node if the node is visible. If you have instanced geometry in a node, you can just conservatively make the whole object visible - just be sure to do a scoreboard/mailbox check so that you don't draw it once for each visible node it intersects. You may want to do the same for non-instanced geometry if your subdivision is very fine and results in a lot of splits (since this will result in more draw calls).

For dynamic objects, you can easily (via a hierarchical test) figure out which nodes it intersects, and if it intersects a visible node, then it is treated as visible.

BTW, How do you intend to compute the PVS for the nodes? (which approach/algorithm - it's a difficult problem)

For best results, I suggest a general axially aligned BSP tree, since you ideally want to keep your leaf nodes cubic.

In Topic: How does tan work exactly?

29 March 2013 - 06:58 PM

First you get some sun, then you get a tan :)

Tsk, tsk. Don't you know that puns about trigonometry are a sin? Why? 'Cos I said so!