peculiar request

Started by
6 comments, last by 3TATUK2 10 years, 5 months ago

This is a long shot, because I strongly believe what I'm attempting is not even possible...

But, does anyone happen to know of a way, even if it's "hackish" - to restrict the max output# samples/fragments/pixels of a render draw call? Specifically even to 1 max pixel before the rest are somehow discarded...

I'm using this for a visibility test, i'm rendering octree bounding boxes and only need to know if they're visible. Obviously I wrap the draw in an occlusion query... But, again, I only need to know if even *one* pixel passes. All the pixels rendered after the first passed pixel are superfluous and negatively effect performance

Advertisement

Using GL_ANY_SAMPLES_PASSED may allow your driver to perform the more efficient test you're looking for, but this is really just a hint rather than explicit behaviour. The same is even more true for GL_ANY_SAMPLES_PASSED_CONSERVATIVE. I'm not aware of any explicit way of requesting this behaviour, and I suspect that even if one did exist, it may not be possible for some implementations without bypassing optimizations elsewhere; i.e. it may turn out to be slower than just doing the full thing. I've no evidence for that, just a hunch.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I would strongly advise against this, unless the driver offers this as an option.

If this is purely for visibility testing of octree nodes, then every fragment should only read a single depth value and not read or write any colors.
You could introduce a global "Has any fragment already passed the depth test" variable (and I'm pretty sure this is possible in OpenGL 4 / DX 11) and let the fragment program discard immediately, if this global flag has already been set.
However you would then have another read (doubling the number of reads per fragment) and force an actual fragment program to be executed for each fragment. I don't know if this still holds, but nVidia cards used to render twice as fast when no fragment program was needed.
Also, I believe discarding kills your hiZ, so you would most likely increase the number of rasterized fragments, instead of decreasing them.

Are you sure, that rendering a couple of boxes without any fragment programs is actually your bottleneck, and that you are not mistaking the latency for the actual rendering speed?

Are you sure ... that you are not mistaking the latency for the actual rendering speed?

This point is key; I'm assuming that this question is related to the OP's other question here. The term "rudimentary occlusion culling" in that question leads me to suspect that the main cause of the performance differential is the OP fetching the query results in the same frame as the queries are executed. This is almost absolutely guaranteed to stall the pipeline as in most normal cases the results won't actually be available until a frame or two later. Trying to fetch them in the same frame will cause all pending GL operations to immediately flush and everything will stall until they've completed: in other words, it completely breaks CPU/GPU asynchronous processing.

A better approach is to test if the results are ready yet (using GL_QUERY_RESULT_AVAILABLE?) and if not use the last-known-good result. If there is no last-known-good result, assume that the object is visible. This is of course a more complex and more conservative approach that will on occasion draw some things that shouldn't be visible, but it's better than introducing pipeline stalls.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

If you still insist (despite everyone not recommending it) on using your idea, you could use an atomic counter variable (with default value 0) and once the first fragment shader executes, increment it. Each fragment shader would test the atomic variable at the beginning, and if it has been set to 1, it will automatically discard the fragment. But it's an ugly hack.


Using GL_ANY_SAMPLES_PASSED may allow your driver to perform the more efficient test you're looking for

This isn't a solution for what I'm requesting. This will simply draw or not draw the entire thing based on a *previous* draw result. I need to not draw specific fragments based on a single fragment passing in *the same* draw call.


Are you sure, that rendering a couple of boxes without any fragment programs is actually your bottleneck

I'm not 100% sure, but I'm *pretty* sure it at least adds to performance loss. The bounding boxes take up the entire screen since they are recursively within eachother, so it ends up being something like 4*resolution pixels processed. I've tested with less of this pixels processed in this manner and it indeed effects performance.


Trying to fetch them in the same frame

I don't. But I don't use GL_QUERY_RESULT_AVAILABLE either, because the occurrence of it not being available is so minuscule, it doesn't really effect performance. Instead I just grab the query result for the last frame - which is typically already available, like you suggest, right before the begin/end of the next query for the current frame.


you could use an atomic counter

I've considered this and it's probably what Ohforf was suggesting... But it's only v4.2+ so not amazing compatibility factor. I'd like to avoid using stuff that's entirely restricted to extremely modern versions only. :)

Thanks all.

Using GL_ANY_SAMPLES_PASSED may allow your driver to perform the more efficient test you're looking for

This isn't a solution for what I'm requesting. This will simply draw or not draw the entire thing based on a *previous* draw result. I need to not draw specific fragments based on a single fragment passing in *the same* draw call.
The only difference between samples-passed and any-samples-passed is that the first returns an integer counter of how many pixels passed the depth test, while the latter returns a Boolean indicating whether any pixels passed the depth test (basically, returning 'counter>0').

If the GPU is capable of short-circuiting a draw call as you're requesting, then using this 'any' version query is a hint to the driver that it should go ahead and perform this short-circuit optimization.

The any-conservative query is the same, but tells the driver that it's allowed to perform the test against the Hi-Z buffer instead of the Z buffer, which will be quicker but less accurate (may return true when the ground-truth answer is false).

So, mhagain has answered the original question perfectly ;-D

The other solutions of implementing an atomic test in the fragment shader will always be slower, because in a typical occlusion querying situation the fragment shader does absolutuely zero work anyway.

FWIW though, in my experience, GPU occlusion queries are a terrible solution for occlusion culling if you're after performance. I'd personally still recommend CPU based solutions...

Oh, I wasn't aware that simply using _ANY_ would cause such an early-out . . . I tend to avoid it though cause it doesn't seem to work on one of my linux/intel legacy laptops whereas regular _SAMPLES_ works . . . furthermore I just tested by swapping in _ANY_ and I get the same framerate :] Also, even if _ANY_ does early-exit - it will only be for the query test... The other pixels will still obviously get processed because if you have color mask enabled, you obviously still want to SEE them

This topic is closed to new replies.

Advertisement