peculiar request

Graphics and GPU Programming Programming

Started by 3TATUK2 October 26, 2013 04:59 AM

6 comments, last by 3TATUK2 10 years, 5 months ago

3TATUK2

715

Author

October 26, 2013 04:59 AM

This is a long shot, because I strongly believe what I'm attempting is not even possible...

But, does anyone happen to know of a way, even if it's "hackish" - to restrict the max output# samples/fragments/pixels of a render draw call? Specifically even to 1 max pixel before the rest are somehow discarded...

I'm using this for a visibility test, i'm rendering octree bounding boxes and only need to know if they're visible. Obviously I wrap the draw in an occlusion query... But, again, I only need to know if even *one* pixel passes. All the pixels rendered after the first passed pixel are superfluous and negatively effect performance

21st Century Moose

13,459

October 26, 2013 01:02 PM

Using GL_ANY_SAMPLES_PASSED may allow your driver to perform the more efficient test you're looking for, but this is really just a hint rather than explicit behaviour. The same is even more true for GL_ANY_SAMPLES_PASSED_CONSERVATIVE. I'm not aware of any explicit way of requesting this behaviour, and I suspect that even if one did exist, it may not be possible for some implementations without bypassing optimizations elsewhere; i.e. it may turn out to be slower than just doing the full thing. I've no evidence for that, just a hunch.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Ohforf sake

2,052

October 26, 2013 02:47 PM

I would strongly advise against this, unless the driver offers this as an option.

If this is purely for visibility testing of octree nodes, then every fragment should only read a single depth value and not read or write any colors.
You could introduce a global "Has any fragment already passed the depth test" variable (and I'm pretty sure this is possible in OpenGL 4 / DX 11) and let the fragment program discard immediately, if this global flag has already been set.
However you would then have another read (doubling the number of reads per fragment) and force an actual fragment program to be executed for each fragment. I don't know if this still holds, but nVidia cards used to render twice as fast when no fragment program was needed.
Also, I believe discarding kills your hiZ, so you would most likely increase the number of rasterized fragments, instead of decreasing them.

Are you sure, that rendering a couple of boxes without any fragment programs is actually your bottleneck, and that you are not mistaking the latency for the actual rendering speed?

21st Century Moose

13,459

October 26, 2013 05:03 PM

Are you sure ... that you are not mistaking the latency for the actual rendering speed?

This point is key; I'm assuming that this question is related to the OP's other question here. The term "rudimentary occlusion culling" in that question leads me to suspect that the main cause of the performance differential is the OP fetching the query results in the same frame as the queries are executed. This is almost absolutely guaranteed to stall the pipeline as in most normal cases the results won't actually be available until a frame or two later. Trying to fetch them in the same frame will cause all pending GL operations to immediately flush and everything will stall until they've completed: in other words, it completely breaks CPU/GPU asynchronous processing.

A better approach is to test if the results are ready yet (using GL_QUERY_RESULT_AVAILABLE?) and if not use the last-known-good result. If there is no last-known-good result, assume that the object is visible. This is of course a more complex and more conservative approach that will on occasion draw some things that shouldn't be visible, but it's better than introducing pipeline stalls.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

michalferko

796

October 26, 2013 06:24 PM

If you still insist (despite everyone not recommending it) on using your idea, you could use an atomic counter variable (with default value 0) and once the first fragment shader executes, increment it. Each fragment shader would test the atomic variable at the beginning, and if it has been set to 1, it will automatically discard the fragment. But it's an ugly hack.

3TATUK2

715

Author

October 26, 2013 11:04 PM

Using GL_ANY_SAMPLES_PASSED may allow your driver to perform the more efficient test you're looking for

This isn't a solution for what I'm requesting. This will simply draw or not draw the entire thing based on a *previous* draw result. I need to not draw specific fragments based on a single fragment passing in *the same* draw call.

Are you sure, that rendering a couple of boxes without any fragment programs is actually your bottleneck

I'm not 100% sure, but I'm *pretty* sure it at least adds to performance loss. The bounding boxes take up the entire screen since they are recursively within eachother, so it ends up being something like 4*resolution pixels processed. I've tested with less of this pixels processed in this manner and it indeed effects performance.

Trying to fetch them in the same frame

I don't. But I don't use GL_QUERY_RESULT_AVAILABLE either, because the occurrence of it not being available is so minuscule, it doesn't really effect performance. Instead I just grab the query result for the last frame - which is typically already available, like you suggest, right before the begin/end of the next query for the current frame.

you could use an atomic counter

I've considered this and it's probably what Ohforf was suggesting... But it's only v4.2+ so not amazing compatibility factor. I'd like to avoid using stuff that's entirely restricted to extremely modern versions only. :)

Thanks all.

Hodgman

52,717

October 27, 2013 03:30 AM

Using GL_ANY_SAMPLES_PASSED may allow your driver to perform the more efficient test you're looking for
This isn't a solution for what I'm requesting. This will simply draw or not draw the entire thing based on a *previous* draw result. I need to not draw specific fragments based on a single fragment passing in *the same* draw call.

The only difference between samples-passed and any-samples-passed is that the first returns an integer counter of how many pixels passed the depth test, while the latter returns a Boolean indicating whether any pixels passed the depth test (basically, returning 'counter>0').

If the GPU is capable of short-circuiting a draw call as you're requesting, then using this 'any' version query is a hint to the driver that it should go ahead and perform this short-circuit optimization.

The any-conservative query is the same, but tells the driver that it's allowed to perform the test against the Hi-Z buffer instead of the Z buffer, which will be quicker but less accurate (may return true when the ground-truth answer is false).

So, mhagain has answered the original question perfectly ;-D

The other solutions of implementing an atomic test in the fragment shader will always be slower, because in a typical occlusion querying situation the fragment shader does absolutuely zero work anyway.

FWIW though, in my experience, GPU occlusion queries are a terrible solution for occlusion culling if you're after performance. I'd personally still recommend CPU based solutions...

. 22 Racing Series .

3TATUK2

715

Author

October 27, 2013 06:58 AM

Oh, I wasn't aware that simply using _ANY_ would cause such an early-out . . . I tend to avoid it though cause it doesn't seem to work on one of my linux/intel legacy laptops whereas regular _SAMPLES_ works . . . furthermore I just tested by swapping in _ANY_ and I get the same framerate :] Also, even if _ANY_ does early-exit - it will only be for the query test... The other pixels will still obviously get processed because if you have color mask enabled, you obviously still want to SEE them

peculiar request

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

peculiar request

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines