Depth pre-pass: does OpenGL still execute fragment shader if depth not written and color mask is GL_NONE?

Started by
12 comments, last by Ohforf sake 10 years, 1 month ago
Instead of answering the question (because I don't know) I'll just suggest you use `glDrawBuffer(GL_NONE)` to explicitly tell it not to draw to a color buffer. I dont use a fragment shader with this setting and Everything runs great.
Advertisement

glStencilFunc(GL_NEVER,0,0);

glEnable(GL_STENCIL_TEST);

as Hodge man said, in case you alter gl_FragCoord, pixel function is run even if zfail occurs. In case you dont, rendering z prepass will mean writing to depth buffer, so pixel function will run if z passes. If you have an extensive instruction you do not wish to run in it computing colors, use a another program with rather empty pixel function, I do not see a big problem setting a program once a frame to generate zprepass. I gess that only in case of colormask false, depthmask false and stencil mask false will driver not perform pixel function, but that would mean not performing actual draw call, what is not a case of depth writing zprepass.

Should I then have a version of my shader program that has only a vertex shader and no fragment shader? [...] The downside of doing a separate shader program for the depth pre-pass is that I'd have to do it for every shader program that has a different vertex shader.


I think you should be able to reuse the shaders for the shadow map generation, so you don't need an additional version (or rather, you need it anyways for the shadows).

As to whether or not you should use seperate versions in the first place: I don't have actual numbers, but a vertex program reads a lot of attributes (texture coordinates, normals, tangents, ...) that you don't need for shadow maps/z pre pass. The driver might be able to detect, that the fragment program can be disabled, and that it doesn't need the interpolants for those attributes, but my guess is that it won't recompile the vertex program to strip out all the unnecessary attribute reading and (possibly) transforming.
Also, most shaders only vary in the fragment part and using a seperate shader for shadow map/z pre pass should allow the renderer to issue less shader program changes or even merge entire draw calls.
So I would expect to see a performance speedup from seperate shader versions, but again: I don't have any real numbers to prove it.

[...] Same result, and you don't even need to know the answer to your question................because it doesn't matter. Its either going to execute your 0 or 1 instruction pixel shader or not.


Actually I seem to recall that a certain widely used GPU could render at twice it's normal speed when fragment programs were explicitely disabled. On that GPU, the 0 or 1 instruction pixel shader would slow down rendering to half the possible speed.

Actually I seem to recall that a certain widely used GPU could render at twice it's normal speed when fragment programs were explicitely disabled. On that GPU, the 0 or 1 instruction pixel shader would slow down rendering to half the possible speed.

That was nVidia GPUS, a long time ago (like... Geforce 8 or so, possibly even Geforce 6).

In general, that same thing is of course still true, but not as such a clear-cut 1:2 thing. The time it takes to complete a frame is the time it takes to process the vertices and run other gimmicks such as geo shader or tesselation, and then shade fragments, and finally push the output trough ROPs.

While ALU (that is, shaders) has gone up tremendously, ROP is more or less the same as it was 5 or 10 years ago, so this matters a lot more.

Depending on how the ROPs work on a partiuclar card (ALU is nowadays scalar, so processing 4 values isn't the same as processing 1 value any more, it might be similar for ROPs, or very different -- I can't tell), it may make virtually no difference at all, or a huge difference. Imagine that a ROP can pass a 4-vector plus depth plus stencil (6 values total) during one clock cycle, or 6 depth-only values (also 6 values total). Or, imagine that it processes one output pixel with up to 16 (or any other number) values.

This topic is closed to new replies.

Advertisement