GS output data that's not needed by the PS - where's it tossed?

Started by
7 comments, last by tonemgub 9 years ago

This may be applicable to non-DirectX pipelines, but I'm working with HLSL and DirectX 11. And this is a "curiosity" question.

I have a stream-out (geometry) shader which, if the vertex passes a depth texture sampling test (i.e., it's visible), outputs position, color and vertex ID (the ID is passed in from the vertex shader). The SO buffer gets just the vertex id (forming a "list" of visible vertices). The pixel shader input is just the position and color.

As I only need to determine the list of visible vertices occasionally, I create a "regular" GS from the same file, and bind that shader (rather than the SO shader and SO buffer target) most of the time.

When the "regular" GS is bound, the vertex id isn't used later in the pipeline. What looks like a very small overhead for using the "regular" GS is the pass-through of the vertex ID, so (for now) I'm just curious:

Where in the pipeline downstream of the GS is that extraneous data (the vertex ID) no longer "passed" along. Anyone know?

The reason I'm asking is to know, in future, what penalty (if any) there may be in streaming larger amounts of data (perhaps for debugging?) from the GS that's not used when the SO shader and SO targets aren't bound.

For the curious - a more detailed description of the data flow -

A pointlist of {Position, Color} from the vertex buffer to the VS (which also has SV_VertexID in the signature).

The VS outputs (and GS inputs) { Position, Color, vertex id }

In the GS, if the vertex position passes a depth sample (i.e., it's visible), the GS creates a billboarded quad (~4x4 pixels) to display just the vertex, and outputs 4 structs of { Position, Color, vertex id } as a TriangleStream. If the SO shader is bound (rather than the "regular" GS), the vertex id is streamed to the SO buffer (which gets processed later.)

>>> [ Where in this part of the pipeline is the vertex id no longer passed? ] <<<

The pixel shader input is { Position, Color } and simply outputs the color.

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

Advertisement

Where in the pipeline downstream of the GS is that extraneous data (the vertex ID) no longer "passed" along. Anyone know?

The pipeline doesn't toss any data by itself. You're doing that yourself, simply by not using the extra data in the downstream shaders.

This is because the pipeline does not really know what input and output a shader actually takes to be able do discard any ouptuts/inputs by itself. Those kind of checks are usually done in the shader compiler, or during reflection, and the compiler only really checks enough inputs and outputs as needed to compile the current shader - it doesn't do a complete check of all pipeline shaders (only the FX framework did that, IIRC). It's always up to you to make sure that the shader inputs&outputs match properly.

There is no penalty to not using all inputs in a shader.


Where in the pipeline downstream of the GS is that extraneous data (the vertex ID) no longer "passed" along. Anyone know?

There is no penalty to not using all inputs in a shader.

I would count output writing bandwidth as one.

ps. Which in this case might be not an issue.

I appreciate the responses. My apologies. I should have phrased my question better. My curiosity is more about unused outputs, rather than unused inputs, and where in the pipeline those output registers are ignored/not-copied. While googling in the process of trying to rephrase my question, and doing a bit of experimentation, I think the answer to where unused output (in this case the GS) is no longer "passed along" is "The Rasterizer Stage."

I.e., from what I can determine, the rasterizer examines the input signature of the currently bound pixel shader to determine what data (by semantics) is required, and examines the output signature of the previous stage (in this case, the GS) to determine where that data can be found. At this point (I assume), if some of the data from the last stage isn't needed by the PS, it's simply not copied - i.e., not "passed along."

IF that's correct, then it appears the penalty (as kalle_h mentions) is only in the GS - (perhaps manipulating data and) loading output registers which aren't read later.

Does that sound right?

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

Stop thinking in terms of registers (which is an abstraction that no longer applies to modern GPUs) and start thinking in terms of memory.

The GS will store data you will later not be using, so you're wasting memory and bandwidth.

When the PS loads the data, it is larger and thus it won't fit in the cache as tight as possible.

Whether any of this is an issue depends on whether you're ALU, bandwidth or cache pollution bottlenecked.


so you're wasting memory and bandwidth. When the PS loads the data, it is larger and thus it won't fit in the cache as tight as possible.

That makes sense. I appreciate the explanation.

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

Jeez, people. smile.png

You're only really wasting memory if you're not using it. But as OP said, he IS using it - all of the inputs & outputs ARE used, just by different pixel shaders. So this is actually a memory use optimization, because having two separate copies of the same input (but one with the extra data) just for the sake of respecting input/output shader layouts would take up even more than double the memory he is currently using. However, memory use is only a concern for data passed to the first stage of the pipeline - the input assembler. For the data passed between shaders, which is what OP is concerned about, it is a non-issue.

However, the performance loss in this case is not because of the additional outputs from the geometry shader. It is actually because of the extra work that the geometry shader is doing to generate the outputs that might not be used later on by the pixel shader (or whatever shader comes next in the pipeline). In this case, it is better to have two or more different versions of the geometry shader with outputs that match the pixel shader inputs - one geomtery shader for each unique set of pixel shader inputs. This is usually achieved by using #ifdefs inside the HLSL... Though it's not really a performance loss, because the geometry shader will always do the same work and output the same data, no matter which of those outputs the pixel shader is using, so the performance of the geomtetry shader will be constant. It is more like a missed chance at optimization, and you usually want to optimize your shaders as much as possible.

Sorry for beating around the bush - I hope I cleared things up a bit.

And AFAIK, the geometry shader uses Output registers as output. What's this about a memory cache?


it is better to have two or more different versions of the geometry shader with outputs that match the pixel shader inputs - one geomtry shader for each unique set of pixel shader inputs.

I'm afraid I may have confused the discussion with my post #4. That was intended only as an illustration. The original configuration is two geometry shaders, one a GS without streamout, one with streamout. For convenience (i.e., laziness), both are created from the same bytecode, so both have the same output signature - three output values. There is just one pixel shader, and it inputs 2 values from the GS (whichever is currently bound). The third output from the GS stage is used for streamout when the SO GS is bound. The third output is not used at all when the GS-without-streamout is bound. For the latter case, I was curious what effect, if any, emitting an unused value has.

The optimized approach would be to compile twice (once for the non-streamout config - emit 2 values, once for the SO config - emit 3 values), and create the shaders from separate sets of bytecode. I'm basically lazy and dislike modifying code that works, so, if I were to optimize in that way, I wanted to have a good idea why I'd be doing it.

So, really, I was just curious. smile.png

Please don't PM me with questions. Post them in the forums for everyone's benefit, and I can embarrass myself publicly.

You don't forget how to play when you grow old; you grow old when you forget how to play.

Stream Output Stage

Oooh, that explains it. :) I was too lazy to figure out what you meant by "stream-out", sorry. Matias is right.

This topic is closed to new replies.

Advertisement