Sign in to follow this  
Buckeye

GS output data that's not needed by the PS - where's it tossed?

Recommended Posts

This may be applicable to non-DirectX pipelines, but I'm working with HLSL and DirectX 11. And this is a "curiosity" question.

 

I have a stream-out (geometry) shader which, if the vertex passes a depth texture sampling test (i.e., it's visible), outputs position, color and vertex ID (the ID is passed in from the vertex shader). The SO buffer gets just the vertex id (forming a "list" of visible vertices). The pixel shader input is just the position and color.

 

As I only need to determine the list of visible vertices occasionally, I create a "regular" GS from the same file, and bind that shader (rather than the SO shader and SO buffer target) most of the time.

 

When the "regular" GS is bound, the vertex id isn't used later in the pipeline.  What looks like a very small overhead for using the "regular" GS is the pass-through of the vertex ID, so (for now) I'm just curious:

 

Where in the pipeline downstream of the GS is that extraneous data (the vertex ID) no longer "passed" along. Anyone know?

 

The reason I'm asking is to know, in future, what penalty (if any) there may be in streaming larger amounts of data (perhaps for debugging?) from the GS that's not used when the SO shader and SO targets aren't bound.

 

For the curious - a more detailed description of the data flow -

 

A pointlist of {Position, Color} from the vertex buffer to the VS (which also has SV_VertexID in the signature).

 

The VS outputs (and GS inputs) { Position, Color, vertex id }

 

In the GS, if the vertex position passes a depth sample (i.e., it's visible), the GS creates a billboarded quad (~4x4 pixels) to display just the vertex, and outputs 4 structs of { Position, Color, vertex id } as a TriangleStream. If the SO shader is bound (rather than the "regular" GS), the vertex id is streamed to the SO buffer (which gets processed later.)

 

>>> [ Where in this part of the pipeline is the vertex id no longer passed? ] <<<

 

The pixel shader input is { Position, Color } and simply outputs the color.

Edited by Buckeye

Share this post


Link to post
Share on other sites

Where in the pipeline downstream of the GS is that extraneous data (the vertex ID) no longer "passed" along. Anyone know?

 

The pipeline doesn't toss any data by itself. You're doing that yourself, simply by not using the extra data in the downstream shaders.

 

This is because the pipeline does not really know what input and output a shader actually takes to be able do discard any ouptuts/inputs by itself. Those kind of checks are usually done in the shader compiler, or during reflection, and the compiler only really checks enough inputs and outputs as needed to compile the current shader - it doesn't do a complete check of all pipeline shaders (only the FX framework did that, IIRC). It's always up to you to make sure that the shader inputs&outputs match properly.

 

There is no penalty to not using all inputs in a shader.

Edited by tonemgub

Share this post


Link to post
Share on other sites

 


Where in the pipeline downstream of the GS is that extraneous data (the vertex ID) no longer "passed" along. Anyone know?

 

There is no penalty to not using all inputs in a shader.

 

I would count output writing bandwidth as one.

 

ps. Which in this case might be not an issue.

Edited by kalle_h

Share this post


Link to post
Share on other sites

I appreciate the responses. My apologies. I should have phrased my question better. My curiosity is more about unused outputs, rather than unused inputs, and where in the pipeline those output registers are ignored/not-copied. While googling in the process of trying to rephrase my question, and doing a bit of experimentation, I think the answer to where unused output (in this case the GS) is no longer "passed along" is "The Rasterizer Stage."

 

I.e., from what I can determine, the rasterizer examines the input signature of the currently bound pixel shader to determine what data (by semantics) is required, and examines the output signature of the previous stage (in this case, the GS) to determine where that data can be found. At this point (I assume), if some of the data from the last stage isn't needed by the PS, it's simply not copied - i.e., not "passed along."

 

IF that's correct, then it appears the penalty (as kalle_h mentions) is only in the GS - (perhaps manipulating data and) loading output registers which aren't read later.

 

Does that sound right?

Edited by Buckeye

Share this post


Link to post
Share on other sites

Stop thinking in terms of registers (which is an abstraction that no longer applies to modern GPUs) and start thinking in terms of memory.

The GS will store data you will later not be using, so you're wasting memory and bandwidth.

 

When the PS loads the data, it is larger and thus it won't fit in the cache as tight as possible.

 

Whether any of this is an issue depends on whether you're ALU, bandwidth or cache pollution bottlenecked.

Edited by Matias Goldberg

Share this post


Link to post
Share on other sites


so you're wasting memory and bandwidth. When the PS loads the data, it is larger and thus it won't fit in the cache as tight as possible.

 

That makes sense. I appreciate the explanation.

Share this post


Link to post
Share on other sites

Jeez, people. smile.png

 

You're only really wasting memory if you're not using it. But as OP said, he IS using it - all of the inputs & outputs ARE used, just by different pixel shaders. So this is actually a memory use optimization, because having two separate copies of the same input (but one with the extra data) just for the sake of respecting input/output shader layouts would take up even more than double the memory he is currently using. However, memory use is only a concern for data passed to the first stage of the pipeline - the input assembler. For the data passed between shaders, which is what OP is concerned about, it is a non-issue.

 

However, the performance loss in this case is not because of the additional outputs from the geometry shader. It is actually because of the extra work that the geometry shader is doing to generate the outputs that might not be used later on by the pixel shader (or whatever shader comes next in the pipeline). In this case, it is better to have two or more different versions of the geometry shader with outputs that match the pixel shader inputs - one geomtery shader for each unique set of pixel shader inputs. This is usually achieved by using #ifdefs inside the HLSL... Though it's not really a performance loss, because the geometry shader will always do the same work and output the same data, no matter which of those outputs the pixel shader is using, so the performance of the geomtetry shader will be constant. It is more like a missed chance at optimization, and you usually want to optimize your shaders as much as possible.

 

Sorry for beating around the bush - I hope I cleared things up a bit.

 

And AFAIK, the geometry shader uses Output registers as output. What's this about a memory cache?

Edited by tonemgub

Share this post


Link to post
Share on other sites

it is better to have two or more different versions of the geometry shader with outputs that match the pixel shader inputs - one geomtry shader for each unique set of pixel shader inputs.

 

I'm afraid I may have confused the discussion with my post #4. That was intended only as an illustration. The original configuration is two geometry shaders, one a GS without streamout, one with streamout. For convenience (i.e., laziness), both are created from the same bytecode, so both have the same output signature - three output values. There is just one pixel shader, and it inputs 2 values from the GS (whichever is currently bound). The third output from the GS stage is used for streamout when the SO GS is bound. The third output is not used at all when the GS-without-streamout is bound. For the latter case, I was curious what effect, if any, emitting an unused value has.

 

The optimized approach would be to compile twice (once for the non-streamout config - emit 2 values, once for the SO config - emit 3 values), and create the shaders from separate sets of bytecode. I'm basically lazy and dislike modifying code that works, so, if I were to optimize in that way, I wanted to have a good idea why I'd be doing it.

 

So, really, I was just curious. smile.png

Edited by Buckeye

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this