How to use geometry shader instances to divide the output data?

Started by
7 comments, last by Nik02 9 years, 1 month ago

In geometry shader, I want to output about 108 vertices from 3 vertices (a triangle), and each output vertex has 12 scalar values. For example


struct VS_OUTPUT
{
     float4 Pos;
     float2 TextureUV;
     float3 Normal;
     float3 Data;
};

The problem is I cannot output all vertices because of the limitation of geometry shader output which does not allow output more than 1024 registers. Is there possible to use 2 instances of a geometry shader to solve this problem such as the first instance outputs the half of data and the second one outputs the other half?. I have searched on the internet, but there is nothing and have tried it but it does not work.

For example


[maxvertexcount(108]
[instance(2)]
GS_OUTPUT GS(triangle VS_OUTPUT input[3], inout TriangleStream<GS_OUTPUT> Output,
	      uint InstanceID : SV_GSInstanceID)
{
       GS_OUTPUT vertex;
       if (InstanceID == 0)
       {
             vertex.Pos = ....
             vertex.TextureUV = ........
       }
       else // InstanceID = 1
       {
             vertex.Normal = ....
             vertex.Data = .....
       }

       // Output data
}
Advertisement

What you describe is not possible as is, because the output vertex structure is fixed - even if you initialize only some of its fields, it still necessarily takes up memory for all the fields.

GS supports multiple stream output, though. I don't remember the specifics off the top of my head, but you can use up to 4 output streams, each of which can receive one element. Since you apparently have 4 elements in your desired output struct, this may work for you. If you can use SM5, the limitations regarding stream out are somewhat more relaxed than in SM4.

EDIT:

I misunderstood slightly. You want to output 108*12 scalars in single invocation? This is not practical with GS. The design intent of GS is that it actually outputs only a small amount of geometry if even that, and the hardware limitations reflect that intent.

What is your use case?

If you can afford not to care about the actual number of output triangles, lose a little precision, and have output geometry that can be parametrized as a rectangular or triangular patch, the tessellator would be the right choice here.

If the tessellator does not fit your use case, have you tried regular geometry instancing? You can emulate a GS with (almost) unlimited output buffer by instancing patches, where the patch per-instance stream is your primitive input and the instance data stream is the geometry to be used as the patch. Note that in this approach, unlike with the tessellator, the "patch" is supplied by you in a vertex buffer. Hence, it does not need to represent an actual continuous patch; it can consist of entirely separate primitives which may or may not represent a continuous patch.

And finally, you could also use compute shader to do the geometry amplification.

The worst case is that you expand the geometry in the CPU and send it to GPU ready to render by using a dynamic VB. How many total geometry clusters are you expecting to generate? This may not be so bad if the total amount of data is not huge.

Niko Suni

Yet another alternative would be to pack your fields into integers (fixed point precision), so you can store them in a smaller amount of registers. This would mean potential precision loss, though, and you'd have to unpack them before rendering.

Niko Suni

Also, I don't remember right away whether the 1024 scalars was the total limit of one GS instance or several instances. If each instance gets its own 1k, you could output half of the primitives (54*12 scalars) from one instance, and the other half from an another instance.

Niko Suni

Thank Nik02, I have tried to output half of the primitives from one instance, and the other half from another instance, but I still get the same error about the output limitation. It is possible to pack a normalized normal vector into a single float value, the same for a normalized view vector. But I am not sure that the interpolation is correct.

It is possible to stream the output of a GS to a vertex buffer, instead of rendering it directly. However, this would result in an additional draw call. The easiest way - if you really require the use of GS - would be to just use two draw calls to begin with, and draw the first half of the 108 vertices with the first one, then the second half with the second one.

Of course, the ultimate easy way would be to just amplify the geometry on the CPU side, where there are very few limitations regarding the output size. This could also be faster than a complex GS, depending on where your current performance bottleneck is.

Niko Suni

If I amplify the geometry on the CPU, then it will consume much more memory. I am thinking about using tessellation to amplify the geometry, then in the geometry shader, I just add some attributes. But I do not understand how tessellation works. For instance, I have a prism which is composed of 2 triangle and 3 quads. If I want to tessellate 3 quads into smaller triangles like NNNNN shapes, what about the others two triangles? Will it not be passed to later stages?

The tessellator will generate somewhat uniform grid for you, and gives you parameters x (for spline), x, y (for quad patch) or x, y, z (for triangle patch). The output is a transient grid mesh - transient in the sense that it doesn't (usually) persist after rendering. All the primitives the tessellator generates within one invocation are of the same type that is configurable in the domain shader function attributes - line segments or triangles. You can stream the tessellator output to GS, though, if you want to further process the generated geometry, and via GS you can also persist the tessellated geometry to a buffer (but this does consume potentially large amount of GPU memory as well).

If you have mixed types of faces, then the tessellator doesn't seem like a very fitting choice. The typical use case is that you have a mesh consisting of quad or tri patch control primitives, and you generate fine mesh from them on the fly for rendering by using the patch tessellation algorithm of your choice (lerp, n-patch, nurbs, etc...) that you implement in your domain shader. The main difference between GS and tessellator is that in the latter, you can't dynamically discard the generated primitives, and that they represent a continuous grid (or a polyline) over the chosen parameter space.

Sorry if this sounds general, I have difficulty wrapping my head over your intended use case.

Niko Suni

All that said, as a general advice, think that the GPU likes to be fed as simple geometry as possible with as few logic branches as possible, and the CPU is happy (happier than GPU anyway) with conditional processing.

If you can make the input geometry uniform on the CPU side, you avoid potentially expensive branching (which the GS pretty much represents) and your GPU runs at full steam.

GS does have its uses but it should not be used just for the sake of using it. GS does generally slow down geometry processing but if you can avoid a bigger bottleneck (such as bus bandwidth cap) by using it, it is then worth using.

Niko Suni

This topic is closed to new replies.

Advertisement