• entries
422
1540
• views
490638

# Direct3D 10: Running a material system on the GPU

1120 views

[caution] The content is based on the February CTP of D3D10 - it should hold for the final release, but as with all pre-release software it's subject to change.
[caution] This mini-article is still work-in-progress, I may well flesh it out more and improve it at a later date. As such I'm VERY keen to get your comments (Leave a public comment or send me a private message).

Direct3D 10: Running a material system on the GPU

D3D10 allows for much more flexibility in the shader pipeline - the part of your graphics engine that runs entirely on the graphics hardware. With this increased flexibility there are two distinctly different things that you could use it for:
1. Bigger and better effects
2. Optimization - performance or efficiency

Obviously you are likely to see #1 dominating the discussion. People like fancy new visual effects - especially if they just took out a new loan to pay for an expensive high-end GPU [grin]

However, as I see it the second option is potentially the dark horse - a feature with a lot of potential to do interesting things provided people give it the attention it deserves. With greater efficiency you could simplify a number of effects - could be as simple as having less permutations for the application to deal with. Current GPU's (and those expected in the near future) are still very specialised pieces of hardware - but with this new flexibility you could start to consider the GPU as even more of a co-processor.

The Direct3D 10 CTP's have included the "CubeMapGS" sample that does cube-map rendering in a single pass (traditionally it would require 6 passes), this is an example of using the pipeline more efficiently. In this particular case you could either use that extra efficiency to implement a higher quality effect (e.g. a higher resolution cubemap) or invest the savings into another effect elsewhere (or simply more cube mapping).

Current material systems

By current I mean Direct3D9, but the concepts still apply generally across 3D graphics and other versions of the Direct3D API.

A material for a given piece of geometry can be made up of a number of constants, probably a whole bunch of textures and a pixel shader (simulating a particular lighting model for example). The exact definition of a material is very context sensitive - it will not only vary across different objects in a scene, but also across different levels of technology (look back at the legacy fixed-function D3DMATERIAL9).

Conventionally the application will configure the pipeline for a material and then render the geometry that uses it. As you increase the number of materials you also increase the amount of management that the CPU/application has to do. This creates a necessary, but somewhat tight, relationship between the GPU and CPU - yes they do work concurrently, but often only in small batches.

At a very high level you could describe it as a conversation:
Quote:
 CPU: Use material 1GPU: OkayCPU: Render Object A Subset 1GPU: Done, now what?CPU: Render Object B Subset 2GPU: Done, now what?CPU: Render Object C Subset 1GPU: Done, now what?CPU: Render Object D Subset 1GPU: OkayCPU: Use material 2GPU: OkayCPU: Render Object A Subset 2GPU: Okay, what next?CPU: Render Object B Subset 1GPU: Okay, what next?CPU: Render Object C Subset 2GPU: Okay, what next?CPU: Use material 3GPU: OkayCPU: Render Object A Subset 3GPU: Okay, what next?...

There are various parts of the above conversation where either the CPU will be waiting for the GPU to finish or the GPU will be waiting for the CPU to tell it what to do.

Much of the performance optimization work with Direct3D 9 revolves around ways of simplifying the above conversation. Try to get the GPU to do as much work as possible before coming back to the CPU, and when the CPU has to get involved try and make it change as few things as possible. Techniques such as geometry instancing and batching give the GPU nice big blocks of work to keep itself busy, and state management and material systems reduce the number of changes the CPU has to do.

The state management system on the CPU (a more general form of the material system this article discusses) is often a good place to find classic algorithms and data structures - trees, lists, graphs etc...

Using the previous conversation example, wouldn't it be much more efficient if we achieved something like this:
Quote:
 CPU: Here's the definition of everything about material 1GPU: Okay, got itCPU: Here's the definition of everything about material 2GPU: Okay, got itCPU: Here's the definition of everything about material 3GPU: Okay, got itCPU: Here's the definition for Object AGPU: Okay, got itCPU: Here's the definition for Object BGPU: Okay, got itCPU: Here's the definition for Object CGPU: Okay, got itCPU: Go render objects A, B and C...

It's a simple example, but it should scale up well for more realistic scenes/worlds where you have many more objects and materials.

A simple example under Direct3D 10

So if we assume that we can upload/configure all of the materials before we start rendering the key question becomes, "I'm rendering this triangle, what material is it?".

The above screenshot shows a simple cube being rendered through Direct3D 10. At first glance there isn't anything particular groundbreaking here - people have been producing better cubes for decades [grin]

BUT, the only information that is being provided by the application is a set of XYZ coordinates and appropriate indices. No additional information per-vertex or per-primitive.

Direct3D 10 defines three system generated values that the pipeline stages have access to. SV_VertexID, SV_PrimitiveID and SV_InstanceID - these are completely new, and require either emulation or awkward hacks to be viable in Direct3D 9. Through a combination of these values a given shader unit can know what it is currently drawing. Provide a little extra information to the shader via constants and you've got enough for a material system.

In the above example the pixel shader reads the primitive ID (between 0 and 11 for a cube) and scales between black and white accordingly:

float4 main( in uint id : SV_PrimitiveID ) : SV_TARGET{    float c = (float)id / 11.0f;	    return float4( c, c, c, 1.0f );}

Extending the example

The above image uses the same principles as the previous section, but it makes use of conditional execution and the integer instructions that are now a basic part of Direct3D 10 (via Shader Model 4). It simply examines the system generated index and compares it to one of three materials - albeit simple red, green or blue colouring.

I can imagine that more experienced shader-writers will have alarm bells ringing at the thought of dynamic branching in a shader. Such techniques are only currently available under Shader Model 3 and even then only the best-of-the-best GPU's can handle them with sufficient performance. Using them across the board as a basic component of any (and possibly every) shader is risky.

The big problem is that, at the time of writing, no Direct3D 10 hardware is available (those that have it are keeping very quiet!) to guage just how well they hold up to more complex shader programs. The Direct3D 10 specification is very rigourous when it comes to defining both what features and how they work - but it crucially doesn't make any requirements on performance.

float4 GetFaceColour( in uint FaceID ){    // Define a default return value    float4 Colour = float4( 0.0f, 0.0f, 0.0f, 1.0f );	    // Determine if this matches one of our attribute ranges    if( (FaceID >= 0) && (FaceID <= 3) )        Colour = float4( 1.0f, 0.0f, 0.0f, 1.0f ); // Red		    if( (FaceID >= 4) && (FaceID <= 7) )        Colour = float4( 0.0f, 1.0f, 0.0f, 1.0f ); // Green		    if( (FaceID >= 8) && (FaceID <= 11) )        Colour = float4( 0.0f, 0.0f, 1.0f, 1.0f ); // Blue		    // Return the selected colour    return Colour;}float4 main( in uint id : SV_PrimitiveID ) : SV_TARGET{    return GetFaceColour( id );}

A simple attribute-range based shader running entirely on the GPU

In the previous step the attributes were hard-coded into the shader - as a proof of concept it works reasonably well, but it's not usable in real situations.

Making the above shader more generic and use application-supplied data is a fairly involved process. Firstly we have to introduce a constant buffer into the shader - this wraps up the necessary values and is accessable from the application. Unlike in Direct3D 9 where the application had direct access to the constant registers you now have to upload a whole constant buffer of data. Data in the constant buffer is packed into the actual hardware registers. It's a very tidy system, but it can be a bit of a pain to get correct [wink]

The following constant buffer allows the application to set/define the necessary values:

cbuffer AttributeBuffer{    uint2 AttrRange[ 16 ]    : packoffset( c0 );     // X is the first face in the range, Y is the last.    uint Attributes          : packoffset( c16 );    // How many, of a maximum 16, ranges are used.    float4 Values[ 16 ]      : packoffset( c17 );    // When we match an attribute, we use the corresponding value};

The packoffset() directives aren't strictly required, but the backing code to this shader was a "stand alone" rather than via the effects framework - thus forcing the binary layout of the constant buffer makes the C/C++ code a bit simpler. If you choose not to force the packing layout then you can always use the reflection interfaces to work out how to create the constant buffer.

The corresponding ID3D10Buffer* can be created by the application and uploaded via PSSetConstantBuffers(). The following fragment of C++ represents the contents of the cbuffer:

// Define a 4-component, 16 byte integer vector:struct INT4{    unsigned __int32 i[4];};// Use the previous definition to create a 528 byte constant buffer:struct PSBuffer{    INT4         AttrRange[16];    INT4         Attributes;    D3DXVECTOR4	 Values[16];};PSBuffer AttrBuffer;// We're going to be using 3 out of the 16 possible attributesAttrBuffer.Attributes.i[0] = 3;// Define the red facesAttrBuffer.AttrRange[ 0 ].i[0] = 0;AttrBuffer.AttrRange[ 0 ].i[1] = 3;AttrBuffer.Values[ 0 ]         = D3DXVECTOR4( 1.0f, 0.0f, 0.0f, 1.0f );// Define the green facesAttrBuffer.AttrRange[ 1 ].i[0] = 4;AttrBuffer.AttrRange[ 1 ].i[1] = 7;AttrBuffer.Values[ 1 ]         = D3DXVECTOR4( 0.0f, 1.0f, 0.0f, 1.0f );// Define the blue facesAttrBuffer.AttrRange[ 2 ].i[0] = 8;AttrBuffer.AttrRange[ 2 ].i[1] = 11;AttrBuffer.Values[ 2 ]         = D3DXVECTOR4( 0.0f, 0.0f, 1.0f, 1.0f );

The following piece of code is an updated version of the GetFaceColour() function from the previous section. It now uses looping as well as dynamic branching:

float4 GetFaceColour( in uint FaceID ){    // Define the default    float4 Colour = float4( 0.0f, 0.0f, 0.0f, 1.0f );	    // Examine the attribute ranges    for( int i = 0; i < Attributes; i++ )    {        if( (FaceID >= AttrRange.x) && (FaceID <= AttrRange.y) )            Colour = Values;    }	    // Return the selected colour    return Colour;}

Whilst we're not really supposed to be dealing with assembly shaders, we can still examine what the HLSL compiler generates:

Quote:
 Compiled with:fxc10 /T ps_4_0 /E "main" /Ges /Fc "c:\users\Jack\Listing.html" /Cc "c:\Direct3D 10\D3D10Test\D3D10Test\Shaders\PixelShader.psh"Results in:ps_4_0dcl_input_sgv v0.x , primitive_iddcl_output o0.xyzwdcl_constantbuffer_dynamic cb0[33]dcl_temps 2mov r0.xyzw, l(0, 0, 0, 0x3f800000)mov r1.x, l(0)loop ilt r1.y, r1.x, cb0[16].x not r1.y, r1.y breakc_nz r1.y uge r1.y, v0.x, cb0[r1.x + 0].x uge r1.z, cb0[r1.x + 0].y, v0.x and r1.y, r1.y, r1.z movc r0.xyzw, r1.yyyy, cb0[r1.x + 17].xyzw, r0.xyzw iadd r1.x, r1.x, l(1)endloop mov o0.xyzw, r0.xyzw

It's probably worth noting that the generated code may not be optimal yet - in the case of this simple shader it probably is, but there's bound to be plenty of optimizations that the HLSL compiler hasn't got fully sorted yet.

Where next?

That covers the basic principles of how to get a simple material system running on the GPU. The original aim for this article was to offer a simple example that you might find interesting and want to extend. Hopefully you can see the potential for using the new features of the Direct3D 10 pipeline for something other than fancy jaw-dropping visuals. Sure, it's not about to replace CPU-based material systems, and chances are there will be some classes of materials that this method either can't handle or won't be efficient at - but as far as I see it, it's got potential.

A few of my thoughts about where to take this...

Another interesting possibility comes from the unlimited shader length allowed by Direct3D 10. Various people have tried out the concept of "super shaders" either as a static compile-time branched shader or at runtime using proper dynamic branching. The material selection discussed in this article could well be used to select a particular lighting model (especially interesting if you also combine this with LOD lighting [grin]) for a range of faces:

float4 ComputeLightingColour( ... ){    int MaterialID = GetMaterialID( PrimitiveID );    switch( MaterialID )    {        case BRUSHED_METAL_MATERIAL            return BrushedMetalShader( ... );            case PLAsTIC_MATERIAL            return GenericPlasticShader( ... );        case BRICK_MATERIAL            return BrickShader( ... );    }}

Once fully expanded/compiled, that sort of construct would almost certainly generate a shader long enough to make any current-generation hardware cry [lol]

That is all for now, I'd be interested to know you thoughts on this so feel free to leave a public comment or send me a private message [smile]

Keep up the good work!

Hehe, thanks - appreciated!

Time permitting I'm trying to get around to writing up a few of the articles that I've got kicking around in the great expanse between my ears [grin]

Cheers,
Jack

Man, if current ATi/NV shader perf re:branching and such is indicative of what we can hope to see in the D3D10 generation, you're just going to make the cards weep.

[lol] Yeah, I can appreciate it'll possibly hurt them...

BUT, the latest X1900's and so on seem to be pretty smooth when it comes to flow control, so fingers-crossed that that sort of performance profile follows through to SM4..

There is also the other consideration... Yes, dynamic flow control might hurt - but does it hurt more than the CPU<->GPU interaction and the various Set**() calls? I can see my method adding overhead - but it's a question whether that overhead is less than anything the CPU can muster... [oh]

Cheers,
Jack

I thought that part of the design behind D3D10 is, like, near-zero overhead for things like Set*() calls?

The tradeoff would be interesting though, because even with ATi's very high quality branching system, you could still be looking at effectively 4 or 5-digit instruction counts, which without some interesting design could cause some havoc on certain video cards. I'm no expert on the actual hardware, but wouldn't the current 512 or so instructions for a shader basically sit in a little cache for shaders and such? If your supershader gets really big, isn't it possible that you'll basically run off of a performance cliff due to cache thrashing or something?

Quote:
 I thought that part of the design behind D3D10 is, like, near-zero overhead for things like Set*() calls?
Yeah, the emphasis is on zero-overhead for Draw*() calls, but that also seems to follow through to Set*() calls as well.

However, even if the actual Set*() call is instantaneous the CPU/GPU are still "locked" together - it's likely that one will be waiting on the other. That unused/wait time strikes me as being potentially as bad as any per-call overhead...

Quote:
 The tradeoff would be interesting though, because even with ATi's very high quality branching system, you could still be looking at effectively 4 or 5-digit instruction counts, which without some interesting design could cause some havoc on certain video cards.
Yup, it could get pretty heavyweight - but it's almost impossible to tell how much pain it'll cause until we get our hands on some real hardware.

Quote:
 I'm no expert on the actual hardware, but wouldn't the current 512 or so instructions for a shader basically sit in a little cache for shaders and such?
That's how I understand it...

Quote:
 If your supershader gets really big, isn't it possible that you'll basically run off of a performance cliff due to cache thrashing or something?
Yup, although from what I've heard this will be a general issue with D3D10 shaders. It could become the new "caps" - if ATI/NV differ substantially then you could end up using different length shaders according to the detected chipset. Maybe have the ubershader for one IHV but break it into multiple (shorter) passes for the other IHV...

Jack

This would be very cool with XML....

I believe that material system designs will not change a lot between DX9 and DX10, if you already use a next-gen material system.
As usual the difference between DX9 and DX10 graphics cards will not be so huge in terms of cache and overall performance. If you used hardware to the max with your DX9 system, you will also have a nice DX10 system. The API changes are interesting, but the hardware performance will dictate what we do not the API ... as it always did and I do not expect a lot of surprises here ... the 360 showed the way.

## Create an account

Register a new account