OpenGL samplers, textures and texture units (design question)

Started by
9 comments, last by Hodgman 9 years, 2 months ago

@Hodgman:
Oh you want to start a war, dont you? smile.png

Sorry tongue.png

On current generation hardware

To clarify myself here -- AMD has won the console wars for now, with them supplying Microsoft, Sony and Nintendo with GPU architectures. As far as AAA stuff goes, the GCN architecture is the only one you really have to optimize for; it is the current generation -- acting as both your primary target, and you minimum spec for the PC port.
Even though nVidia has a majority market share in PC gaming, they're now the "alternative" GPU that a minority of total consumers will be using.
The assumption then is that if you've optimized it to run well on your min-spec GCN GPU, it will run fine on nVidia cards anyway.

Everything below is relevant to AMD's GCN architecture. nVidia's architecture isn't quite as bindless yet.

Timotty Lottes has two posts with a very thorough analysis on both styles on modern HW.

As much as I like Lottes (I was very looking forward to playing his game, until he shelved it to start work at the Graphics Mafia sad.png) he's playing the GL-apologist here and has stopped examining the AMD side of things as soon as he got the conclusions he (and his employer) was looking for.
I won't go into his "Re: Things that drive me nuts about OpenGL" post because it's off topic and I don't have anything nice to say laugh.png
A bunch of issues with his "Bindless and Descriptors" post though--

1- He fails to mention that his "D3D" examples could be used by GL drivers, and his "GL" examples could be used by D3D drivers. There is an API<->Hardware mismatch already, with the drivers converting between API abstractions and hardware realities. If his "GL" examples are indeed more optimal, you can expect that D3D drivers will be using them.
If the API uses D3D11-style split texture-view/sampler objects, it's very easy for the driver to support GL-style hardware realities by merging those two objects prior to submitting the draw-call. Vice-versa is also possible, though much harder on the driver.
That's one good reason that APIs should follow the D3D11-style abstraction -- it allows the driver/hardware designers more flexibility in choosing different solutions, while keeping the drivers clean and simple.
Emulating D3D-style APIs on GL-style hardware is dead easy, emulating GL-style APIs on D3D-style hardware is complex and/or slow.
Modern AMD hardware is D3D-style. nVidia hardware is still leaning towards GL-style. An API that's optimal everywhere should expose the abstraction that's easy to use on both sets of hardware.

2- AMD GCN is a fully bindless architecture. His "AMD GL Non-bindless" and "AMD DX Non-bindless" examples are actually bindless examples - examples of how the D3D/GL drivers themselves are internally using bindless for you, when you're still use these old non-bindless APIs.
There's no "texture registers" like in the nVidia non-bindless examples; the descriptors being loaded are the actual guts of ShaderResoruceView/SamplerState objects, not handles(slots) to special registers containing that information. These Views/States always have to be loaded into SGPRs for use, and you've got a generous number of SGPRs such that it's not really a problem (VGPR pressure is usually a bigger problem, bottlenecking your occupancy).
So, these examples are showing how a slot-based DX/GL API would be emulated on bindless hardware already... and they're also showing how a bindless API would work on bindless hardware!
The "AMD GL Bindless" example is then how to pointlessly emulate nVidia-style indirection into a common register table, which isn't how I'd implement a bindless API on top of this hardware...

3- He mentions that S_LOAD_DWORD can either load one Texture/Sampler descriptor, or, that if you pair them it can also load the pair in one go (his conclusion: you may as well just pair them all the time).

3.a - Firstly on this, optimizing for scalar instruction counts is really scraping the bottom of the barrel -- he theorizes it might be helpful when occupancy is so low that the hardware can't dual issue any more... but if you're in that situation, you're going to have terrible performance across the board (no memory latency hiding), so you should instead be optimizing to get occupancy back up. If you can't do that, then with such low occupancy you're probably suffering from the horrible latency on your vector loads (64x wider than your scalar loads), so you'd probably want to optimize them first too...

3.b - S_LOAD_DWORD can actually load 1-16 DWORDS though, not just 4/8. So if you have a descriptor table like:
struct Table { 
  SamplerDesc s0;
  TextureDesc t0, t1, t2;
};
^ then you can load all 4 of those descriptors with one load instruction, as the table size is 16 DWORDS.
If we convert that table to his GL version, where textures/samplers are always paired...
struct Table { 
  SamplerDesc s0; TextureDesc t0;
  SamplerDesc s1; TextureDesc t1;
  SamplerDesc s2; TextureDesc t2;
};
...then we need two load instructions as the table size is now 24 DWORDS... The opposite of what he claimed is true -- always pairing your samplers/textures actually results in more load instructions.

3.c- He mentions in the 'D3D' case that you need one load for each texture, plus one load every time a sampler is used for the first time. As shown above in 3.b, this just isn't true, but even if we assumed it was (and that S_LOAD_DWORD only loads either 4 or 8 DWORDS), it's still possible for him to apply his 'GL' optimization here!
e.g. A shader with two textures and one sampler might produce a table like below:
struct Table {
  TextureDesc t0; // Lottes fetch #0  // Alternate: fetch #0  // Reality: fetch #0
  SamplerDesc s0; // Lottes fetch #1  // Alternate: fetch #0  // Reality: fetch #0
  TextureDesc t1; // Lottes fetch #2  // Alternate: fetch #1  // Reality: fetch #0
};
He says that you'd require a load for t0 plus a load for s0 (as it's being used for the first time), then later you'd need another load for t1.
Applying his 'GL' optimization, you'd get a single double-sized load for t0+s0 in one instruction, then later another load for t1.

But as above, in reality you could load that whole table with one instruction anyway...

3.d - Even if we're optimizing to minimize peak scalar-GPR usage, as well as optimizing for scalar instruction count, the D3D-style gives the driver way more options.
Say we've got three textures that all use one sampler.
The shader compiler may decide that it doesn't want to keep around the SGPR data for the sampler all the time. In that case, the driver can choose to waste memory/bandwidth and duplicate the sampler-desc - producing the "GL-style" example with paired textures/samplers and three memory fetches to get them into SGPRs:
struct Table {
  TextureDesc t0;       //fetch #0
  SamplerDesc s0;       //fetch #0
  TextureDesc t1;       //fetch #1
  SamplerDesc s0_clone; //fetch #1
  TextureDesc t2;       //fetch #2
  SamplerDesc s0_clone2;//fetch #2
};
Or it can get fancy by realizing it doesn't need s0_clone, as t1 is actually still contiguous to s0!
struct Table {
  TextureDesc t0;      //fetch #0
  SamplerDesc s0;      //fetch #0 & fetch #1
  TextureDesc t1;      //           fetch #1
  TextureDesc t2;      //fetch #2
  SamplerDesc s0_clone;//fetch #2
};
Or maybe the shader compiler decides that it can keep s0 around in SGPR's, but only between the usage of t0/t1, and that it will have to be re-fetched later for t2. In that case, if we're still also optimizing for scalar instruction count, the driver can produce:
struct Table {
  TextureDesc t0;      //fetch #0
  SamplerDesc s0;      //fetch #0
  TextureDesc t1;      //fetch #0
  TextureDesc t2;      //fetch #1
  SamplerDesc s0_clone;//fetch #1
};
It doesn't have to be one or the other. The driver is free to use hybrids between the GL-style and D3D-style examples.
If the high level API has already merged samplers/textures into one object, the driver is robbed of this flexibility (unless it wants to do the complex stuff from my last post, of finding unique sets, etc).

4- he mentions this crucial detail but then doesn't factor it into his examples:
"As can be seen in the AMD programming guide under "SGPR Initialization", up to 16 scalars can be pre-loaded before shader start."

Let's say we have a simple shader with: one cbuffer, one sampler, two textures.
That's a descriptor table like:
struct Table { 
  BufferDesc b0;
  SamplerDesc s0;
  TextureDesc t0, t1;
};
As this descriptor table is exactly 16 DWORDS in size, we get it pre-loaded for free, which removes all the S_LOAD_DWORD instructions from all of his examples.

If we used his GL example though where we always pair up our Textures/Samplers, our table flows past the 16 DWORD limit, so we have to split it in two structures now:
struct TableBase { 
  BufferDesc b0;
  SamplerDesc s0;
  TextureDesc t0;
  void* extra;
  //2 spare DWORDS, could have another void* if required
};
struct TableExtra
{
  SamplerDesc s1;
  TextureDesc t1;
};
We then still get the data in TableBase loaded for free, and we can load all the data from TableExtra with a single S_LOAD_DWORD instruction.
For more complex shaders, we end with this general rule of thumb:
descriptorCount = numBuffers + numTextures + numSamplers;
if( descriptorCount <= 4 )
  loadsRequired = 0;
else
  loadsRequired = ceil( (descriptorCount-3)/4 );

This topic is closed to new replies.

Advertisement