Is there a way to query max root signature size?

Started by
5 comments, last by galop1n 6 years, 7 months ago

I have read the max root signature size is 64 DWORDS. It is basically 64 on NVIDIA and 13 on AMD. But I got this info from a person working at AMD. I can't do this for every vendor out there :P. There has to be a better way to query the max root size through the API without the need to contact a person working in the company.

So is there a way to find out the actual max root signature size for the active GPU?

Thank you

Advertisement

The root signature is an abstraction, the size limit is to prevent abuse. Underneath, it is not how things works and there is no 13 limit on AMD what so ever.

What the AMD guy may have tell you is that they store stuffe up to 13 reserved SGPR and if this is not enough, they generate an extended user data buffer memory and store it in 2 SGPR. The consequence is that some of your constants or descriptors are accessed with an extra indirection.

In 99% of the time, the indirection is free because all of these are loaded with SGPR instruction and these load perfectly hide with other instructions..

If you want to know how a shader really access your data, you can take a look at the ISA by capturing a frame in PIX, 

It's 64 on everything.

AMD GPU's have 16 DWORDS of space in hardware available to implement the concept of the root signature. Apparently the driver eats up 3 for itself, leaving you 13 left over. If your root parameters fit into this space, then hopefully the driver will directly place them into these HW registers... otherwise the driver will allocate a temporary table, put some of your root parameters into that table, and place a pointer to the table into a HW register.

However, to further complicate things, you don't know what the actual size of different root parameters are either! D3D12 says that:
Descriptor tables cost 1 DWORD each.
Root constants cost 1 DWORD each, since they are 32-bit values.
Root descriptors (64-bit GPU virtual addresses) cost 2 DWORDs each.

And this is true regarding D3D12's 64 DWORD limit... but might not match the hardware. Perhaps a root descriptor is a different size depending on whether it's a texture or a buffer? Perhaps they're 1 DWORD, or 4 DWORDs? Perhaps the underlying hardware doesn't even have root parameter registers at all -- maybe it's all traditional style resource binding slots?

If you want to write cross platform stuff, just follow the rules of D3D12's abstraction... because there is no reliable way to get answers to these questions from the underlying hardware.

11 minutes ago, Hodgman said:

If you want to write cross platform stuff, just follow the rules of D3D12's abstraction...

Sorry, but what are these rules?

Does it mean that I just use root descriptors for constantly changing cbvs and use sets for all other resources?

1 hour ago, mark_braga said:

Sorry, but what are these rules?

That the root parameter block size is 64 DWORDs on every GPU, a table takes up 1 DWORD, a root-constant takes up 1 DWORD per element, and a root descriptor takes up 2 DWORDs.

If you really want a particular parameter to go into a HW register instead of being spilled to memory, put it earlier in the root descriptor.
e.g. Let's say that AMD can support 6 root descriptors in hardware -- if you use 12 root descriptors, the first 6 will go into HW registers (fast for shader to read, fast for driver to update), and the other 5 will spill into memory (very very slight penalty for the shader setup, more cost for the driver to update). So, if some of those descriptors change frequently, you should put the frequently-changing ones first in the root, and the rest afterwards. No matter the actual HW root register size, this will ensure that your most dynamic resources are the most likely to go into HW registers.

After that, if you want to micro-optimize for different GPUs... yeah, there is no official API to help with this. You're down to asking individual vendors for advice and applying that advice for every different GPU model out there...

Right now AMD suggest to put your frequently changing stuff first, while nVidia say last. I would say none matters much because unless you have a very edgy case for the bound resources and access pattern, the difference in performance on the CPU and the GPU are at the level of a white noise.

I would even suggest to you to even care less, because the trend in rendering large amount of data is to go bindless, and it adds an explicit indirection in the shader from material/object properties to the actual descriptors. Cost again is negligible in practice and compensate by the shrinking amount of work the command processor has to do plus you have the possibility to prepare many of your binding in advance in immutable buffers and reuse them from frame to frame instead spending time flushing everything everyframe.

This topic is closed to new replies.

Advertisement