About registers in HLSL and Optimization

Started by
8 comments, last by Adam Miles 6 years, 6 months ago

(I've googled for the possibility to write assembler-like instructions for DirectX 12, but have found nothing more than hacking the bytecode. Something I don't have the patience to do. So I need to deal with HLSL but I need to make few things clear)

I have thought of a simple example that demonstrates what exactly I need to know:

For the example I have two textures of the same size, both 2D and of the type of DXGI_FORMAT_R8_UINT. Let say 128x128 "pixels". One is inputted to the pixel shader and the pixel shader outputs the result to the other.

I want the shader to take the input byte, XOR it with 44h and output it to the render target.

My doubt is which of those two will happen:
(assuming for the example that the registers used in the GPU's ALUs are 16 byte wide)

case 1: the HLSL compiler loads the byte to the first component of a 16-bytes-wide vector register(or to the lowest 8 bits of the first 4-dwords-wide vector register).
and the rest of 15 bytes are just zeroed

case 2: HLSL compiler is wise enough to read the data in chunks of 16 bytes and load them into the 16 bytes of a 16 components vector register, propagate that 44h to another 16 bytes wide register, XOR it once and write a single 16 bytes chunk to the render target. (but I can't see it happen, because it says in MSDN that the unused components are zeroed. Not clear if they are just hidden from me or they was zeroed for real as in the first case....)

What about variables:

Would this


float fVar = 3.1f;

occupy the same type of register as this


float4  fVector = { 0.2f, 0.3f, 0.4f, 0.1f };  ?


(i'm sorry if i'm posting too often, but it is very hard for me to find something helpful in google)

Advertisement

I think the reason you are unable to fine-tune your shaders to such extent is that the data-typs themselves are implementation-defined, so for a float in your shader, some gpus could load it into a vector register, some others into a scalar. Some of them don't even have half-types if I am not mistaken.

To my knowledge, only AMD exposes the real assembly for shaders. What you see for compiled hlsl files is a bytecode for intermediate assembly, which gets compiled further down by the driver to native gpu instructions with all the render pipeline state when you are using the shader.

@turanszkij So, there is yet another compiler that gets in the middle...

For the simple example above, I could for example read the texture as DXGI_FORMAT_R32G32B32A32_UINT and XOR it with 0x44444444. This way I will solve my problem from the HLL, without losing time with assembler for the GPU. The problem is my project is more complex than this simplified example. I try to run out of doubts by changing my solution. I could try to force some kind of parallelism by using only vector variables of four components in my code, but it is painful to guess how it works.

AFAIK, all recent GPU use 32bit rgisters and nothing else. A vec4 uses 4 registers, a byte uses 1 register. (Exception is AMD Vega, which brings back 16bit registers with twice ALU performance).

Also recent GPUs are scalar, there is no more native vec4 and the compiler always creates 4 instructions if you use them.

Native vector instructions date back to pre Fermi and pre GCN.

(Correct me if i'm wrong with anything)

 

1 hour ago, NikiTo said:

case 2: HLSL compiler is wise enough to read the data in chunks of 16 bytes and load them into the 16 bytes of a 16 components vector register, propagate that 44h to another 16 bytes wide register, XOR it once and write a single 16 bytes chunk to the render target. (but I can't see it happen, because it says in MSDN that the unused components are zeroed. Not clear if they are just hidden from me or they was zeroed for real as in the first case....)

I assume in your code each thread does the XOR on it's own data. This means that each thread uses it's own register and because other threads can't access this register what you suggest is impossible, or at least it makes no sense. Actually on every GPU groups of 4 neighbouring threads can share registers, but this is available only through extensions to the programmer, and even if the compiler utilizes this there would be no win in packing data around just so that one thread does a single XOR while the other 3 threads do nothing meanwhile (Remember threads operate in lockstep and all of them need work to do all the time if possible.)

Related: The only way to share data between threads efficiently is LDS memory. And you can use this only with compute shaders. Pixel Shaders are only good for simple brute force algorithms.

To look at final assembly code, you may look at vendor tools like NSight or CodeXL. (The latter at least works for OpenCL - not sure how far DX12 support goes.)

I would expect the compiler to produce 16 parallel XORs in each of those 4 threads(for the simple example above in the OP) giving me 16x4 totally parallel operations.

58 minutes ago, NikiTo said:

I would expect the compiler to produce 16 parallel XORs in each of those 4 threads(for the simple example above in the OP) giving me 16x4 totally parallel operations.
The all and any instructions in HLSL suggest to me that there are SIMD registers under the hood.

Ok, but then you can simply ready your bytes in blocks of 32bits and XOR them with 0x44444444 and write back 32bits again.

I guess what you have in mind is not that simple and then the answer probably again becomes: No, the compiler won't do such optimiziations for you, but you can do it yourself when possible, e.g. storing 4 counters in one int32, and incrementing all of them with one instructuon: counters += 0x01010101.

There are no more SIMD or MMX like registers like we have on CPU, instead a GPU processes 1 instruction on 32/64 threads, each of them operating on its own registers. This is transparent to the programmer and there is nothing hidden under the hood that transforms your code into something very different. 

You may give a more complex example that i can not solve with a simple trick as above...

 

58 minutes ago, NikiTo said:

I try to change my algorithms in a way to be able to run in pixel shaders on the 3D Engine with its own memory each one, because I guess 3D Engine is guaranteed to be the best optimized piece of hardware in any GPU(Compute Engine, they say, could not be even present on some GPUs)).

From your other thread i have the impression you work on an algorithm probably with some complexity that does not rasterize triangles?

Then compute shaders is for you and you won't benefit from graphics pipeline. GPUs without compute cpability - that's more than a decade back.

Graphics pipeline hase very restricted GPGPU capabilities - you need to do things very inefficient. (I would even say you can do nothing useful with pixel haders, except shading pixels)

It seems you are very wrong with your guess (!) Exception: Your algorithm is so tiny and simple that you can do it in pixel shader without any limitation, but if you don't know compute, you can't be sure. And even if, pixel shader has likely the same performance than compute shader.

 

Example: A N-Body simulation of 1024 planets.

Pixel shader: Every thread reads ALL 1024 planets to see how their gravity affects the threads body.

Compute shader: Every thread reads ONLY ITS OWN planet, stores it in LDS memory, now every thread has access to every planet. 15 times faster. Threads can communicate this way - you can do useful things.

 

As others have alluded to, there's a layer of abstraction here that you're not accounting for here. fxc outputs DXBC, which contains instructions that use a virtual ISA. So it's basically emitting code for a virtual GPU that's expected to behave according to rules defined in the spec for that virtual ISA. The GPU and its driver is then responsible for taking that virtual ISA, JIT-compiling it to native instructions that the GPU can understand (which they sometimes do by going through multiple intermediate steps!), and then executing those instructions in a way that the final shader results are consistent with what the virtual machine would have produced. This gives GPU's tons of leeway in how they can design their hardware and their instructions,  which is kind of the whole point of having a virtual ISA. The most famous example is probably recent Nvidia and AMD GPU's, which have their threads work in terms of scalar instructions instead of the 4-wide instructions that are used in DXBC.

Ultimately what this means for you is that you often can only reason about things in terms of the DXBC virtual ISA. The IHV's will often provide tools that can show you the final hardware-specific instructions for reference, which can occasionally help guide you in terms of writing more optimal code for a particular generation of hardware. But in the long run hardware can change in all kinds of ways, and you can never make assumptions about how hardware will compute the results of your shader program.

That said, the first thing you should do is compile your shader and look at the resulting DXBC. In the case of loading from an R8_UINT texture and XOR'ing it, it's probably just going to load the integer data into a single component of one of its 16-byte registers and then perform an xor instruction on that. Depending on what else is going on in your program you might or might not have other data packed into the same register, and the compiler may or may not merge multiple scalar operations into a 2, 3, or 4-way vector operation. But again, this can have little bearing on the actual instructions executed by the hardware.

In general, I think that your worries about "packing parallel XOR's" are a little misplaced in terms of modern GPU's. GPU's will typically use SIMT style execution, where a single "thread" running a shader program will run on a single lane of a SIMD unit. So as long as you have lots of thread executing (pixels being shaded, in your case) the XOR will pretty much always be run in parallel across wide SIMD units as a matter of course.

Another point worth making is that the shader compiler (both the HLSL->DXBC one and the DXBC->IHV ISA one) have no idea that your Texture is of format R8_UINT. All that the compilers are aware of is the fact that your texture is of an 'integer' format and may in fact be R32_UINT.

For that reason, even if a GPU did have a 16-byte wide register, it would be able to do no better than load your 8-bit texel into a 32-bit component of that register. There is no scope for a 16x speedup by doing 16 XORs simultaneously without jumping through the hoops of using R32G32B32A32_UINT.

Note also that D3D11.1 added optional support for Logic Ops in the Blend State, of which XOR is one of the (optionally) supported ops.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

This topic is closed to new replies.

Advertisement