NikiTo

DX12 About registers in HLSL and Optimization

Recommended Posts

(I've googled for the possibility to write assembler-like instructions for DirectX 12, but have found nothing more than hacking the bytecode. Something I don't have the patience to do. So I need to deal with HLSL but I need to make few things clear)

I have thought of a simple example that demonstrates what exactly I need to know:

For the example I have two textures of the same size, both 2D and of the type of DXGI_FORMAT_R8_UINT. Let say 128x128 "pixels". One is inputted to the pixel shader and the pixel shader outputs the result to the other.

I want the shader to take the input byte, XOR it with 44h and output it to the render target.

My doubt is which of those two will happen:
(assuming for the example that the registers used in the GPU's ALUs are 16 byte wide)

case 1: the HLSL compiler loads the byte to the first component of a 16-bytes-wide vector register(or to the lowest 8 bits of the first 4-dwords-wide vector register).
and the rest of 15 bytes are just zeroed

case 2: HLSL compiler is wise enough to read the data in chunks of 16 bytes and load them into the 16 bytes of a 16 components vector register, propagate that 44h to another 16 bytes wide register, XOR it once and write a single 16 bytes chunk to the render target. (but I can't see it happen, because it says in MSDN that the unused components are zeroed. Not clear if they are just hidden from me or they was zeroed for real as in the first case....)

What about variables:

Would this

float fVar = 3.1f;

occupy the same type of register as this

float4  fVector = { 0.2f, 0.3f, 0.4f, 0.1f };  ?


(i'm sorry if i'm posting too often, but it is very hard for me to find something helpful in google)

Edited by NikiTo

Share this post


Link to post
Share on other sites

I think the reason you are unable to fine-tune your shaders to such extent is that the data-typs themselves are implementation-defined, so for a float in your shader, some gpus could load it into a vector register, some others into a scalar. Some of them don't even have half-types if I am not mistaken.

To my knowledge, only AMD exposes the real assembly for shaders. What you see for compiled hlsl files is a bytecode for intermediate assembly, which gets compiled further down by the driver to native gpu instructions with all the render pipeline state when you are using the shader.

Share this post


Link to post
Share on other sites

@turanszkij So, there is yet another compiler that gets in the middle...

For the simple example above, I could for example read the texture as DXGI_FORMAT_R32G32B32A32_UINT and XOR it with 0x44444444. This way I will solve my problem from the HLL, without losing time with assembler for the GPU. The problem is my project is more complex than this simplified example. I try to run out of doubts by changing my solution. I could try to force some kind of parallelism by using only vector variables of four components in my code, but it is painful to guess how it works.

Share this post


Link to post
Share on other sites

AFAIK, all recent GPU use 32bit rgisters and nothing else. A vec4 uses 4 registers, a byte uses 1 register. (Exception is AMD Vega, which brings back 16bit registers with twice ALU performance).

Also recent GPUs are scalar, there is no more native vec4 and the compiler always creates 4 instructions if you use them.

Native vector instructions date back to pre Fermi and pre GCN.

(Correct me if i'm wrong with anything)

 

Share this post


Link to post
Share on other sites
1 hour ago, NikiTo said:

case 2: HLSL compiler is wise enough to read the data in chunks of 16 bytes and load them into the 16 bytes of a 16 components vector register, propagate that 44h to another 16 bytes wide register, XOR it once and write a single 16 bytes chunk to the render target. (but I can't see it happen, because it says in MSDN that the unused components are zeroed. Not clear if they are just hidden from me or they was zeroed for real as in the first case....)

I assume in your code each thread does the XOR on it's own data. This means that each thread uses it's own register and because other threads can't access this register what you suggest is impossible, or at least it makes no sense. Actually on every GPU groups of 4 neighbouring threads can share registers, but this is available only through extensions to the programmer, and even if the compiler utilizes this there would be no win in packing data around just so that one thread does a single XOR while the other 3 threads do nothing meanwhile (Remember threads operate in lockstep and all of them need work to do all the time if possible.)

Related: The only way to share data between threads efficiently is LDS memory. And you can use this only with compute shaders. Pixel Shaders are only good for simple brute force algorithms.

To look at final assembly code, you may look at vendor tools like NSight or CodeXL. (The latter at least works for OpenCL - not sure how far DX12 support goes.)

Share this post


Link to post
Share on other sites

I would expect the compiler to produce 16 parallel XORs in each of those 4 threads(for the simple example above in the OP) giving me 16x4 totally parallel operations.

Edited by NikiTo

Share this post


Link to post
Share on other sites
58 minutes ago, NikiTo said:

I would expect the compiler to produce 16 parallel XORs in each of those 4 threads(for the simple example above in the OP) giving me 16x4 totally parallel operations.
The all and any instructions in HLSL suggest to me that there are SIMD registers under the hood.

Ok, but then you can simply ready your bytes in blocks of 32bits and XOR them with 0x44444444 and write back 32bits again.

I guess what you have in mind is not that simple and then the answer probably again becomes: No, the compiler won't do such optimiziations for you, but you can do it yourself when possible, e.g. storing 4 counters in one int32, and incrementing all of them with one instructuon: counters += 0x01010101.

There are no more SIMD or MMX like registers like we have on CPU, instead a GPU processes 1 instruction on 32/64 threads, each of them operating on its own registers. This is transparent to the programmer and there is nothing hidden under the hood that transforms your code into something very different. 

You may give a more complex example that i can not solve with a simple trick as above...

 

58 minutes ago, NikiTo said:

I try to change my algorithms in a way to be able to run in pixel shaders on the 3D Engine with its own memory each one, because I guess 3D Engine is guaranteed to be the best optimized piece of hardware in any GPU(Compute Engine, they say, could not be even present on some GPUs)).

From your other thread i have the impression you work on an algorithm probably with some complexity that does not rasterize triangles?

Then compute shaders is for you and you won't benefit from graphics pipeline. GPUs without compute cpability - that's more than a decade back.

Graphics pipeline hase very restricted GPGPU capabilities - you need to do things very inefficient. (I would even say you can do nothing useful with pixel haders, except shading pixels)

It seems you are very wrong with your guess (!) Exception: Your algorithm is so tiny and simple that you can do it in pixel shader without any limitation, but if you don't know compute, you can't be sure. And even if, pixel shader has likely the same performance than compute shader.

 

Example: A N-Body simulation of 1024 planets.

Pixel shader: Every thread reads ALL 1024 planets to see how their gravity affects the threads body.

Compute shader: Every thread reads ONLY ITS OWN planet, stores it in LDS memory, now every thread has access to every planet. 15 times faster. Threads can communicate this way - you can do useful things.

 

Edited by JoeJ

Share this post


Link to post
Share on other sites

As others have alluded to, there's a layer of abstraction here that you're not accounting for here. fxc outputs DXBC, which contains instructions that use a virtual ISA. So it's basically emitting code for a virtual GPU that's expected to behave according to rules defined in the spec for that virtual ISA. The GPU and its driver is then responsible for taking that virtual ISA, JIT-compiling it to native instructions that the GPU can understand (which they sometimes do by going through multiple intermediate steps!), and then executing those instructions in a way that the final shader results are consistent with what the virtual machine would have produced. This gives GPU's tons of leeway in how they can design their hardware and their instructions,  which is kind of the whole point of having a virtual ISA. The most famous example is probably recent Nvidia and AMD GPU's, which have their threads work in terms of scalar instructions instead of the 4-wide instructions that are used in DXBC.

Ultimately what this means for you is that you often can only reason about things in terms of the DXBC virtual ISA. The IHV's will often provide tools that can show you the final hardware-specific instructions for reference, which can occasionally help guide you in terms of writing more optimal code for a particular generation of hardware. But in the long run hardware can change in all kinds of ways, and you can never make assumptions about how hardware will compute the results of your shader program.

That said, the first thing you should do is compile your shader and look at the resulting DXBC. In the case of loading from an R8_UINT texture and XOR'ing it, it's probably just going to load the integer data into a single component of one of its 16-byte registers and then perform an xor instruction on that. Depending on what else is going on in your program you might or might not have other data packed into the same register, and the compiler may or may not merge multiple scalar operations into a 2, 3, or 4-way vector operation. But again, this can have little bearing on the actual instructions executed by the hardware.

In general, I think that your worries about "packing parallel XOR's" are a little misplaced in terms of modern GPU's. GPU's will typically use SIMT style execution, where a single "thread" running a shader program will run on a single lane of a SIMD unit. So as long as you have lots of thread executing (pixels being shaded, in your case) the XOR will pretty much always be run in parallel across wide SIMD units as a matter of course.

Share this post


Link to post
Share on other sites

Another point worth making is that the shader compiler (both the HLSL->DXBC one and the DXBC->IHV ISA one) have no idea that your Texture is of format R8_UINT. All that the compilers are aware of is the fact that your texture is of an 'integer' format and may in fact be R32_UINT.

For that reason, even if a GPU did have a 16-byte wide register, it would be able to do no better than load your 8-bit texel into a 32-bit component of that register. There is no scope for a 16x speedup by doing 16 XORs simultaneously without jumping through the hoops of using R32G32B32A32_UINT.

Note also that D3D11.1 added optional support for Logic Ops in the Blend State, of which XOR is one of the (optionally) supported ops.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Announcements

  • Forum Statistics

    • Total Topics
      628379
    • Total Posts
      2982349
  • Similar Content

    • By Vilem Otte
      So, I've been playing a bit with geometry shaders recently and I've found a very interesting bug, let me show you the code example:
      struct Vert2Geom { float4 mPosition : SV_POSITION; float2 mTexCoord : TEXCOORD0; float3 mNormal : TEXCOORD1; float4 mPositionWS : TEXCOORD2; }; struct Geom2Frag { float4 mPosition : SV_POSITION; nointerpolation float4 mAABB : AABB; float3 mNormal : TEXCOORD1; float2 mTexCoord : TEXCOORD0; nointerpolation uint mAxis : AXIS; float3 temp : TEXCOORD2; }; ... [maxvertexcount(3)] void GS(triangle Vert2Geom input[3], inout TriangleStream<Geom2Frag> output) { ... } So, as soon as I have this Geom2Frag structure - there is a crash, to be precise - the only message I get is:
      D3D12: Removing Device.
      Now, if Geom2Frag last attribute is just type of float2 (hence structure is 4 bytes shorter), there is no crash and everything works as should. I tried to look at limitations for Shader Model 5.1 profiles - and I either overlooked one for geometry shader outputs (which is more than possible - MSDN is confusing in many ways ... but 64 bytes limit seems way too low), or there is something iffy that shader compiler does for me.
      Any ideas why this might happen?
    • By VietNN
      Hi everyone, I am new to Dx12 and working on a game project.
      My game just crash at CreateShaderResourceView with no infomation output in debug log, just: 0xC0000005: Access violation reading location 0x000001F22EF2AFE8.
      my code at current:
      CreateShaderResourceView(m_texture, &desc, *cpuDescriptorHandle);
       - m_texture address is: 0x000001ea3c68c8a0
      - cpuDescriptorHandle address is 0x00000056d88fdd50
      - desc.Format, desc.ViewDimension, Texture2D.MostDetailedMip, Texture2D.MipLevels is initalized.
      The crash happens all times at that stage but not on same m_texture. As I noticed the violation reading location is always somewhere near m_texture address.
      I just declare a temp variable to check how many times CreateShaderResourceView already called, at that moment it is 17879 (means that I created 17879 succesfully), and CreateDescriptorHeap for cpuDescriptorHandle was called 4190, do I reach any limit?
      One more infomation, if I set miplevel of all texture when create to 1 it seem like there is no crash but game quality is bad. Do not sure if it relative or not.
      Anyone could give me some advise ?
    • By VietNN
      Hi all,
      The D3D12_SHADER_RESOURCE_VIEW_DESC has a member Shader4ComponentMapping but I don't really know what is it used for? As several example set its value to D3D12_DEFAULT_SHADER_4_COMPONENT_MAPPING. I also read the document on MSDN but still do not understand anything about it.
      https://msdn.microsoft.com/en-us/library/windows/desktop/dn903814(v=vs.85).aspx
      https://msdn.microsoft.com/en-us/library/windows/desktop/dn770406(v=vs.85).aspx
      Anyone could help me, thank you.
    • By DejayHextrix
      Hi, New here. 
      I need some help. My fiance and I like to play this mobile game online that goes by real time. Her and I are always working but when we have free time we like to play this game. We don't always got time throughout the day to Queue Buildings, troops, Upgrades....etc.... 
      I was told to look into DLL Injection and OpenGL/DirectX Hooking. Is this true? Is this what I need to learn? 
      How do I read the Android files, or modify the files, or get the in-game tags/variables for the game I want? 
      Any assistance on this would be most appreciated. I been everywhere and seems no one knows or is to lazy to help me out. It would be nice to have assistance for once. I don't know what I need to learn. 
      So links of topics I need to learn within the comment section would be SOOOOO.....Helpful. Anything to just get me started. 
      Thanks, 
      Dejay Hextrix 
    • By HD86
      As far as I know, the size of XMMATRIX must be 64 bytes, which is way too big to be returned by a function. However, DirectXMath functions do return this struct. I suppose this has something to do with the SIMD optimization. Should I return this huge struct from my own functions or should I pass it by a reference or pointer?
      This question will look silly to you if you know how SIMD works, but I don't.
  • Popular Now