Jump to content
  • Advertisement
NikiTo

DX12 About registers in HLSL and Optimization

Recommended Posts

(I've googled for the possibility to write assembler-like instructions for DirectX 12, but have found nothing more than hacking the bytecode. Something I don't have the patience to do. So I need to deal with HLSL but I need to make few things clear)

I have thought of a simple example that demonstrates what exactly I need to know:

For the example I have two textures of the same size, both 2D and of the type of DXGI_FORMAT_R8_UINT. Let say 128x128 "pixels". One is inputted to the pixel shader and the pixel shader outputs the result to the other.

I want the shader to take the input byte, XOR it with 44h and output it to the render target.

My doubt is which of those two will happen:
(assuming for the example that the registers used in the GPU's ALUs are 16 byte wide)

case 1: the HLSL compiler loads the byte to the first component of a 16-bytes-wide vector register(or to the lowest 8 bits of the first 4-dwords-wide vector register).
and the rest of 15 bytes are just zeroed

case 2: HLSL compiler is wise enough to read the data in chunks of 16 bytes and load them into the 16 bytes of a 16 components vector register, propagate that 44h to another 16 bytes wide register, XOR it once and write a single 16 bytes chunk to the render target. (but I can't see it happen, because it says in MSDN that the unused components are zeroed. Not clear if they are just hidden from me or they was zeroed for real as in the first case....)

What about variables:

Would this

float fVar = 3.1f;

occupy the same type of register as this

float4  fVector = { 0.2f, 0.3f, 0.4f, 0.1f };  ?


(i'm sorry if i'm posting too often, but it is very hard for me to find something helpful in google)

Edited by NikiTo

Share this post


Link to post
Share on other sites
Advertisement

I think the reason you are unable to fine-tune your shaders to such extent is that the data-typs themselves are implementation-defined, so for a float in your shader, some gpus could load it into a vector register, some others into a scalar. Some of them don't even have half-types if I am not mistaken.

To my knowledge, only AMD exposes the real assembly for shaders. What you see for compiled hlsl files is a bytecode for intermediate assembly, which gets compiled further down by the driver to native gpu instructions with all the render pipeline state when you are using the shader.

Share this post


Link to post
Share on other sites

@turanszkij So, there is yet another compiler that gets in the middle...

For the simple example above, I could for example read the texture as DXGI_FORMAT_R32G32B32A32_UINT and XOR it with 0x44444444. This way I will solve my problem from the HLL, without losing time with assembler for the GPU. The problem is my project is more complex than this simplified example. I try to run out of doubts by changing my solution. I could try to force some kind of parallelism by using only vector variables of four components in my code, but it is painful to guess how it works.

Share this post


Link to post
Share on other sites

AFAIK, all recent GPU use 32bit rgisters and nothing else. A vec4 uses 4 registers, a byte uses 1 register. (Exception is AMD Vega, which brings back 16bit registers with twice ALU performance).

Also recent GPUs are scalar, there is no more native vec4 and the compiler always creates 4 instructions if you use them.

Native vector instructions date back to pre Fermi and pre GCN.

(Correct me if i'm wrong with anything)

 

Share this post


Link to post
Share on other sites
1 hour ago, NikiTo said:

case 2: HLSL compiler is wise enough to read the data in chunks of 16 bytes and load them into the 16 bytes of a 16 components vector register, propagate that 44h to another 16 bytes wide register, XOR it once and write a single 16 bytes chunk to the render target. (but I can't see it happen, because it says in MSDN that the unused components are zeroed. Not clear if they are just hidden from me or they was zeroed for real as in the first case....)

I assume in your code each thread does the XOR on it's own data. This means that each thread uses it's own register and because other threads can't access this register what you suggest is impossible, or at least it makes no sense. Actually on every GPU groups of 4 neighbouring threads can share registers, but this is available only through extensions to the programmer, and even if the compiler utilizes this there would be no win in packing data around just so that one thread does a single XOR while the other 3 threads do nothing meanwhile (Remember threads operate in lockstep and all of them need work to do all the time if possible.)

Related: The only way to share data between threads efficiently is LDS memory. And you can use this only with compute shaders. Pixel Shaders are only good for simple brute force algorithms.

To look at final assembly code, you may look at vendor tools like NSight or CodeXL. (The latter at least works for OpenCL - not sure how far DX12 support goes.)

Share this post


Link to post
Share on other sites

I would expect the compiler to produce 16 parallel XORs in each of those 4 threads(for the simple example above in the OP) giving me 16x4 totally parallel operations.

Edited by NikiTo

Share this post


Link to post
Share on other sites
58 minutes ago, NikiTo said:

I would expect the compiler to produce 16 parallel XORs in each of those 4 threads(for the simple example above in the OP) giving me 16x4 totally parallel operations.
The all and any instructions in HLSL suggest to me that there are SIMD registers under the hood.

Ok, but then you can simply ready your bytes in blocks of 32bits and XOR them with 0x44444444 and write back 32bits again.

I guess what you have in mind is not that simple and then the answer probably again becomes: No, the compiler won't do such optimiziations for you, but you can do it yourself when possible, e.g. storing 4 counters in one int32, and incrementing all of them with one instructuon: counters += 0x01010101.

There are no more SIMD or MMX like registers like we have on CPU, instead a GPU processes 1 instruction on 32/64 threads, each of them operating on its own registers. This is transparent to the programmer and there is nothing hidden under the hood that transforms your code into something very different. 

You may give a more complex example that i can not solve with a simple trick as above...

 

58 minutes ago, NikiTo said:

I try to change my algorithms in a way to be able to run in pixel shaders on the 3D Engine with its own memory each one, because I guess 3D Engine is guaranteed to be the best optimized piece of hardware in any GPU(Compute Engine, they say, could not be even present on some GPUs)).

From your other thread i have the impression you work on an algorithm probably with some complexity that does not rasterize triangles?

Then compute shaders is for you and you won't benefit from graphics pipeline. GPUs without compute cpability - that's more than a decade back.

Graphics pipeline hase very restricted GPGPU capabilities - you need to do things very inefficient. (I would even say you can do nothing useful with pixel haders, except shading pixels)

It seems you are very wrong with your guess (!) Exception: Your algorithm is so tiny and simple that you can do it in pixel shader without any limitation, but if you don't know compute, you can't be sure. And even if, pixel shader has likely the same performance than compute shader.

 

Example: A N-Body simulation of 1024 planets.

Pixel shader: Every thread reads ALL 1024 planets to see how their gravity affects the threads body.

Compute shader: Every thread reads ONLY ITS OWN planet, stores it in LDS memory, now every thread has access to every planet. 15 times faster. Threads can communicate this way - you can do useful things.

 

Edited by JoeJ

Share this post


Link to post
Share on other sites

As others have alluded to, there's a layer of abstraction here that you're not accounting for here. fxc outputs DXBC, which contains instructions that use a virtual ISA. So it's basically emitting code for a virtual GPU that's expected to behave according to rules defined in the spec for that virtual ISA. The GPU and its driver is then responsible for taking that virtual ISA, JIT-compiling it to native instructions that the GPU can understand (which they sometimes do by going through multiple intermediate steps!), and then executing those instructions in a way that the final shader results are consistent with what the virtual machine would have produced. This gives GPU's tons of leeway in how they can design their hardware and their instructions,  which is kind of the whole point of having a virtual ISA. The most famous example is probably recent Nvidia and AMD GPU's, which have their threads work in terms of scalar instructions instead of the 4-wide instructions that are used in DXBC.

Ultimately what this means for you is that you often can only reason about things in terms of the DXBC virtual ISA. The IHV's will often provide tools that can show you the final hardware-specific instructions for reference, which can occasionally help guide you in terms of writing more optimal code for a particular generation of hardware. But in the long run hardware can change in all kinds of ways, and you can never make assumptions about how hardware will compute the results of your shader program.

That said, the first thing you should do is compile your shader and look at the resulting DXBC. In the case of loading from an R8_UINT texture and XOR'ing it, it's probably just going to load the integer data into a single component of one of its 16-byte registers and then perform an xor instruction on that. Depending on what else is going on in your program you might or might not have other data packed into the same register, and the compiler may or may not merge multiple scalar operations into a 2, 3, or 4-way vector operation. But again, this can have little bearing on the actual instructions executed by the hardware.

In general, I think that your worries about "packing parallel XOR's" are a little misplaced in terms of modern GPU's. GPU's will typically use SIMT style execution, where a single "thread" running a shader program will run on a single lane of a SIMD unit. So as long as you have lots of thread executing (pixels being shaded, in your case) the XOR will pretty much always be run in parallel across wide SIMD units as a matter of course.

Share this post


Link to post
Share on other sites

Another point worth making is that the shader compiler (both the HLSL->DXBC one and the DXBC->IHV ISA one) have no idea that your Texture is of format R8_UINT. All that the compilers are aware of is the fact that your texture is of an 'integer' format and may in fact be R32_UINT.

For that reason, even if a GPU did have a 16-byte wide register, it would be able to do no better than load your 8-bit texel into a 32-bit component of that register. There is no scope for a 16x speedup by doing 16 XORs simultaneously without jumping through the hoops of using R32G32B32A32_UINT.

Note also that D3D11.1 added optional support for Logic Ops in the Blend State, of which XOR is one of the (optionally) supported ops.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement
  • Advertisement
  • Popular Tags

  • Similar Content

    • By NikiTo
      Recently I read that the APIs are faking some behaviors, giving to the user false impressions.
      I assume Shader Model 6 issues the wave instructions to the hardware for real, not faking them.

      Is Shader Model 6, mature enough? Can I expect the same level of optimization form Model 6 as from Model 5? Should I expect more bugs from 6 than 5?
      Would the extensions of the manufacturer provide better overall code than the Model 6, because, let say, they know their own hardware better?

      What would you prefer to use for your project- Shader Model 6 or GCN Shader Extensions for DirectX?

      Which of them is easier to set up and use in Visual Studio(practically)?
    • By mark_braga
      I am trying to get the DirectX Control Panel to let me do something like changing the break severity but everything is greyed out.
      Is there any way I can make the DirectX Control Panel work?
      Here is a screenshot of the control panel.
       

    • By Keith P Parsons
      I seem to remember seeing a version of directx 11 sdk that was implemented in directx12 on the microsoft website but I can't seem to find it anymore. Does any one else remember ever seeing this project or was it some kind off fever dream I had? It would be a nice tool for slowly porting my massive amount of directx 11 code to 12 overtime.
    • By NikiTo
      In the shader code, I need to determine to which AppendStructuredBuffers the data should append. And the AppendStructuredBuffers are more than 30.
      Is declaring 30+ AppendStructuredBuffers going to overkill the shader? Buffers descriptors should consume SGPRs.

      Some other way to distribute the output over multiple AppendStructuredBuffers?

      Is emulating the push/pop functionality with one single byte address buffer worth it? Wouldn't it be much slower than using AppendStructuredBuffer?
    • By Sobe118
      I am rendering a large number of objects for a simulation. Each object has instance data and the size of the instance data * number of objects is greater than 4GB. 
      CreateCommittedResource is giving me: E_OUTOFMEMORY Ran out of memory. 
      My PC has 128GB (only 8% ish used prior to testing this), I am running the DirectX app as x64. <Creating a CPU sided resource so GPU ram doesn't matter here, but using Titan X cards if that's a question>
      Simplified code test that recreates the issue (inserted the code into Microsofts D3D12HelloWorld): 
      unsigned long long int siz = pow(2, 32) + 1024; D3D12_FEATURE_DATA_D3D12_OPTIONS options; //MaxGPUVirtualAddressBitsPerResource = 40 m_device->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS, &options, sizeof(options)); HRESULT oops = m_device->CreateCommittedResource( &CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_UPLOAD), D3D12_HEAP_FLAG_NONE, &CD3DX12_RESOURCE_DESC::Buffer(siz), D3D12_RESOURCE_STATE_GENERIC_READ, nullptr, IID_PPV_ARGS(&m_vertexBuffer)); if (oops != S_OK) { printf("Uh Oh"); } I tried enabling "above 4G" in the bios, which didn't do anything. I also tested using malloc to allocate a > 4G array, that worked in the app without issue. 
      Are there more options or build setup that needs to be done? (Using Visual Studio 2015)
      *Other approaches to solving this are welcome too. I thought about splitting up the set of items to render into a couple of sets with a size < 4G each but would rather have one set of objects. 
      Thank you.
    • By _void_
      Hey guys!
      I am not sure how to specify array slice for GatherRed function on Texture2DArray in HLSL.
      According to MSDN, "location" is one float value. Is it a 3-component float with 3rd component for array slice?
      Thanks!
    • By lubbe75
      I have a winforms project that uses SharpDX (DirectX 12). The SharpDX library provides a RenderForm (based on a System.Windows.Forms.Form). 
      Now I need to convert the project to WPF instead. What is the best way to do this?
      I have seen someone pointing to a library, SharpDX.WPF at Codeplex, but according to their info it only provides support up to DX11.
      (Sorry if this has been asked before. The search function seems to be down at the moment)
    • By korben_4_leeloo
      Hi.
      I wanted to experiment D3D12 development and decided to run some tutorials: Microsoft DirectX-Graphics-Samples, Braynzar Soft, 3dgep...Whatever sample I run, I've got the same crash.
      All the initialization process is going well, no error, return codes ok, but as soon as the Present method is invoked on the swap chain, I'm encountering a crash with the following call stack:
      https://drive.google.com/open?id=10pdbqYEeRTZA5E6Jm7U5Dobpn-KE9uOg
      The crash is an access violation to a null pointer ( with an offset of 0x80 )
      I'm working on a notebook, a toshiba Qosmio x870 with two gpu's: an integrated Intel HD 4000 and a dedicated NVIDIA GTX 670M ( Fermi based ). The HD 4000 is DX11 only and as far as I understand the GTX 670M is DX12 with a feature level 11_0. 
      I checked that the good adapter was chosen by the sample, and when the D3D12 device is asked in the sample with a 11_0 FL, it is created with no problem. Same for all the required interfaces ( swap chain, command queue...).
      I tried a lot of things to solve the problem or get some info, like forcing the notebook to always use the NVIDIA gpu, disabling the debug layer, asking for a different feature level ( by the way 11_0 is the only one that allows me to create the device, any other FL will fail at device creation )...
      I have the latest NVIDIA drivers ( 391.35 ), the latest Windows 10 sdk ( 10.0.17134.0 ) and I'm working under 
      Visual Studio 2017 Community.
      Thanks to anybody who can help me find the problem...
    • By _void_
      Hi guys!
      In a lot of samples found in the internet, people when initialize D3D12_SHADER_RESOURCE_VIEW_DESC with resource array size 1 would normallay set its dimension as Texture2D. If the array size is greater than 1, then they would use dimension as Texture2DArray, for an example.
      If I declare in the shader SRV as Texture2DArray but create SRV as Texture2D (array has only 1 texture) following the same principle as above, would this be OK? I guess, this should work as long as I am using array index 0 to access my texture?
      Thanks!
    • By _void_
      Hey!
       
      What is the recommended upper count for commands to record in the command list bundle?
      According to MSDN it is supposed to be a small number but do not elaborate on the actual number.
      I am thinking if I should pre-record commands in the command buffer and use ExecuteIndirect or maybe bundles instead.
      The number of commands to record in my case could vary greatly. 
       
      Thanks!
  • Advertisement
  • Popular Now

  • Forum Statistics

    • Total Topics
      631362
    • Total Posts
      2999563
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!