Reducing byte transfer between C++ and HLSL.

Started by
12 comments, last by Matias Goldberg 8 years, 3 months ago

My idea is simple: all the colors I have, are in RGB(A) format [0-255, 0-255, 0-255, 0-255].

For me it seems a good idea to pass only 4 bytes as one unsigned value and somehow (automatically) decode it in a shader to unorm float[3].

Instead of a way I always follow: pass 12 bytes as float[3] from C++ directly.

Is it possible to reinterpret 4 bytes as float4 inside a shader?

C++


struct RdAmbientLight
{
  unsigned colorUNorm = 0xFFFFFFFF;
  //Other data
};

void DefPass2::createPsAmbientColorCb()
{
  CD3D11_BUFFER_DESC constantBufferDesc(sizeof(RdAmbientLight), D3D11_BIND_CONSTANT_BUFFER, D3D11_USAGE_DYNAMIC, D3D11_CPU_ACCESS_WRITE);
  m_device->CreateBuffer(&constantBufferDesc, nullptr, &m_psAmbientCb);
}

HLSL:


cbuffer AmbientLight: register(b1)
{
   unorm float4 Color; //< Some automatic decoding
};
Advertisement
Yes, it is possible, but...

You'll do more work unpacking the unsigned on the GPU than you would just passing in the other three floats-- by several orders of magnitude.
fastcall22, on 03 Jan 2016 - 06:17 AM, said:

Yes, it is possible, but...

Is it possible via some HLSL syntaxes or I should unpack it manually (like use << and divide each color value by 255)?

Yes you can just pass it as an int (4 bytes), and then shift and mask to get the 3 bytes out, and then multiply by 1/255.0 to convert to normalized floats.

This may be faster or slower than sending full floats, depending on the situation.
Hodgman, on 03 Jan 2016 - 06:55 AM, said:

Yes you can just pass it as an int (4 bytes), and then shift and mask to get the 3 bytes out, and then multiply by 1/255.0 to convert to normalized floats.

This may be faster or slower than sending full floats, depending on the situation.

Thank you, Hodgman!

I just thought there is a native way for video card to make this translation for free which I am not aware of.

If you're desperate, you could create a standard ID3D11Buffer of 4 bytes and a ShaderResourceView of format DXGI_FORMAT_R8G8B8A8_UNORM. That'll auto-unpack for you on the GPU side, but it won't be a constant buffer any more so might cost you a tiny bit of GPU performance on some hardware.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Depends how you're sending the data - in buffers or textures, you can specify the R8G8B8A8_UNORM format.
Constant-buffers only support 32bit integers, not 8bit, so you have to manually unpack.

This is one of the places where OpenGL is ahead of D3D.

OpenGL has unpackUnorm for this. It's cumbersome but gets the job done. On most modern hardware, this function maps directly to a native instruction. Unfortunately, as far as I know HLSL has no equivalent.

However you do have f16tof32 which is the next best thing.

Edit: Someone already wrote some util functions. With extreme luck the compiler recognizes the pattern and issues the native instruction instead of lots of bitshifting, masking and multiplication / division. You can at least check the results on GCN hardware using GPUPerfStudio's ShaderAnalyzer to see if the driver does indeed recognize what you're doing (I don't think it will though...).

Edit: Someone already wrote some util functions. With extreme luck the compiler recognizes the pattern and issues the native instruction instead of lots of bitshifting, masking and multiplication / division. You can at least check the results on GCN hardware using GPUPerfStudio's ShaderAnalyzer to see if the driver does indeed recognize what you're doing (I don't think it will though...).

I'm not even sure GCN has an instruction for what he wants to do. The best I can figure out it would be 4 v_cvt_f32_ubyte[0|1|2|3] and then 4 v_mul_f32 by 1/255.0f.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

I'm not even sure GCN has an instruction for what he wants to do. The best I can figure out it would be 4 v_cvt_f32_ubyte[0|1|2|3] and then 4 v_mul_f32 by 1/255.0f.

Maybe yes, maybe not, but what I mean is that it's still very far from doing 4 loads, 4 bitshifts, 4 'and' masks, 4 conversions to float, then the 1/255 mul.

Edit: Checked, you're right about the instructions. "fragCol = unpackUnorm4x8(val);" outputs: (irrelevant ISA code stripped):


  v_cvt_f32_ubyte0  v0, s4                                  // 00000000: 7E002204
  v_cvt_f32_ubyte1  v1, s4                                  // 00000004: 7E022404
  v_cvt_f32_ubyte2  v2, s4                                  // 00000008: 7E042604
  v_cvt_f32_ubyte3  v3, s4                                  // 0000000C: 7E062804
  v_mov_b32     v4, 0x3b808081                              // 00000010: 7E0802FF 3B808081
  v_mul_f32     v0, v4, v0                                  // 00000018: 10000104
  v_mul_f32     v1, v1, v4                                  // 0000001C: 10020901
  v_mul_f32     v2, v2, v4                                  // 00000020: 10040902
  v_mul_f32     v3, v3, v4                                  // 00000024: 10060903

Edit 2: Well, that was disappointing. I checked the manual and GCN does have a single instruction for this conversion, if I'm not mistaken it should be:


tbuffer_load_format_xyzw v[0:3], v0, s[4:7], 0 idxen format:[BUF_DATA_FORMAT_8_8_8_8,BUF_NUM_FORMAT_UNORM]

This topic is closed to new replies.

Advertisement