Sign in to follow this  
Hyunkel

Structured buffer float compression

Recommended Posts

Hyunkel    401
I have a computer shader that generates procedural planetary terrain and stores vertices in a structured buffer which has the following layout.
[CODE]
struct PlanetVertex
{
float3 Position;
float3 Normal;
float Temperature;
float Humidity;
};
[/CODE]

That's 10 floats with 4 bytes per float -> 40 bytes per vertex.
A terrain node or patch contains 33x33 vertices, which is 43560 bytes.
At the highest quality setting, the compute shader will output up to 5000 nodes,
so the buffer needs to be 5000 * 43560 bytes, which is
217800000 bytes or ~207mb.

Due to the way I handle load balancing between rendering and terrain generation, I need to have this buffer in memory twice, so I use
~415mb only for vertex data.
This is okay I guess, since a planet is sort of the primary object, but I want to reduce this buffer size if possible.

For example the normal vector: It doesn't need 32bit precision per channel, 16 would be more than enough.
As for temperature and humidity, they could even fit in an 8bit unorm, but I doubt that's available here.

I found that there are f32tof16 and f16tof32 functions in hlsl, which I assume are what I need, but I cannot quite figure out how they're supposed to work:
[url="http://msdn.microsoft.com/en-us/library/windows/desktop/ff471399(v=vs.85).aspx"]http://msdn.microsoft.com/en-us/library/windows/desktop/ff471399(v=vs.85).aspx[/url]

It says here that f32to16 returns a uint, but isn't that 32 bit as well?

Cheers,
Hyu

Share this post


Link to post
Share on other sites
Hyunkel    401
Oh, alright, now I get it.
So basically I have to run f32tof16 on two floats which I want to store in a single uint (with 16 bit precision) but I have to do packaging myself.
[CODE]
uint Float2ToF1616(in float2 f)
{
uint packed;
packed = asuint(f32tof16(f.x)) | (asuint(f32tof16(f.y)) << 16);
return packed;
}
[/CODE]

Thanks! :)

Share this post


Link to post
Share on other sites
Nik02    4348
If you assume that the normal is a normalized vector, you can pack it to a 2-element vector:

n[sub]z[/sub] = 1.0 - n[sub]x[/sub] - n[sub]y[/sub]

Share this post


Link to post
Share on other sites
Nik02    4348
[quote name='Hyunkel' timestamp='1338893120' post='4946379']
Oh, alright, now I get it.
So basically I have to run f32tof16 on two floats which I want to store in a single uint (with 16 bit precision) but I have to do packaging myself.
[CODE]
uint Float2ToF1616(in float2 f)
{
uint packed;
packed = asuint(f32tof16(f.x)) | (asuint(f32tof16(f.y)) << 16);
return packed;
}
[/CODE]

Thanks! :)
[/quote]

Yes

Share this post


Link to post
Share on other sites
Hyunkel    401
[quote name='Nik02' timestamp='1338893126' post='4946380']
If you assume that the normal is a normalized vector, you can pack it to a 2-element vector:

n[sub]z[/sub] = 1.0 - n[sub]x[/sub] - n[sub]y[/sub]
[/quote]

The only packing algorithms I know of are these:
[url="http://aras-p.info/texts/CompactNormalStorage.html"]http://aras-p.info/texts/CompactNormalStorage.html[/url]

and they are for view space normal vectors mostly.
I don't really see a quick way to do 2 channel packing, though it would be quite useful if I could.

[quote name='Nik02' timestamp='1338893291' post='4946384']
The underlying reason for this is that modern hardware doesn't actually have 16-bit registers.
[/quote]

Yeah, I'm aware, which is why I thought that f32tof16() would need to take 2 floats as input and got confused.

Share this post


Link to post
Share on other sites
Madhed    4095
[quote name='Nik02' timestamp='1338893126' post='4946380']
If you assume that the normal is a normalized vector, you can pack it to a 2-element vector:

n[sub]z[/sub] = 1.0 - n[sub]x[/sub] - n[sub]y[/sub]
[/quote]

Did you mean n[sub]z[/sub] = sqrt(1.0 - n[sub]x[/sub][sup]2[/sup] - n[sub]y[/sub][sup]2[/sup]) ?

Share this post


Link to post
Share on other sites
Hyunkel    401
I always assumed that with this method it would be troublesome to reconstruct the correct sign for z.
But yes, I see how you can just store the sign in the x or y channel since both are biased to a positive range.

Now I'm confused though. This seems excessively easy. Why the need for all the much more complicated methods then?

Share this post


Link to post
Share on other sites
Nik02    4348
[quote name='Madhed' timestamp='1338896962' post='4946400']
[quote name='Nik02' timestamp='1338893126' post='4946380']
If you assume that the normal is a normalized vector, you can pack it to a 2-element vector:

n[sub]z[/sub] = 1.0 - n[sub]x[/sub] - n[sub]y[/sub]
[/quote]

Did you mean n[sub]z[/sub] = sqrt(1.0 - n[sub]x[/sub][sup]2[/sup] - n[sub]y[/sub][sup]2[/sup]) ?
[/quote]

Yea, sorry for the confusion. Edited by Nik02

Share this post


Link to post
Share on other sites
MJP    19756
[quote name='Hyunkel' timestamp='1338897001' post='4946401']
Now I'm confused though. This seems excessively easy. Why the need for all the much more complicated methods then?
[/quote]

Because other methods may be cheaper and/or give you better precision.

Share this post


Link to post
Share on other sites
Hyunkel    401
I doubt they'll be much cheaper considering the simplicity of the store sign and reconstruct z method.
However I can see how other methods could provide more precision. In fact I did notice precision issues using this very method.
Now that I think about it, the method also only works on signed types in order to store the sign of z.

Share this post


Link to post
Share on other sites
Nik02    4348
The optimal approach depends on the problem you're trying to solve. If lower precision is enough for your use case, using lower-precision calculation could be just fine. If best quality is needed, then you should use float3 to store the data at full fidelity and suffer the bandwidth/storage impact.

View-space normal storage is a different scenario than per-vertex normal storage. Vertex shader is typically invoked a lot more infrequently than a pixel shader, so more complex calculations can be done there without hitting a bottleneck.

Share this post


Link to post
Share on other sites
pcmaster    982
Why not use DXGI_FORMAT_R11G11B10_FLOAT? There's plenty of nice formats. Or a single R32_UINT and pack your normal manually, no big deal. No need to ever use R32G32B32_FLOAT format for normals transfer! If you can have normals in screen-space, definitely go pack them according to one of the methods linked. I use #4 to my greatest pleasure in screen space (1 float3 to float2 (stored as R16G16_FLOAT)).

Regarding HLSL, you just write your float, float2, float3, whatever happily and depending on the bound target view (RTV, DSV, UAV?) format, the conversion happens automatically. There is no HLSL construct for "half", there is no need.

There is no sense of packing data in between the shader stages (such as from vertex shader to hull shader or such). You can happily send i.e. R8_UNORM buffers to input assembler (or bind them as SRV) and your shaders see whatever type (such as float) automatically. The same at output.

I'd split your struct into separate streams of positions, normals, temperatures etc, with formats R32G32B32_FLOAT, R11G11B10_FLOAT, R8_UNORM, etc., for example. Edited by pcmaster

Share this post


Link to post
Share on other sites
Ashaman73    13715
Have you considered alignment of data ?

[CODE]
struct PlanetVertex
{
float3 Position;
float3 Normal;
float Temperature;
float Humidity;
};
[/CODE]
That's 2x 4-component vectors, considering that you want the data aligned, using 16 instead of 32 bit could be your best bet.

Share this post


Link to post
Share on other sites
Hyunkel    401
[quote name='pcmaster' timestamp='1338984972' post='4946733']
Why not use DXGI_FORMAT_R11G11B10_FLOAT? There's plenty of nice formats. Or a single R32_UINT and pack your normal manually, no big deal. No need to ever use R32G32B32_FLOAT format for normals transfer! If you can have normals in screen-space, definitely go pack them according to one of the methods linked. I use #4 to my greatest pleasure in screen space (1 float3 to float2 (stored as R16G16_FLOAT)).

Regarding HLSL, you just write your float, float2, float3, whatever happily and depending on the bound target view (RTV, DSV, UAV?) format, the conversion happens automatically. There is no HLSL construct for "half", there is no need.

There is no sense of packing data in between the shader stages (such as from vertex shader to hull shader or such). You can happily send i.e. R8_UNORM buffers to input assembler (or bind them as SRV) and your shaders see whatever type (such as float) automatically. The same at output.

I'd split your struct into separate streams of positions, normals, temperatures etc, with formats R32G32B32_FLOAT, R11G11B10_FLOAT, R8_UNORM, etc., for example.
[/quote]

I think you misunderstand what I'm doing, though it is perfectly possible that I'm misunderstanding what you're suggesting.
The generated data comes from a compute shader, which is not part of the normal rendering pipeline.
It is stored in a structured buffer, which can only contain hlsl data types (afaik), and as you mentioned, half is not available.

If I was able to use formats such as R11G11B10_FLOAT there would be no issues.
For the view space normals in my geometry buffer I do in fact use an R16G16 format with packing method #4 and it works brilliantly.


[quote name='Ashaman73' timestamp='1338987545' post='4946736']
Have you considered alignment of data ?

[CODE]
struct PlanetVertex
{
float3 Position;
float3 Normal;
float Temperature;
float Humidity;
};
[/CODE]
That's 2x 4-component vectors, considering that you want the data aligned, using 16 instead of 32 bit could be your best bet.
[/quote]

Well... no. I have not.
Do I have to consider this with structured buffers?
Right now I have this:

[CODE]
struct PlanetVertex
{
float3 Position;
uint2 PackedNormalTempHumidity;
};
[/CODE]

With PackedNormalTempHumidity being packed as
X = float2(Normal.x, Normal.y)
Y = float2(Temperature, Humidity)

Though there are some minor precision issues with the 2 channel normal vector when z is close to 0
I might go for something like this instead:

[CODE]
struct PlanetVertex
{
float3 Position;
uint2 PackedNormalTemp
uint PackedHumiditySomethingelse;
};
[/CODE]

With a 3 channel normal vector with 16bit/channel Edited by Hyunkel

Share this post


Link to post
Share on other sites
MJP    19756
[quote name='Hyunkel' timestamp='1338996087' post='4946776']
Well... no. I have not.
Do I have to consider this with structured buffers?
[/quote]

No, you don't. It's totally legal to access structures with a stride that's not a multiple of 16 bytes.

Share this post


Link to post
Share on other sites
pcmaster    982
Hyunkel, yes, I'm familiar with DX11 compute shaders. You don't necessarily need to use StructuredBuffer UAV. You can use several UAVs as outputs of your compute shaders. So instead of a stream (array) of packed interleaved struct data, you might have streams (arrays) of individual struct members. Instead of 1 RWStructuredBuffer, you'd have 4 RWBuffers as targets of your compute shader. The main disadvantage I see is that you use 4 target slots instead of 1 (there should always be at least 8 supported, if I recall well). I believe you can have texture/buffer UAVs as well in cs_5_0 (unlike cs_4_1) but I've actually used RWStructuredBuffer just like you.

Share this post


Link to post
Share on other sites
Hyunkel    401
[quote name='MJP' timestamp='1339044850' post='4946949']
[quote name='Hyunkel' timestamp='1338996087' post='4946776']
Well... no. I have not.
Do I have to consider this with structured buffers?
[/quote]

No, you don't. It's totally legal to access structures with a stride that's not a multiple of 16 bytes.
[/quote]
Thanks.

[quote name='pcmaster' timestamp='1339053221' post='4946972']
Hyunkel, yes, I'm familiar with DX11 compute shaders. You don't necessarily need to use StructuredBuffer UAV. You can use several UAVs as outputs of your compute shaders. So instead of a stream (array) of packed interleaved struct data, you might have streams (arrays) of individual struct members. Instead of 1 RWStructuredBuffer, you'd have 4 RWBuffers as targets of your compute shader. The main disadvantage I see is that you use 4 target slots instead of 1 (there should always be at least 8 supported, if I recall well). I believe you can have texture/buffer UAVs as well in cs_5_0 (unlike cs_4_1) but I've actually used RWStructuredBuffer just like you.
[/quote]
Oh, you can use the standard storage formats with RWBuffers?
For some reason I never thought of that.
That does make things a lot easier indeed.
I'll have to see if that performs just as well, but I see no reason why it wouldn't, and the memory footprint should obviously be quite a bit lower.

Thanks for pointing this out! :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this