# DX11 Normalized (Unsigned) Integers vs Floats as Vertex Data

## Introduction:

In general my questions pertain to the differences between floating- and fixed-point data. Additionally I would like to understand when it can be advantageous to prefer fixed-point representation over floating-point representation in the context of vertex data and how the hardware deals with the different data-types. I believe I should be able to reduce the amount of data (bytes) necessary per vertex by choosing the most opportune representations for my vertex attributes. Thanks ahead of time if you, the reader, are considering the effort of reading this and helping me.

I found an old topic that shows this is possible in principal, but I am not sure I understand what the pitfalls are when using fixed-point representation and whether there are any hardware-based performance advantages/disadvantages.

(TLDR at bottom)

## The Actual Post:

To my understanding HLSL/D3D11 offers not just the traditional floating point model in half-,single-, and double-precision, but also the fixed-point model in form of signed/unsigned normalized integers in 8-,10-,16-,24-, and 32-bit variants. Both models offer a finite sequence of "grid-points". The obvious difference between the two models is that the fixed-point model offers a constant spacing between values in the normalized range of [0,1] or [-1,1], while the floating point model allows for smaller "deltas" as you get closer to 0, and larger "deltas" the further you are away from 0.

To add some context, let me define a struct as an example:

struct VertexData
{
float[3] position; //3x32-bits
float[2] texCoord; //2x32-bits
float[3] normals; //3x32-bits
} //Total of 32 bytes

Every vertex gets a position, a coordinate on my texture, and a normal to do some light calculations. In this case we have 8x32=256bits per vertex. Since the texture coordinates lie in the interval [0,1] and the normal vector components are in the interval [-1,1] it would seem useful to use normalized representation as suggested in the topic linked at the top of the post. The texture coordinates might as well be represented in a fixed-point model, because it seems most useful to be able to sample the texture in a uniform manner, as the pixels don't get any "denser" as we get closer to 0. In other words the "delta" does not need to become any smaller as the texture coordinates approach (0,0). A similar argument can be made for the normal-vector, as a normal vector should be normalized anyway, and we want as many points as possible on the sphere around (0,0,0) with a radius of 1, and we don't care about precision around the origin. Even if we have large textures such as 4k by 4k (or the maximum allowed by D3D11, 16k by 16k) we only need as many grid-points on one axis, as there are pixels on one axis. An unsigned normalized 14 bit integer would be ideal, but because it is both unsupported and impractical, we will stick to an unsigned normalized 16 bit integer. The same type should take care of the normal vector coordinates, and might even be a bit overkill.

struct VertexData
{
float[3] position; //3x32-bits
uint16_t[2] texCoord; //2x16bits
uint16_t[3] normals; //3x16bits
} //Total of 22 bytes

Seems like a good start, and we might even be able to take it further, but before we pursue that path, here is my first question: can the GPU even work with the data in this format, or is all I have accomplished minimizing CPU-side RAM usage? Does the GPU have to convert the texture coordinates back to a floating-point model when I hand them over to the sampler in my pixel shader? I have looked up the data types for HLSL and I am not sure I even comprehend how to declare the vertex input type in HLSL. Would the following work?

struct VertexInputType
{
float3 pos; //this one is obvious
unorm half2 tex; //half corresponds to a 16-bit float, so I assume this is wrong, but this the only 16-bit type I found on the linked MSDN site
snorm half3 normal; //same as above
}

I assume this is possible somehow, as I have found input element formats such as: DXGI_FORMAT_R16G16B16A16_SNORM and DXGI_FORMAT_R16G16B16A16_UNORM (also available with a different number of components, as well as different component lengths). I might have to avoid 3-component vectors because there is no 3-component 16-bit input element format, but that is the least of my worries. The next question would be: what happens with my normals if I try to do lighting calculations with them in such a normalized-fixed-point format? Is there no issue as long as I take care not to mix floating- and fixed-point data? Or would that work as well? In general this gives rise to the question: how does the GPU handle fixed-point arithmetic? Is it the same as integer-arithmetic, and/or is it faster/slower than floating-point arithmetic?

Assuming that we still have a valid and useful VertexData format, how far could I take this while remaining on the sensible side of what could be called optimization? Theoretically I could use the an input element format such as DXGI_FORMAT_R10G10B10A2_UNORM to pack my normal coordinates into a 10-bit fixed-point format, and my verticies (in object space) might even be representable in a 16-bit unsigned normalized fixed-point format. That way I could end up with something like the following struct:

struct VertexData
{
uint16_t[3] pos; //3x16bits
uint16_t[2] texCoord; //2x16bits
uint32_t packedNormals; //10+10+10+2bits
} //Total of 14 bytes

Could I use a vertex structure like this without too much performance-loss on the GPU-side? If the GPU has to execute some sort of unpacking algorithm in the background I might as well let it be. In the end I have a functioning deferred renderer, but I would like to reduce the memory footprint of the huge amount of vertecies involved in rendering my landscape.

TLDR: I have a lot of vertices that I need to render and I want to reduce the RAM-usage without introducing crazy compression/decompression algorithms to the CPU or GPU. I am hoping to find a solution by involving fixed-point data-types, but I am not exactly sure how how that would work.

Edited by chiffre

##### Share on other sites

So you have a lot of questions, which is understandable because this is all fairly complex and confusing. I think the first thing you should probably know is that DirectX shader programs generally only work in terms of 32-bit datatypes. So when you write HLSL code to add two values or transform a vector by matrix, the actual ALU operations are mostly going to be working with 32-bit floating point values or 32-bit integers. These are the only data formats that are guaranteed to be supported by the hardware in your shader programs, and most games/programs out there work exclusively with 32-bit operations. There are ways to work with operations that run at lower-than-32-bit precision, as well as the "double" type that can be used to access double-precision operations. However both of these are optional, and are only supported by certain GPU's. For most of the DX10 and DX11 era desktop GPU's only supported 32-bit operations, but recent AMD and Nvidia GPU's have started adding support for fp16 operations. Support for double-precision can be rather patchy, since it's generally only intended for high-end compute usage rather than 3D graphics.

Unsigned normalized integer; which is interpreted in a resource as an unsigned integer, and is interpreted in a shader as an unsigned normalized floating-point value in the range [0, 1]. All 0's maps to 0.0f, and all 1's maps to 1.0f. A sequence of evenly spaced floating-point values from 0.0f to 1.0f are represented. For instance, a 2-bit UNORM represents 0.0f, 1/3, 2/3, and 1.0f.

The part that I bolded for emphasis is the important part: it's basically saying that the shader will see a floating point value in the range [0, 1], not an integer value. This lets you write shader code like this:

Texture2D<float4> ColorTexture;
SamplerState ColorSampler;

cbuffer Constants
{
float3 LightDir;
float3 LightColor;
}

float4 PSMain(in float3 uv : UV, in float3 normal : NORMAL) : SV_Target0
{
float3 surfaceColor = ColorTexture.Sample(ColorSampler, uv).xyz;
float3 lighting = surfaceColor * saturate(dot(normal, LightDir)) * LightColor;
return float4(lighting, 1.0f);
}

So the shader is just working entirely in float's, and doesn't care about the data format of the texture. The texture might be R8G8B8A8_UNORM, it might by R16G16B16A16_FLOAT, or it might be BC1_UNORM. It doesn't really matter to the shader, as long as the format is ultimately interpreted as a float. Hence you don't see any unpacking or conversion code here, we just go right to multiplying the texture sample with the lighting. Generally the only time you do care is if you use a UINT or SINT format, since those require decorating  the texture with <uint> or <int>. In those cases the integer values will be case to a 32-bit int type on read, which lets the shader work with them using 32-bit operations.

On a similar note, we can have data conversion happening when the pixel shader's output value gets written to a render target texture. The pixel shader outputs a float4, but the render target might be using R16G16B16A16_FLOAT, which is a common format for HDR rendering. If that's the case, the pixel shader output value will automatically get converted to that format on write, again without the shader really knowing or caring.

The same concepts extend to a few other places in the pipeline. Shaders can read from the "Buffer<>" type in HLSL, which maps to a shader resource view with D3D11_SRV_DIMENSION_BUFFER. This kind of buffer also uses a DXGI_FORMAT, and supports automatic conversion from the packed format to 32-bit values. RWTexture and RWBuffer UAV's can also perform format conversion on write, and possibly also on load depending on the GPU and OS that you're using.

The other place where format conversion is commonly uses is the input assembler (IA), which is responsible for reading from your vertex and index buffer and passing the vertex data to your vertex shader. For the formats that the IA supports, conversion works the same way it does for textures: transparently. Your HLSL code just needs to declare the input variable according to the way that the format is interpreted as specified in the docs, which means float for FLOAT/UNORM/SNORM, int for SINT, and uint for UINT formats. So if you have this as your VS input struct:

struct VertexInput
{
float3 pos;
float2 tex;
float3 normal;
};

Those 3 attributes could be using any of the FLOAT, UNORM, or SNORM formats in the vertex stream. You don't need to change your shader declaration if you change the vertex packing, as long as you're continuing to use formats with one of those 3 modifiers. This means that you can totally use lower-than-32-bit packing for your vertex attributes if you want to save on memory and bandwidth, and you absolutely should do this! You just have to make sure that the precision is adequate for your use cases.

As for how the conversion happens, it's really up to the hardware. Texture format conversion is generally handled by dedicated hardware in the texture units, which is necessary because unpacking (and possibly decompression from a BC format) needs to happen before filtering can occur. on the flip side, some hardware (particularly mobile hardware) will patch code into the pixel shader to handle conversion from 32-bit floats to the render target format. For the IA, it depends. Some GPU's still have dedicated hardware and caches for fetching from vertex buffers. Some don't have any hardware for this, and instead will patch in a pre-amble to your vertex shader that reads and unpacks the vertex data. But even in those cases the decode cost tends to be pretty small compared to the cost of additional memory access, so tightly-packed data is most likely still going to be a win.

##### Share on other sites

Thanks so much for this post. The information is very valuable to me, even if I only pursue D3D11 projects as a hobby (to learn C++ in the process of writing a little game engine).

I hope I didn't, quite literally, ask for too much here.

Quick edit: in the second third of your post you mention this:

8 hours ago, MJP said:

The same concepts extend to a few other places in the pipeline. Shaders can read from the "Buffer<>" type in HLSL, which maps to a shader resource view with D3D11_SRV_DIMENSION_BUFFER. This kind of buffer also uses a DXGI_FORMAT, and supports automatic conversion from the packed format to 32-bit values. RWTexture and RWBuffer UAV's can also perform format conversion on write, and possibly also on load depending on the GPU and OS that you're using.﻿

I have tried out rendering with structured buffers via pulling vertices directly from the buffer with SV_VertexID, and while I couldn't find a performance difference between rendering from structured buffers and the standard method (D3D11_BIND_SHADER_RESOURCE vs D3D11_BIND_VERTEX_BUFFER etc.) I am curious if I understand you correctly here, as I think UAVs and structured buffers are similar and this info could be quite relevant to me. What I understand is: it is not given, that all the DXGI_FORMAT_'s that are supported by the IA or the equivalent preamble in the shader-code for traditional vertex-buffer usage (D3D11_BIND_VERTEX_BUFFER) are also supported when loading data from structured buffers or UAVs in the shader.

Edited by chiffre

##### Share on other sites

StructuredBuffer and ByteAddressBuffer have no format conversion applied to them on read. If you would like to do any packing of lower-precision values in there, you have to unpack then yourself in the shader program. Here's an example of what that might look like:

struct Vertex
{
float3 Position;    // 3x32bit
uint UV;            // 2x16bit UNORM
uint2 Normal        // 4x16bit SNORM
uint2 Color         // 4x16bit FLOAT
};

StructuredBuffer<Vertex> Vertices;

float U16NToF32(in uint val)
{
return saturate(val / 65535.0f);
}

float2 S16NToF32_2(in uint val)
{
int2 signextended;
signextended.x = (int)(val << 16) >> 16;
signextended.y = (int)(val & 0xFFFF0000) >> 16;
return max(float2(signextended) / 32767.0f, -1.0f);
}

VSOutput VSMain(in uint VertexID SV_VertexID)
{
Vertex vertex = Vertices[VertexID];
float3 position = vertex.Position;

// Unpack by converting to full 32-bit values
float2 uv = float2(U16NToF32(vertex.uv & 0xFFFF), U16NToF32(vertex.uv >> 16));
float4 normal = float4(S16NToF32_2(vertex.Normal.x), S16NToF32_2(vertex.Normal.y));
float4 Color = float4(f16tof32(vertex.Color.x), f16tof32(vertex.Color.x >> 16),
f16tof32(vertex.Color.y), f16tof32(vertex.Color.y >> 16));

...
}

The same goes for UAV's. RWStructuredBuffer an RWByteAddressBuffer have no format conversion applied on write, but RWBuffer<> and RWTexture do.

## Create an account

Register a new account

• ### What is your GameDev Story?

In 2019 we are celebrating 20 years of GameDev.net! Share your GameDev Story with us.

• 28
• 16
• 10
• 10
• 11
• ### Forum Statistics

• Total Topics
634113
• Total Posts
3015571
×