Storing a float as two half floats (GLSL 3.30)

Started by
6 comments, last by Ashaman73 11 years, 6 months ago
I would like to encode a single precision (32-bit) float in two channels of an RGBA16F framebuffer texture. I've written some code which I believe should work, but I have a few questions which I'm having trouble finding clear answers on.

First, the code. How it works is explained in the comments:

[source lang="plain"]
// we want to store a float in two half floats using some bit hackery
// first we interpret as integers so we can work with the bits
uint bits = floatBitsToUint( normalZDepth );
uvec2 parts = uvec2(
bits >> 16, // the higher 16 bits
bits & 0x0000ffff ); // the lower 16 bits
// each component's lower 16 bits now contain the lower and higher bits from the original value
// we want these bits to remain the same when put in 16-bit floats.
// we do this by putting these into normal 32-bit floats such that when these
// 32-bit floats are converted into 16-bit floats, the important bits will be all that remain
// 32-bit float: [ 1 (sign) | 8 (exponent) | 23 (mantissa) ]
// 16-bit float: [ 1 (sign) | 5 (exponent) | 10 (mantissa) ]
// when converting float to half:
// bit 31 (sign) moves to bit 15
// bits 23-30 (exponent) will be truncated such that bits 23-27 move to bits 10-14
// bits 0-22 (mantissa) will be truncated such that bits 13-22 move to bits 0-9
// therefore we construct the following integer to be cast back to float:
// position: [31] [30-28] [27-23] [22-13] [12-0 ]
// bits: [15] ...0... [14-10] [ 9-0 ] ...0...
// combining the contiguous portion of the exponent and mantissa we get:
// position: [31] [30-28] [27-13] [12-0 ]
// bits: [15] ...0... [14-0 ] ...0...
// so the final result is that we shift bit 15 by 16 over to bit 31, and bits 0-12 by 13 over to 13-27
uvec2 floatBits = ((parts & 0x8000) << 16) | ((parts & 0x7FFF) << 13);
// now just interpret as float - ready to be stored as half floats
vec2 halfBits = uintBitsToFloat( floatBits );

[/source]

1) I'm only guessing this is how float to half conversion works - by truncating the upper bits of the exponent and the lower bits of the mantissa. Though actually now that I think about it, since the exponent is biased, this isn't quite right (but that should be easy to fix). Could someone explain how this conversion actually occurs? Are there rounding issues that could occur? (The rest of the bits in the mantissa are 0.) Is there an IEEE754 standard for this conversion, and if so, is GLSL 3.30 required to follow it?

2) Suppose one of the integers ended up holding the value 0111110000000000. Then the resulting (32-bit) intermediate float would be:
0 00011111 00000000000000000000000
which is a "valid" float (not NaN, etc.). But when converting to a 16-bit float, it would become 0 11111 0000000000, which is NaN. (Again I realize this is slightly wrong since I forgot to take the exponent bias into account, but you could easily construct an equivalent example with bias.) This is actually the behavior I desire, since it preserves the correct bits - and I would want other "harmless" bit patterns in the 32-bit float to convert to potentially "bad" values in the 16-bit float, such as underflow, overflow, infinity, denormalization, etc. However, I suspect this is not the case, since, for example, it would not make much sense for a "valid" 32-bit float to become an unrelated error just because the bits happened to line up in that way. So again, I guess this is the same as the first question - how exactly is the float to half conversion performed?

Alternatively, is there some way of telling GLSL to stick a particular set of bits into a half float output, rather than specifying a full float and hoping GLSL converts to half the way I want it to?

Or worded different... is there a way to guarantee a 1 to 1 mapping between half float bits and 16 bits of an integer?
Advertisement
Sorry, can't answer your question, but I'm curious why do you want to do it this way ? ph34r.png

For one we have 32F textures. If you want to encode the depth, 24 bit (2 *12 bit mantissa) is sufficient, atleast for me. And if you want to keep 32 bit float while supporting older hardware, have you considered, that older hardware might have (performance) issues with integer calculations ?
I'm using RGBA16F textures for my framebuffer already for HDR, and I'm trying to pack normal/depth into one texture. Right now I'm putting nx and ny into r and g, and it would be nice if I could put (linear depth)*(sign(nz)) into b and a.

EDIT: I found a paper describing how one would deal with special value cases when converting to/from 16-bit floats. Using the information here, I modified my code (and also fixed the bias issue):
[source lang="plain"]
// first we interpret as integers so we can work with the bits
uint bits = floatBitsToUint( normalZDepth );
uvec2 parts = uvec2(
bits >> 16, // the higher 16 bits
bits & 0x0000ffff ); // the lower 16 bits

// each component's lower 16 bits now contain the lower and higher bits from the original value
// we want these bits to remain the same when put in 16-bit floats.
// we do this by putting these into normal 32-bit floats such that when these
// 32-bit floats are converted into 16-bit floats, the important bits will be all that remain

// 32-bit float: [ 1 (sign) | 8 (exponent) | 23 (mantissa) ] bias = 127
// 16-bit float: [ 1 (sign) | 5 (exponent) | 10 (mantissa) ] bias = 15
// the full conversion is:
// int16 ==> float ==> half ==> float ==> int16
// therefore, we must ensure that the set of "important" bits in each representation remains unchanged
// the following cases can occur:
// int16 ==> float ==> half ==> float ==> int16
// inf/NaN: s 11111 mmmmmmmmmm ==> s 11111111 mmmmmmmmmm0000000000000 ==> s 11111 mmmmmmmmmm ==> s 11111111 mmmmmmmmmm0000000000000 ==> s 11111 mmmmmmmmmm
// zero/denorm: s 00000 mmmmmmmmmm ==> s 00000000 mmmmmmmmmm0000000000000 ==> s 00000 mmmmmmmmmm ==> s 00000000 mmmmmmmmmm0000000000000 ==> s 00000 mmmmmmmmmm
// normal: s eeeee mmmmmmmmmm ==> s EEEEEEEE mmmmmmmmmm0000000000000 ==> s eeeee mmmmmmmmmm ==> s EEEEEEEE mmmmmmmmmm0000000000000 ==> s eeeee mmmmmmmmmm
// note that in the normal case, the exponent is always in the range [-14,15]
// (EEEEEEEE represents the exponent after adjusted bias, which is still in the proper range [-14,15])

// in all cases, the sign bit and the mantissa remain unmodified
// therefore, we first copy them directly over
uvec2 floatBits = ((parts & 0x8000) << 16) | ((parts & 0x03FF) << 13);

// now we deal with the different cases (0x7C00 is the 16-bit mantissa mask, 0x7F800000 is the 32-bit mantissa mask)
uvec2 exp16 = floatBits & 0x7C00;
// simplified ((exp16 >> 10) - 15 + 127) << 23
uvec2 exp32 = (exp16 - 0x1C00) << 13;
bvec2 expIs00000 = equal( exp16, uvec2( 0, 0 ) );
bvec2 expIs11111 = equal( exp16, uvec2( 0x7C00, 0x7C00 ) );
if (expIs00000.x)
exp32.x = 0;
if (expIs00000.y)
exp32.y = 0;
if (expIs11111.x)
exp32.x = 0x7F800000;
if (expIs11111.y)
exp32.y = 0x7F800000;

floatBits |= exp32;

// now just interpret as float - ready to be stored as half floats
vec2 halfBits = uintBitsToFloat( floatBits );
[/source]
Untested, but in theory this should work (but of course, only if hardware actually abides by these rules). Does the OpenGL 3.3 spec define how these conversions should occur? Currently trying to find out...
Not really what you asked, but i say anyway: i would just use two textures (RG16F + R32F).

Edit: if you went with RGBA16F because your framebuffer is 64bit and you heard you can not use different sized buffers - then, that restriction was lifted ages ago (iirc, way before 3.3 - which your workaround glsl requires [min for "uintBitsToFloat"]).

Right now I'm putting nx and ny into r and g, and it would be nice if I could put (linear depth)*(sign(nz)) into b and a.

I'm excatly doing this with a very simple encoding/decoding without any problem:


encoding: [depth=0..1]
scaled_depth = depth*2048.0;
result.x = floor(scaled_depth);
result.y = fract(scaled_depth);

decoding:
result = dot(encoded_depth.xy, vec2(1.0/2048.0,1.0));


A standard depth/stencil buffer have only 24 bits too, so this works for me for lightning, particle depth tests etc.

PS: a fast decoding could be useful too
I recommend using the glm math library. It has support for half floats, and it is made for OpenGL and shader compatibility.
[size=2]Current project: Ephenation.
[size=2]Sharing OpenGL experiences: http://ephenationopengl.blogspot.com/
Thanks for the replies.
tanzanite7 - Whoa, I didn't realize that restriction was dropped! For now I've switched over to an R32F buffer for depth (which actually cut down on total gbuffer size, since I was using two RGBA16F textures before but now I'm using one RGBA16F and one R32F). In terms of older hardware support... if it doesn't support different sized attachments I'll just cut down to 32-bit buffers and get lower quality HDR I guess, though to be honest I'm not thinking too much about that (this is mostly just a hobbyist project).
Ashaman73 - I was hoping to store the exact bits, but if you haven't had any problems with that route then maybe I'll do that instead, as I think I'd have to jump up to 4.2 or 4.3 to "truly" support that type of float packing. Have you played with the value (2048.0) at all to determine if it is optimal for keeping precision?
larspensjo - the link seems broken. But unfortunately I don't think a math library will help for the problem I'm trying to solve.

Have you played with the value (2048.0) at all to determine if it is optimal for keeping precision?

I'm not 100% sure, but I think that 2048 is actually the value I use. That are 2x 11 bit + implicit leading 1, 24 bits, enough in my opinion considering that you don't use it for exact z-order.

This topic is closed to new replies.

Advertisement