#### Archived

This topic is now archived and is closed to further replies.

# small (short) float?

This topic is 5273 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm trying to minimize memory usage as much as possible in my program, and I realized alot of my object's 'float' members are only being used to cover a 0.0 - 1.0 range, rather than the insane whole number range floats can cover. So I was wondering if some people could help me with the code to create a "small float" datatype. Kinda like you can add "unsigned" to "int", it would be nice to designate a "small" keyword. Or maybe just utilize "short", so a "short float" datatype would be a "0 to 1" decimal value? I was hoping to do this as painlessly as possible, although I can assume it will be pretty advanced. Finally, I was wondering, before anyone were to help me with this, is it even worth it? Will the size still be the same in order to handle PRECISION within the 0.0 - 1.0 range? Perhaps it could also be hacked to a 4 or 5 decimal precision amount? Maybe something like this already exists? In all honesty, I don't care much about the logistics behind it; I'd be perfectly content with just copying and pasting some code in. Anyway, thanks for any help you can provide

##### Share on other sites
I don''t think there''s any smaller data type for floating point numbers than float for the x86 processor. From a speed issue I don''t even think it''s worth to manually make a 16 bit float, since all the manual bit shifiting you''d had to do would be 100 time slower than using the 32 bit version supported by the processor. Although it is a nice thought, I think you should abandon it (unless speed is not an isssue and size is everything).

##### Share on other sites
alot of graphics cards (like the GameCube's) support 16-bit packed number formats, where you specify how many bits represent the integer part, and how many bits represent the float part. You obviously lose alot of percision, and you do not want to try and do any math on these numbers, but for static data that is just sent to the graphics chip (like texture UVs, normals, and such), this can save alot of memory (and when you are limited to 24 MB, you need to save everywhere you can :-).

[edited by - chiuyan on July 5, 2003 7:13:01 PM]

##### Share on other sites
if you can live with the overhead of dividing each time you need to use the value, you can store them in a unsigned char, and divide by 255 when you use it.
class CTinyFloat  {  private:    unsigned char p_value;  public:    void SetValue(const float sValue)      {      p_value = unsigned char(sValue * 255.0f);      };    const float GetValue(void)      {      return (float)p_value/255.0f;      };  };

you''ll have to check for a valid range in SetValue, since a negative or > 1.0f value will not fit in the unsigned char...

##### Share on other sites
Use a short or char. When you want to convert from [0,1] do:

type quantizedfloat = (type)(x * (1 << (sizeof(type) << 3)));

To convert back:

float regularfloat = (float)quantizedfloat / (float)(1 << (sizeof(type) << 3));

To make it a little faster make sure you do the float->int, int->float conversion yourself to fit your needs. As mentioned, it's a little slower than just using floats and isn't really worth it unless you want to get a massive amount of floats down to a smaller size.

The most useful place for this is in sending vertices to the video card. You can compress them like this, send them to the card faster, then decompress with a vertex shader. There are a few articles in the reference section on vertex quantization.

Note that the above will lose some precision, with char losing more than short. Don't waste your time using this on data that the CPU manipulates, unless you're working on a platform where you need fixed point or something. Look into fixed point for a general solution to this problem, which works for values outside of [0,1].

------------
- outRider -

[edited by - outRider on July 5, 2003 7:05:52 PM]

##### Share on other sites
16 bit , fixed point, is my suggestion.

##### Share on other sites
Yeah, and since the number only need be 0 -> 1.. you only need to have 1.1.14 format. (1 bit for sign, 1 bit for integer part, and 14 bits for float).

This is based off my fixed point stuff for my virtual machine testing stuff .

struct Fixed16_S //1.1.14{	short val;	__forceinline float FloatVal(void)	{		return (float)(val/16384.0f);	};	operator+=(Fixed16_S &f)	{		val+=f.val;	}	operator-=(Fixed16_S &f)	{		val-=f.val;	}	operator*=(Fixed16_S &f)	{		val = (val>>7)*(f.val>>7);	}	operator/=(Fixed16_S &f)	{		val = (val/f.val)<<14;	}	operator=(const Fixed16_S &f)	{		val = f.val;	}	operator=(const short &v)	{		val=v*16384;  //Set our value to a short!	}	operator=(const float &v)	{		val=(short)(v*16384.0f);	}};

You can now use:
Fixed16_S Test1, Test2;Test1 = 0.5f;Test2 = 0.5f;Test1*= Test2;printf("%f",Test1.FloatVal()); //Should print out 0.25..

Hope this gives you some idea on how fixed point works. This gives pretty good precision and only uses 16-bits, and also preserves the sign properly. This was originally a 32-bit fixed point struct that I just converted to 16, so typographical errors may have popped up. Also, this can easily be changed into an 8-bit format at the loss of some precision.

##### Share on other sites
Yeh, I agree with the fixed point suggestion.

##### Share on other sites
Hmm... Thanks a bunch to all that helped out.

The floats I am concerned about are all the floats for my particles. There will be an undetermined amount of them, but definately alot of them, all with multiple float members.

The CPU will be working with these alot, so from what''s been suggested it sounds like it won''t be a good tradeoff... Oh well, at least now I know and am not nagged by "but what if I could?", heh. Thanks alot guys

##### Share on other sites
quote:
The CPU will be working with these alot.

In this case I don''t think you should use 16 bit because 32 bit Intel/AMD chips are designed to access dword aligned memory (i.e. addresses that are multiples of 4 bytes) more quickly than memory that is not dword aligned.

##### Share on other sites
Yeah, floats would be faster than fixed point in most cases, unless you're doing a lot of float -> integer conversions, which it doesn't appear you'd be doing.

Although, most of the work you'd be doing would be subtraction and addition I'd presume, and integer addition/subtraction is plenty fast. Who knows, just use:

typedef MyFloat float;

Then write the program and get it to work, benchmark.. then change the typedef to...

typedef MyFloat Fixed16_S;

Program should still work the same, benchmark and see which was faster. It's not that hard to test out because both float types are treated the same in code, so no changes should have to be made to test them!

--- Edit ---
By the way, in your structs/classes, I mean to use MyFloat as the data type instead of float or Fixed16_S, so it's as simple as changing your typedef to change all the types instead of doing it manually. Once you find the most efficient method, stick to it .

[edited by - Ready4Dis on July 6, 2003 9:17:13 AM]

##### Share on other sites
if you take a look at the precision loss, it follows a distinct pattern (i just noticed) im working on a way to improve acuracy, i will post the results soon..

A GOOD friend will come bail you out of jail...
but, a TRUE friend will be sitting next to you saying, "Damn, we fucked up."
Ingite 3D Game Engine Home -- Just click it

##### Share on other sites
ok, ive done a quick rough up, i think i will fiddle a bit more and
see what i can come up with

so far
on a quick test of 3000 floats the average error is: 0.00209244
reasonable considering it takes up exactly half the memory
space!

ive implemented 2 methods of error compensation
• odd/even compensation, works nicely and i think with
developement is the one to watch as the precision loss follows a
pattern.
• average error compensation, this method simply adds the average
error levels, its fast and works very well (better than o/e
currently) the demo uses this method

with minimal modification the template class float16 should be able to even compress down to 8bits with a developed error algorithm.

(in about 10 mins after this post plz)

CODE

silvermace007@hotmail.com

compiled on g++ 3.xx
g++ -W -Wall _float.cpp -O6 -o flt

have fun.

A GOOD friend will come bail you out of jail...
but, a TRUE friend will be sitting next to you saying, "Damn, we fucked up."
Ingite 3D Game Engine Home -- Just click it

[edited by - silvermace on July 7, 2003 6:43:47 AM]

• ### Forum Statistics

• Total Topics
628670
• Total Posts
2984152

• 13
• 10
• 10
• 9
• 9