Sign in to follow this  
TheUnnamable

How to handle endianness?

Recommended Posts

I always wanted to write a library to write and read simple data types ( integers, floats, etc. ) to/from files. Also, I wanted the library to be as much cross-platform as it can be - if I write an integer to a file on Mac, I want to get the same number from that file on Windows.

 

I've actually written that library, but I feel I've made an overkill with it. It's designed to handle little-endian data. It gives you a buffer class, to which you can write integers, floats, or just random blobs of data. The overkill I mentioned happens when writing specific data formats. It constructs everything bit-by-bit, eg.:

    buffer& operator<<(buffer& os, unsigned short x)
    {
        if(os.size()<os.tellp()+2){os.resize(os.size()+2);}
        size_t offs=os.tellp()*8; //Bit index to start writing at

        for(byte_t i=0; i<16; i++){os.set_bit(offs+i,(x>>i)&1);}

        os.seekp(2,1);
        return os;
    };

I'm suspecting I could do something like this:

    buffer& operator<<(buffer& os, unsigned short x)
    {
        os.put(x%256); x -= x%256;
	os.put(x/256);
        return os;
    };

And it would be still readable on most of the platforms.

 

So my questions:

  • If I handle bytes instead of bits, will it still be readable on other platforms? Or on other computers?
  • With the simplified method, how would I go with signed integers? Use it's abs but write inverted bytes? ( ~x operator )
  • Floats?

Also, a note: Another concern while writing this library to conform to some kind of standard. So, for integers, I've just checked how Window's calc.exe handles them, for floats, I checked Wikipedia ( http://en.wikipedia.org/wiki/Single-precision_floating-point_format )

You can check the whole source at https://github.com/elementbound/binio. The interesting parts are binio/buffer.h ( buffer class ) and binio/formats.h and .cpp.

Share this post


Link to post
Share on other sites
All modern computers I'm aware of have 8-bit bytes, and store integers in 2's complement. So you'd get the same thing on all platforms when you write bytes. Cast signed ints to unsigned before doing any bitwise operations on them (cast them back to signed on reading in, once re-constructed). Handle floats by making the bit pattern into an int, then decomposing to bytes as you would an int.

union IntOrFloat
{
int i;
float f;
};
 
float f = 1.0f;
IntOrFloat iof;
iof.f = f;
printf("%d", iof.i);
Oh, and you probably want to use defined width types rather than int and float, which will be different on different platforms (e.g. 32/64 bit OSes).

Don't do
float f = 1.0f;
int i = *((int *)&f);
because it might not do what you want. Look up 'aliasing' to find out why.

Share this post


Link to post
Share on other sites

Don't do

float f = 1.0f;
int i = *((int *)&f);
because it might not do what you want. Look up 'aliasing' to find out why.

 

Using the union solution also causes undefined behaviour. You are only allowed to read the last assigned member (with few exceptions, not applicable here).

Share this post


Link to post
Share on other sites

 

Don't do

float f = 1.0f;
int i = *((int *)&f);
because it might not do what you want. Look up 'aliasing' to find out why.

 

Using the union solution also causes undefined behaviour. You are only allowed to read the last assigned member (with few exceptions, not applicable here).

 

Keep in mind that the primary compilers (VC, GCC and Clang) all have a specific exception to standard aliasing rules for the union variation which makes it valid code.  Obviously for any other compilers you need to check the documentation and verify the behavior.

Share this post


Link to post
Share on other sites

Thanks for all the valuable information!
I've modified my design a bit. Firstly, for the buffer to work, there must be an init to determine the machine's native endianness. After that, every newly created buffer uses the native endianness as a default. This can be modified for each individual buffer by the user, providing a bit more freedom.
So when reading from a buffer and setting it to little-endian, it will assume that its source data is little-endian. If the machine's not little-endian, it will swap every element to big-endian.
Similarly, when writing to a buffer and setting it to little-endian, it will store everything in little-endian. So if the machine's not little-endian, it will swap every element to big-endian before storing them.
 
Right now I'm about to reimplement the writing and reading routines for different formats. The way I'm planning this is similar to the method described in the article incertia linked.

    class buffer
    {
        private:
            ...

            static endian_t m_SysEndian; //Native endianness
            endian_t m_Endian; //Buffer's endianness

            //Internal writers
            void write_raw(const byte_t*, size_t);
            void write_swapped(const byte_t*, size_t);

            //Internal readers
            size_t read_raw(byte_t*, size_t);
            size_t read_swapped(byte_t*, size_t);

            void(buffer::*m_writeelem)(const byte_t*, size_t);
            size_t(buffer::*m_readelem)(byte_t*, size_t);

        public:
            ...

            //Endian-correct I/O
            //These just call m_writeelem and m_readelem
            //I chose to use void* so no casting is needed from the user's side
            void   write_element(const void* d, size_t s);
            size_t read_element(void* d, size_t s);
    };

The buffer's write_element function writes a chunk of data with given size, swapping it if needed. I'm planning to use this when writing integers, like so:

buffer& operator<<(buffer& os, short x)
    {
        if(os.size()<os.tellp()+2){os.resize(os.size()+2);}
        os.write_element((byte_t*)&x,2);
        return os;
    };

    buffer& operator>>(buffer& is, short& ret)
    {
        if(is.size()<is.tellg()+2){return is;}
        is.read_element(&ret, 2);
        return is;
    };

I figured I could do the same with floats. Am I correct?

PeterStock showed how not to do it, and the article uses an union construct when handling floats, although I don't understand why. Is it because you can't just do x&255 with a float? Or is there something I'm missing?

Also, I'm not really copying the float, I'm using it directly. Is this a different case?

Here's what I'm doing right now:

    buffer& operator<<(buffer& os, float x)
    {
        os.write_element(&x,4);
        return os;
    };

    buffer& operator>>(buffer& is, float& x)
    {
        if(is.size()<is.tellg()+4){return is;}
        is.read_element(&x,4);
        return is;
    };

Also, about the width-defined types. I'm planning to use them, but after some research, it seems to me that they are available only in C++11 - I need to include cstdint. Or I can just use stdint.h. Is there any better way? Or some way to detect if cstdint is available? I think checking if cstdint's available and falling back to stdint.h if necessary is a good solution. ( At least, nothing better comes to my mind right now )

Share this post


Link to post
Share on other sites


Also, about the width-defined types. I'm planning to use them, but after some research, it seems to me that they are available only in C++11 - I need to include cstdint. Or I can just use stdint.h. Is there any better way? Or some way to detect if cstdint is available? I think checking if cstdint's available and falling back to stdint.h if necessary is a good solution. ( At least, nothing better comes to my mind right now )

 

Keep in mind stdint.h is not guaranteed to even exist (it's a C99 header). And having two fallback solutions feels over-engineered to me. Though in all honesty, you are unlikely to find a hosted environment where stdint.h is not available (I mean, even Microsoft supports it now) so it's not a problem unless you really want to target embedded platforms.

 

You need to consider whether it's really worth going through all the portability hoops to make your code portable everywhere - literally - or just on a few popular operating systems and architectures, work out which assumptions you may make to simplify your library, and reason about how to write your code from there.

Share this post


Link to post
Share on other sites

I figured I could do the same with floats. Am I correct?

PeterStock showed how not to do it, and the article uses an union construct when handling floats, although I don't understand why. Is it because you can't just do x&255 with a float? Or is there something I'm missing?

Also, I'm not really copying the float, I'm using it directly. Is this a different case?

Here's what I'm doing right now:

    buffer& operator<<(buffer& os, float x)
    {
        os.write_element(&x,4);
        return os;
    };

    buffer& operator>>(buffer& is, float& x)
    {
        if(is.size()<is.tellg()+4){return is;}
        is.read_element(&x,4);
        return is;
    };

 

Floats are a bit of a special case and certain rules of C++ get in the way with manipulating them in binary forms.  To start with, you are correct, x&0xFF has no meaning when x is a float, the compilers will generally spit that out as an error.  On the other hand, the old standby you still see occasionally: "uint32_t* bx = (uint32_t*)&x;" runs you into the C++ aliasing rules and this is also an error (though not always noted by the compilers).  In the most simplistic form, you can not have two pointers which represent different types and point at the same piece of memory.  Aliasing such as this is a serious problem for the compilers and leads to invalid optimizations and other problems which can completely break your code in the most obnoxious and difficult to figure out ways.

 

So, instead of the pointer hackery, a lot of folks work around it with the union trick:  union {float f; uint32_t u;}.  This was the clean method of avoiding the pointer casts but was invalidated around Cxx0 I believe it was.  Unfortunately, this was such a hugely common pattern that the big three compilers (VC, GCC, Clang) decided to make an exception specifically to allow this pattern to continue working.  By strict standard, the *only* (ignoring minor variations) method of properly dealing with this issue is: "uint32_t u; memcpy( &u, &x, sizeof( uint32_t ) );"  For all intents and purposes, this is identical to "uint32_t u = *(uint32_t*)&x;" and in fact most compilers detect the small fixed sized memcpy and replace it with a simple mov instruction so there is no call overhead or anything else involved.

 

So, now having said all that.  Why you need these tricks is just like with the uint32_t you have to byte swap things around when changing the endian of a float but you can't do it due to the lack of bit operations.  But, as stated, taking the address of the float and casting to integer is also illegal so it's a bit of a chicken and an egg problem, which is why you should use the memcpy variation as it is the "proper" thing to do.  Having said that, you can use the technically illegal variations and probably get away with it but they will likely end up biting you on the ass at some point in the future, you have been warned.  :)

Share this post


Link to post
Share on other sites

Indeed, you are right about, Bacterius. I figured I'd still have those fallbacks.

First, if the current compiler supports C++11, cstdint will be used. I'm checking the __cplusplus define. Although compilers can lie, so with a simple define, the user can choose to use stdint.h instead. I think that'll be available for most of the users, so it won't be an issue.

If none of the above libraries are available, there's an emergency fallback, although it's not guaranteed to work. This fallback can be activated with a define, too.

Maybe it's a bit over-engineered, but it will definitely work on modern compilers, and on almost-old compilers too smile.png

 

Ah, I get it now, AllEightUp! Thanks for clearing that up! Until now, I didn't wholly get the concept of aliasing. I'll rewrite my float-handing routines to memcpy. You saved my future-self from a naughty bite biggrin.png

 

EDIT: Also, I've updated the github project in case of interest :)

Edited by TheUnnamable

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this