How to handle endianness?

Started by
8 comments, last by TheUnnamable 10 years, 7 months ago

I always wanted to write a library to write and read simple data types ( integers, floats, etc. ) to/from files. Also, I wanted the library to be as much cross-platform as it can be - if I write an integer to a file on Mac, I want to get the same number from that file on Windows.

I've actually written that library, but I feel I've made an overkill with it. It's designed to handle little-endian data. It gives you a buffer class, to which you can write integers, floats, or just random blobs of data. The overkill I mentioned happens when writing specific data formats. It constructs everything bit-by-bit, eg.:


    buffer& operator<<(buffer& os, unsigned short x)
    {
        if(os.size()<os.tellp()+2){os.resize(os.size()+2);}
        size_t offs=os.tellp()*8; //Bit index to start writing at

        for(byte_t i=0; i<16; i++){os.set_bit(offs+i,(x>>i)&1);}

        os.seekp(2,1);
        return os;
    };

I'm suspecting I could do something like this:


    buffer& operator<<(buffer& os, unsigned short x)
    {
        os.put(x%256); x -= x%256;
	os.put(x/256);
        return os;
    };

And it would be still readable on most of the platforms.

So my questions:

  • If I handle bytes instead of bits, will it still be readable on other platforms? Or on other computers?
  • With the simplified method, how would I go with signed integers? Use it's abs but write inverted bytes? ( ~x operator )
  • Floats?

Also, a note: Another concern while writing this library to conform to some kind of standard. So, for integers, I've just checked how Window's calc.exe handles them, for floats, I checked Wikipedia ( http://en.wikipedia.org/wiki/Single-precision_floating-point_format )

You can check the whole source at https://github.com/elementbound/binio. The interesting parts are binio/buffer.h ( buffer class ) and binio/formats.h and .cpp.

Advertisement
All modern computers I'm aware of have 8-bit bytes, and store integers in 2's complement. So you'd get the same thing on all platforms when you write bytes. Cast signed ints to unsigned before doing any bitwise operations on them (cast them back to signed on reading in, once re-constructed). Handle floats by making the bit pattern into an int, then decomposing to bytes as you would an int.

union IntOrFloat
{
int i;
float f;
};
 
float f = 1.0f;
IntOrFloat iof;
iof.f = f;
printf("%d", iof.i);
Oh, and you probably want to use defined width types rather than int and float, which will be different on different platforms (e.g. 32/64 bit OSes).

Don't do
float f = 1.0f;
int i = *((int *)&f);
because it might not do what you want. Look up 'aliasing' to find out why.

Much of your worry is what I call overkill. There is a greatly simplified method of dealing with this which is not only faster for the actual read/write but also for the implementation. My approach has always been to simply write the data files in the natural endian for the platform it is written from, so there is no dealing with endian at all during writes (beyond testing) and only deal with it during reads. In order to make this work, you simply have an upfront file header which you read in which contains an endian flag. Now, when a platform of a different endian opens a file it checks that bit and creates the appropriate read object which encapsulates all the endian aware reading. This makes it such that 99.9% of the time you are working in the proper endian for the machine and don't have to deal with the slower reconstruction of always being endian aware.

Now, the code you post is a problem, let's fix that up:


buffer& operator<<(buffer& os, unsigned short x)
{
  uint8_t  part0 = uint8_t( x & 0xFF );
 uint8_t   part1 = uint8_t( (x & 0xFF00) >> 8 );
 os.put( part0 );
 os.put( part1 );
    return os;
};
 
// by side effect this works also:
buffer& operator << ( buffer& os, signed short x )
{
 unsigned short xx;
memcpy( &xx, &x, sizeof( signed short ) );  // Avoid aliasing, could use pointer tricks on most compilers though.
  os << xx;
 return os;
}

By using the extract and shift operations you don't care about the signed/unsigned nature of the data as per byte the data is identical and only the order of memory matters when reconstructing a complete value.

For floating point values, so long as the machines support IEEE form floats/doubles (pretty much everything out there), you can do exactly the same thing as the above just extend to 32/64 bits and use a temporary 'uint32_t' as the target of the memcpy. Again, the only difference in such a case is byte order and you don't have to manipulate the bits in any manner.

NOTE: Should have really written the code as:


buffer& operator<<(buffer& os, unsigned short x)
{
  os.write( (const char*)&x, sizeof( unsigned short ) );
  return os;
};

For the write side, just don't care about endian or dealing with the multibyte nature of the data. The reading portion is where you would do the appropriate shifts or swaps if the endian needs to change.

Don't do


float f = 1.0f;
int i = *((int *)&f);
because it might not do what you want. Look up 'aliasing' to find out why.

Using the union solution also causes undefined behaviour. You are only allowed to read the last assigned member (with few exceptions, not applicable here).

Don't do


float f = 1.0f;
int i = *((int *)&f);
because it might not do what you want. Look up 'aliasing' to find out why.

Using the union solution also causes undefined behaviour. You are only allowed to read the last assigned member (with few exceptions, not applicable here).

Keep in mind that the primary compilers (VC, GCC and Clang) all have a specific exception to standard aliasing rules for the union variation which makes it valid code. Obviously for any other compilers you need to check the documentation and verify the behavior.

An article was written about this
what

Thanks for all the valuable information!
I've modified my design a bit. Firstly, for the buffer to work, there must be an init to determine the machine's native endianness. After that, every newly created buffer uses the native endianness as a default. This can be modified for each individual buffer by the user, providing a bit more freedom.
So when reading from a buffer and setting it to little-endian, it will assume that its source data is little-endian. If the machine's not little-endian, it will swap every element to big-endian.
Similarly, when writing to a buffer and setting it to little-endian, it will store everything in little-endian. So if the machine's not little-endian, it will swap every element to big-endian before storing them.

Right now I'm about to reimplement the writing and reading routines for different formats. The way I'm planning this is similar to the method described in the article incertia linked.


    class buffer
    {
        private:
            ...

            static endian_t m_SysEndian; //Native endianness
            endian_t m_Endian; //Buffer's endianness

            //Internal writers
            void write_raw(const byte_t*, size_t);
            void write_swapped(const byte_t*, size_t);

            //Internal readers
            size_t read_raw(byte_t*, size_t);
            size_t read_swapped(byte_t*, size_t);

            void(buffer::*m_writeelem)(const byte_t*, size_t);
            size_t(buffer::*m_readelem)(byte_t*, size_t);

        public:
            ...

            //Endian-correct I/O
            //These just call m_writeelem and m_readelem
            //I chose to use void* so no casting is needed from the user's side
            void   write_element(const void* d, size_t s);
            size_t read_element(void* d, size_t s);
    };

The buffer's write_element function writes a chunk of data with given size, swapping it if needed. I'm planning to use this when writing integers, like so:


buffer& operator<<(buffer& os, short x)
    {
        if(os.size()<os.tellp()+2){os.resize(os.size()+2);}
        os.write_element((byte_t*)&x,2);
        return os;
    };

    buffer& operator>>(buffer& is, short& ret)
    {
        if(is.size()<is.tellg()+2){return is;}
        is.read_element(&ret, 2);
        return is;
    };

I figured I could do the same with floats. Am I correct?

PeterStock showed how not to do it, and the article uses an union construct when handling floats, although I don't understand why. Is it because you can't just do x&255 with a float? Or is there something I'm missing?

Also, I'm not really copying the float, I'm using it directly. Is this a different case?

Here's what I'm doing right now:


    buffer& operator<<(buffer& os, float x)
    {
        os.write_element(&x,4);
        return os;
    };

    buffer& operator>>(buffer& is, float& x)
    {
        if(is.size()<is.tellg()+4){return is;}
        is.read_element(&x,4);
        return is;
    };

Also, about the width-defined types. I'm planning to use them, but after some research, it seems to me that they are available only in C++11 - I need to include cstdint. Or I can just use stdint.h. Is there any better way? Or some way to detect if cstdint is available? I think checking if cstdint's available and falling back to stdint.h if necessary is a good solution. ( At least, nothing better comes to my mind right now )


Also, about the width-defined types. I'm planning to use them, but after some research, it seems to me that they are available only in C++11 - I need to include cstdint. Or I can just use stdint.h. Is there any better way? Or some way to detect if cstdint is available? I think checking if cstdint's available and falling back to stdint.h if necessary is a good solution. ( At least, nothing better comes to my mind right now )

Keep in mind stdint.h is not guaranteed to even exist (it's a C99 header). And having two fallback solutions feels over-engineered to me. Though in all honesty, you are unlikely to find a hosted environment where stdint.h is not available (I mean, even Microsoft supports it now) so it's not a problem unless you really want to target embedded platforms.

You need to consider whether it's really worth going through all the portability hoops to make your code portable everywhere - literally - or just on a few popular operating systems and architectures, work out which assumptions you may make to simplify your library, and reason about how to write your code from there.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

I figured I could do the same with floats. Am I correct?

PeterStock showed how not to do it, and the article uses an union construct when handling floats, although I don't understand why. Is it because you can't just do x&255 with a float? Or is there something I'm missing?

Also, I'm not really copying the float, I'm using it directly. Is this a different case?

Here's what I'm doing right now:


    buffer& operator<<(buffer& os, float x)
    {
        os.write_element(&x,4);
        return os;
    };

    buffer& operator>>(buffer& is, float& x)
    {
        if(is.size()<is.tellg()+4){return is;}
        is.read_element(&x,4);
        return is;
    };

Floats are a bit of a special case and certain rules of C++ get in the way with manipulating them in binary forms. To start with, you are correct, x&0xFF has no meaning when x is a float, the compilers will generally spit that out as an error. On the other hand, the old standby you still see occasionally: "uint32_t* bx = (uint32_t*)&x;" runs you into the C++ aliasing rules and this is also an error (though not always noted by the compilers). In the most simplistic form, you can not have two pointers which represent different types and point at the same piece of memory. Aliasing such as this is a serious problem for the compilers and leads to invalid optimizations and other problems which can completely break your code in the most obnoxious and difficult to figure out ways.

So, instead of the pointer hackery, a lot of folks work around it with the union trick: union {float f; uint32_t u;}. This was the clean method of avoiding the pointer casts but was invalidated around Cxx0 I believe it was. Unfortunately, this was such a hugely common pattern that the big three compilers (VC, GCC, Clang) decided to make an exception specifically to allow this pattern to continue working. By strict standard, the *only* (ignoring minor variations) method of properly dealing with this issue is: "uint32_t u; memcpy( &u, &x, sizeof( uint32_t ) );" For all intents and purposes, this is identical to "uint32_t u = *(uint32_t*)&x;" and in fact most compilers detect the small fixed sized memcpy and replace it with a simple mov instruction so there is no call overhead or anything else involved.

So, now having said all that. Why you need these tricks is just like with the uint32_t you have to byte swap things around when changing the endian of a float but you can't do it due to the lack of bit operations. But, as stated, taking the address of the float and casting to integer is also illegal so it's a bit of a chicken and an egg problem, which is why you should use the memcpy variation as it is the "proper" thing to do. Having said that, you can use the technically illegal variations and probably get away with it but they will likely end up biting you on the ass at some point in the future, you have been warned. :)

Indeed, you are right about, Bacterius. I figured I'd still have those fallbacks.

First, if the current compiler supports C++11, cstdint will be used. I'm checking the __cplusplus define. Although compilers can lie, so with a simple define, the user can choose to use stdint.h instead. I think that'll be available for most of the users, so it won't be an issue.

If none of the above libraries are available, there's an emergency fallback, although it's not guaranteed to work. This fallback can be activated with a define, too.

Maybe it's a bit over-engineered, but it will definitely work on modern compilers, and on almost-old compilers too smile.png

Ah, I get it now, AllEightUp! Thanks for clearing that up! Until now, I didn't wholly get the concept of aliasing. I'll rewrite my float-handing routines to memcpy. You saved my future-self from a naughty bite biggrin.png

EDIT: Also, I've updated the github project in case of interest :)

This topic is closed to new replies.

Advertisement