Writing Efficient Endian-Independent Code in C++

Published June 04, 2015 by 'No Bugs' Hare, posted by Sergey Ignatchenko
Do you see issues with this article? Let us know.
Advertisement

Once upon a time, there was an article published on gamedev.net [Roy2013], which described a way (as he says, mostly taken from Quake2 engine) to deal with Little-Endian/Big-Endian issues in games. While this approach is mostly sound ("mostly" because of unaligned-read issues which will be discussed below), it is not the most efficient one. Better (simpler, faster, and more general) approaches do exist, and they will be discussed below.

What is Endianness

Endianness itself has been described in many different works, including [[Roy2013] and [WikipediaEndianness]. Basically, it is a way that CPU stores multi-byte data in memory; little-endian systems store least significant byte first, and big-endian ones store most-significant byte first. So, if you have uint16_t x = 1234; then x will look as {0xD2, 0x04} on a little-endian system, and as {0x04, 0xD2} on a big-endian system. As a result, code such as send(socket,&x,2); will send different data over the wire depending on system endianness (little-endian systems will send {0xD2,0x04}, and big-endian ones will send {0x04,0xD2}). It is important to note that endianness effects cannot be observed unless we have some kind of cast between pointers on data of different sizes (in the example above, there was an implicit cast from &x, which is uint16_t*, to void*, which is actually treated as byte pointer by the send() function). In other words, as long as we keep away from casts and stay within arithmetical and bitwise operations without pointer casts, the result is always the same regardless of endianness. 2+3 is always 5, and (((uint16_t)0xAA)[lessthan][lessthan]3)^0xCCCC is always 0xC99C, regardless of the system where our code is running. Let's name such calculations endianness-agnostic.

Scope

First of all, where do we need to deal with little-endian/big-endian issues? In fact, there are only two scenarios of which I know, where it is important. First one is reading files (which might have been written on a different machine), and another one is network communication. From our perspective, both of these cases are essentially the same: we're transferring data from one machine to another one.

Serialization/Marshalling

One thing which should be noted for both these data-transfer-between-machines scenarios, is that you should never transfer data as a C structure; instead, you should serialize/marshal it. Putting C structure in a file (which may be read on another machine) or over the network, is a Really Bad Idea for several reasons. First of all, when writing/reading C structure to external storage, you're becoming a hostage of implicit alignment rules of the compiler you're using. In general, when you have a structure such as struct X { uint8_t a; uint32_t b; }; then sizeof(X) won't be 5 bytes as some might expect; in many cases sizeof(X) will be 8 bytes (1 byte of a, then 3 unused bytes of padding just to make b aligned on a 4-byte boundary, and then 4 bytes of b), but this is not guaranteed at all. To make things worse, the amount of alignment is not specified by standards, so when you're switching from one compiler to another one, it may change (not to mention switching between CPUs); to make things even worse, it can be affected by compiler switches and on a struct-by-struct basis by things such as #pragma pack. If you are using types such as int or long (rather than guaranteed-size-types such as uint8_t and uint32_t), things worsen even further (yes, this is possible) - due to different sizes of these types on different platforms. Oh, and don't forget that variable-length strings and C++ containers are clearly off-limits. There are other (rather minor) reasons for avoiding writing C structures directly: you'll write more data then necessary, the data written will include garbage (which will affect the ability to compress it), and so on. However, the most important issue is (lack of) inter-platform and inter-compiler compatibility mentioned above. These issues are so important, that in the networking world sending C structures over the network is universally considered a Big No-No. So, what you should do when you need to send a C structure over the network (or to save it to the file)? You should serialize it first (in network world term "marshal" is generally preferred, though it is essentially the same thing).

Implementing Serialization

The idea behind serialization is simple: for the struct X above you write one byte of a, and 4 bytes of b, avoiding alignment issues. In fact, you can go further and use, for example, VLQ [WikipediaVLQ] variable-length encoding, or put null-terminated strings into your serialized data. One way of serializing data (the one I prefer), is to have serialize/deserialize functions such as void serialize_uint16(DataBlock&, uint16_t);//DataBlock should grow as the data is serialized uint16_t deserialize_uint16(Parser&);//there is constructor Parser(DataBlock&) When we have these functions implemented, then serializing our struct X will look as DataBlock data; serialize_uint8(data,x.a); serialize_uint32(data,x.b); (and deserializing will look similar). So far so good, now let's see how we can implement our serialize_uint16() function. If we will implement it according to [Roy2013], it would look like: void serialize_uint16(DataBlock& data,uint16_t u16) { void* ptr = data.grow(2);//add 2 bytes to the end of data, and return pointer to these 2 bytes u16 = LittleShort(u16);//calling ShortSwap on big-endian systems, and ShortNoSwap on little-endian systems *(uint16_t*)ptr = u16; //(*) } This would work fine on x86 and x86-64, but on the other platforms the line marked as (*) may run into problems. The problem is that our ptr might be either even or odd; and if it is odd - some CPUs will refuse to read 2-byte data from it (also they will usually refuse to read 4-byte data unless its address is a multiple of 4, and so on). This never happens on x86/x86-64, but happens on SPARC, and may or may not happen on ARM (unless we specify __packed qualifier for uint16_t*, but it is not universally available).

Another Popular Alternative

Another popular alternative (thanks to Servant of the Lord for reminding me of it) is based on LITTLE_ENDIAN and BIG_ENDIAN macros. In some sense, it can be seen as the same serialize_uint16() as above, but using different implementation for BigShort() etc.: //code courtesy of Servant of the Lord #ifdef LITTLE_ENDIAN #define BigShort(x) ShortSwap(x) #define LittleShort(x) (x) //Do nothing, just 'return' the same variable. #define BigLong(x) LongSwap(x) #define LittleLong(x) (x) #define BigFloat(x) FloatSwap(x) #define LittleFloat(x) (x) #elif defined(BIG_ENDIAN) #define BigShort(x) (x) #define LittleShort(x) ShortSwap(x) #define BigLong(x) (x) #define LittleLong(x) LongSwap(x) #define BigFloat(x) FloatSwap(x) #define LittleFloat(x) (x) #else #error No idea about endianness #endif While it is faster and less bulky than a previous one (see "Performance Analysis" section below), it has the same problem with unaligned read/writes on non-x86 platforms :-(. In other words, for serialization purposes it won't work, for example, on SPARC (and it working for ARM is not guaranteed).

What Is to be Done?

What is to be done? -- name of the novel by Nikolay Chernyshevsky, 1863 --
The answer is quite simple, actually. Instead of writing two bytes as one chunk, we can always write it byte-by-byte: void serialize_uint16(DataBlock& data,uint16_t u16) { uint8_t* ptr = data.grow(2); *ptr++ = (uint8_t)u16; *ptr = (uint8_t)(u16 >> 8); }//deserializing is very similar This technique is well-known (see, for example, [Pike2012], but the idea is known for years before); a Really Good Thing about it is that we don't need to care about endianness at all(!). This stands because the code above doesn't perform any casts, and calculates all the bytes in a completely endianness-agnostic manner (see "What is Endianness" section above); and all writes are exactly 1 byte in size, so there is no chance for endianness to manifest itself. While [Pike2012] argues that all the other marshalling methods represent a fallacy, I'm not so sure about it, and will describe an improvement over byte-by-byte marshalling in a moment.

Further Optimization

When I really care about performance (which I usually do, as server-side handling of billions of network messages per day is quite a significant load), I often add a special handling for those platforms of the most interest, for example (taken from [NoBugs2015]): void serialize_uint16(DataBlock& data, uint16_t u16) { //assuming little-endian order on the wire uint8_t* ptr = data.grow(2); #if defined(__i386) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64) *(uint16_t*)ptr = u16; // safe and fast as x86/x64 are ok with unaligned writes #else *ptr++ = (uint8_t)u16; *ptr = (uint8_t)(u16 >> 8); #endif } With this approach, we have the best of both worlds: (a) universal version (the one under #else) which works everywhere, and (b) optimized version for specific platforms which we know for sure will work efficiently there.

Performance analysis

Now, let's analyze relative performance of all three approaches: (a) the one from [Roy2013], (b) LITTLE_ENDIAN/BIG_ENDIAN based, (c) endianness-agnostic one (see [Pike2012]), and (d) the one from [NoBugs2015]. For the purposes of our analysis let's assume that all the data is already in L1 cache (it is the most common case for continuous reading, if it isn't, penalties will be the same for all the methods). Also, let's assume that L1 reads cost 2-3 clocks (L1 latencies with modern x86 CPUs are around 4-5 clocks, but latency isn't exactly translated to overall execution time, so 2-3 clocks works as a reasonable estimate); writes are usually around 1 clock. Also, we won't count costs of data.grow() for writing and of parser offset management for reading; it can be done in a manner that ensures amortized costs are quite low (of the order of single-digit clocks), and it will be the same regardless of the endianness handling method. LITTLE_ENDIAN/BIG_ENDIAN-based marshalling is certainly not bad performance-wise: it can be inlined easily, and has a cost of 1 clock for writing and 2-3 clocks for reading. [Roy2013] causes a function-call-by-pointer on each conversion; most importantly, such calls cannot possibly be inlined. From my experience, on x86 function calls with such parameters usually cost around 15-20 CPU clocks (while CALL/RET instructions are cheap, all the required PUSH/POPs and creating/destroying stack frame, taken together, are not), compared to inlined version. In addition, it will need roughly 1 clock to write the data (and 2-3 clocks to read), making the total for writing around 16-21 clocks and total for reading around 17-23 clocks. Endianness-agnostic approach can be inlined easily; however, it causes 2 writes/reads instead of 1 (and compilers don't combine them, at least I've never seen a compiler combine 2 writes/reads), which is normally translated into 2 clocks for writing and 4-6 clocks for reading; also, it requires shifts and casts, which also cost around 2 additional clocks, making totals for writing around 4 clocks and total for reading around 6-8 clocks. [NoBugs2015] is optimized for x86, and can be inlined too; same as LITTLE_/BIG_ENDIAN one, it has cost of 1 clock for writing and 2-3 clocks for reading. Of course, the analysis above is very approximate and there are other things not taken into account (such as larger size of inline functions, which in general may affect caches), but I feel that these other considerations in most cases won't affect the overall picture.

Implementation Notes

One thing to note when implementing marshaling, is that in most cases it is simpler to do it using unsigned integers rather than signed ones; while using signed types isn't formally a bad thing, in practice it tends to cause trouble. Not that it isn't possible to implement marshalling with signed ints - it is just simpler to implement it with unsigned ints, with one less thing to worry about. For a list of troubles which can be caused by using signed stuff for marshalling - see a comment by SICrane below; however, you don't really need to care about them - just use unsigned and you'll be fine :-). Another thing to keep in mind is to use those guaranteed-size types such as uint32_t and uint16_t (rather than int and short). You never know where your code will be compiled in 5 years from now, and just recently I've seen a guy who needed to fix his code because when compiling for AVR8, sizeof(int) is 2 (but sizeof(uint32_t) is always 4, regardless of the platform).

Summary

Properties of the three ways to handle endianness can be summarized in the following table: [table] [tr] [th][/th] [th]Applicability[/th] [th]Clocks on x86, write uint16_t[/th] [th]Clocks on x86, read uint16_t[/th] [th]Clocks on x86, write uint32_t[/th] [th]Clocks on x86, write uint32_t Clocks on x86, read uint32_t[/th] [/tr] [tr] [td][Roy2013][/td] [td]Only C-structure based[/td] [td]16-21[/td] [td]17-23[/td] [td]16-21[/td] [td]17-23[/td] [/tr] [tr] [td]LITTLE_/BIG_ENDIAN[/td] [td]Only C-structure based[/td] [td]1[/td] [td]2-3[/td] [td]1[/td] [td]2-3[/td] [/tr] [tr] [td]Endianness-agnostic[/td] [td]Both C-structure based and serialization[/td] [td]4[/td] [td]6-8[/td] [td]8[/td] [td]12-16[/td] [/tr] [tr] [td][NoBugs2015][/td] [td]Both C-structure based and serialization[/td] [td]1[/td] [td]2-3[/td] [td]1[/td] [td]2-3[/td] [/tr] [/table] Of course, on non-x86 platforms the picture won't be as good for [Nobugs2015] as it is written, but it will still perform exactly as the endianness-agnostic one, and there will be an option to optimize it for a specific platform (in the manner similar to x86 optimization) if necessary and possible.

References

[Roy2013] Promit Roy, Writing Endian Independent Code in C++, 2013, http://www.gamedev.net/page/resources/_/technical/general-programming/writing-endian-independent-code-in-c-r3301 [WikipediaEndianness] https://en.wikipedia.org/wiki/Endianness [WikipediaVLQ] https://en.wikipedia.org/wiki/Variable-length_quantity [Pike2012] Rob Pike, The Byte Order Fallacy, 2012, http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html [NoBugs2015] 'No Bugs' Hare, 64 Network DO's and DON'Ts for Game Engine Developers, Part IIa, Protocols and APIs, 2015 http://ithare.com/64-network-dos-and-donts-for-game-engine-developers-part-iia-protocols-and-apis/

Article Update Log

Jun 5, 2015: Minor change to explanation "why signed is bad" in Implementation Detail section Jun 4, 2015: Added LITTLE_/BIG_ENDIAN approach to those being analyzed Jun 4, 2015: Added 'Implementation Notes' section Jun 2, 2015: Initial release

Cancel Save
0 Likes 17 Comments

Comments

SiCrane

One thing I would emphasize is the importance of using unsigned types when doing these kinds of integer operations. In particular, deserializing from signed byte array may give weird results when any of the bytes are negative.

June 03, 2015 05:28 AM
Servant of the Lord

It sounds like your problem with Promit's method is purely because of the function pointers?

If so, another common way to do this, avoiding unnecessary function calls, is to use macroes at compile-time instead of function pointers, that way they can still be inlined.

For example: (translating Promit's 'InitEndian()' function)


#if LITTLE_ENDIAN
    #define BigShort(x)     ShortSwap(x)
    #define LittleShort(x)  (x) //Do nothing, just 'return' the same variable.
    #define BigLong(x)      LongSwap(x)
    #define LittleLong(x)   (x)
    #define BigFloat(x)     FloatSwap(x)
    #define LittleFloat(x)  (x)
#elif BIG_ENDIAN
    #define BigShort(x)     (x)
    #define LittleShort(x)  ShortSwap(x)
    #define BigLong(x)      (x)
    #define LittleLong(x)   LongSwap(x)
    #define BigFloat(x)     FloatSwap(x)
    #define LittleFloat(x)  (x)
#else
    #error blah
#endif

If using macroes to redirect to the correct function at compile time, I'd also make the correct functions constexpr, as a hint to the compiler. Or you can put your byte swapping directly into the macroes.

June 03, 2015 09:20 PM
Sergey Ignatchenko

@SiCrane: yes, this is one of those things you need to keep in mind when implementing marshalling. I've tried to explain it (without going into too many details) in 'Implementation Notes' section. I'm not 100% sure about the wording there, so further comments are very welcome.

June 04, 2015 10:26 AM
Sergey Ignatchenko

@Servant of the Lord: thanks, I've forgot to mention this one. I've added it to the analysis now. In short - while peformance-wise it is perfectly fine, it is not safe for marshalling purposes due to unaligned reads/writes (which don't work on quite a few platforms, though always work on x86). In fact, my own way [NoBugs2015] is a combination of two well-known things - it is a using pretty much the same thing as yours for x86 (where unaligned reads are not a problem), and byte-by-byte thing (which is guaranteed to work everywhere) for those "unknown" and "not so sure" platforms (with an option to rewrite it for specific platform of interest as soon as performance on that platform becomes an issue).

June 04, 2015 10:52 AM
Servant of the Lord

Ah, thanks for explaining. That makes sense.

June 04, 2015 04:51 PM
SiCrane

Well, the difference between arithmetic right shift and whatever other implementation choice is used for signed right shift can make a difference in some situations, it doesn't really come up in this case, because whenever you right shift for serialization you don't actually care what gets filled in at those top bits. If you want to worry about implementation specific details regarding signed/unsigned, I'd start with the fact that casting an integer to a smaller signed type has a completely implementation defined value if it's value can't be represented in that smaller type, which is not the case for a smaller unsigned type. For example, casting the uint16_t 0xff00 to uint8_t will always give you 0x00. Casting it to the signed 8 bit value, and you can get any value the implementation feels like. Fortunately for the intended audience, while platforms exist where this can be an issue, you're unlikely to run into them when doing application programming for any gaming platform that I can think of. I suppose you could also worry about weirdness that happens during conversion of negative numbers between different integer sizes if the platform doesn't use two's complement, but that's probably even more academic for the target audience. There's also the fact that left shifting a negative signed integer is explicitly undefined behavior under C++11. That's right, undefined, not implementation defined. Again, in practice, on the platforms that game programmers work with, it's not an issue and works more less like you expect.

However, what I was thinking about is (mostly) platform independent: sign extension when signed bytes are converted to larger integer types. So if you're deserializing two bytes a and b to a 16 bit number, the straightforward way to combine them is either (a << 8) + b or (a << 8) | b. If a contains the bit pattern 0x01 and b contains the bit pattern 0xff, then you're going to get the bit pattern 0x01ff out from unsigned types with either computation. With a and b as signed characters you're probably going to get 0x00ff out from the addition version and 0xffff out from the bit-or version (the "probably" referring to the fact that I'm assuming two's complement).

June 05, 2015 03:37 AM
Sergey Ignatchenko

@SICrane: thanks, I've just added a reference to your comment into the article. I didn't try using signed ints for marshalling since around my first marshalling code in 1997, so trying to remember what might go wrong, is difficult :-).

June 05, 2015 05:26 AM
alh420

Wouldn't this mean throwing away quite a lot of performance on all platforms that are not little-endian?

Writing byte-by-byte instead of word by word can be quite a big performance loss.

I'm not sure with current ARM processor, but at least in older ones it takes the exact same time to read/write a byte, as it takes to read/write a full word, so writing byte-by-byte would mean a performance loss of x4!

Making sure all your memory-accesses are aligned before pulling/pushing stuff from/to file or network doesn't really seem like that big a deal that it is worth throwing away that much performance for it.. Most of the time, the compiler handles it for you

June 05, 2015 07:59 AM
Sergey Ignatchenko

@Olof Hedman:

> Wouldn't this mean throwing away quite a lot of performance on all platforms that are not little-endian?

Well, performance for whatever-order-you-pass-on-the-network, is inherently better on systems-that-use-the-same-order. You can choose to pass it as big-endian, and get performance advantage on big-endian systems, but they're really few and far between these days (while ARM is technically either-little-or-big-endian, AFAIR both iOS and Android use it as little-endian).

As for the performance degradation for byte-by-byte: you're right, it does degrade (see estimates in the table), but have you seen "Further Optimization" section? It allows for exactly the-best-possible way for x86, and can be applied to ARMs too if ARM is one of your primary target platforms (and newer ARMv7 is much more lenient towards unaligned reads than, say, ARMv5).

As for the alignment: as discussed above, "compiler handles it for you" is the root of the problem. The problem is that different compilers handle it differently, and - if you have dozens of different application-level data structures to be passed over the network, making it consistent across all the platforms is so major and never-ending headache that it is not worth a minor performance loss due to unaligned reads. Adding inability to pass any variable-size (e.g. textual) data efficiently without marshalling - and I have a very strong case against C-structures-over-the-network. Looking at it from design point of view, by passing C structure over the network you're exposing too much of your internal implementation to the other side, creating unnecessary coupling, and making separation of concerns significantly more dirty than necessary. Outside of gaming world, pretty much nobody ever passes application-level C structures over the network, and for a good reason.

June 05, 2015 09:02 AM
Brain

It would be nice to see mention of the Sockets API calls that can be used for endian neutral code.

Rather than reinventing the wheel perhaps we could make use of for example ntohl(), htonl(), etc?

This is a good article though, which definitely will help a lot of people understand and fix the issue.

June 05, 2015 12:27 PM
Sergey Ignatchenko

@braindigitalis: ntohl/htonl is certainly a valid option; however, "network byte order" they're producing is big-endian, and with most of the gaming platforms (and even more importantly - servers) these days being little-endian, they're bound to lose a few clocks even in the case of ideal implementation (macro rather than function, and so on).

In addition, from my experience, when writing marshalling, you're pretty much bound to provide stuff which goes beyond the simple little-big-endian integers/floats (one notable example is strings), so marshalling library needs to be developed anyway; then marshalling of big-little-endian stuff becomes an implementation detail of this library (which can be implemented either via one of the methods described above, or via htons/ntohs). And when you already have it as an implementation detail - IMHO the code above is simple enough to implement it yourself (it's certainly not a rocket science) :-).

June 05, 2015 02:08 PM
JoshuaWaring

You can accelerate the swapping of bytes by using bswap, you should be able to swap 2, 4 and 8 bytes at a time.

http://www.c-jump.com/CIS77/ASM/DataTypes/T77_0230_bswap.htm

Edit:

Arm also has instructions to swap the byte order

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/BABJJDDB.html

June 06, 2015 10:16 AM
Sergey Ignatchenko

@Joshua5: you're right, but how it will affect overall picture? Most of performance losses in Roy2013 are due to un-inlinable function calls anyway...

June 06, 2015 12:40 PM
Ryan_001
I've recently been using the boost endian library (http://www.boost.org/doc/libs/1_58_0/libs/endian/doc/index.html), and I quite like it.
June 21, 2015 04:31 PM
Sergey Ignatchenko

I've recently been using the boost endian library (http://www.boost.org/doc/libs/1_58_0/libs/endian/doc/index.html), and I quite like it.

If going boost::, I'd certainly suggest to take a look at boost::Serialization instead (though I cannot vouch for its performance). As noted above, conversions alone are not enough to serialize the data (due to alignment issues, which happen even more often than bit-to-little-endian issues). So, storing C structure and doing endian conversion later, is very risky (as on the other platform compiler alignment rules may be very different, and you will spend lots of time trying to achieve exactly the same alignment across different compilers; while always possible (and there is even an algorithm how to do it "almost for sure"), it is a Big Headache).

June 22, 2015 05:11 AM
WakkawAkka

Great Explanation.

However for most scenarios I think it's more practical to use just one of the existing serialization formats like protocol buffers.

September 14, 2015 09:26 AM
Sergey Ignatchenko

However for most scenarios I think it's more practical to use just one of the existing serialization formats like protocol buffers.

Thanks for mentioning protocol buffers; I've took a quick look, and while IDL idea (their .proto file) is good in general, and streaming (though not C++ iostream) is good in general too, protocol buffer's wire format is suboptimal both for data sizes (despite attempts to optimize in this regard) and speed-wise. I am sure that Google can afford having a few extra servers to compensate for being suboptimal, but just like Scrooge McDuck, "I cannot stand waste" :-). In particular, a requirement for the parser to handle fields in arbitrary order is quite a waste for most practical applications (games included), and implementing this requirement leads to quite inefficient parsers (of course, they will still shine against XML, but will inevitably be inferior to linear parsers).

For Python, it might be ok, but for C++ I've seen certain messsages processed at an average time of about 1000 CPU clocks, so every pipeline stall (and you will inevitably have a lot of them for arbitrary-order parser) can count.

If somebody could write an alternative and more efficient wire protocol for protocol buffer's IDL (just like ASN.1 has one IDL and several wire formats - BER, DER, CER, XER) - it would be great.

September 14, 2015 03:10 PM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!

Better (simpler, faster, and more general) approaches, building on a previous GDnet article on this topic

Advertisement

Other Tutorials by Sergey Ignatchenko

Sergey Ignatchenko has not posted any other tutorials. Encourage them to write more!
Advertisement