Jump to content
Site Stability Read more... ×
  • Advertisement

Sergey Ignatchenko

  • Content Count

  • Joined

  • Last visited

Community Reputation

625 Good


About Sergey Ignatchenko

  • Rank

Personal Information

  • Interests

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Sergey Ignatchenko

    Static, zero-overhead (probably) PIMPL in C++

      IMO move constructor should do the trick. As in - you're creating an object on-stack, then construct Foo<> (wherever you want), which calls move-constructor from your on-stack object within. 
  2. Sergey Ignatchenko

    Writing Efficient Endian-Independent Code in C++

      Thanks for mentioning protocol buffers; I've took a quick look, and while IDL idea (their .proto file) is good in general, and streaming (though not C++ iostream) is good in general too, protocol buffer's wire format is suboptimal both for data sizes (despite attempts to optimize in this regard) and speed-wise. I am sure that Google can afford having a few extra servers to compensate for being suboptimal, but just like Scrooge McDuck, "I cannot stand waste" :-). In particular, a requirement for the parser to handle fields in arbitrary order is quite a waste for most practical applications (games included), and implementing this requirement leads to quite inefficient parsers (of course, they will still shine against XML, but will inevitably be inferior to linear parsers).   For Python, it might be ok, but for C++ I've seen certain messsages processed at an average time of about 1000 CPU clocks, so every pipeline stall (and you will inevitably have a lot of them for arbitrary-order parser) can count.   If somebody could write an alternative and more efficient wire protocol for protocol buffer's IDL (just like ASN.1 has one IDL and several wire formats - BER, DER, CER, XER) - it would be great.
  3. Sergey Ignatchenko

    Writing Efficient Endian-Independent Code in C++

    If going boost::, I'd certainly suggest to take a look at boost::Serialization instead (though I cannot vouch for its performance). As noted above, conversions alone are not enough to serialize the data (due to alignment issues, which happen even more often than bit-to-little-endian issues). So, storing C structure and doing endian conversion later, is very risky (as on the other platform compiler alignment rules may be very different, and you will spend lots of time trying to achieve exactly the same alignment across different compilers; while always possible (and there is even an algorithm how to do it "almost for sure"),  it is a Big Headache).
  4. Sergey Ignatchenko

    Writing Efficient Endian-Independent Code in C++

    @Joshua5: you're right, but how it will affect overall picture? Most of performance losses in Roy2013 are due to un-inlinable function calls anyway...
  5. Sergey Ignatchenko

    Writing Efficient Endian-Independent Code in C++

    @braindigitalis: ntohl/htonl is certainly a valid option; however, "network byte order" they're producing is big-endian, and with most of the gaming platforms (and even more importantly - servers) these days being little-endian, they're bound to lose a few clocks even in the case of ideal implementation (macro rather than function, and so on).   In addition, from my experience, when writing marshalling, you're pretty much bound to provide stuff which goes beyond the simple little-big-endian integers/floats (one notable example is strings), so marshalling library needs to be developed anyway; then marshalling of big-little-endian stuff becomes an implementation detail of this library (which can be implemented either via one of the methods described above, or via htons/ntohs). And when you already have it as an implementation detail  - IMHO the code above is simple enough to implement it yourself (it's certainly not a rocket science) :-).
  6. Sergey Ignatchenko

    Writing Efficient Endian-Independent Code in C++

    @Olof Hedman: > Wouldn't this mean throwing away quite a lot of performance on all platforms that are not little-endian? Well, performance for whatever-order-you-pass-on-the-network, is inherently better on systems-that-use-the-same-order. You can choose to pass it as big-endian, and get performance advantage on big-endian systems, but they're really few and far between these days (while ARM is technically either-little-or-big-endian, AFAIR both iOS and Android use it as little-endian).   As for the performance degradation for byte-by-byte: you're right, it does degrade (see estimates in the table), but have you seen "Further Optimization" section? It allows for exactly the-best-possible way for x86, and can be applied to ARMs too if ARM is one of your primary target platforms (and newer ARMv7 is much more lenient towards unaligned reads than, say, ARMv5).   As for the alignment: as discussed above, "compiler handles it for you" is the root of the problem. The problem is that different compilers handle it differently, and - if you have dozens of different application-level data structures to be passed over the network, making it consistent across all the platforms is so major and never-ending headache that it is not worth a minor performance loss due to unaligned reads. Adding inability to pass any variable-size (e.g. textual) data efficiently without marshalling - and I have a very strong case against C-structures-over-the-network. Looking at it from design point of view, by passing C structure over the network you're exposing too much of your internal implementation to the other side, creating unnecessary coupling, and making separation of concerns significantly more dirty than necessary. Outside of gaming world, pretty much nobody ever passes application-level C structures over the network, and for a good reason.
  7. Sergey Ignatchenko

    Writing Endian Independent Code in C++

    NB (and shameless bragging ;-) ): another article with a supposed improvement over this approach has been published on GDNet: http://www.gamedev.net/page/resources/_/technical/general-programming/writing-efficient-endian-independent-code-in-c-r4080
  8. Sergey Ignatchenko

    Writing Efficient Endian-Independent Code in C++

    @SICrane: thanks, I've just added a reference to your comment into the article. I didn't try using signed ints for marshalling since around my first marshalling code in 1997, so trying to remember what might go wrong, is difficult :-).
  9. Once upon a time, there was an article published on gamedev.net [Roy2013], which described a way (as he says, mostly taken from Quake2 engine) to deal with Little-Endian/Big-Endian issues in games. While this approach is mostly sound ("mostly" because of unaligned-read issues which will be discussed below), it is not the most efficient one. Better (simpler, faster, and more general) approaches do exist, and they will be discussed below. What is Endianness Endianness itself has been described in many different works, including [[Roy2013] and [WikipediaEndianness]. Basically, it is a way that CPU stores multi-byte data in memory; little-endian systems store least significant byte first, and big-endian ones store most-significant byte first. So, if you have uint16_t x = 1234; then x will look as {0xD2, 0x04} on a little-endian system, and as {0x04, 0xD2} on a big-endian system. As a result, code such as send(socket,&x,2); will send different data over the wire depending on system endianness (little-endian systems will send {0xD2,0x04}, and big-endian ones will send {0x04,0xD2}). It is important to note that endianness effects cannot be observed unless we have some kind of cast between pointers on data of different sizes (in the example above, there was an implicit cast from &x, which is uint16_t*, to void*, which is actually treated as byte pointer by the send() function). In other words, as long as we keep away from casts and stay within arithmetical and bitwise operations without pointer casts, the result is always the same regardless of endianness. 2+3 is always 5, and (((uint16_t)0xAA)[lessthan][lessthan]3)^0xCCCC is always 0xC99C, regardless of the system where our code is running. Let's name such calculations endianness-agnostic. Scope First of all, where do we need to deal with little-endian/big-endian issues? In fact, there are only two scenarios of which I know, where it is important. First one is reading files (which might have been written on a different machine), and another one is network communication. From our perspective, both of these cases are essentially the same: we're transferring data from one machine to another one. Serialization/Marshalling One thing which should be noted for both these data-transfer-between-machines scenarios, is that you should never transfer data as a C structure; instead, you should serialize/marshal it. Putting C structure in a file (which may be read on another machine) or over the network, is a Really Bad Idea for several reasons. First of all, when writing/reading C structure to external storage, you're becoming a hostage of implicit alignment rules of the compiler you're using. In general, when you have a structure such as struct X { uint8_t a; uint32_t b; }; then sizeof(X) won't be 5 bytes as some might expect; in many cases sizeof(X) will be 8 bytes (1 byte of a, then 3 unused bytes of padding just to make b aligned on a 4-byte boundary, and then 4 bytes of b), but this is not guaranteed at all. To make things worse, the amount of alignment is not specified by standards, so when you're switching from one compiler to another one, it may change (not to mention switching between CPUs); to make things even worse, it can be affected by compiler switches and on a struct-by-struct basis by things such as #pragma pack. If you are using types such as int or long (rather than guaranteed-size-types such as uint8_t and uint32_t), things worsen even further (yes, this is possible) - due to different sizes of these types on different platforms. Oh, and don't forget that variable-length strings and C++ containers are clearly off-limits. There are other (rather minor) reasons for avoiding writing C structures directly: you'll write more data then necessary, the data written will include garbage (which will affect the ability to compress it), and so on. However, the most important issue is (lack of) inter-platform and inter-compiler compatibility mentioned above. These issues are so important, that in the networking world sending C structures over the network is universally considered a Big No-No. So, what you should do when you need to send a C structure over the network (or to save it to the file)? You should serialize it first (in network world term "marshal" is generally preferred, though it is essentially the same thing). Implementing Serialization The idea behind serialization is simple: for the struct X above you write one byte of a, and 4 bytes of b, avoiding alignment issues. In fact, you can go further and use, for example, VLQ [WikipediaVLQ] variable-length encoding, or put null-terminated strings into your serialized data. One way of serializing data (the one I prefer), is to have serialize/deserialize functions such as void serialize_uint16(DataBlock&, uint16_t);//DataBlock should grow as the data is serialized uint16_t deserialize_uint16(Parser&);//there is constructor Parser(DataBlock&) When we have these functions implemented, then serializing our struct X will look as DataBlock data; serialize_uint8(data,x.a); serialize_uint32(data,x.b); (and deserializing will look similar). So far so good, now let's see how we can implement our serialize_uint16() function. If we will implement it according to [Roy2013], it would look like: void serialize_uint16(DataBlock& data,uint16_t u16) { void* ptr = data.grow(2);//add 2 bytes to the end of data, and return pointer to these 2 bytes u16 = LittleShort(u16);//calling ShortSwap on big-endian systems, and ShortNoSwap on little-endian systems *(uint16_t*)ptr = u16; //(*) } This would work fine on x86 and x86-64, but on the other platforms the line marked as (*) may run into problems. The problem is that our ptr might be either even or odd; and if it is odd - some CPUs will refuse to read 2-byte data from it (also they will usually refuse to read 4-byte data unless its address is a multiple of 4, and so on). This never happens on x86/x86-64, but happens on SPARC, and may or may not happen on ARM (unless we specify __packed qualifier for uint16_t*, but it is not universally available). Another Popular Alternative Another popular alternative (thanks to Servant of the Lord for reminding me of it) is based on LITTLE_ENDIAN and BIG_ENDIAN macros. In some sense, it can be seen as the same serialize_uint16() as above, but using different implementation for BigShort() etc.: //code courtesy of Servant of the Lord #ifdef LITTLE_ENDIAN #define BigShort(x) ShortSwap(x) #define LittleShort(x) (x) //Do nothing, just 'return' the same variable. #define BigLong(x) LongSwap(x) #define LittleLong(x) (x) #define BigFloat(x) FloatSwap(x) #define LittleFloat(x) (x) #elif defined(BIG_ENDIAN) #define BigShort(x) (x) #define LittleShort(x) ShortSwap(x) #define BigLong(x) (x) #define LittleLong(x) LongSwap(x) #define BigFloat(x) FloatSwap(x) #define LittleFloat(x) (x) #else #error No idea about endianness #endif While it is faster and less bulky than a previous one (see "Performance Analysis" section below), it has the same problem with unaligned read/writes on non-x86 platforms :-(. In other words, for serialization purposes it won't work, for example, on SPARC (and it working for ARM is not guaranteed). What Is to be Done? What is to be done? -- name of the novel by Nikolay Chernyshevsky, 1863 -- The answer is quite simple, actually. Instead of writing two bytes as one chunk, we can always write it byte-by-byte: void serialize_uint16(DataBlock& data,uint16_t u16) { uint8_t* ptr = data.grow(2); *ptr++ = (uint8_t)u16; *ptr = (uint8_t)(u16 >> 8); }//deserializing is very similar This technique is well-known (see, for example, [Pike2012], but the idea is known for years before); a Really Good Thing about it is that we don't need to care about endianness at all(!). This stands because the code above doesn't perform any casts, and calculates all the bytes in a completely endianness-agnostic manner (see "What is Endianness" section above); and all writes are exactly 1 byte in size, so there is no chance for endianness to manifest itself. While [Pike2012] argues that all the other marshalling methods represent a fallacy, I'm not so sure about it, and will describe an improvement over byte-by-byte marshalling in a moment. Further Optimization When I really care about performance (which I usually do, as server-side handling of billions of network messages per day is quite a significant load), I often add a special handling for those platforms of the most interest, for example (taken from [NoBugs2015]): void serialize_uint16(DataBlock& data, uint16_t u16) { //assuming little-endian order on the wire uint8_t* ptr = data.grow(2); #if defined(__i386) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64) *(uint16_t*)ptr = u16; // safe and fast as x86/x64 are ok with unaligned writes #else *ptr++ = (uint8_t)u16; *ptr = (uint8_t)(u16 >> 8); #endif } With this approach, we have the best of both worlds: (a) universal version (the one under #else) which works everywhere, and (b) optimized version for specific platforms which we know for sure will work efficiently there. Performance analysis Now, let's analyze relative performance of all three approaches: (a) the one from [Roy2013], (b) LITTLE_ENDIAN/BIG_ENDIAN based, (c) endianness-agnostic one (see [Pike2012]), and (d) the one from [NoBugs2015]. For the purposes of our analysis let's assume that all the data is already in L1 cache (it is the most common case for continuous reading, if it isn't, penalties will be the same for all the methods). Also, let's assume that L1 reads cost 2-3 clocks (L1 latencies with modern x86 CPUs are around 4-5 clocks, but latency isn't exactly translated to overall execution time, so 2-3 clocks works as a reasonable estimate); writes are usually around 1 clock. Also, we won't count costs of data.grow() for writing and of parser offset management for reading; it can be done in a manner that ensures amortized costs are quite low (of the order of single-digit clocks), and it will be the same regardless of the endianness handling method. LITTLE_ENDIAN/BIG_ENDIAN-based marshalling is certainly not bad performance-wise: it can be inlined easily, and has a cost of 1 clock for writing and 2-3 clocks for reading. [Roy2013] causes a function-call-by-pointer on each conversion; most importantly, such calls cannot possibly be inlined. From my experience, on x86 function calls with such parameters usually cost around 15-20 CPU clocks (while CALL/RET instructions are cheap, all the required PUSH/POPs and creating/destroying stack frame, taken together, are not), compared to inlined version. In addition, it will need roughly 1 clock to write the data (and 2-3 clocks to read), making the total for writing around 16-21 clocks and total for reading around 17-23 clocks. Endianness-agnostic approach can be inlined easily; however, it causes 2 writes/reads instead of 1 (and compilers don't combine them, at least I've never seen a compiler combine 2 writes/reads), which is normally translated into 2 clocks for writing and 4-6 clocks for reading; also, it requires shifts and casts, which also cost around 2 additional clocks, making totals for writing around 4 clocks and total for reading around 6-8 clocks. [NoBugs2015] is optimized for x86, and can be inlined too; same as LITTLE_/BIG_ENDIAN one, it has cost of 1 clock for writing and 2-3 clocks for reading. Of course, the analysis above is very approximate and there are other things not taken into account (such as larger size of inline functions, which in general may affect caches), but I feel that these other considerations in most cases won't affect the overall picture. Implementation Notes One thing to note when implementing marshaling, is that in most cases it is simpler to do it using unsigned integers rather than signed ones; while using signed types isn't formally a bad thing, in practice it tends to cause trouble. Not that it isn't possible to implement marshalling with signed ints - it is just simpler to implement it with unsigned ints, with one less thing to worry about. For a list of troubles which can be caused by using signed stuff for marshalling - see a comment by SICrane below; however, you don't really need to care about them - just use unsigned and you'll be fine :-). Another thing to keep in mind is to use those guaranteed-size types such as uint32_t and uint16_t (rather than int and short). You never know where your code will be compiled in 5 years from now, and just recently I've seen a guy who needed to fix his code because when compiling for AVR8, sizeof(int) is 2 (but sizeof(uint32_t) is always 4, regardless of the platform). Summary Properties of the three ways to handle endianness can be summarized in the following table: [table] [tr] [th][/th] [th]Applicability[/th] [th]Clocks on x86, write uint16_t[/th] [th]Clocks on x86, read uint16_t[/th] [th]Clocks on x86, write uint32_t[/th] [th]Clocks on x86, write uint32_t Clocks on x86, read uint32_t[/th] [/tr] [tr] [td][Roy2013][/td] [td]Only C-structure based[/td] [td]16-21[/td] [td]17-23[/td] [td]16-21[/td] [td]17-23[/td] [/tr] [tr] [td]LITTLE_/BIG_ENDIAN[/td] [td]Only C-structure based[/td] [td]1[/td] [td]2-3[/td] [td]1[/td] [td]2-3[/td] [/tr] [tr] [td]Endianness-agnostic[/td] [td]Both C-structure based and serialization[/td] [td]4[/td] [td]6-8[/td] [td]8[/td] [td]12-16[/td] [/tr] [tr] [td][NoBugs2015][/td] [td]Both C-structure based and serialization[/td] [td]1[/td] [td]2-3[/td] [td]1[/td] [td]2-3[/td] [/tr] [/table] Of course, on non-x86 platforms the picture won't be as good for [Nobugs2015] as it is written, but it will still perform exactly as the endianness-agnostic one, and there will be an option to optimize it for a specific platform (in the manner similar to x86 optimization) if necessary and possible. References [Roy2013] Promit Roy, Writing Endian Independent Code in C++, 2013, http://www.gamedev.net/page/resources/_/technical/general-programming/writing-endian-independent-code-in-c-r3301 [WikipediaEndianness] https://en.wikipedia.org/wiki/Endianness [WikipediaVLQ] https://en.wikipedia.org/wiki/Variable-length_quantity [Pike2012] Rob Pike, The Byte Order Fallacy, 2012, http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html [NoBugs2015] 'No Bugs' Hare, 64 Network DO's and DON'Ts for Game Engine Developers, Part IIa, Protocols and APIs, 2015 http://ithare.com/64-network-dos-and-donts-for-game-engine-developers-part-iia-protocols-and-apis/ Article Update Log Jun 5, 2015: Minor change to explanation "why signed is bad" in Implementation Detail section Jun 4, 2015: Added LITTLE_/BIG_ENDIAN approach to those being analyzed Jun 4, 2015: Added 'Implementation Notes' section Jun 2, 2015: Initial release
  10. Sergey Ignatchenko

    Writing Efficient Endian-Independent Code in C++

    @Servant of the Lord: thanks, I've forgot to mention this one. I've added it to the analysis now. In short - while peformance-wise it is perfectly fine, it is not safe for marshalling purposes due to unaligned reads/writes (which don't work on quite a few platforms, though always work on x86). In fact, my own way [NoBugs2015] is a combination of two well-known things - it is a using pretty much the same thing as yours for x86 (where unaligned reads are not a problem), and byte-by-byte thing (which is guaranteed to work everywhere) for those "unknown" and "not so sure" platforms (with an option to rewrite it for specific platform of interest as soon as performance on that platform becomes an issue).
  11. Sergey Ignatchenko

    Writing Efficient Endian-Independent Code in C++

    @SiCrane: yes, this is one of those things you need to keep in mind when implementing marshalling. I've tried to explain it (without going into too many details) in 'Implementation Notes' section. I'm not 100% sure about the wording there, so further comments are very welcome.
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!