Endian independent binary files?

Started by
7 comments, last by larspensjo 11 years, 4 months ago
I'm trying to write binary data on Android and read it from Windows (and also the other way around). I tried just using the std::fstream in binary mode alone. The thing is, when I read the data from Windows that I wrote in Android I get lots of corrupted values. I read that ARM is a bi-endian processor so it could be little or big endian depending on the device? How can binary file formats be written to be endian independent?
Advertisement
Endianness is dictated by the OS, so even if the processor supports it, it doesn't matter.

You are probably looking for ::htons and their ilk http://msdn.microsoft.com/en-gb/library/windows/desktop/ms738557(v=vs.85).aspx
"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

I'm trying to write binary data on Android and read it from Windows (and also the other way around). I tried just using the std::fstream in binary mode alone. The thing is, when I read the data from Windows that I wrote in Android I get lots of corrupted values. I read that ARM is a bi-endian processor so it could be little or big endian depending on the device? How can binary file formats be written to be endian independent?


Just use big-endian and write/read data on Android using the DataOutputStream and DataInputStream classes (They read/write data as big endian regardless of platform)
[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!
I'm writing using the NDK. Is there any way to tell if a variable is big endian already? ntohl and htonl just flip the byte order and don't check anything right?
No, ntohl doesn't do anything if the platform is already the correct endianness (but there is no check). You need to recompile if you ship binaries on different endian platforms though.
"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley
A given value doesn't have any implicit endianness. You need to know the endianness of whatever wrote the data. The most robust approach for then reading the data would be to store the endianness as header information in the file, so regardless of what platform reads the file, it knows how to handle the data.
You cannot tell from looking at a variable whether it's little endian or big endian, except if you know the contents of the variable already (this is the base of many runtime endianness detection tricks).

One solution is to always enforce one particular endianness. Using ntohl as suggested above is a portable and relatively easy way of doing that for single integers.

Another approach is knowing the endianness of a file before reading it (and writing whatever is native). To know which endianness was used writing data to a file, you need to start your file with a well-known magic value. Since you know that value, you know what it must look like, and what it looks like when endianness is incorrect.
If you read the first word and it comes out correctly, just read the rest of your file. If it comes out with wrong byte order, you need to flip all values. If it comes out as something else, it's not a file of the type you expected. Microsoft's infamous byte order mark is nothing else but exactly that.

If you read the first word and it comes out correctly, just read the rest of your file. If it comes out with wrong byte order, you need to flip all values. If it comes out as something else, it's not a file of the type you expected. Microsoft's infamous byte order mark is nothing else but exactly that.

Are you talking about the Unicode BOM? I don't think that has anything to do with Microsoft (other than indirectly via their involvement in the working group). Though if you know better, I'd be interested to hear more about it.
One concrete of a file format that does something like this would be TIF:
Every TIFF begins with a 2-byte indicator of byte order: "II" for little-endian (aka "intel byte ordering", circa 1980) and "MM" for big-endian (aka "motorola byte ordering", circa 1980) byte ordering. The third byte represents the number 42 which happens to be the ASCII character "*", also represented by hexidecimal 2A, selected because this is the binary pattern 101010 and "for its deep philosophical significance". The 4th byte is represented by a 0, an ASCII "NULL". All words, double words, etc., in the TIFF file are assumed to be in the indicated byte order. The TIFF 6.0 specification says that compliant TIFF readers must support both byte orders (II and MM); writers may use either.[/quote] (source)
It is common to see confusion how to handle endianess (though most got it right above). The trick is to care about the defined format of the data, and not using any mechanisms that depend on the current architecture. See the excellent blog The byte order fallacy by Rob Pike.
[size=2]Current project: Ephenation.
[size=2]Sharing OpenGL experiences: http://ephenationopengl.blogspot.com/

This topic is closed to new replies.

Advertisement