Jump to content
  • Advertisement
Sign in to follow this  
Dirge

Unicode (U-8) file reading madness...

This topic is 4406 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

So while doing some file system work on a project recently I realized certain files would load but with some weird garbage characters in front of and after the first and last characters. I wasn't quite sure what this was but after a LOT of debugging I finally noticed in a text editor that the character set was being specified in U8 (UTC-8/Unicode) format. This was completely unintentional but likely the result of the files previous history as an xml file (which I renamed extension to delineate new file format status). Now within my code I make no point to read in binary data in wchar (wide string) format, but is there some way I can detect in the future whether a file is likely to have been stored in unicode so I can correctly convert to standard (multi-byte) ascii for my parsing code (read flags or unicode magic number checks)? I certainly understand the situation where an xml file should be stored in unicode and I'd like to code for that scenario, but in all likely hood it's not an issue that will affect my project (except in this scenario). Thanks very much ahead of time for any insights.

Share this post


Link to post
Share on other sites
Advertisement
Utf-8: any text file containing ascii character only, will look exactly as with pure ascii encoding. If the file contains any byte > 127, then it's definitely using some non-ascii characters.

Utf-16: there's likely to be Byte Order Mark at the beginning. If it isn't, then any ascii character encoded as multibyte will be preceeded with '\0' (or followed by '\0', depending on endianess).

Bottom line: search for BOM at the beginning, then if it's not there, search for any '\0' or >127 bytes in the text.

Also, Raymond Chen had an interesting insight into this issue.



Quote:
Wikipedia on "Byte Order Mark"
A Byte Order Mark (BOM) is the character at code point U+FEFF ("zero-width no-break space"), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text is encoded in UTF-8, UTF-16 or UTF-32.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
The lesson learned here is that all text should always be considered useless, wrong, or dangerous if you do not know which encoding it arrives in.

Sure, there are statistical methods to estimate the encodinf of a string, but they are little more than an educated guess, especially if the sample is small.

Bottom line is: Always be aware of the encoding text from unknown sources have. If you have no way of knowing beforehand, at least let the user specify somehow. Then convert it to an encoding suitable for your output, if necessary.

Everything else will lead to unexpected failures at some point.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!