Jump to content
  • Advertisement
Sign in to follow this  
acron86

Unicode File IO woes...

This topic is 4437 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

This Unicode is really getting on my tits I'm a real C++ zealot, so I don't know a lot of the old C i/o functions. Just check this code out Chris:
tempString = new WCHAR[128];
currentElement.AttributeWChars("path", tempString);

FILE *m_file = _wfopen(tempString,L"r");

    if(m_file)
    {
        fgetws(tempString, 128, m_file); // Temporary - skips first line of file.

        while( !feof(m_file) )
        {
            tempString = new WCHAR[128];
            fgetws(tempString, 128, m_file);
            m_textStrings.push_back(tempString);
        }

    fclose(m_file);

    }
And the Unicode file looks like this:
3
0 This is a test! #
1 Welcome To Go-Go-Golf! #
2 レポート、ブログにて掲載中 (2005/5/30). ●, ダンジョン シ #
Not sure if you'll see but '2' is followed by Japanese characters. Now, using C++ filehandling I can get these all in nicely but the Jap chars come in as '?'. tinyXML doesn't support Unicode so we have to do it ourselves - the chars come in as garbage. I found these C-style i/o functions which apparently open the file, return me 3 characters of junk the first time we see fgetws(tempString, 128, m_file); After that, every fgetws is just blank ( "" ). Any ideas? :/

Share this post


Link to post
Share on other sites
Advertisement
What character encoding is the "Unicode file" using?
MSDN says
Quote:
fgetwc is the wide-character version of fgetc; it reads c as a multibyte character or a wide character according to whether stream is opened in text mode or binary mode.

So if you open the file in text mode (as in that example), it tries to read multibyte (UTF-8? it seems to depend on the locale) characters; if you open it in binary mode instead, it'll read wchar_t (UTF-16) instead. So it matters how your file is representing the Unicode characters as bytes.
(_wfopen just accepts Unicode filenames - the opened file is treated exactly the same as with fopen.)

Share this post


Link to post
Share on other sites
Ahhhh

Well, WordPad's "Unicode Text Document". I assume that's UTF-16.

How can I open it in binary mode?

EDIT: fgetws (after formatting) is returning:

0x00ee84f8 "2 ì0Ý0ü0È00Ö0í0°0k0f0²c-N (2005/5/30). Ï%, À0ó0¸0ç0ó0 ·0 " wchar_t *

...for the above Japanese string. Argh.

EDIT 2:

Fixed :) Opening the file in binary ( "rb" ) fixed it!

Share this post


Link to post
Share on other sites
Wordpad does seem to save as UTF-16, so you have to open it in binary mode with
FILE *m_file = _wfopen(tempString, L"rb");

When I do that, I get the correct "0x0012fde4 "2 レポート、ブログにて掲載中 (2005/5/30). ●, ダンジョン シ #" in the debugger.


(Incidentally, the calls to "new WCHAR[128]" will results in lots of memory leaks, since you're never delete[]ing the array. Something like "WCHAR tempString[128];" should work the same but without the leaking.)

Edit: Ah, just noticed your edit [smile]. By the way, the "Show Preview" in this window doesn't really like Unicode...

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!