• Advertisement
Sign in to follow this  

Sqlite strange characters

This topic is 3613 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, I've built a sqlite database for my game but when I try to read strings containing characters like å, ä, ö, ü etc. they become messsed up. An example: Altmünster --> Altmünster Böheimkirchen --> Böheimkirchen After having a closer look at the database file (using notepad) I saw that the strange characters were in the database itself. So my question is, how can I get rid of the strange characters and get the correct once instead? The editor I'm using (SQLite Database Browser) displays the correct characters even though they are not present in the database file so I figure I can read them to my game the same way the editor does. I guess it has something to do with text encoding which I don't know much about...

Share this post


Link to post
Share on other sites
Advertisement
Every string you manipulate has a certain encoding. Typical encodings are latin-1 and utf-8. When a string is meant to be utf-8, but is used as if it were latin-1, then errors similar to the one you mention appear.

In short, you have to do one of three things:

  1. Set up your database to store strings as utf-8 instead of latin-1.

  2. Re-encode your string from utf-8 to latin-1 (but you'll risk losing some characters which don't exist in latin-1) using the encoding manipulation functions of your language.

  3. Escape utf-8 characters so that they can be represented as plain latin-1, then unescape them when retrieving them from the database.

Share this post


Link to post
Share on other sites
Thanks for the quick reply.

Quote:
1. Set up your database to store strings as utf-8 instead of latin-1.
How do I do this? The editor I'm using won't let me (as far I know) choose anything like that when I create the database.

Quote:
2. Re-encode your string from utf-8 to latin-1 (but you'll risk losing some characters which don't exist in latin-1) using the encoding manipulation functions of your language.
I'm using c++, what is the name of that function? and is it expensive doing it this way?

Quote:
3. Escape utf-8 characters so that they can be represented as plain latin-1, then unescape them when retrieving them from the database.
I don't understand what you mean.


Sorry for being a pain in the a** ;)

Btw, in the editor I'm using you can choose to export the database as a sql-file, checking it out with notepad you see that all the correct characters are present, but in the sqlite-file it is not. Does sqlite and sql use different text encodings?

Share this post


Link to post
Share on other sites
Firstly, make sure you are using the latest version of SQLite, which is version 3. Version 2 appears to require that you explicitly build the library in UTF-8 or iso8859 format, which makes life harder.

It appears that you are feeding in ISO-8859 strings, which are then being converted to UTF-8. UTF-8 is a Unicode encoding, which is basically a way of storing characters that can handle not just English, Swedish, and other Western European characters, but characters from all over the world. To do this however, it must make some sacrifices, and one of those is that some of your Swedish characters are now represented by 2 bytes rather than 1.

It may be possible to find a setting that forces your strings to be stored in ISO-8859 format - if so, that is the easiest way to get what you want. Otherwise, you have to look at encoding UTF-8 as ISO-8859, which involves being able to translate some characters and discard others, and which may require an external library like iconv.

Share this post


Link to post
Share on other sites
I'm gonna google and see if I can find something about forcing strings to be stored in the ISO-8859 format, I'm using sqlite3 btw.

I'm also gonna check out iconv, the only thing I'm worried about is how expensive it is to have to convert everything, there is plenty of text.

Share this post


Link to post
Share on other sites
Quote:
the only thing I'm worried about is how expensive it is to have to convert everything, there is plenty of text


How many queries per second do you perform? How many rows in the table? If under 1000,1 million respectively, performance shouldn't be a concern.

Also, such conversions often do not impact the actual storage and algorithms, they merely serve as a flag on how to display the data. Under one encoding, ASCII character 0x4e is displayed as u, and under other as ü.

Share this post


Link to post
Share on other sites
Well performance shouldn't be a problem then. Solving the ugly characters still is though. I found iconv but I don't know how to use it, I can't find any samples only the manual that doesn't help much.

Share this post


Link to post
Share on other sites
Dabo, forget iconv until you have seen if it's possible to have your database store in ISO-8859 format. If so, that is the correct thing to do. Since sqlite says the following, I think the problem might be in the database editor you use. "SQLite is not particular about the text it receives and is more than happy to process text strings that are not normalized or even well-formed UTF-8 or UTF-16. Thus, programmers who want to store IS08859 data can do so using the UTF-8 interfaces. As long as no attempts are made to use a UTF-16 collating sequence or SQL function, the byte sequence of the text will not be modified in any way."

Share this post


Link to post
Share on other sites
UTF-8 is just a way of interpreting bytes. SQLite says that if you ask to perform text-based operations on the data, it will interpret the bytes as UTF-8. The interfaces it talks about are just the usual ways of passing strings in and pulling them out. If you just push in data and pull it out, it will/should be unaltered, and the fact that it was never valid UTF-8 while it was being stored is not an issue.

I doubt the C++ wrapper would change this, but I couldn't guarantee it. Write a trivial example without the wrapper to test it. It should be only about 10 or 20 lines of code to do an INSERT and a SELECT.

If SQLite doesn't appear to act like this, and you're sure you're not performing text-based functions on the data while it's in there, then you probably need to speak to the SQLite people about why the behaviour doesn't match the documentation.

Share this post


Link to post
Share on other sites
After messing around with the sample from the official website, I managed to read som data from a database without the wrapper, but I still did get those ugly characters.

What do you mean by "and you're sure you're not performing text-based functions on the data while it's in there"? The only thing I do with the strings I read from the database is printing them using cout.

Share this post


Link to post
Share on other sites
I mean SQLite functions. It doesn't matter what you do to the data after you pull it back out again. I suggest you contact the SQLite people and find out why the documentation I quoted isn't holding true for you.

Share this post


Link to post
Share on other sites
Quote:
Original post by Zahlman
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

(This is four and a half years old now, and lots of people still don't know these things.)

Thanks for the article link. I generally knew most of that, but the historical perspective was very helpful. Original poster: Read this article.

Share this post


Link to post
Share on other sites
Quote:
Original post by Kylotan
I mean SQLite functions. It doesn't matter what you do to the data after you pull it back out again. I suggest you contact the SQLite people and find out why the documentation I quoted isn't holding true for you.


I see if I can get a hold on them.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement