Sqlite strange characters

Started by
13 comments, last by dabo 16 years ago
Hi, I've built a sqlite database for my game but when I try to read strings containing characters like å, ä, ö, ü etc. they become messsed up. An example: Altmünster --> Altmünster Böheimkirchen --> Böheimkirchen After having a closer look at the database file (using notepad) I saw that the strange characters were in the database itself. So my question is, how can I get rid of the strange characters and get the correct once instead? The editor I'm using (SQLite Database Browser) displays the correct characters even though they are not present in the database file so I figure I can read them to my game the same way the editor does. I guess it has something to do with text encoding which I don't know much about...
Advertisement
Every string you manipulate has a certain encoding. Typical encodings are latin-1 and utf-8. When a string is meant to be utf-8, but is used as if it were latin-1, then errors similar to the one you mention appear.

In short, you have to do one of three things:

  1. Set up your database to store strings as utf-8 instead of latin-1.

  2. Re-encode your string from utf-8 to latin-1 (but you'll risk losing some characters which don't exist in latin-1) using the encoding manipulation functions of your language.

  3. Escape utf-8 characters so that they can be represented as plain latin-1, then unescape them when retrieving them from the database.
Thanks for the quick reply.

Quote:1. Set up your database to store strings as utf-8 instead of latin-1.
How do I do this? The editor I'm using won't let me (as far I know) choose anything like that when I create the database.

Quote:2. Re-encode your string from utf-8 to latin-1 (but you'll risk losing some characters which don't exist in latin-1) using the encoding manipulation functions of your language.
I'm using c++, what is the name of that function? and is it expensive doing it this way?

Quote:3. Escape utf-8 characters so that they can be represented as plain latin-1, then unescape them when retrieving them from the database.
I don't understand what you mean.


Sorry for being a pain in the a** ;)

Btw, in the editor I'm using you can choose to export the database as a sql-file, checking it out with notepad you see that all the correct characters are present, but in the sqlite-file it is not. Does sqlite and sql use different text encodings?
Firstly, make sure you are using the latest version of SQLite, which is version 3. Version 2 appears to require that you explicitly build the library in UTF-8 or iso8859 format, which makes life harder.

It appears that you are feeding in ISO-8859 strings, which are then being converted to UTF-8. UTF-8 is a Unicode encoding, which is basically a way of storing characters that can handle not just English, Swedish, and other Western European characters, but characters from all over the world. To do this however, it must make some sacrifices, and one of those is that some of your Swedish characters are now represented by 2 bytes rather than 1.

It may be possible to find a setting that forces your strings to be stored in ISO-8859 format - if so, that is the easiest way to get what you want. Otherwise, you have to look at encoding UTF-8 as ISO-8859, which involves being able to translate some characters and discard others, and which may require an external library like iconv.
I'm gonna google and see if I can find something about forcing strings to be stored in the ISO-8859 format, I'm using sqlite3 btw.

I'm also gonna check out iconv, the only thing I'm worried about is how expensive it is to have to convert everything, there is plenty of text.
Quote:the only thing I'm worried about is how expensive it is to have to convert everything, there is plenty of text


How many queries per second do you perform? How many rows in the table? If under 1000,1 million respectively, performance shouldn't be a concern.

Also, such conversions often do not impact the actual storage and algorithms, they merely serve as a flag on how to display the data. Under one encoding, ASCII character 0x4e is displayed as u, and under other as ü.
Well performance shouldn't be a problem then. Solving the ugly characters still is though. I found iconv but I don't know how to use it, I can't find any samples only the manual that doesn't help much.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

(This is four and a half years old now, and lots of people still don't know these things.)
Dabo, forget iconv until you have seen if it's possible to have your database store in ISO-8859 format. If so, that is the correct thing to do. Since sqlite says the following, I think the problem might be in the database editor you use. "SQLite is not particular about the text it receives and is more than happy to process text strings that are not normalized or even well-formed UTF-8 or UTF-16. Thus, programmers who want to store IS08859 data can do so using the UTF-8 interfaces. As long as no attempts are made to use a UTF-16 collating sequence or SQL function, the byte sequence of the text will not be modified in any way."
But I thought my database used the UTF-8 interfaces, that's why I get these strange characters. Or what do they mean by UTF-8 interfaces?

Could it be the c++ wrapper I use to read the data from the database?
http://www.alhem.net/project/sqlite/html/index.html

This topic is closed to new replies.

Advertisement