[web] Setting server charset/encoding

Started by
6 comments, last by markr 19 years, 9 months ago
Running my html through the w3's validator turns up "No character encoding found". As far as I can tell thats supposed to be transmitted alongside an actual document (much like a MIME type?) rather than actually embedded inside the document itself. How can I actually set this up? My hosting doesn't provide direct access to the server config, instead using CPanel web interface. What kind of settings/options should I be looking out for? Or do I really have to worry about this? One other thing I suppose I could do would be to generate this in a PHP header embedded in the individual files. Is this worth doing if I can't manage to configure it otherwise?
Advertisement
It actually is embedded in the document itself. http://www.w3.org/International/O-charset.html
el
Thanks, setting the encoding in the head of the document worked. :) But how can the browser read the encoding without knowing the encoding beforehand? Isn't this the equivilent of opening the box with the crowbar inside?
I don't get it either, ISO-8859-1? UTF-8? it doesn't seem to affect the encoding. It's all ANSI text isn't it? And like you said, the box is opened without the crowbar...
Trust me, if you start using MSXML to create your pages, your encoding nightmares have only just begun.
Quote:Original post by Boder
I don't get it either, ISO-8859-1? UTF-8? it doesn't seem to affect the encoding. It's all ANSI text isn't it? And like you said, the box is opened without the crowbar...


You need to put the charset in so that chars with accents and such show correctly even in browsers with other language setups. Look at the bugs page here for examples (no charset in new forums yet).

Jay
You just need to configure the web server to send an encoding with HTML documents by default; this is pretty easy, I'm fairly sure it can be done with a .htaccess file.

It's certainly possible to do it in a .php script, and it can in theory (but should not be in practice) be set differently on a per-page basis.

In order to maintain your sanity, you should probably keep all HTML pages on your site in the same encoding.

If your page is in English or another West-European language (like French, German, Italian, Spanish and some others), then you can just set it to ISO-8859-1 and stop worrying about it.

If using PHP, steer away from UTF-8 (and UTF-anything-else) if at all possible, because its support for it is terrible. In fact if using PHP, steer away from any multibyte encoding, as it is just awful at doing it.

I would personally not consider developing a web site which required multi-byte character sets in PHP, on the other hand, I don't speak any languages which require multi-byte character sets, so that's fairly unlikely.

Mark
Quote:Original post by OrangyTang
But how can the browser read the encoding without knowing the encoding beforehand? Isn't this the equivilent of opening the box with the crowbar inside?


That is a very strange question. I think the answer must be: - it tries to read it in ASCII, and if it appears to be ASCII or at least an ASCII-like character set, it continues until it finds a charset definition, and then scraps what it has, then starts again.

If of course it doesn't appear to be ASCII, it has to assume it's one of the non-ASCII-like encodings, probably UTF-16 - therefore it has to repeat the exercise.

The same is true of XML, except that the default encoding for XML is UTF-8, and XML documents MUST state their encoding before having any data if they are not in UTF-8. This means the parser never has to read any non-ASCII characters before it reaches the encoding. This must be enormously helpful to parser developers, as they should be able to easily tell what encoding a document's in almost straight away.

Of course a lot of HTML documents on the web are served in a different encoding than they are actually in, and some others have conflicting encodings in the Content-type: header and in the document.

For this reason, I think most modern browsers attempt to second-guess the encoding of the document. Just how they do this is really quite weird - not to mention that it must be error-tolerant. I guess it may be by statistical analysis of the characters to try to determine the most likely encoding.

I recently discovered that some pages on my web site said they were in ISO-8859-1 (A previous developer accidentally left the headers there), but were actually in UTF-8. This was not visible in browsers, because they were "clever enough" to sus it out. Google's cache, however, had assumed they were in 8bit, and was spitting UTF escapes out.

Here's a good question for you:

It may be technically feasible, to write a document which looks like one thing in one encoding (complete with meta http-equiv charset) and another thing in another encoding (At least if one of them was UTF-16). This should look right in either of two encodings, so it is dependent on which one the browser tries first :)

Mark

This topic is closed to new replies.

Advertisement