[web] MIME Format

Started by
6 comments, last by ZergUser 15 years, 5 months ago
Is every byte within a MIME message a US-ASCII one? I read that MIME messages support text in character sets other than US-ASCII. However, it does this not by actually including the raw bytes for a particular character set. Instead, those raw bytes are encoded into US-ASCII ones, which are then included in the message along with headers describing the used encoding method. Is this correct? Thanks.
Advertisement
I don't know myself, but you can probably find out in an RFC (http://tools.ietf.org/rfc/index)
xdXD
Yeah, I scanned RFC 822 and RFC 2045. I just wanted to see if anyone else might know whether or not my conclusion is correct.
IMHO your conclusion is incorrect. And I think that maybe you're confusing two things.

There are four things at work here:
1) The character set of the text itself. E.g. unicode, gb2312 or latin-1

2) How it is encoded. US-ASCII, UTF-8, shift_jis, etcetera.

The "charset" setting tells you both things. EUC-CN means that it's a gb2312 text encoded into 8bit using EUC-CN. UTF-8 is unicode text encoded into 8bit.

3) The domain (7bit, 8bit or binary). I.e. what characters are actually present in the raw message.

4) The transfer encoding. That is, how to map (2) to (3). For example, quote-printable or base64 (or nothing, if (2) already fits into (3)).

The "Content-transfer-encoding" setting tels you both these things. It determines how the above is put in the mime-block. "7bit", "quote-printable" and "base64" all look like US-ASCII. Emphasize "look like". It's not the same. It's merely convenient that US-ASCII takes up only 7 bits. That means that you can display a 7bit stream using US-ASCII characters. But it does not have to make sense since it's not the same.

The "8bit" and "binary" settings do not fall into 7 bit. 8bit be displayed using extended ASCII (not the same as US ASCII). "binary" is raw binary data and can be anything you want (like raw unicode, or even a movie).

So, text can be encoded twice in the message. Say I have a piece of chineese text in gb2312 which is 16-bit. I can use the "charset" EUC-CN to map that to 8bit, then use "transfer-encoding" base64 to map those 8bit into 7bit, suitable for sending in an e-mail.

But that same piece of text can also be represented in unicode, which I can map to 8-bit using UTF-8. When I set the "transfer encoding" to 8-bit, nothing changes. My message is 8bit wide. Not suitable for e-mail (which has a 7bit limit*) but suitable for other applications that use Mime.

(*) Formally. These days, virtually all e-mail programs and MTAs can deal with 8bit messages.

[Edited by - Sander on November 7, 2008 1:37:29 AM]

<hr />
Sander Marechal<small>[Lone Wolves][Hearts for GNOME][E-mail][Forum FAQ]</small>

Quote:3) The domain (7bit, 8bit or binary). I.e. what characters are actually present in the raw message.

Is a raw MIME message US-ASCII encoded?

I'm trying to understand how to decode a simple raw MIME message. If the above is true, you would have to parse this as 7-bit byte stream. So, you feed the body into a decoder based on the content transfer encoding. You feed this output into another decoder based on the coded character set and character-encoding scheme, which are both identified by the "charset" parameter. Then, you have an output sequence of characters.

This is correct for the most part?


Almost. It's not always a 7-bit byte stream. It may also be an 8-bit or binary stream.

If the content transfer encoding is quote-printable or base64 then you need to decode that first. Now you have a stream that is encoded according to the charset setting. Sometimes this is a charset that you can display directly (like utf-8). Sometimes you may need to decode it further (e.g. decode EUC-CN to get gb2312).

Sometimes you may need to convert the text to a different character set after that before you can print it. Ee.g. convert gb2312 to unicode, then encode it as utf-8, because you don't have functions to output gb2312 but only a function to output utf-8.

Pseudocode:
if content-transfer-encoding == base64:    base64decode(text)elseif content-transfer-encoding == quote-printable:    decode_quote_printable(text)if is_printable(charset) // e.g. utf-8. latin-1 or us-ascii    print(text)else    iconv(charset, unicode, text)    utf8_encode(text)    print(text)


So, don't worry about whether the content-transfer-encoding is 7-bit, 8-bit or binary. Just treat them all as binary stream. Only a content-transfer-encoding of base64 or quote-printable is interesting because you need to decode these first.

<hr />
Sander Marechal<small>[Lone Wolves][Hearts for GNOME][E-mail][Forum FAQ]</small>

Quote:Original post by ZergUser
Is a raw MIME message US-ASCII encoded?


As Sander says, the message may not be 7-bit US-ASCII. However, the header names are. So you can decode the headers in (mostly) US-ASCII, but the actual body might be something else -- depending on the values in the Content-Type and Content-Transfer-Encoding headers.

Also, some headers might have their values encoding in a non-US-ASCII character set but those two headers at least will be OK.

MIME is a pretty messy standard. It's been extended and updated so many times it's not funny....
I think I have a better understanding now. MIME messages are 8-bit byte streams, that are typically encoded in a 7-bit, 8-bit, or binary format, where the chosen one depends on the input requirements of the underlying transport.

Thanks for all the comments.

This topic is closed to new replies.

Advertisement