Using Unicode with a TCP Stream

Started by
13 comments, last by hplus0603 11 years, 10 months ago
Are you asking how to parse UTF-8 or are you asking how to send a stream of octets over TCP? The two subjects are not really related in any way.

Stephen M. Webb
Professional Free Software Developer

Advertisement
You don't need to encode the serialized object data to UTF-8 to send it. Just send it as raw bytes with a length prefix. The fact that send() takes a char* is not string-related; you can treat it as a void*.

Varint is not the same encoding method as UTF-8. UTF-8 and Varint both solve the problem of "how do I pack various integers in the range 0 .. N into a sequence of smaller integers with the range 0 .. 255?" For UTF-8, the "integers" are Unicode code points. UTF-8 has the benefit of knowing how long the encoding of a code point is by looking at the first byte, but it tops out at encoding 33 bits of data IIRC. It is also less efficient than Varint. Varint has the benefit of knowing exactly what bytes are stop bytes -- every stop byte has the 0x80 bit cleared. It is also more efficient than UTF-8, requiring the same number of a smaller number of bytes for any particular code point. Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.
enum Bool { True, False, FileNotFound };
Ok, silly me, I thought "Varint" was some how related to UTF-8 xD (well they are considering they are both encodings for Unicode, but still very different), I know what I'm going to do now, thanks for the replies everyone.
So your saying with UTF-8 encoding, when you use the English language (well ASCII characters) the last bit is constantly in the binary state of 1, as this is a full character itself? This really makes sense if this is true. And other non-ASCII characters may have several 7-bit data bytes with the last bit (8th bit) having a binary state of 0 (unless of course, its the last byte of the character indicating that you have all the data in order to decode)?

You got it the other way: ASCII characters always have its highest bit clear (value is less than 0x80), non-ASCII characters always have its highest bit set (bytes are all 0x80 or above). But yeah, the idea is the same.

Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.

It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead (which can potentially cause trouble). Since UTF-8 was meant to be used for communication between devices, this kind of error handling is important.

To the OP: as for how varint works, just look up for MIDI variable length values =P
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead


Any protocol or implementation that requires you to deal with somehow ending up in the middle of an encoded character, is inherently broken. Being able to tell that it's broken isn't going to help you figure out what you were supposed to be doing with that data.
enum Bool { True, False, FileNotFound };

This topic is closed to new replies.

Advertisement