Using Unicode with a TCP Stream

Jordan Singh · 2012-06-20T22:53:05

I'm going to begin coding another C# game server for learning purposes soon (ive made servers before and each time i'm trying to re-implement it better and better). It will use TCP (its not a first person shooter and I do not want to waste time adding reliability on UDP for no point) and will also use the delimiter messaging system. This means all data sent through the network will be in Unicode format (using UTF-8 encoding). There is a problem though, this time I am switching from using 1-byte-long-ANSII to Unicode. Unicode however can be more than 1 byte long (up to 5 bytes!) and since TCP is really a stream how will I know if I have received all Unicode characters in full? I might receive half a Unicode character, what happens if I try to decode those bytes that only consist half of a Unicode character? Would I have to either use ANSII, or, a different messaging system, or, some sort of special strategy? All replies are appreciated. Thanks Xanather.

Networking and Multiplayer Programming

Started by Xanather June 16, 2012 02:34 PM

13 comments, last by hplus0603 11 years, 10 months ago

Bregma

9,461

June 19, 2012 03:52 PM

Are you asking how to parse UTF-8 or are you asking how to send a stream of octets over TCP? The two subjects are not really related in any way.

Stephen M. Webb
Professional Free Software Developer

hplus0603

11,916

June 19, 2012 05:43 PM

You don't need to encode the serialized object data to UTF-8 to send it. Just send it as raw bytes with a length prefix. The fact that send() takes a char* is not string-related; you can treat it as a void*.

Varint is not the same encoding method as UTF-8. UTF-8 and Varint both solve the problem of "how do I pack various integers in the range 0 .. N into a sequence of smaller integers with the range 0 .. 255?" For UTF-8, the "integers" are Unicode code points. UTF-8 has the benefit of knowing how long the encoding of a code point is by looking at the first byte, but it tops out at encoding 33 bits of data IIRC. It is also less efficient than Varint. Varint has the benefit of knowing exactly what bytes are stop bytes -- every stop byte has the 0x80 bit cleared. It is also more efficient than UTF-8, requiring the same number of a smaller number of bytes for any particular code point. Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.

enum Bool { True, False, FileNotFound };

Xanather

782

Author

June 20, 2012 03:48 AM

Ok, silly me, I thought "Varint" was some how related to UTF-8 xD (well they are considering they are both encodings for Unicode, but still very different), I know what I'm going to do now, thanks for the replies everyone.

http://www.vironsoftware.com

Sik_the_hedgehog

3,003

June 20, 2012 05:28 AM

So your saying with UTF-8 encoding, when you use the English language (well ASCII characters) the last bit is constantly in the binary state of 1, as this is a full character itself? This really makes sense if this is true. And other non-ASCII characters may have several 7-bit data bytes with the last bit (8th bit) having a binary state of 0 (unless of course, its the last byte of the character indicating that you have all the data in order to decode)?

You got it the other way: ASCII characters always have its highest bit clear (value is less than 0x80), non-ASCII characters always have its highest bit set (bytes are all 0x80 or above). But yeah, the idea is the same.

Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.

It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead (which can potentially cause trouble). Since UTF-8 was meant to be used for communication between devices, this kind of error handling is important.

To the OP: as for how varint works, just look up for MIDI variable length values =P

Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

hplus0603

11,916

June 20, 2012 10:53 PM

It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead

Any protocol or implementation that requires you to deal with somehow ending up in the middle of an encoded character, is inherently broken. Being able to tell that it's broken isn't going to help you figure out what you were supposed to be doing with that data.

enum Bool { True, False, FileNotFound };

Using Unicode with a TCP Stream

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Using Unicode with a TCP Stream

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines