Using Unicode with a TCP Stream

Started by
13 comments, last by hplus0603 11 years, 10 months ago
I'm going to begin coding another C# game server for learning purposes soon (ive made servers before and each time i'm trying to re-implement it better and better). It will use TCP (its not a first person shooter and I do not want to waste time adding reliability on UDP for no point) and will also use the delimiter messaging system. This means all data sent through the network will be in Unicode format (using UTF-8 encoding).

There is a problem though, this time I am switching from using 1-byte-long-ANSII to Unicode. Unicode however can be more than 1 byte long (up to 5 bytes!) and since TCP is really a stream how will I know if I have received all Unicode characters in full? I might receive half a Unicode character, what happens if I try to decode those bytes that only consist half of a Unicode character? Would I have to either use ANSII, or, a different messaging system, or, some sort of special strategy?

All replies are appreciated. Thanks Xanather.
Advertisement
The UTF8 format makes it possible to detect where a character starts, and once you have the start character, you know how long the character should be.
I would recommend using the length-data format, rather than the data-delimiter format, though. It makes everything easier IMO.
enum Bool { True, False, FileNotFound };
Now when I think about it, for a TCP stream, a length-data format would probably be better and I wouldn't have to use up characters to detect delimiters or make sure to see if all Unicode bytes have arrived. Thanks.
One quick question before I re attempt making another server. I think I have come up with a good idea with this length-data networking format which is to have a 32-bit integer (4-bytes long) to represent the length of the soon to arrive message and maybe one byte after this 32-bit integer representing the core type of the message (e.g. is it a unicode string, or is it a serialized object, or maybe something that is lower-level based).

So really it looks like this (what the bytes look like):

234:256:234:234:0
The first 4 bytes represents the 32-bit integer and the last byte (which is 0) represents the core message type. 0 in the last byte in this case represents a Unicode string message. What is your opinion on this strategy? I should probably be more confident in what I think of but sadly I have this thing where I have to learn things the RIGHT way, but really there is no right way in coding xD.

Although just a reply on your opinion of this (or anyone's opinion) on this strategy would be appreciated.

Thanks, Xanather.
Just one data point for you: good-old 7-bit ASCII is a proper subset of the UTF-8 encoding of Unicode. In other words, an ASCII strings is a UTF-8 string. You can save the last byte of your message header, since it will always have the same value.

Stephen M. Webb
Professional Free Software Developer

Yes I am aware of that, thanks anyway though. Isn't it when you start using symbols other than the English language characters it starts moving from 1 byte to 2 bytes? Also what if I want to send serialized objects? How would you do this without the core-message-types/that 1 byte I was talking about in my previous post?
ASCII characters (0 to 127) always take one byte (and are stored as-is). Characters not present in ASCII (128 onwards) take multiple bytes.
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.
I think 4 bytes for length-prefix isn't needed for a game. Sending a four-gigabyte stream of data (the max representable by 4 bytes) over any network will take a long time, longer than you want to wait for an individual game message.

I find myself using single bytes when I'm talking about messages (individual units of information,) and double bytes when talking about packets (the full data transmission unit, like a UDP packet, or or "network tick" packet.)

Another useful optimization is varint; values 0 through 127 are encoded as a single byte; higher values are encoded 7 bits at a time, with the highest bit set to 1. To decode, keep reading bytes, masking off the high bit, and shift by 7, until you get a byte with a clear high bit. You can also encode negative numbers this way by sending a "negate this" bit as the last value bit (thus packing 6 bits of numeric info into the last byte.)
For a stream over TCP (or file,) this is useful if you want to be able to go really large once in a blue moon, but don't want to spend the overhead for each and every little unit of data.
enum Bool { True, False, FileNotFound };
Wow, thank you for that, I never knew that smile.png.

So your saying with UTF-8 encoding, when you use the English language (well ASCII characters) the last bit is constantly in the binary state of 1, as this is a full character itself? This really makes sense if this is true. And other non-ASCII characters may have several 7-bit data bytes with the last bit (8th bit) having a binary state of 0 (unless of course, its the last byte of the character indicating that you have all the data in order to decode)?

What I dont understand though is how you could pack 6-bits of numeric info into the last byte?

(I didnt type that too well... heh)
It sounds like you're getting too complicated, send whatever you need and don't worry too much about payload size, you can optimize it later if needed. Optmization before profiling is usually wasted effort.

This topic is closed to new replies.

Advertisement