Jump to content

  • Log In with Google      Sign In   
  • Create Account

Using Unicode with a TCP Stream


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
14 replies to this topic

#1 Xanather   Members   -  Reputation: 712

Like
0Likes
Like

Posted 16 June 2012 - 08:34 AM

I'm going to begin coding another C# game server for learning purposes soon (ive made servers before and each time i'm trying to re-implement it better and better). It will use TCP (its not a first person shooter and I do not want to waste time adding reliability on UDP for no point) and will also use the delimiter messaging system. This means all data sent through the network will be in Unicode format (using UTF-8 encoding).

There is a problem though, this time I am switching from using 1-byte-long-ANSII to Unicode. Unicode however can be more than 1 byte long (up to 5 bytes!) and since TCP is really a stream how will I know if I have received all Unicode characters in full? I might receive half a Unicode character, what happens if I try to decode those bytes that only consist half of a Unicode character? Would I have to either use ANSII, or, a different messaging system, or, some sort of special strategy?

All replies are appreciated. Thanks Xanather.

Sponsor:

#2 hplus0603   Moderators   -  Reputation: 5693

Like
1Likes
Like

Posted 16 June 2012 - 09:47 AM

The UTF8 format makes it possible to detect where a character starts, and once you have the start character, you know how long the character should be.
I would recommend using the length-data format, rather than the data-delimiter format, though. It makes everything easier IMO.
enum Bool { True, False, FileNotFound };

#3 Xanather   Members   -  Reputation: 712

Like
0Likes
Like

Posted 16 June 2012 - 09:58 AM

Now when I think about it, for a TCP stream, a length-data format would probably be better and I wouldn't have to use up characters to detect delimiters or make sure to see if all Unicode bytes have arrived. Thanks.

Edited by Xanather, 16 June 2012 - 09:59 AM.


#4 Xanather   Members   -  Reputation: 712

Like
0Likes
Like

Posted 18 June 2012 - 05:03 AM

One quick question before I re attempt making another server. I think I have come up with a good idea with this length-data networking format which is to have a 32-bit integer (4-bytes long) to represent the length of the soon to arrive message and maybe one byte after this 32-bit integer representing the core type of the message (e.g. is it a unicode string, or is it a serialized object, or maybe something that is lower-level based).

So really it looks like this (what the bytes look like):

234:256:234:234:0
The first 4 bytes represents the 32-bit integer and the last byte (which is 0) represents the core message type. 0 in the last byte in this case represents a Unicode string message. What is your opinion on this strategy? I should probably be more confident in what I think of but sadly I have this thing where I have to learn things the RIGHT way, but really there is no right way in coding xD.

Although just a reply on your opinion of this (or anyone's opinion) on this strategy would be appreciated.

Thanks, Xanather.

#5 Bregma   Crossbones+   -  Reputation: 5406

Like
0Likes
Like

Posted 18 June 2012 - 05:36 AM

Just one data point for you: good-old 7-bit ASCII is a proper subset of the UTF-8 encoding of Unicode. In other words, an ASCII strings is a UTF-8 string. You can save the last byte of your message header, since it will always have the same value.

Edited by Bregma, 18 June 2012 - 05:36 AM.

Stephen M. Webb
Professional Free Software Developer

#6 Xanather   Members   -  Reputation: 712

Like
0Likes
Like

Posted 18 June 2012 - 06:15 AM

Yes I am aware of that, thanks anyway though. Isn't it when you start using symbols other than the English language characters it starts moving from 1 byte to 2 bytes? Also what if I want to send serialized objects? How would you do this without the core-message-types/that 1 byte I was talking about in my previous post?

#7 Sik_the_hedgehog   Crossbones+   -  Reputation: 1833

Like
0Likes
Like

Posted 18 June 2012 - 08:51 AM

ASCII characters (0 to 127) always take one byte (and are stored as-is). Characters not present in ASCII (128 onwards) take multiple bytes.
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#8 hplus0603   Moderators   -  Reputation: 5693

Like
0Likes
Like

Posted 18 June 2012 - 09:47 AM

I think 4 bytes for length-prefix isn't needed for a game. Sending a four-gigabyte stream of data (the max representable by 4 bytes) over any network will take a long time, longer than you want to wait for an individual game message.

I find myself using single bytes when I'm talking about messages (individual units of information,) and double bytes when talking about packets (the full data transmission unit, like a UDP packet, or or "network tick" packet.)

Another useful optimization is varint; values 0 through 127 are encoded as a single byte; higher values are encoded 7 bits at a time, with the highest bit set to 1. To decode, keep reading bytes, masking off the high bit, and shift by 7, until you get a byte with a clear high bit. You can also encode negative numbers this way by sending a "negate this" bit as the last value bit (thus packing 6 bits of numeric info into the last byte.)
For a stream over TCP (or file,) this is useful if you want to be able to go really large once in a blue moon, but don't want to spend the overhead for each and every little unit of data.

enum Bool { True, False, FileNotFound };

#9 Xanather   Members   -  Reputation: 712

Like
0Likes
Like

Posted 19 June 2012 - 05:35 AM

Wow, thank you for that, I never knew that Posted Image.

So your saying with UTF-8 encoding, when you use the English language (well ASCII characters) the last bit is constantly in the binary state of 1, as this is a full character itself? This really makes sense if this is true. And other non-ASCII characters may have several 7-bit data bytes with the last bit (8th bit) having a binary state of 0 (unless of course, its the last byte of the character indicating that you have all the data in order to decode)?

What I dont understand though is how you could pack 6-bits of numeric info into the last byte?

(I didnt type that too well... heh)

Edited by Xanather, 19 June 2012 - 05:55 AM.


#10 Rasterman   Members   -  Reputation: 206

Like
0Likes
Like

Posted 19 June 2012 - 09:39 AM

It sounds like you're getting too complicated, send whatever you need and don't worry too much about payload size, you can optimize it later if needed. Optmization before profiling is usually wasted effort.

#11 Bregma   Crossbones+   -  Reputation: 5406

Like
0Likes
Like

Posted 19 June 2012 - 09:52 AM

Are you asking how to parse UTF-8 or are you asking how to send a stream of octets over TCP? The two subjects are not really related in any way.
Stephen M. Webb
Professional Free Software Developer

#12 hplus0603   Moderators   -  Reputation: 5693

Like
0Likes
Like

Posted 19 June 2012 - 11:43 AM

You don't need to encode the serialized object data to UTF-8 to send it. Just send it as raw bytes with a length prefix. The fact that send() takes a char* is not string-related; you can treat it as a void*.

Varint is not the same encoding method as UTF-8. UTF-8 and Varint both solve the problem of "how do I pack various integers in the range 0 .. N into a sequence of smaller integers with the range 0 .. 255?" For UTF-8, the "integers" are Unicode code points. UTF-8 has the benefit of knowing how long the encoding of a code point is by looking at the first byte, but it tops out at encoding 33 bits of data IIRC. It is also less efficient than Varint. Varint has the benefit of knowing exactly what bytes are stop bytes -- every stop byte has the 0x80 bit cleared. It is also more efficient than UTF-8, requiring the same number of a smaller number of bytes for any particular code point. Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.

enum Bool { True, False, FileNotFound };

#13 Xanather   Members   -  Reputation: 712

Like
0Likes
Like

Posted 19 June 2012 - 09:48 PM

Ok, silly me, I thought "Varint" was some how related to UTF-8 xD (well they are considering they are both encodings for Unicode, but still very different), I know what I'm going to do now, thanks for the replies everyone.

Edited by Xanather, 19 June 2012 - 09:56 PM.


#14 Sik_the_hedgehog   Crossbones+   -  Reputation: 1833

Like
0Likes
Like

Posted 19 June 2012 - 11:28 PM

So your saying with UTF-8 encoding, when you use the English language (well ASCII characters) the last bit is constantly in the binary state of 1, as this is a full character itself? This really makes sense if this is true. And other non-ASCII characters may have several 7-bit data bytes with the last bit (8th bit) having a binary state of 0 (unless of course, its the last byte of the character indicating that you have all the data in order to decode)?

You got it the other way: ASCII characters always have its highest bit clear (value is less than 0x80), non-ASCII characters always have its highest bit set (bytes are all 0x80 or above). But yeah, the idea is the same.

Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.

It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead (which can potentially cause trouble). Since UTF-8 was meant to be used for communication between devices, this kind of error handling is important.

To the OP: as for how varint works, just look up for MIDI variable length values =P
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#15 hplus0603   Moderators   -  Reputation: 5693

Like
0Likes
Like

Posted 20 June 2012 - 04:53 PM

It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead


Any protocol or implementation that requires you to deal with somehow ending up in the middle of an encoded character, is inherently broken. Being able to tell that it's broken isn't going to help you figure out what you were supposed to be doing with that data.
enum Bool { True, False, FileNotFound };




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS