#1 Members - Reputation: 634
Posted 16 June 2012 - 08:34 AM
There is a problem though, this time I am switching from using 1-byte-long-ANSII to Unicode. Unicode however can be more than 1 byte long (up to 5 bytes!) and since TCP is really a stream how will I know if I have received all Unicode characters in full? I might receive half a Unicode character, what happens if I try to decode those bytes that only consist half of a Unicode character? Would I have to either use ANSII, or, a different messaging system, or, some sort of special strategy?
All replies are appreciated. Thanks Xanather.
#2 Moderators - Reputation: 3292
Posted 16 June 2012 - 09:47 AM
I would recommend using the length-data format, rather than the data-delimiter format, though. It makes everything easier IMO.
#3 Members - Reputation: 634
Posted 16 June 2012 - 09:58 AM
Edited by Xanather, 16 June 2012 - 09:59 AM.
#4 Members - Reputation: 634
Posted 18 June 2012 - 05:03 AM
So really it looks like this (what the bytes look like):
234:256:234:234:0
The first 4 bytes represents the 32-bit integer and the last byte (which is 0) represents the core message type. 0 in the last byte in this case represents a Unicode string message. What is your opinion on this strategy? I should probably be more confident in what I think of but sadly I have this thing where I have to learn things the RIGHT way, but really there is no right way in coding xD.
Although just a reply on your opinion of this (or anyone's opinion) on this strategy would be appreciated.
Thanks, Xanather.
#5 Members - Reputation: 2762
Posted 18 June 2012 - 05:36 AM
Edited by Bregma, 18 June 2012 - 05:36 AM.
Professional Free Software Developer
#6 Members - Reputation: 634
Posted 18 June 2012 - 06:15 AM
#7 Members - Reputation: 957
Posted 18 June 2012 - 08:51 AM
#8 Moderators - Reputation: 3292
Posted 18 June 2012 - 09:47 AM
I find myself using single bytes when I'm talking about messages (individual units of information,) and double bytes when talking about packets (the full data transmission unit, like a UDP packet, or or "network tick" packet.)
Another useful optimization is varint; values 0 through 127 are encoded as a single byte; higher values are encoded 7 bits at a time, with the highest bit set to 1. To decode, keep reading bytes, masking off the high bit, and shift by 7, until you get a byte with a clear high bit. You can also encode negative numbers this way by sending a "negate this" bit as the last value bit (thus packing 6 bits of numeric info into the last byte.)
For a stream over TCP (or file,) this is useful if you want to be able to go really large once in a blue moon, but don't want to spend the overhead for each and every little unit of data.
#9 Members - Reputation: 634
Posted 19 June 2012 - 05:35 AM
So your saying with UTF-8 encoding, when you use the English language (well ASCII characters) the last bit is constantly in the binary state of 1, as this is a full character itself? This really makes sense if this is true. And other non-ASCII characters may have several 7-bit data bytes with the last bit (8th bit) having a binary state of 0 (unless of course, its the last byte of the character indicating that you have all the data in order to decode)?
What I dont understand though is how you could pack 6-bits of numeric info into the last byte?
(I didnt type that too well... heh)
Edited by Xanather, 19 June 2012 - 05:55 AM.
#12 Moderators - Reputation: 3292
Posted 19 June 2012 - 11:43 AM
Varint is not the same encoding method as UTF-8. UTF-8 and Varint both solve the problem of "how do I pack various integers in the range 0 .. N into a sequence of smaller integers with the range 0 .. 255?" For UTF-8, the "integers" are Unicode code points. UTF-8 has the benefit of knowing how long the encoding of a code point is by looking at the first byte, but it tops out at encoding 33 bits of data IIRC. It is also less efficient than Varint. Varint has the benefit of knowing exactly what bytes are stop bytes -- every stop byte has the 0x80 bit cleared. It is also more efficient than UTF-8, requiring the same number of a smaller number of bytes for any particular code point. Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.
#13 Members - Reputation: 634
Posted 19 June 2012 - 09:48 PM
Edited by Xanather, 19 June 2012 - 09:56 PM.
#14 Members - Reputation: 957
Posted 19 June 2012 - 11:28 PM
You got it the other way: ASCII characters always have its highest bit clear (value is less than 0x80), non-ASCII characters always have its highest bit set (bytes are all 0x80 or above). But yeah, the idea is the same.So your saying with UTF-8 encoding, when you use the English language (well ASCII characters) the last bit is constantly in the binary state of 1, as this is a full character itself? This really makes sense if this is true. And other non-ASCII characters may have several 7-bit data bytes with the last bit (8th bit) having a binary state of 0 (unless of course, its the last byte of the character indicating that you have all the data in order to decode)?
It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead (which can potentially cause trouble). Since UTF-8 was meant to be used for communication between devices, this kind of error handling is important.Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.
To the OP: as for how varint works, just look up for MIDI variable length values =P
#15 Moderators - Reputation: 3292
Posted 20 June 2012 - 04:53 PM
It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead
Any protocol or implementation that requires you to deal with somehow ending up in the middle of an encoded character, is inherently broken. Being able to tell that it's broken isn't going to help you figure out what you were supposed to be doing with that data.






