• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Xanather

Using Unicode with a TCP Stream

14 posts in this topic

I'm going to begin coding another C# game server for learning purposes soon (ive made servers before and each time i'm trying to re-implement it better and better). It will use TCP (its not a first person shooter and I [b]do not[/b] want to waste time adding reliability on UDP for no point) and will also use the delimiter messaging system. This means all data sent through the network will be in Unicode format (using UTF-8 encoding).

There is a problem though, this time I am switching from using 1-byte-long-ANSII to Unicode. Unicode however can be more than 1 byte long (up to 5 bytes!) and since TCP is really a stream how will I know if I have received all Unicode characters in full? I might receive half a Unicode character, what happens if I try to decode those bytes that only consist half of a Unicode character? Would I have to either use ANSII, or, a different messaging system, or, some sort of special strategy?

All replies are appreciated. Thanks Xanather.
0

Share this post


Link to post
Share on other sites
The UTF8 format makes it possible to detect where a character starts, and once you have the start character, you know how long the character should be.
I would recommend using the length-data format, rather than the data-delimiter format, though. It makes everything easier IMO.
1

Share this post


Link to post
Share on other sites
Now when I think about it, for a TCP stream, a length-data format would probably be better and I wouldn't have to use up characters to detect delimiters or make sure to see if all Unicode bytes have arrived. Thanks. Edited by Xanather
0

Share this post


Link to post
Share on other sites
One quick question before I re attempt making another server. I think I have come up with a good idea with this length-data networking format which is to have a 32-bit integer (4-bytes long) to represent the length of the soon to arrive message and maybe one byte after this 32-bit integer representing the core type of the message (e.g. is it a unicode string, or is it a serialized object, or maybe something that is lower-level based).

So really it looks like this (what the bytes look like):

234:256:234:234:0
The first 4 bytes represents the 32-bit integer and the last byte (which is 0) represents the core message type. 0 in the last byte in this case represents a Unicode string message. What is your opinion on this strategy? I should probably be more confident in what I think of but sadly I have this thing where I have to learn things the RIGHT way, but really there is no right way in coding xD.

Although just a reply on your opinion of this (or anyone's opinion) on this strategy would be appreciated.

Thanks, Xanather.
0

Share this post


Link to post
Share on other sites
Just one data point for you: good-old 7-bit ASCII is a proper subset of the UTF-8 encoding of Unicode. In other words, an ASCII strings [i]is[/i] a UTF-8 string. You can save the last byte of your message header, since it will always have the same value. Edited by Bregma
0

Share this post


Link to post
Share on other sites
Yes I am aware of that, thanks anyway though. Isn't it when you start using symbols other than the English language characters it starts moving from 1 byte to 2 bytes? Also what if I want to send serialized objects? How would you do this without the core-message-types/that 1 byte I was talking about in my previous post?
0

Share this post


Link to post
Share on other sites
ASCII characters (0 to 127) always take one byte (and are stored as-is). Characters not present in ASCII (128 onwards) take multiple bytes.
0

Share this post


Link to post
Share on other sites
I think 4 bytes for length-prefix isn't needed for a game. Sending a four-gigabyte stream of data (the max representable by 4 bytes) over any network will take a long time, longer than you want to wait for an individual game message.

I find myself using single bytes when I'm talking about messages (individual units of information,) and double bytes when talking about packets (the full data transmission unit, like a UDP packet, or or "network tick" packet.)

Another useful optimization is varint; values 0 through 127 are encoded as a single byte; higher values are encoded 7 bits at a time, with the highest bit set to 1. To decode, keep reading bytes, masking off the high bit, and shift by 7, until you get a byte with a clear high bit. You can also encode negative numbers this way by sending a "negate this" bit as the last value bit (thus packing 6 bits of numeric info into the last byte.)
For a stream over TCP (or file,) this is useful if you want to be able to go really large once in a blue moon, but don't want to spend the overhead for each and every little unit of data.
0

Share this post


Link to post
Share on other sites
Wow, thank you for that, I never knew that [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img].

So your saying with UTF-8 encoding, when you use the English language (well ASCII characters) the last bit is constantly in the binary state of 1, as this is a full character itself? This really makes sense if this is true. And other non-ASCII characters may have several 7-bit data bytes with the last bit (8th bit) having a binary state of 0 (unless of course, its the last byte of the character indicating that you have all the data in order to decode)?

What I dont understand though is how you could pack 6-bits of numeric info into the last byte?

(I didnt type that too well... heh) Edited by Xanather
0

Share this post


Link to post
Share on other sites
It sounds like you're getting too complicated, send whatever you need and don't worry too much about payload size, you can optimize it later if needed. Optmization before profiling is usually wasted effort.
0

Share this post


Link to post
Share on other sites
Are you asking how to parse UTF-8 or are you asking how to send a stream of octets over TCP? The two subjects are not really related in any way.
0

Share this post


Link to post
Share on other sites
You don't need to encode the serialized object data to UTF-8 to send it. Just send it as raw bytes with a length prefix. The fact that send() takes a char* is not string-related; you can treat it as a void*.

Varint is not the same encoding method as UTF-8. UTF-8 and Varint both solve the problem of "how do I pack various integers in the range 0 .. N into a sequence of smaller integers with the range 0 .. 255?" For UTF-8, the "integers" are Unicode code points. UTF-8 has the benefit of knowing how long the encoding of a code point is by looking at the first byte, but it tops out at encoding 33 bits of data IIRC. It is also less efficient than Varint. Varint has the benefit of knowing exactly what bytes are stop bytes -- every stop byte has the 0x80 bit cleared. It is also more efficient than UTF-8, requiring the same number of a smaller number of bytes for any particular code point. Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.
0

Share this post


Link to post
Share on other sites
Ok, silly me, I thought "Varint" was some how related to UTF-8 xD (well they are considering they are both encodings for Unicode, but still very different), I know what I'm going to do now, thanks for the replies everyone. Edited by Xanather
0

Share this post


Link to post
Share on other sites
[quote name='Xanather' timestamp='1340105737' post='4950544']So your saying with UTF-8 encoding, when you use the English language (well ASCII characters) the last bit is constantly in the binary state of 1, as this is a full character itself? This really makes sense if this is true. And other non-ASCII characters may have several 7-bit data bytes with the last bit (8th bit) having a binary state of 0 (unless of course, its the last byte of the character indicating that you have all the data in order to decode)?[/quote]
You got it the other way: ASCII characters always have its highest bit clear (value is less than 0x80), non-ASCII characters always have its highest bit set (bytes are all 0x80 or above). But yeah, the idea is the same.

[quote name='hplus0603' timestamp='1340127781' post='4950649']Varint is not a widely accepted standard for encoding Unicode code points, but it could do the job just fine.[/quote]
It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead (which can potentially cause trouble). Since UTF-8 was meant to be used for communication between devices, this kind of error handling is important.

To the OP: as for how varint works, just look up for MIDI variable length values =P
0

Share this post


Link to post
Share on other sites
[quote name='Sik_the_hedgehog' timestamp='1340170121' post='4950842']
It's due to error handling, with UTF-8 if you try to read in the middle of a character you will get an invalid byte sequence (and thereby the program will know to skip it), with varint it will look like a different value instead
[/quote]

Any protocol or implementation that requires you to deal with somehow ending up in the middle of an encoded character, is inherently broken. Being able to tell that it's broken isn't going to help you figure out what you were supposed to be doing with that data.
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0