Sign in to follow this  

Multithreading message queue ... performance concerns?

Recommended Posts

I'm currently working on a multiplayer server in Java which is using seperate threads to listen for incoming data via TCP.

Now, i've worked in java and did some minor multithreaded applications before, but i've never done anything of such a scale.

 

My problem is that even though the server "works" the code itself is quite ugly and thus i try to refractor it without sacrificing a huge amount of performance.

 

So i create 2 threads on the server for each connected client:

- The first thread is waiting for incoming TCP messages

- The second one is processing data, sending data back to the client (without being blocked by the TCP thread), doing movement validation, etc...

 

 

Now here is the thing:

 

The first thread which handles the incoming TCP data also sends data back or does some other funky stuff (like setting specifig flags for the player or accessing other systems to update player coordinates, etc...)

This makes the Code quite ugly as i have stuff like this lying around (on the TCP thread):

case 201: //incoming message ID
	//reset run
	player.getCheckpointList().getLock().writeLock().lock();
	player.getCheckpointList().clear();
	player.getCheckpointList().getLock().writeLock().unlock();
	
	break;
	
case 202:
	//checkpoint activated
	int time = StreamConverter.buffer_read_u32(input);
	
	player.getCheckpointList().getLock().writeLock().lock();
	player.getCheckpointList().add(time);
	player.getCheckpointList().getLock().writeLock().unlock();
	
	break;
	

case 203:
	//goal activated
	long goalTime = StreamConverter.buffer_read_u32(input);
	//TODO: check if checkpoint count is valid
	PlayerClientData data = player.getPlayerDataCopy();//playerDataCopy to avoid multithreading issues.
	int hash = data.getName().hashCode();//has to be more specific (like using the playerskin in addition to the name, etc...)
	//TODO: Better hashcode for player identification
	server.getGameState().getHighscoreState().addEntry(new LeaderboardEntry(goalTime,data.getName(),(byte) 1,hash));//add to highscoreState (replaces entry if one already exists)
	
	//clear checkpoint list
	player.getCheckpointList().getLock().writeLock().lock();
	player.getCheckpointList().clear();
	player.getCheckpointList().getLock().writeLock().unlock();

	
	break;

The thread which handles the incoming TCP traffic does access other parts of the systems and has the ability to send messages back. This means that it has to use locks (in one form or another) to prevent race conditions.

 

Now i figured that it would be a good idea to transfer all the response code to the other Thread (which is doing the exact same stuff anyway) and use a messaging Queue to send messages from one thread to the other.

 

This would:

 

A. Make the TCP thread have only one task: Listening to TCP messages and notifying the other thread without worrying about responding to the client or updating data on the serverside.

 

B. Make the other thread the only place where any kind of data is processed or sent to the client. (Which would make it cleaner and more unified.)

 

 

Now the issue which i'm worrying about is the performance of messages:

The only way i can think of implementing that is some kind of message objects which are put into the queue by one thead and then read by the other thread.

Now, given that Java is using a GC and every single message would require allocating a new object on the heap, this could be quite problematic in the long run especially as i want to support up to 64 clients in realtime. (64 Threads would be generating a lot of garbage in a very short time which would require more frequent GC pauses. Although i have no idea how big of an performance impact this would have.)

 

 

Does anyone have experience with message queues and their performance impact?

Is this a common solution for servers (or any kind of multithreaded application?)

Edited by Lewa

Share this post


Link to post
Share on other sites
Generally it is well understood that one-thread-per-connection is a bad idea; scaling two is going to be even worse, as you've discovered.

A better approach is to have a small pool of IO threads that take turns pulling data off any ready sockets. They then pass this to a single simulation thread which may in turn spawn jobs on a standard multithreaded job system for parallelization.


Moving to Network forum for additional feedback.

Share this post


Link to post
Share on other sites

You can use ExecutorServices for this I guess: http://winterbe.com/posts/2015/04/07/java8-concurrency-tutorial-thread-executor-examples/

 

The TCP thread would make a pooled executor with a fixed amount of threads (ie, Executors.newFixedThreadPoop(threads)), and when receiving a package it'd push a task to it then keep listening. The ExecutorService handles automatically which thread gets which task for you. This way you only have a couple of threads (say, number of cores - 1, so to leave the VM one thread to work).

 

Of course, if each task is mutating shared data, you'd have to apply necessary synchronization.

Share this post


Link to post
Share on other sites

Generally it is well understood that one-thread-per-connection is a bad idea; scaling two is going to be even worse, as you've discovered.

A better approach is to have a small pool of IO threads that take turns pulling data off any ready sockets. They then pass this to a single simulation thread which may in turn spawn jobs on a standard multithreaded job system for parallelization.


Moving to Network forum for additional feedback.

I've read a bit on that topic. As far as i know it is generally considered "bad practice" but from what i was able to gather this will only start to become an issue if the server has to manage threads upwards of one hundred (like 200 or 300.) (Disclaimer: I'm definitely not an expert on that field, so my assumptions can of course be entirely wrong.)

Given that i've never worked with this style of server programming (i think for java there is the Java NIO api) and the current server was already somewhat finished/usable i sticked with the current architecture. (But i'll consider rewriting that if it turns out to be a big issue.)

 

The main question is if creating message Objects on the TCP socket threads which will then be read by the second thread will be bogging down the performance due to the huge Garbage generation.

 

 

 

 

The TCP thread would make a pooled executor with a fixed amount of threads (ie, Executors.newFixedThreadPoop(threads)), and when receiving a package it'd push a task to it then keep listening. The ExecutorService handles automatically which thread gets which task for you. This way you only have a couple of threads (say, number of cores - 1, so to leave the VM one thread to work).

That's actually a neat idea. But this would require to use the "single thread for all connections" architecture which i've very little to no experience with. This would probably be the best solution in terms of code structure/performance. But even then this would require pretty much 2 threads per player. 1 thread has to be active in order to do realtime validation of player data (veryfying movement data, and sometimes sending information from time to time which are triggered when certain conditions are met, etc...) while the other threads (which would be generated by the thread executor) would also send responses back to the client or fiddle around with other systems (if they need to.)

 

What i would like to do is to transfer all those operations into the one thread which also does the validation of playerdata. (in order to reduce the number of places where locks on systems get executed and to have a central place for sending data back to the player.)

 

Of course i could have one single thread which is waiting for incoming TCP data from all clients and then create (as an example) Message objects which get fed into a message queue so that other threads can read them and then take the coresponding actions.

 

 

The big question is: How does a message-queue system fare on a managed language in terms of performance (due to the huge number of message objects which get allocated on the heap)?

Edited by Lewa

Share this post


Link to post
Share on other sites
If your world contains shared state (everybody wants the same pickups, collides against the same walls, go through the same doors, shoot at each other, etc) then using multiple threads for accessing the same world is a bad idea.

A typical simple server app in Java may use a single thread per TCP socket to send/receive data, perhaps, but it's better to use nio.
And, when receiving data, you want to post it in a queue to your simulation code, which you want to run once per tick; harvesting messages from all the players, running the simulation, and then sending updates as necessary to the players.

Share this post


Link to post
Share on other sites

If your world contains shared state (everybody wants the same pickups, collides against the same walls, go through the same doors, shoot at each other, etc) then using multiple threads for accessing the same world is a bad idea.

A typical simple server app in Java may use a single thread per TCP socket to send/receive data, perhaps, but it's better to use nio.

I think i will look further into that NIO stuff in the next few days. (Now i'm really curios how this exactly works.)

 

 

 

And, when receiving data, you want to post it in a queue to your simulation code, which you want to run once per tick; harvesting messages from all the players, running the simulation, and then sending updates as necessary to the players. 

One question though: Creating messages will have an overhead. (I'll want to store specific data depending on the message. Like movement data will have vectors, chatmessages will have the recieved string, etc...) This means that i'll have to use simple Objects which represent such data. Those will then be passed to a queue by the TCP thread.

 

Isn't this going to create a lot of garbage (which then the Garbage Collector will have to clean up?) due to constant messageobject allocations? I know that java is fast at creating new Objects due to preallocated memory, but recieving messages (and allocating new message objects) from 64 player simultaneously seems to be a lot. I don't really like the idea of having the garbage constantly cleaned up by the GC and thus decreasing the overall performance/responsiveness of the server application.

Edited by Lewa

Share this post


Link to post
Share on other sites

This means that i'll have to use simple Objects which represent such data. Those will then be passed to a queue by the TCP thread.


Yes, one of the problems with Java is that it only has reference types (objects) not value types (other than simple primitives.)
This adds more pressure on the garbage collector.
Generally, I'd recommend defining an object that contains all of the network-visible state for an object, and keeping that around as "the view of the object."
You can then link that object to all queues of outgoing data for clients -- "send this view to that client."
The queue-message-objects need to be more dynamically allocated, which you can generally solve by keeping a free list of messages rather than calling new and then losing track for each one.
Finally, the message to the client goes into a byte buffer. Typically, you allocate one byte buffer up front with a bigger size (say, 65k) and then use some slice of it for your actual work.

That being said, if GC performance is the worst problem your game will have, you'll be doing very well!
There are good tools to profile those kinds of problems and there are known work-arounds (such as what I indicate above.)

Share this post


Link to post
Share on other sites

That's actually a neat idea. But this would require to use the "single thread for all connections" architecture which i've very little to no experience with. This would probably be the best solution in terms of code structure/performance. But even then this would require pretty much 2 threads per player.
Not really. The TCP thread is just a polling thread. All it does is receive a packet, then it immediately sets up a task and hands it to the thread pool. TCP thread does nothing but listen at the socket, no matter the player. The point of the TCP thread is that it is thin an does minimal work.

 

You'd have one thread, plus a bunch of worker threads, for any number of players. Number of players here doesn't affects the number of threads, since the thread pool maintains them alive (most certainly parked if they're doing nothing) to process further tasks, that might or might not belong to the same player, its irrelevant.

 

Have also in mind that creating a thread is a costly operation, so you actually want that pool regardless, even if you only use one thread for TCP and one thread for the rest.

 

Isn't this going to create a lot of garbage (which then the Garbage Collector will have to clean up?) due to constant messageobject allocations?
Yes. For objects that "escape" threads and methods (after JIT inlining), heap alloc is inevitable. So every once in a while a few packets might take more time to get processed (ie, when GC runs). Have in mind that not all GCs are born the same. Short lived objects get collected much faster than long lived objects. HotSpot has a few object generations each with different characteristics regarding GC. 

 

The thing is that you're missing the forest for the trees. Your objective shouldnt be "every packet must have 10ms latency" but rather "[high_percentaje] of packets must have 10ms latency tops". The point is not to never do GC, but to keep GC below your required limits. Some packets might need more time than others, and thats fine, as long as you keep them below your limits.

 

Also have in mind that new objects don't really affect GC time. Garbage collection actually doesn't collects garbage at all. It collects the stuff that isn't garbage, the objects that you're using.

 

For short lived objects (like your messages I assume, you'd need to profile it), they will probably live in eden generation, or young gen, they'll never make it to old gen. Eden/young gen collections get triggered when you run out of space in that gen, but what HotSpot does is to copy back the live objects, not the dead ones. No longer reachable memory doesn't gets touched. So the GC will only take longer than usual if you have tons of recently allocated live objects, whereas if you used them and discarded them right away, they don't impact the GC time. Allocation of short lived objects impact the frequency of GC, not the duration. Whereas for long lived objects its the opposite, long lived objects get compacted so if you have many of those, it will take longer. This is why object pools are discouraged, because they "artifically" increment the liveness of a bunch of tiny objects, which later the GC will have to move around and compact each time an old gen GC is made.

 

So your usual techniques for avoiding GC might trade a bunch of shorter GCs for one big stop the world GC that will increase your latency a lot for a few specific packets. Which is not what you really want, specially since interpolation techniques on the client side can mask short delays very well. You want to maintain the overall throughput constant, even if its not the highest it can be.

 

As always, first and foremost, try it out. If it ever becomes a problem you can refactor.

Share this post


Link to post
Share on other sites

Your objective shouldnt be "every packet must have 10ms latency" but rather "[high_percentaje] of packets must have 10ms latency tops"


Well, maybe! It is totally possible to build systems that guarantee latencies. However, doing so IN JAVA is a lot harder, because of the way Java works.
Or, to put it another way: If you are OK with setting some kind of percentile performance, and then optimizing until you get there, knowing that you will never be perfect, then Java is a fine language.
Any language with a GC is similar, although some languages are better at reducing the amount of heap GC (value types in C# help, optimizing away the stack in Haskell helps, reducing the size of the heap to whatever a single message loop uses in Erlang helps, etc.)
There even exists GCs with predictable latency -- incremental generational systems that mark N links for each new allocation are guaranteed to always make progress and never take more than N amount of time per allocation, but these collectors have other draw-backs (no guaranteed upper bound on memory used for example) and aren't generally available in typical desktop/server tool chains.

Share this post


Link to post
Share on other sites

 

Well, maybe! It is totally possible to build systems that guarantee latencies.
Of course. Its been done for high-frequency trading, also in Java. There is the G1 collector in HotSpot that can be set up to try to reach latency targets. There is the Azul System's GC for HotSpot that is designed for "zero latency GC" (at lower overall throughput I hear), but if OP needed such thing, he/she wouldn't be asking us anything I believe.

Share this post


Link to post
Share on other sites

Thank you guys for the input.

 

I was reading into the whole NIO stuff again for the past few days. I even got it working (to some extent) with the game client. But i have to say, it's hard.

(Compared to the thread per client approach.)

 

So i managed to merge the non-TCP player threads into one single thread which iterates over all registered players and sends data back and/or processes other stuff (validating coordinates, etc...).

Now the TCP side... 

I also merged all TCP threads (thread per socket) into one NIO thread, but it gets complicated very fast.

 

First, the thread is reading the data from the sockets which then has to be temporarily buffered in a seperate bytebuffer. (As i could either recieve only a fragment of a full message or multiple messages at once.)

 

Then i have to loop over this buffer and parse the data in order to find out if one full message was actually recieved and is stored in that buffer.

Then i have to copy that message out, create a message object, put that byte data into the message object and send it to the other thread which then responds/processes this message.

The bytes in the temporary message buffer then need to be shifted (to fill the gap where the full message was "cut out") in order to avoid breaking the parsing process.

 

All that stuff seams kinda heavy weight for a single thread which needs to handle up to 64 connections at once. (Especially the whole parsing, and message creating process.)

My idea was to create a seperate thread which does the parsing/message creation process alongside the TCP listener thread in order to split that load on 2 cores, but then i need additional locks to avoid any kind of race-conditions or errors due to the message buffer which would be shared alongside two threads. (And i have no idea if the locks would make the situation even worse if they interfere with each other too much.)

 

 

I have no idea how to proceed further. (I'm really considering ditching that.)

There are 3 options now:

 

a) try to proceed further with the NIO stuff (which is kinda complicated and in addition to that i really don't have any clue if the general server architecture which i'm building is that much better compared to the thread per client stuff. > Lack of experience and knowledge on my part.)

b) Use the thread per socket model but merge the other non-TCP threads into one single thread which iterates over all registered clients. (The server would then have a maximum of 64 threads for each socket + 3 to 4 other threads for all the other systems which are running in the background.)

c) Revert back to the original architecture (1 reader and 1 processing/writer thread per client. Total number of threads would be 64*2= 128 + 3 to 4 other threads.)

 

It would be a good idea to benachmark that stuff. (But i have really no clue how. I never wrote any kind of servers before nor bencharked a server application. Do i have to write a software which emulates 64 clients? Is this the common way of doing things?)

Edited by Lewa

Share this post


Link to post
Share on other sites

Do i have to write a software which emulates 64 clients? Is this the common way of doing things?


Yes, and yes.

If your regular game client can run without graphics/sound/input, from a command line, with a script controlling it, then you may be able to run many instances of your regular client from the command line on the same machine, and get a few machines to get to the number of simultaneous clients you need.

Share this post


Link to post
Share on other sites

Most of the stuff that sounds awkward there would exist whether I/O is being handled in threads or not. So perhaps you are overestimating the complexity.

 

Message parsing can be simplified if your messages start with their size. Then, knowing if you have a full message or not can be as easy as comparing the buffer size with the size value at or near the start of the buffer. Only when you have a full message do you then have to worry about parsing it, but you can extract the message data and hand it off to another thread for parsing if you like. Don't forget to check that the supplied size value is reasonable...

 

You can also avoid shifting any data after you extract a message if you use a ring buffer, making that a free operation.

Share this post


Link to post
Share on other sites

Why not use something like KryoNet?

 

https://github.com/EsotericSoftware/kryonet

 

Looked at it, but it looks like you need to have their proprietary client in order to connect to their server. (And i can't do that as the client isn't written in java.)

Additionally, i plan to release the source code of the server (for potential modding purposes) and not having to worry about any kind of code licensing is a nice bonus.

 

 

What it did for the time being is to rewrite the server and reduce most of the stuff to only specific threads. (And add a message Queue for passing messages from the TCP reciever thread to the rest of the system.)

The TCP communication still works with the "one connection per thread" approach, but i can now easily swap that out in the future due to better code abstraktion in the server source code. (Once i get NIO working properly.)

 

I still have to profile that stuff though...

Edited by Lewa

Share this post


Link to post
Share on other sites

 

Looked at it, but it looks like you need to have their proprietary client in order to connect to their server. (And i can't do that as the client isn't written in java.)

 

Additionally, i plan to release the source code of the server (for potential modding purposes) and not having to worry about any kind of code licensing is a nice bonus.

 

 

 

 

Any networking library is going to have 'server' and 'client' entities that wrap up all the socket stuff and hide the details. These are just regular old objects that you use to create your actual server and client software. They aren't requiring you to use any sort of proprietary software, which seems to be what you are alluding to.

And Kryonet is open source with a liberal license (the BSD three-clause). It won't impact releasing the source of your project at all, so there's no "code licensing" to worry about. If you're doing this as a learning exercise, that's fine, but if you want to just get stuff done, Kryonet is a fine option.

Share this post


Link to post
Share on other sites

 

Looked at it, but it looks like you need to have their proprietary client in order to connect to their server.
Proprietary? 

 

My bad. (Bad wording.) What i ment was that you need their client java code/api in order to connect to the server (which i sadly can't use as the game isn't written in java.)

 

 

 

Looked at it, but it looks like you need to have their proprietary client in order to connect to their server. (And i can't do that as the client isn't written in java.)

 

Additionally, i plan to release the source code of the server (for potential modding purposes) and not having to worry about any kind of code licensing is a nice bonus.

 

 

 

 

Any networking library is going to have 'server' and 'client' entities that wrap up all the socket stuff and hide the details. These are just regular old objects that you use to create your actual server and client software. They aren't requiring you to use any sort of proprietary software, which seems to be what you are alluding to.

And Kryonet is open source with a liberal license (the BSD three-clause). It won't impact releasing the source of your project at all, so there's no "code licensing" to worry about. If you're doing this as a learning exercise, that's fine, but if you want to just get stuff done, Kryonet is a fine option.

 

The problem is that while the server is written in java, the game isn't. This means that i would have to port the client side code in order to use it. (Which isn't worth it due to too much language differences.)

 

But thank you all for responding and giving me advice on this topic.

Share this post


Link to post
Share on other sites

I've never done stuff with java threads, but I may add that you need to would look into how do the threads interact with the current GC to prevent memory leaks (you would probably need to kill the thread/unreference the thread/sleep the thread to get the garbage collected). Probably a thread queue would do good enough.

Edited by RenzoCoppola

Share this post


Link to post
Share on other sites
The problem is that while the server is written in java, the game isn't. This means that i would have to port the client side code in order to use it. (Which isn't worth it due to too much language differences.)


You're already doing that anyway, so how does using Kyronet mean you can't continue? You can use Kyronet or any other networking library to implement the server and then use any language + library combination you want to implement the client. All the client has to do is open a single socket, connect to the server, and start parsing packets. It need neither know nor care how the server is implemented.
 

Edited by Aldacron

Share this post


Link to post
Share on other sites

KryoNet has some built-in serialisation system that won't be compatible in this case. Maybe it's possible to use it just for the connections and to ignore all the other stuff, but that seems like an awkward path to take.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this