Structuring a simulation over several threads

Started by
7 comments, last by All8Up 12 years, 3 months ago
So, I've been working on the server for my hobby game, so far I've done the simple thing of just running it all in one single thread - doing the "simplest thing that will work", sadly - it does not work any more as one thread is not enough to handle all the sending and receiving of data plus the simulation at once. My initial though was to split the sending and receiving to their separate threads, and run the simulation in one thread, and then communicating between the three threads using message queues. This works decently well, but there is still going to be a point where one thread for the simulation is not going to be enough.

The whole task of truly multi-threading any application is not something undertaken lightly, I have written several big system running on N-core systems, where N is pretty big (32+). But all of these problem domains have not been as complex as a real time simulation are and have had problem domains which are easy to partition over several threads as they were highly isolated.

My first though and the initial implementation I built was to split the server into N threads, where each thread gets a certain subset of the players and actors on the server, and this thread is responsible for receiving data, sending data and running the simulation for the players and actors that it has been assigned. Now, this does actually work pretty well and I've been able to run a huge amount of actors on the server. The problem then comes when one actor in thread T0 needs to access an actor in thread T1.

I once again tried to go with the "simplest thing that works" and just do a simple locking scheme, while you do something to actor X which is not owned by your current thread you have to lock and release him with a normal Monitor.Enter / Monitor.Exit pair. This obviously does not scale and the lock contention becomes insane, when having a lot of interactions the server is barely faster then a single threaded, which is to be expected as doing locking on a large scale essentially makes your application sequential in the same way that a single thread does (not quite true, but the gains are not nearly as big as a truly concurrent implementation), on top of this add all the context switching excessive locking causes and you will not gain so much performance.

So, I decided try to revamp the locking and implemented a thread safe reference counter for the objects shared between threads, so that one thread could say "I need this object right now" or even "I need this object for some time" - but it would not grant it exclusive access to the object, it would just make sure that the object does not vanish under it from being deleted from the world (for example an NPC that could get despawned from it's owning thread while another thread is trying to buy an item from it) .

This still leaves a problem though, with two threads possibly modifying the object at the same time (currently two different threads could say "I need this object right now", and they both will be allowed to access it), which obviously does not work. And this is where I'm at currently, I have a couple of ideas on how to solve this though. I'm looking for feedback on these ideas, or general comments or explanation of how problems like this have been solved before in commercial games or any literature that I could read.

IDEA 1: Creating an even more fine-grained locking scheme which allows for both read and write locks to be held on shared objects.

Basically implementing a fine-grained locking scheme which would allow threads to both lock an actor for reading and writing, only allowing one writer but several readers obviously, there are several ways this could be accomplished by using wait handles built into windows or more light weight methods using Monitor.Enter/Exit and pair of sync objects.

IDEA 2: Allow each thread to be fully self containing, exposing no shared objects to other threads and instead have read-only or some type of auto updating proxy objects which the other threads can use

The basic idea is to not share any writable data between the threads, and instead expose read only objects to the other threads which either act as proxies for the main objects or contain the data needed by the other thread in a read-only format. This has the added benefit of removing the different between "a different thread" and "a different process" and "a different machine" as it would in theory be possible to create an abstraction layer which removes the different communication mechanisms for "between threads", "between processes" and "between machines". How updates to objects would be handled I'm not 100% sure of, probably some incoming message queue that other threads can push wanted changes to objects onto, which then is de-queued in sequential order on the owning thread.

As I said I'm interested in any feed back on this type of problem, I posted it in Multiplayer / Networking because it's specifically for a multi-player server, hope I picked the right forum!
Advertisement

So, I've been working on the server for my hobby game, so far I've done the simple thing of just running it all in one single thread - doing the "simplest thing that will work", sadly - it does not work any more as one thread is not enough to handle all the sending and receiving of data plus the simulation at once. My initial though was to split the sending and receiving to their separate threads, and run the simulation in one thread, and then communicating between the three threads using message queues. This works decently well, but there is still going to be a point where one thread for the simulation is not going to be enough.

The whole task of truly multi-threading any application is not something undertaken lightly, I have written several big system running on N-core systems, where N is pretty big (32+). But all of these problem domains have not been as complex as a real time simulation are and have had problem domains which are easy to partition over several threads as they were highly isolated.

My first though and the initial implementation I built was to split the server into N threads, where each thread gets a certain subset of the players and actors on the server, and this thread is responsible for receiving data, sending data and running the simulation for the players and actors that it has been assigned. Now, this does actually work pretty well and I've been able to run a huge amount of actors on the server. The problem then comes when one actor in thread T0 needs to access an actor in thread T1.

I once again tried to go with the "simplest thing that works" and just do a simple locking scheme, while you do something to actor X which is not owned by your current thread you have to lock and release him with a normal Monitor.Enter / Monitor.Exit pair. This obviously does not scale and the lock contention becomes insane, when having a lot of interactions the server is barely faster then a single threaded, which is to be expected as doing locking on a large scale essentially makes your application sequential in the same way that a single thread does (not quite true, but the gains are not nearly as big as a truly concurrent implementation), on top of this add all the context switching excessive locking causes and you will not gain so much performance.

So, I decided try to revamp the locking and implemented a thread safe reference counter for the objects shared between threads, so that one thread could say "I need this object right now" or even "I need this object for some time" - but it would not grant it exclusive access to the object, it would just make sure that the object does not vanish under it from being deleted from the world (for example an NPC that could get despawned from it's owning thread while another thread is trying to buy an item from it) .

This still leaves a problem though, with two threads possibly modifying the object at the same time (currently two different threads could say "I need this object right now", and they both will be allowed to access it), which obviously does not work. And this is where I'm at currently, I have a couple of ideas on how to solve this though. I'm looking for feedback on these ideas, or general comments or explanation of how problems like this have been solved before in commercial games or any literature that I could read.

IDEA 1: Creating an even more fine-grained locking scheme which allows for both read and write locks to be held on shared objects.

Basically implementing a fine-grained locking scheme which would allow threads to both lock an actor for reading and writing, only allowing one writer but several readers obviously, there are several ways this could be accomplished by using wait handles built into windows or more light weight methods using Monitor.Enter/Exit and pair of sync objects.

IDEA 2: Allow each thread to be fully self containing, exposing no shared objects to other threads and instead have read-only or some type of auto updating proxy objects which the other threads can use

The basic idea is to not share any writable data between the threads, and instead expose read only objects to the other threads which either act as proxies for the main objects or contain the data needed by the other thread in a read-only format. This has the added benefit of removing the different between "a different thread" and "a different process" and "a different machine" as it would in theory be possible to create an abstraction layer which removes the different communication mechanisms for "between threads", "between processes" and "between machines". How updates to objects would be handled I'm not 100% sure of, probably some incoming message queue that other threads can push wanted changes to objects onto, which then is de-queued in sequential order on the owning thread.

As I said I'm interested in any feed back on this type of problem, I posted it in Multiplayer / Networking because it's specifically for a multi-player server, hope I picked the right forum!

Other than most enterprise applications, time is a very limited factor in real-time games/simulations. So you want to avoid almost any locking, cause it will kill performance quite fast.

Your second idea seems to be a good idea plus some buffers. The basic idea is, that you use some kind of lock-step update, that is , you need to syncronise your threads, but only seldomly(once per frame). Here's the basic idea:


#start_sync // only one thread has exclusive access to all objects

// distribution/proxy phase
for each modified object do
distribute all actions to target objects
update proxy of object
end

#end_sync

distribute object to threads

per thread:
for each object do
update object according to actions // gathered in distribution phase
interact with other proxies (read-only)
save (inter-)actions with other objects // enqueue action in a thread-only or object-only queue to avoid syncs
end


The basic idea is, that the communication of an object with other objects is handled by messages which are saved at the object itself only. In a short syncronisation phase, all messages will be distributed, but after that, all objects can continue to work in an isolated way by only "seeing" the proxy objects (a limited, read only view of the real object).
separating network and simulation in different threads is definitly a good approach, and a step in the right direction. Transporting a message from simulation to network can be done with a lockless queue, and the same back. You dont want your simulation thread to wait for messages to get transported.

Your second idea is definitly the most scalable. As you are saying, you can now add processes instead of threads, which will scale to whatever hw you can afford.
I would have focused on designing for process scaling, because; if you have several cores you can start several servers on one machine and just get them to talk together, but if your designing for thread scaling you are limiting yourself to the a single machine.

I think you may find this page interesting http://www.zeromq.org/ - it avoids locks like the devil, and uses lock free techniques. I find the design very facinating, and reading the intro could spawn some ideas ;) at least it did for me!
www.ageofconan.com
Other than most enterprise applications, time is a very limited factor in real-time games/simulations. So you want to avoid almost any locking, cause it will kill performance quite fast.

Your second idea seems to be a good idea plus some buffers. The basic idea is, that you use some kind of lock-step update, that is , you need to syncronise your threads, but only seldomly(once per frame). Here's the basic idea:


#start_sync // only one thread has exclusive access to all objects

// distribution/proxy phase
for each modified object do
distribute all actions to target objects
update proxy of object
end

#end_sync

distribute object to threads

per thread:
for each object do
update object according to actions // gathered in distribution phase
interact with other proxies (read-only)
save (inter-)actions with other objects // enqueue action in a thread-only or object-only queue to avoid syncs
end


The basic idea is, that the communication of an object with other objects is handled by messages which are saved at the object itself only. In a short syncronisation phase, all messages will be distributed, but after that, all objects can continue to work in an isolated way by only "seeing" the proxy objects (a limited, read only view of the real object).


Thanks for your feedback, yes this definitely seems like the way to go about it, with read-only proxy object, I've started on this implementation and it seems to be working pretty well!



separating network and simulation in different threads is definitly a good approach, and a step in the right direction. Transporting a message from simulation to network can be done with a lockless queue, and the same back. You dont want your simulation thread to wait for messages to get transported.

Your second idea is definitly the most scalable. As you are saying, you can now add processes instead of threads, which will scale to whatever hw you can afford.
I would have focused on designing for process scaling, because; if you have several cores you can start several servers on one machine and just get them to talk together, but if your designing for thread scaling you are limiting yourself to the a single machine.

I think you may find this page interesting http://www.zeromq.org/ - it avoids locks like the devil, and uses lock free techniques. I find the design very facinating, and reading the intro could spawn some ideas ;) at least it did for me!


My current solution is looking to be like this, both process and thread based:

1) Based around processes, which can either communicate over TCP/IP if on different machines or over either a loopback-socket or a block of shared memory if on the same machine

2) Each process have the 2+N threads

3) Thread #1 is responsible for receiving and sending data from and to the clients

4) Thread #2 is responsible for communicating with other simulation processes on the same machine or remote machines, sends and receives data from them to keep the local universe "up to date" in regards to what exists around it in forms of actors on other servers, etc.

5) Threads #N are responsible for the simulation itself, this can be from 1 to how many threads you wish.

I think you may find this page interesting http://www.zeromq.org/ - it avoids locks like the devil, and uses lock free techniques. I find the design very facinating, and reading the intro could spawn some ideas ;) at least it did for me!


My current solution is looking to be like this, both process and thread based:
[/quote]

You should look at lockless containers/pipelines; they are sometimes quite useful for in-process synchronization.

I think that, for high scalability, you need to re-think your strategy, though. You shouldn't need to "call into" objects at all. Instead, each individual object looks at the world, the way it sees the world, and takes actions, which may impact the world. The trick is, you want an object to only be able to impact the world in the future, and only take inputs from the world in the past. The bigger you can make this window, the higher a tolerance for latency your system will have.
If you step the world, say, 50 times a second, and say that you can only affect the world for the next step, not the current step, then you have 20 milliseconds of latency to play with. This means that each object could potentially run on a different machine, and still interact with objects running on other machines. The draw-back is that collisions might be "squishy" or "one-sided," because you can't have two objects collaborate to resolve a particular collision pair for a particular time step. Each object needs to independently resolve its "half" of the collision.

The way an object affects the world might be to enqueue messages to other objects within a neighborhood, and it may be to update proxy information about the object (say, the position and orientation of a collision shape), and it may be other such "lightweight" operations -- sending data, rather than running code.

Now, you may still have problems where each simulation server needs to serve the area for the entire simulation, because it has objects that are "spread out;" at that point you probably want to reduce cross-server messaging by loosely tying areas to servers. However, you still want more than one server being able to serve objects in the same area, so you don't suffer a density limitation because of computation limits.

I don't know how dense you want to make the simulation, and how complex the simulation really is, but the above is pretty much where state of the art systems in dense, distributed simulation end up.
enum Bool { True, False, FileNotFound };
If your 'turn' cycle time is fast you may be able to get away with a double buffered state mechanism where the
current simulation turn resolution/arbitration is being performed on LAST turns state data (and the result will
go into the new state buffer and they will flip every turn. That way there is no lock on the data used to determine
the turns results as they are READ-ONLY and the new (next) state data is assembled independantly until
the cycle is done and then it is locked down (distributed) in turn.

With a pipeline scheme you can have other processes doing client bound data packaging while the next turn
results are being processed (client inputs came in and are cut off at the turnover point and those have their own
pipeline for decoding, validation, routing again off of the read-only previous state (maybe a triple buffered scheme)

Redundant copies of turn state (and the load of multiple distribution) can be done but if it amounts to alot
(many dynamic objects) it can significantly load down your cluster. Localized entity clustering with fixed
areas/zones -- or better -- load leveling with shifted spacial boundries would minimize the redundancy into hopefully
small 'edge' overlap areas (though the coding is alot more complicated than a brute force transmit everything
to everyone scheme).

Some games (like space games) have very short interaction distances and low entity counts within most areas.
Low count for the entire game world might make a 'everything to everybody' distribution not really a problem
but as the numbers go up the N-squared demon quickly kills you. No matter what you do, if all the players insist
on being in the same small area you may have to have heavy handed congestion handling that wont appear normal.
--------------------------------------------[size="1"]Ratings are Opinion, not Fact
The problem comes with the fact that one object can modify another object. No matter what, you cant get around the fact that one object will be able to write to another object during gameplay. Having said that, when you introduce threading into the mix, there just isnt a way around the fact that you will have to provide some type of object locking. There are ways to minimize the locking. You could reduce the world into a grid (which is normally done anyway) and then iterate over the world with a parallel_for loop. You can minimize the locking by making the grid sized fairly large, for example: 1000m x1000m. You can then assign a type of border to each grid that exists within the 1000 x 1000 square, maybe 100m border?

Anything writes that occur to an object within that border must acquire a lock because it could be a neighboring thread attempting to write. This means that anything within the 900 x 900 square can preform work without getting any type of lock. The sizes are completely arbitrary and can and will be different. Its the idea . . . .
Wisdom is knowing when to shut up, so try it.
--Game Development http://nolimitsdesigns.com: Reliable UDP library, Threading library, Math Library, UI Library. Take a look, its all free.
You can make a totally lockless, yet still threaded, simulation engine. Well -- all threads need to synchronize for each tick, but that could be done with polling if you want to be really, really, lockless ;-)

The rule is that simulating an object can read the world in its current state, and can read events generated from the previous simulation step by any other object, but cannot write to any object. All it writes is its own next state, and it generates events for any other objects -- those events will be processed next step, not this step.

You can use lockless FIFOs or pre-allocated lockless queues for all the events, because you know the list of events won't be read until the next step (post-tick-sync) so any temporary out-of-order queue manipulation doesn't matter. If you care about consistency, you can then sort the generated events according to some consistency rule as part of "ticking over" to the next step.

You can then take this approach all the way to a software architecture, where inputs/needs, state, and events are pre-declared/pre-allocated as pointers to plain data, and the simulation code can just read/write through those pointers. You then sort those pointers (in RAM) by object, so all data can be properly pre-fetched and you reduce cache misses.

For interactions across nodes (networking,) you would have to either pipeline the simulation one more step (delaying events/interactions/input by an additional step,) or you will have to make sure that the latency of sending events/state across the network is covered by the end of simulation for one tick, and the start of simulation for the next tick -- with 3 millisecond transmission delay (very, very aggressive!) and 60 Hz simulation, this means your simulation can't use more than 80% of each CPU (3 milliseconds latency over 16.6 milliseconds step time gets wasted).
enum Bool { True, False, FileNotFound };

So, I've been working on the server for my hobby game, so far I've done the simple thing of just running it all in one single thread - doing the "simplest thing that will work", sadly - it does not work any more as one thread is not enough to handle all the sending and receiving of data plus the simulation at once. My initial though was to split the sending and receiving to their separate threads, and run the simulation in one thread, and then communicating between the three threads using message queues. This works decently well, but there is still going to be a point where one thread for the simulation is not going to be enough.

IDEA 1: Creating an even more fine-grained locking scheme which allows for both read and write locks to be held on shared objects.
IDEA 2: Allow each thread to be fully self containing, exposing no shared objects to other threads and instead have read-only or some type of auto updating proxy objects which the other threads can use


I go with "none of the above". Sorry, that's not meant to be asshole, just a silly answer with a serious point. Currently I process about 10k objects a frame with ~190 +-30 locks per update at 60fps. It screams through my simulation (non-trivial) with only about 15% cpu utilization on the 4 cores I give it to play on. In real usage, I only run it at 10 fps and it hardly shows on the cpu utilization, the 60fps, that's my stress test. So, how. There are no secrets involved, just avoid locking of any type like the plague, and unfortunately like was suggested lockless, is only partially helpful as enough contention and it fails also due to cache line syncs constantly blowing out the CPU internal communications channels. (Not to mention false sharing issues and such which can bite even harder than os level synchronizations if you don't watch out for them.)

It's a pain in the ass but I've turned my entire pipeline on it's side. Where I might normally write "update object position, update object orientation, check collision, if( ai controlled ) update AI, etc" in a linear function I chop the function up into multiple pieces. I then use what I call "inverse access rules". If I look at another objects position while trying to figure out how to move, I can't write to my position until a latter step. If I write to my position then no one can look at my position in this step of processing. The same holds true for systems, if I use an SAS system I can read from it but I can't change my position in it until later.

OK, that's the high level which sounds like nonsense, here's a specific example:

void UpdatePosition( float deltaT )
{
SAS& sas = Engine::Service< SAS >(); // Get spacial awareness system.
SAS::GoVector_t nearObjects( sas.Query( mPosition, mRange ) ); // Get near objects.

if( !nearObjects.empty() )
{
math::Vector3f average = nearObjects::begin()->Position();
SAS::GoVector_t::const_iterator iBegin = nearObjects.begin()+1;
SAS::GoVector_t::const_iterator iEnd = nearObjects.end();
for( ; iBegin!=iEnd; ++iBegin )
{
average = (*iBegin)->Position();
}
average /= nearObjects.size();
}

// Calculate movement based on near object average position.
// Kinda brain dead flocking system.
....

// Assign the position. This is illegal in my model.......
this->Position( ... ) !!!!! Bad bad,

// Instead:
mUpdatedPosition = newPos;
}

I can't assign my position in the above because I both looked at positions in other objects but also I accessed the spacial awareness system which is dependent on the position of the objects. So, instead, you just store the position in a temporary and the next function is:

void AssignPosition( float deltaT )
{
this->Position( mUpdatedPosition );
}

What this split does is that instead of a bunch of fine grained calls to mutex locking or atomic operations, you only need 1 mutex in order to do the work. You run all the UpdatePosition functions wait till all threads have completed then run all AssignPosition functions. There are "no" locks except the one splitting the two function calls. You take a small hit involving memory access due to having to reload the positions (can mostly be avoided) but you've just done something with two functions with a single lock which could have taken 1000's of locks otherwise. And, it scales like a bat outta hell over as many cores as you throw at it.

Unfortunately I know all too well how painful breaking up the code like this can be and I can't say it's for everyone. I rather like it though, it's a another aspect of avoiding monolithic styles by allowing you to add/remove "steps" in the processing without having to update hundreds of functions to deal with a new step. (All the dependencies are broken already, adding/removing things is just a matter of changing the list of steps.) Some things are dynamic of course, so I stated +-30 locks because I run a very simple continuous sphere collision system to trigger adding/removing to Bullet collision worlds. It can add prior time items (i.e. points of collision which happened within the deltaT duration) which in really complicated cases can be a number of random steps of processing to be dealt with.

I've actually been interested in trying a very simple combination of some old school and new stuff and building a decent example of this system for folks, I'm always talking about it and putting it in games, should put my typing where my money is eventually. :) Basically take boost and the old school Duff's device to build a multi-core stackless coroutine system which does a completely user space manual threading model with basically the same principles by default. I.e. a mutex would be a lockless add to a fifo and yield (all local stack is blown, makes it only semi-painful) so the worker thread keeps going without locking. The only os locks would then be when threads run out of work, which is to be expected. Maybe if we could get some interested folks I could be convinced to setup a little git repository with the basic starting framework and such. (I won't be able to do a lot right now, current paying work is very painful.)

As to all the networking, it's just another step in this explicit organization. It works just fine in practice.

This topic is closed to new replies.

Advertisement