Sign in to follow this  
InvalidPointer

Multicore design... harder than this, somehow?

Recommended Posts

I've been making major strides towards the completion of a commercial-grade 3D engine as of late (many thanks to you guys here for answering my questions!) and I came to the dreaded beast of overall engine multicore behavior. I did a lot of research on the subject and considered the multitude of contemporary concurrency problems and came up with what I think to be a bulletproof lockless system over the course of an afternoon. I'm going to be discussing it in the context of entity updates/game logic as this is what I would consider the most headache-inducing aspect of this field of design. As with Intel's Smoke and many other 'highly-threaded' architectures, a core scheduler/monitor class prepares n worker threads for execution. In my implementation, this would basically entail creation (or hypothetical reuse via an object pool, etc.) of a 'done list' and a 'work queue' for each of the worker threads. Both are linked lists only accessed by their respective threads and as such do not have mutices assigned to them. Work queue construction is arbitrary, but I thought doing an even division of all entities to process makes the most sense. Work balance may be an issue, but the solution is fairly straightforward and will be discussed more later. While the work queue covers what to *do*, the 'done queue' covers what's been *done* and is what I would consider to be the strength of the architecture. When a thread finishes processing an entity/whatever, it just writes the result to a temporary buffer until a separate 'cleanup' thread iterates over all n thread 'done queues' and commits them to the 'real' database for later use. By doing this we completely remove the need for locks as we can guarantee only one thread is ever going to be accessing both lists and as such there's no need for exclusion. There are two minor performance issues I can see with this: First, some redundant reads/writes for transferral. This is a tricky one to solve, but it's actually a non-issue with the rest of my engine design. All of my entities, etc. use a handle-based system whereupon a pointer is replaced with an 'index' into a separate list of 'real' pointers. One could hypothetically just replace the pointers instead of copying over the whole object, saving some cycles but potentially aggravating the second issue, i.e. cache misses/locality. The design would just brutally axe-murder it AFAIK. That's trickier but with some clever allocation and prefetching magic you might be able to put a silver bullet in that monster as well. Finally, we come to the issue of work balancing and variable update times. I'd love to get a nice O(1) update function but I really fail to see that ever happening. You might be able to tinker with the design to allow threads to to a few interlocked ops on one another's 'work queues' and share the wealth, but that might cause issues and be more trouble than it's worth. I have a thread pool, and releasing the thread back makes sense so it can work on something else (preferably) unrelated so as to further reinforce the idea that LOCKS CAN BURN IN DEVELOPMENT HELL MUAHAHAHAHAHAA. *ahem* So anyway, seeing the huge fuss everyone's making on this subject and the reasonably short time I came up with a solution, I figure I must've forgotten something important or overlooked an aspect of the design. What is it? It seems too good to be true :S [Edited by - InvalidPointer on October 21, 2008 8:21:37 PM]

Share this post


Link to post
Share on other sites
I think the biggest problem is that entity updates are rarely independent of other entities. If entity A needs to know the position of entity B, how do you handle that without either a) creating a read-only copy of entity B for entity for each update or b) using some sort of locking to ensure that thread 1 does not update entity B's position while thread 2 is trying to read it?

You could possibly "batch" updates so that if entity A "knows" it's going to need information about entity B it gets queued to the same worker thread, but then you've just added more problems: how do you know entity A will need access to entity B's state? Besides, it is rare that entities can be batched like that and still be able to be separated out effeciently (e.g. usually entity A needs entity B's state, entity B needs entity C's state and so on)

Also, you mention a "cleanup" thread which iterates over the "done" list. How do you protect the "done" list while the worker thread is writing to it and the "cleanup" thread is trying to read from it at the same time?

In the same vein, how does the game add things to the worker's "work" queues at the same time as the workers are reading from them?

Share this post


Link to post
Share on other sites
Quote:
So anyway, seeing the huge fuss everyone's making on this subject and the reasonably short time I came up with a solution, I figure I must've forgotten something important or overlooked an aspect of the design. What is it? It seems too good to be true :S
I'm not crystal clear on how your design works, so let me try and clarify here:
* All entities in the database that need to be updated are identified.
* Indexes/ptrs to these entities are then divided among the worker threads 'work lists'.
* The worker threads make a copy of each entity in it's 'work list' and calls an Update func on the copy.
* If Update needs to read any data from another entity, it will read data from the database, not from one of these "thread private" copies.
* The copies are placed into a 'done list'.
* At the end of the update cycle, the copied entities in the 'done lists' are written back to the database.

^^ Is that correct?

Quote:
Original post by InvalidPointer
As with Intel's Smoke and many other 'highly-threaded' architectures, a core scheduler/monitor class prepares n worker threads for execution. In my implementation, this would basically entail creation (or hypothetical reuse via an object pool, etc.) of a 'done list' and a 'work queue' for each of the worker threads. Both are linked lists only accessed by their respective threads and as such do not have mutices assigned to them. Work queue construction is arbitrary, but I thought doing an even division of all entities to process makes the most sense. Work balance may be an issue, but the solution is fairly straightforward and will be discussed more later.
My worker pool works in a similar fashion. Each worker thread has it's own private (mutex-less) queue of tasks to run. The workers fill up their own private queues from a shared public queue (using lock-free/interlocked techniques). Each worker thread grabs more than one task at a time from the public queue to minimise contention.

Quote:
While the work queue covers what to *do*, the 'done queue' cover's what's been *done* and is what I would consider to be the strength of the architecture. When a thread finishes processing an entity/whatever, it just writes the result to a temporary buffer until a separate 'cleanup' thread iterates over all n thread 'done queues' and commits them to the 'real' database for later use. By doing this we completely remove the need for locks as we can guarantee only one thread is ever going to be accessing both lists and as such there's no need for exclusion.
Each thread has it's own 'done queue', which is basically a list of delayed writes to a database, right? Doesn't this mean that writing to the database requires a lock (etc), as two threads might be going through their 'done queues' at the same time?
[EDIT] Oh, I see there is another thread that processes the 'done queues'. Does this mean that the overall update kinda looks like this?
[Serial Task]
Create list of entities to update.
Divide list over N threads.
[Parallel Task]
Update entities, output to thread's done queue.
[Serial Task]
Iterate through all done queues, commit data.

(i.e. the clean-up thread can safely use the done queue's because the worker threads are guaranteed to not be updating any entities when the clean-up thread is running.)

[Edited by - Hodgman on October 21, 2008 8:47:55 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Codeka
I think the biggest problem is that entity updates are rarely independent of other entities. If entity A needs to know the position of entity B, how do you handle that without either a) creating a read-only copy of entity B for entity for each update or b) using some sort of locking to ensure that thread 1 does not update entity B's position while thread 2 is trying to read it?

You could possibly "batch" updates so that if entity A "knows" it's going to need information about entity B it gets queued to the same worker thread, but then you've just added more problems: how do you know entity A will need access to entity B's state? Besides, it is rare that entities can be batched like that and still be able to be separated out effeciently (e.g. usually entity A needs entity B's state, entity B needs entity C's state and so on)

Also, you mention a "cleanup" thread which iterates over the "done" list. How do you protect the "done" list while the worker thread is writing to it and the "cleanup" thread is trying to read from it at the same time?

In the same vein, how does the game add things to the worker's "work" queues at the same time as the workers are reading from them?


Yeah, I wasn't particularly explicit about that. The point was that there's no need for a lock on the central DB as there are no writes to it while updates are still being done. See www.flounder.com/no_synchronization.htm for an explanation and the actual point that got my mind going. Ultimately I can read from them with impunity.

On a similar note, I don't need locks for the per-thread lists-- they're per-thread and as such they aren't going to change without the specific owning thread doing something to them. That's the point of a lock. Additionally I aimed to *defer* final updates and the actual writes until I know nothing actually needed to read from it. Again, there'd be no need for a lock.


Quote:
Original post by Hodgman
Quote:
So anyway, seeing the huge fuss everyone's making on this subject and the reasonably short time I came up with a solution, I figure I must've forgotten something important or overlooked an aspect of the design. What is it? It seems too good to be true :S
I'm not crystal clear on how your design works, so let me try and clarify here:
* All entities in the database that need to be updated are identified.
* Indexes/ptrs to these entities are then divided among the worker threads 'work lists'.
* The worker threads make a copy of each entity in it's 'work list' and calls an Update func on the copy.
* If Update needs to read any data from another entity, it will read data from the database, not from one of these "thread private" copies.
* The copies are placed into a 'done list'.
* At the end of the update cycle, the copied entities in the 'done lists' are written back to the database.

^^ Is that correct?

Quote:
Original post by InvalidPointer
As with Intel's Smoke and many other 'highly-threaded' architectures, a core scheduler/monitor class prepares n worker threads for execution. In my implementation, this would basically entail creation (or hypothetical reuse via an object pool, etc.) of a 'done list' and a 'work queue' for each of the worker threads. Both are linked lists only accessed by their respective threads and as such do not have mutices assigned to them. Work queue construction is arbitrary, but I thought doing an even division of all entities to process makes the most sense. Work balance may be an issue, but the solution is fairly straightforward and will be discussed more later.
My worker pool works in a similar fashion. Each worker thread has it's own private (mutex-less) queue of tasks to run. The workers fill up their own private queues from a shared public queue (using lock-free/interlocked techniques). Each worker thread grabs more than one task at a time from the public queue to minimise contention.

Quote:
While the work queue covers what to *do*, the 'done queue' covers what's been *done* and is what I would consider to be the strength of the architecture. When a thread finishes processing an entity/whatever, it just writes the result to a temporary buffer until a separate 'cleanup' thread iterates over all n thread 'done queues' and commits them to the 'real' database for later use. By doing this we completely remove the need for locks as we can guarantee only one thread is ever going to be accessing both lists and as such there's no need for exclusion.
Each thread has it's own 'done queue', which is basically a list of delayed writes to a database, right? Doesn't this mean that writing to the database requires a lock (etc), as two threads might be going through their 'done queues' at the same time?

You pretty much have it. I had originally conceived that there's no real reason the determination for update (a 'dirty flag,' if you will) and the actual update couldn't be merged, but other than a few semantic differences that's right on the money. I also thought that just having the threads iterate over a certain span of indices in the handle table made the most sense-- it didn't involve copying.

Object creation might mess things up a bit now that I think about it. I'll have to sleep on that one.

EDIT: Yeah, you got it. Although 'serial task' is a bit of a misnomer. I think of it more as 'bit that requires one available thread.' Others can do other threadly things in the mean time, such as prepping display lists, sound, rendering calls, whatever.

Share this post


Link to post
Share on other sites
Quote:
Original post by InvalidPointer
Yeah, you got it. Although 'serial task' is a bit of a misnomer. I think of it more as 'bit that requires one available thread.' Others can do other threadly things in the mean time, such as prepping display lists, sound, rendering calls, whatever.
Yeah, I meant "serial in respect to other entity related tasks". I have a class (FrameManager) in my system that handles the whole "update cycle" - in this class I have a FrameState variable/enumeration (for debugging mostly).

When someone calls fm.Update I assert that fm.state == eIdle and then set it to ePreUpdate. Then I get all the entities/threads ready to start updating. Then I set state to eUpdate and set all the worker threads going. After all the workers are finished, I set state to ePostUpdate and do clean-up tasks (such as processing your done lists) and then set state to eIdle again.

Tracking these states has been very a very useful development/debugging tool - if you put lots of assertions in your database/entities then you can pick up any threading/sequence problems straight away -- such as if someone tries to write to the database while the FM is in then eUpdate state, or if someone creates a new entity during ePostUpdate, etc...

Quote:
Object creation might mess things up a bit now that I think about it. I'll have to sleep on that one.
The problem that I had with object creation was figuring out how to assign ID numbers to new objects in a deterministic manner.

Determinism is a key requirement of my system (if I run the same simulation, I should get the same results, even if different entities are updated in a different order) and object ID's are used for network replication, save-games, etc..
e.g. In a multi-player shared simulation, where the simulation is run on all PC's (e.g. for an RTS game) if different PC's assign different object ID's to a new object, then the network will lose synch.

The solution I found was to assign ID's at the end of the frame (i.e. in ePostUpdate state) and to sort all new objects in a deterministic order.
In my system, objects can only be created in the update stage, so during updating I keep track of which entity is currently being updated. When a new object is created, I pair it with the ID of this (currently updating) entity.
Then at ePostUpdate I sort the new objects by their paired ID's.
I use TLS for this, e.g.

thread_local<Entity*> g_tlsCurrentlyUpdating;

void RunUpdateEntityTask( Entity& plzUpdateMe )
{
g_tlsCurrentlyUpdating = &plzUpdateMe;
plzUpdateMe.Update();
g_tlsCurrentlyUpdating = 0;
}

Entity::Entity( FrameManager& fm )
{
fm.EntityConstructed( this );
}

void FrameManager::EntityConstructed( Entity* p )
{
ASSERT( state == eUpdate )
ASSERT( g_tlsCurrentlyUpdating != NULL )
constructed.push_back( std::make_pair(p,g_tlsCurrentlyUpdating->GetId()) );
}

void FrameManager::DoPostUpdate()
{
ASSERT( state == eUpdate )
state == ePostUpdate;

std::stable_sort( constructed.begin(), constructed.end(), sortOnPairDotSecond );

for each constructed
constructed[i].first->SetId( GetNewObjectId() );
}

Share this post


Link to post
Share on other sites
Just a quick note. Have you taken a look at SGI's Performa docs? It's threaded rendering is basically a lock free threaded pipeline divided into 3 distinct phases.

App->Cull->Draw

IIRC each phase sits in its own thread, with a copy of its own state. And at any point in time code operating in each phase sees only its own frame's data. The cost of this would be a 3 frame latency between currently executed code and the time till which it is drawn, the advantage would be interframe latency would be ~ the time of the slowest phase. In an ideal world...

Share this post


Link to post
Share on other sites

I recall from one of the mags (Game Developer?) that triple buffering the graphics rendering is done with one buffer for the current display the next for the frame being built and a third to get started on if the second frame gets done early (to be able to make up time better if that next+1 frame takes an extra amont of time). The third frame can be discarded if some significant shift (such as the player changing course) takes place.

Note that would have to integrate with the simulation buffering system the OP describes where the relevant data for each frame is finalized as early as possible (and protecting the current set by using massed delta updates at the frame changeover (the third frame could start on static scenery rendering (or things like wave effects) which wont rely on object data yet).

Cyclic interlocks control the task list postings and access to the different frame data sets. Keeping all CPUs busy can be hard, but some tasks like AI pathfinding can be run across 'turns' and so fill in on CPUs which are idle for scheduled processing (ie waiting for a single thread that does the lockless
DB update pass.

Share this post


Link to post
Share on other sites
Quote:
Original post by wodinoneeye

I recall from one of the mags (Game Developer?) that triple buffering the graphics rendering is done with one buffer for the current display the next for the frame being built and a third to get started on if the second frame gets done early (to be able to make up time better if that next+1 frame takes an extra amont of time). The third frame can be discarded if some significant shift (such as the player changing course) takes place.

Note that would have to integrate with the simulation buffering system the OP describes where the relevant data for each frame is finalized as early as possible (and protecting the current set by using massed delta updates at the frame changeover (the third frame could start on static scenery rendering (or things like wave effects) which wont rely on object data yet).

Cyclic interlocks control the task list postings and access to the different frame data sets. Keeping all CPUs busy can be hard, but some tasks like AI pathfinding can be run across 'turns' and so fill in on CPUs which are idle for scheduled processing (ie waiting for a single thread that does the lockless
DB update pass.

That's an interesting thought, and one I'll have to consider in more depth. That may be the issue here-- I've been thinking about this solely in the context of a single frame and there might be stalls waiting for everything to catch up, which can kill perf. Thanks for the tip.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this