Quote:To take advantage of heavily multi-cored CPUs (Sun Niagara2 -- 64 cores!) I believe we need a totally different programming paradigm (message passing / state machines), and we don't really have the languages and tools (like debuggers!) to make good use of that paradigm yet. Also, I think you'll want to use L2 cache as a message queue, so you can avoid going through main RAM for the messaging; currently there's not quite enough control to make sure that that happens. Lockable/assignable L2 cache is probably in our future.
I spent some time verifying and implementing the proposed system, and while I do not have such huge hardware, it doesn't affect the solution itself.
The problem is formalized with a very simple primitive:
class ObjectPtr { void send( Message msg ) { if ( is_local ) { m_object.messagequeue.push( msg ); MARK_PENDING(m_object); // notify demuxer that m_object needs attention } else { socket.send( msg ); } } Object *m_object;}class Object { void process_pending() { int n = 10; // process at most n messages, then yield to others while ( !messagequeue.empty() && n-- ) { handlemessage( messagequeue.pop() ); } } State object_local_state;}
The processing model results in this:
---Socket-->[Demuxer}-------->{Objects} ^ |//Send +-----------------+------>[Serializer]--->{Net}
Serializer runs in its own thread, and demuxer passes out worker threads to objects with non-empty message queues (most likely only 1 (special case, can run in completely single-threaded implementation with no change to code) or 2 workers per processor)
While this would work fine in theory, the problem becomes with message storage. If used for true IPC, it needs to be persisted (either on heap, or via network). And using lock-free queues, allocating and de-allocating becomes a challenge in itself. Using a lockfree allocator solves this problem, but then causes livelock problem if the allocator is exhausted (e.g. Object produces messages at 1:2 ratio, when generating a message, it will spin on allocator waiting to receive an allocation, and demuxer will be waiting for processing to complete). There might be a solution for this at some higher level.
One side-effect of the design I hadn't noticed at first is that there is no more need for object shadowing. Since all calculations are performed either on object that owns the queue, or the data is passed through a message to another object, the execution context will always be local only to thread to which the object has been assigned.
In many ways I feel this is close to the state machine model mentioned. In this example, each state machine has its own thread, and messages form both, input and output to another SM.
One problem that I do suspect is the most problematic is acknowledgment or feedback. Since there are no guarantees as to when other a certain action will be acknowledged (one node is very busy and takes seconds to respond, when expected time is milliseconds). Then again, since this is about real-time gameplay, if such disconnect occurs, the solution would likely lie elsewhere, such as re-distributing the objects across nodes.
And, this design also reduces intra-node traffic (no shadowed copies), albeit at the expense of larger intra-object messages. So far, I've found that the most complex action I need to express takes 4 messages (per user command), with most of them being executable in direct response to command with no additional messages.
I won't know about real-world behaviour until the implementation is in workable form and split across a few machines.
Another consequence of such model is also that it becomes trivial to use DHT as storage, real-world analysis will need to show the needs for traffic balancing and explicit object grouping on a single machine - I suspect that with moderately complex logic of a typical MMO, this overhead would be less than originally expected.