Sign in to follow this  

Cooperative multitasking(coroutines) vs true multithreading

This topic is 2954 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Guys, I'm developing an MMO game and I'm using coroutines for implentation of the game logic instead of true multithreading. Could you please review and critique this approach? I wonder if anyone around using anything similar. Ok, here it goes... The World thread is where the game simulation logic happens, players interact with each other, NPC AI makes decisions, etc. The World is single threaded. It's single threaded on purpose: putting synchronization logic into World classes would make them IMHO too difficult to develop, debug and maintain. Furthermore, I believe it could even slow down the perfomance due to crititical sections overhead. All World tasks(which are basically boost::fiber coroutines) are executed within a pretty simple scheduler which controls their execution and provides a fixed time budget for all tasks. Currently it's 10ms for all input handling tasks(see below) and 5ms for the rest ones. There is an important rule which must be strictly followed by all World tasks: any expensive operations(e.g IO) MUST be executed in async mode. In order to achieve this all expensive operations are executed in separate threads while the currently executed World task waits for its completion using boost::fibers "yield" call so that other World tasks can be executed in cooperative multitasking fashion. Of course, as I said above, there are other threds besides World. For example, all database related operations are executed in a separate thread which provides all neccessary tools for asynchronous execution of SQL queries(I'm using MySQL as a backend). Using boost::future it's possible to submit an async query from the World thread and wait for its completion without any blocking. The similar schema is used for all other expensive operations(logging, A* searches, etc). Networking is also implemented in a separate thread using async facilities of boost::asio. Incoming packets are transformed into World input handling tasks and are added to into the World scheduler. World communicates with the networking thread by adding outgoing packets into a special queue. As for possible scalability issues, currently I'm experimenting with the idea of running separate World location instances in separate processes. This way it should be possible to spread the load among multiple server boxes in the future. So, guys what do you think about all the said above? P.S. boost::fiber and boost::future are experimental libraries which can be found in boost vault - http://www.boostpro.com/vault/ [Edited by - pachanga on November 7, 2009 8:33:33 AM]

Share this post


Link to post
Share on other sites
That sounds like a fairly common approach. If you debug the system, it will probably run well. The biggest draw-back is that it won't scale across multi-cores, as you yourself note.

The main danger with fibers is if some fiber allocates some resources, or schedules a callback, and some other fiber either deletes those resources, or destroys the fiber while the callback is outstanding. In both cases, you will get hard crashes, that can sometimes be hard to debug, because the crash happens when the fiber stack is switched in and the real "culprit" context is lost.

Also remember that fibers are fairly expensive, in that they each need a stack of some size. If you have code that likes allocating large buffers on the stack ("wchar_t path1[1024], path2[1024], path3[1024];") then you will have to budget for the worst case for each fiber. A separate "state machine" approach where an object has to come all the way back to the caller to "yield" will avoid that problem, but the code will be more cumbersome to write.

The Python world has published a lot on this approach; if you google for "twisted" and "stackless python" you'll probably find some worthwhile reading. However, I know of several C++ Linux and Windows systems that do more or less the same thing (although what, specifically, goes into separate fibers depends on the job).

Share this post


Link to post
Share on other sites
Why not just use Stackless Python, Erlang, Scala, Clojure, Lua, .... C++ is a pain for this type of thing.

Fibers are ok if you want to minimize impact of thread context switches. But as soon as you go multi-core, or scale across multiple machines, it doesn't really offer much anymore, compared to the issues it introduces (exception propagation, stack size).

Quote:
It's single threaded on purpose: putting synchronization logic into World classes would make them too difficult to develop, debug and maintain


Fibers alone do not solve data consistency problem. Unless you ensure that each fiber only manipulates the data it owns, then fibers are just as bad as other forms of concurrency.
Future balance("player/balance");
Future debt("player/debt");

player.update("player/balance", balance->get() - debt->get());
Nothing guarantees that this will be safe, even if no multi-threading goes on. You either need some form of transactional semantics, or locks. STM comes closest to this.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Why not just use Stackless Python, Erlang, Scala, Clojure, Lua, .... C++ is a pain for this type of thing.


It's not that painful provided one has good libraries available :) And it's not an option at the moment. Lots of code already written in C++. BTW, AI is actually scripted in Lua and it's the core World logic which is implented in C++.

Quote:

Fibers are ok if you want to minimize impact of thread context switches. But as soon as you go multi-core, or scale across multiple machines, it doesn't really offer much anymore, compared to the issues it introduces (exception propagation, stack size).


So another option(in case of C++) would be to add synchronization logic into every World class? I just wonder if anyone around did this and if he was happy with this decision at the end.

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
Also remember that fibers are fairly expensive, in that they each need a stack of some size. If you have code that likes allocating large buffers on the stack ("wchar_t path1[1024], path2[1024], path3[1024];") then you will have to budget for the worst case for each fiber. A separate "state machine" approach where an object has to come all the way back to the caller to "yield" will avoid that problem, but the code will be more cumbersome to write.


You are absolutely right about possible stack issues. Currently I'm using 16kb stack for each coroutine and it seems to be pretty enough. I'm using stl containers instead of built-in arrays and any large data is stored on the heap. It's also possible to customize the task in constructor and require more memory.


Quote:

The Python world has published a lot on this approach; if you google for "twisted" and "stackless python" you'll probably find some worthwhile reading.


Thanks, I'll have a closer look

Quote:

However, I know of several C++ Linux and Windows systems that do more or less the same thing (although what, specifically, goes into separate fibers depends on the job).


BTW, in my implementation of tasks it's also possible to specify if the task should be a coroutine or a simple functor. For example, a task which is responsible for responding to client pings is a functor, while a task responsible for Player login is a coroutine since it interacts with the database, login server, etc.

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
That sounds like a fairly common approach.


I know that some folks around(including you :) ) develop and maintain pretty large MMO games, I'm just curious if this approach is actually used in production. Or maybe in the long term it's better to use full-fledged multithreading and thus put explicit synchronization logic into each World class?

[Edited by - pachanga on November 7, 2009 8:57:57 AM]

Share this post


Link to post
Share on other sites
We have used co-routines for an application server for MMOs/virtual worlds for a very long time, and it's worked pretty well. We did not use it for simulation, because simulation is generally better expressed as functors/stepping anyway. However, we have started feeling the pain of not being able to effectively make use of multi-core. The best we can do is put multiple server processes on the same multi-core machine, which is less efficient because the working set of the physical machine becomes bigger.

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
We have used co-routines for an application server for MMOs/virtual worlds for a very long time, and it's worked pretty well. We did not use it for simulation, because simulation is generally better expressed as functors/stepping anyway. However, we have started feeling the pain of not being able to effectively make use of multi-core. The best we can do is put multiple server processes on the same multi-core machine, which is less efficient because the working set of the physical machine becomes bigger.


Thank you very much for sharing. Do you have any ideas what the alternative could be? I would be very interesting to hear. I personally have a gut feeling that explicit locking in domain logic is wrong and some different model should be used instead. For example, something similar to erlang model where objects don't share any data and they interact with each other by messages only.

Share this post


Link to post
Share on other sites
Something like this: Generally, an update will only affect a single object, or perhaps a pair of objects. Doing per-object locking might work well. Put together a queue of work, and a worker thread pool; do object locking; have all operations declare objects they need up front and always lock objects in ascending numerical object ID order to avoid deadlock.

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
Something like this: Generally, an update will only affect a single object, or perhaps a pair of objects. Doing per-object locking might work well. Put together a queue of work, and a worker thread pool; do object locking; have all operations declare objects they need up front and always lock objects in ascending numerical object ID order to avoid deadlock.


Thanks for the idea. BTW, if anyone interested I started a new more general topic here

Share this post


Link to post
Share on other sites
Quote:
Original post by pachanga
So another option(in case of C++) would be to add synchronization logic into every World class? I just wonder if anyone around did this and if he was happy with this decision at the end.

Four options I would prefer to having to do low-level locking would include:
- explicit message passing for everything, ie. Actor/Erlang model;
- making the core simulation almost entirely single-threaded and factoring out or removing the slow operations into message passing (maybe even on separate machines, and/or over HTTP, etc);
- a multi-process simulation where entities are owned by one process and updated in a single-threaded fashion in all cases where 2 entities don't interact, but when 2 interact they are migrated to be on the same process;
- a multi-process simulation where all multiple-entity interactions are handled by escrow objects that are passed across processes.

Each of these is really just message passing in some form. The question is where you put that dividing line and what your messages consist of.

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
We have used co-routines for an application server for MMOs/virtual worlds for a very long time, and it's worked pretty well. We did not use it for simulation, because simulation is generally better expressed as functors/stepping anyway. However, we have started feeling the pain of not being able to effectively make use of multi-core. The best we can do is put multiple server processes on the same multi-core machine, which is less efficient because the working set of the physical machine becomes bigger.


Interestingly, the new Go language just released by Google provides the same basic set of functionality as Stackless Python. That is microthreads (Go: goroutines, Stackless: tasklets) and channels.

However, where Stackless microthreads are thread-bound because tasklets use the C stack of the thread they were created on (sharing it by being copied on and off above the scheduler), Go supports multi-core through actually being stackless and being heap-based where Stackless is not.

When a Go microthread does a system call, the other microthreads on the scheduler sharing that thread/core are handed off to a scheduler on another thread. To get the same functionality on Stackless, you'd need to serialise the other running tasklets, then unserialise them in the scheduler on another thread. It would be very costly where Go migration has almost none. And even then, all calls available in the Python interpreter that would block in the operating system would need to be identified and wrapped. Go already has that covered.

Now, the way Stackless shares the C stack, each microthread only uses the thread stack memory when it is currently running and how much they use is no issue. Of course, in systems with limited memory like the Nintendo DS, this is not the case. Go microthreads use a minimum of 4k of heap allocation for their stack and increase the size as needed, but I guess thats the prerogative of a language with its own custom compiler.

Anyway, just some thoughts that came up while reading this topic.

[Edited by - rmtew on November 11, 2009 5:56:14 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
... You either need some form of transactional semantics, or locks. STM comes closest to this.


Locks...don't forget your locks. I know you guys are referring to C/Cpp (I'm developing a C#, cross-platform client/server), and I don't know really anything about boost::... But I'm still able to follow along - at least enough to agree that asynchronous operations are THE way to go. Even with async operations, although the 'threading' is done at system level, you can't just fire and forget. You still need to watch for those system level processes clashing over a resource.

Share this post


Link to post
Share on other sites

This topic is 2954 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this