Mulithreaded Engine Design
Currently, I'm learning how to use Ogre, and I wanted to start to make a small shooting game.
To be able to reuse my work later, I wanted to create a small multithreaded game engine, so it would scale better:
The basic layout:
http://c.imagehost.org/0388/SampleEngine.jpg
After starting the game it would look like this:
- GameEngine is created.
- Sub-threads containing all of the subsystems shown on the image (probably more) are created by the GameEngine class.
- Each Subsystem at least contains one World object and a process() function, as well as a pointer to the GameEngine class.
- GameEngine will call process() for each Subystem, which will process as long as GameEngine doesn't call stop().
- Calling pause() pauses processing till process() is called again.
After one full turn of processing (one frame rendered, one physics timestep, etc) the Subsystem calls GameEngine->SyncWorld(world) to get the changes into the main game world and to update its own subworld.
The subworld from each Subsystem doesn't contain a copy of the whole game world, but only the relevant things.
To ensure this, the pointers to unrelevant stuff will be set to NULL. If the GameEngine encouters a NULL pointer while syncing, it will just skip these parts. Furthermore, everything contains a "changed" flag to speed up syncing. This will only work in the Subsystem's world copy, when syncing from the main world back to the subsystem I would have to copy everything (save for the NULL-pointer sections). (I could add a revision number so only things which really changed would be copied, but I guess that would be to much overhead)
Is this a reasonable design?
Are there any flaws which I haven't seen?
[Edited by - Mononofu on July 23, 2008 1:23:59 PM]
Quote:Original post by Mononofu
Is this a reasonable design?
Are there any flaws which I haven't seen?
No. Well it is possible to get the job done so Yes.
Yes
How do you handle updates when both the physics engine and the AI modify it? Your single core performance is going to be terrible because you are adding a lot of overhead. Your dual core performance will be better but still not good. It doesn't scale up or down in general.
There's an interesting video of two Intel software guys talking about multithreading in games.
Multi-threading for Games Developers
Multi-threading for Games Developers
Quote:Original post by stonemetal
How do you handle updates when both the physics engine and the AI modify it?
My fault, it didn't write how I plan to do this:
The AI just writes actions the player/a NPC wants to do, it DOESN'T actually modifiy the world. It's the physics subsystem which reads this in, modifies the world accordingly and then simulates it.
(this would involve a simple list where the AI system appends all actions and the physics system removes all action every processing turn)
Quote:Original post by stonemetal
Your single core performance is going to be terrible because you are adding a lot of overhead. Your dual core performance will be better but still not good. It doesn't scale up or down in general.
Well, I believe it would scale up pretty well, but you're right regarding single core performance. I guess I have to do something about this, but hey, this is the reason why I posted my design example here: I never did something like this before, so I have to learn it now ^^
@Silicon Dude: Thanks for the link, I'll watch it.
Edit: I just had an idea:
I could change the syncing behaviour so everything is just synced with subsystems which really need these data.
For example, the AI subsystem could just sync with the physics subsystem, as well as the Input subsystem just has to sync with the physics and the AI subsystems. I'll update the image to better show what I want to say.
Edit2: Rethought this idea and found out that it's not really necessary, since it wouldn't reduce blocking.
--> Blocking only occurs if one syncing thread wants to read something another thread is already writing.
This can occur if:
- the AI writes to actions while Input wants to write there, or the other way round
- AI or Input write to actions while physics wants to read it
- physics writes to geometry while audio or graphics want to read it
( I haven't thought about network yet )
[Edited by - Mononofu on July 23, 2008 1:57:21 PM]
Quote:Original post by Mononofu
Well, I believe it would scale up pretty well, but you're right regarding single core performance. I guess I have to do something about this, but hey, this is the reason why I posted my design example here: I never did something like this before, so I have to learn it now ^^
@Silicon Dude: Thanks for the link, I'll watch it.
It doesn't scale past six threads. You have six subsytems each gets a thread so you can't scale beyond six cores. According to Intel, their next gen processors are going to have 2-8 cores running 4-16 threads. now on the low end you are going to be scheduling more threads than they have processors to run and you loose some performance to thread overhead(hence the not scaling down part). If they have more than six processors then they don't see benefit from the extra processors(hence the not scaling up part). Second your program will only run as fast as the slowest component. Say graphics takes a 1/10th of a second and for sake of argument everything else takes 1/20th of a second, in such a situation 50% of the time you are utilizing one core while everything else in the system sits idle. The better design is to multi thread within each sub system rather than around each sub sytem.
Sorry for off-topic, but what program did you use to generate that SampleEngine.jpg? Not Paint, was it?
Just to not be completely useless, let me link you to an interesting and informative past thread here.
Just to not be completely useless, let me link you to an interesting and informative past thread here.
Quote:Original post by stonemetal
It doesn't scale past six threads. You have six subsytems each gets a thread so you can't scale beyond six cores. According to Intel, their next gen processors are going to have 2-8 cores running 4-16 threads. now on the low end you are going to be scheduling more threads than they have processors to run and you loose some performance to thread overhead(hence the not scaling down part). If they have more than six processors then they don't see benefit from the extra processors(hence the not scaling up part).
This is just an EXAMPLE, not the final layout of the programm. It won't be a problem to add more threads: a worker pool for physics, a worker pool for AI, a worker pool for pathfinding, etc.
But you are right regardings the overhead, this could be a problem. Especially with Single Core CPUs, maybe with Dual Core. Do you have a better idea / a solution for this problem?
Quote:Original post by stonemetal
Second your program will only run as fast as the slowest component. Say graphics takes a 1/10th of a second and for sake of argument everything else takes 1/20th of a second, in such a situation 50% of the time you are utilizing one core while everything else in the system sits idle. The better design is to multi thread within each sub system rather than around each sub sytem.
I think you missunderstood me. Of course the other threads won't wait for the graphics system, why should they? That's the whole point behind using local copies from the global "World" object - every subsystems processes on step, commits the result, processes on step, etc. . ., without caring about the other subsystems.
For Example, the physics subsystem could run at 100 TPS (timesteps per second xD) and the graphics at 34 FPS and they wouldn't interfer, except if both one of them writes data to the world object while the other reads it in. (of course, only if the writing and reading occurs at exact the same section). I think blocking should be mininmal.
@shurcool: I'm using Dia's UML objects for this, works pretty well and you can export the final result to almost everything. I think you can even generate code from it.
And thanks for the link!
PS: just for archive: the current layout of my engine
PPS: WTF can't I use BB Codes here? Do I have to use HTML?
Edit: I just could add a option to disable multithreading if the user wants, couldn't I? That would solve the problem with overhead on a singlecore CPU.
[Edited by - Mononofu on July 23, 2008 2:51:30 PM]
This is the basic model of distributing logic across different systems. Currently it's regarded as non-scalable, since it's better for achieving redundancy (one system goes down, others aren't affected) than scalability.
Under this model, slowest system will determine how fast things go. If physics can't handle, all other threads will idle. If AI can't handle, physics will keep on running, but nothing will be changing. And so on.
There are currently only a handful of large-scale models which enable scalable concurrency. Neither is perfect, each is suitable for specific tasks.
Dr. Dobb's article, one of gamasutra's articles (there's many more), as well as some general advice.
Locks and thread-per-system are the basic approach. But the true scalability will inveriably focus on reduction/elimination of shared state, and replication rather than sharing of resources.
All of this is further complicated by the fact that for real-time systems, everything needs to work on slowest possible hardware. So anything that scales beyond that will be mostly optional fluff (more particles, larger FoV, but nothing that affects the logic (different physics time-step, or even dynamic time-step).
This is likely somewhat easier on consoles, where hardware specs are very well defined, and there's no need to worry about arbitrary scalability, since upper and lower bound are either identical, or very close to each other. Unlike PCs, where you have 1-16 cores, .5-16Gb RAM, not to even mention the difference between all other components.
One problem comes from the fact that these systems, even if not hard-linked, depend on each other. Unless physics update the frame, there's no point for renderer to do anything - nothing has changed.
If anything, if one system is too slow, others should give up their time and take over some of the tasks.
For example, having physics update at micro time-step doesn't really do anything, since you are bound by user input. You cannot simulate car 50 seconds in advance, if user controls it 5 times a second.
Under this model, slowest system will determine how fast things go. If physics can't handle, all other threads will idle. If AI can't handle, physics will keep on running, but nothing will be changing. And so on.
There are currently only a handful of large-scale models which enable scalable concurrency. Neither is perfect, each is suitable for specific tasks.
Dr. Dobb's article, one of gamasutra's articles (there's many more), as well as some general advice.
Locks and thread-per-system are the basic approach. But the true scalability will inveriably focus on reduction/elimination of shared state, and replication rather than sharing of resources.
All of this is further complicated by the fact that for real-time systems, everything needs to work on slowest possible hardware. So anything that scales beyond that will be mostly optional fluff (more particles, larger FoV, but nothing that affects the logic (different physics time-step, or even dynamic time-step).
This is likely somewhat easier on consoles, where hardware specs are very well defined, and there's no need to worry about arbitrary scalability, since upper and lower bound are either identical, or very close to each other. Unlike PCs, where you have 1-16 cores, .5-16Gb RAM, not to even mention the difference between all other components.
Quote:Example, the physics subsystem could run at 100 TPS (timesteps per second xD) and the graphics at 34 FPS and they wouldn't interfer
One problem comes from the fact that these systems, even if not hard-linked, depend on each other. Unless physics update the frame, there's no point for renderer to do anything - nothing has changed.
If anything, if one system is too slow, others should give up their time and take over some of the tasks.
For example, having physics update at micro time-step doesn't really do anything, since you are bound by user input. You cannot simulate car 50 seconds in advance, if user controls it 5 times a second.
Hm.
Sounds logical.
So it would be better if I'd use a normal game loop and use multithreaded libs for CPU heavy tasks, like physics, as well as (libs with) background threads for audio, network and input?
Edit:
I finally came to this conclusion:
My suggestion:
+ threads will run constantly, no matter what other threads are doing - constant workload
- probably much overhead due to copying of all that world stuff - I don't know how much time that will take
Your suggestion:
+ not much overhead
- if one thing is only serial, it will force the other cores to do nothing
I don't really have any experience in practice with these things, so I would like to hear some opinions on this topic before I finally decide. Meanwhile, I'll read everything you linked.
Edit: Finally read all posted articles.
My conclusion: Since I'm not able to create a fully data-centric engine (I can't possibly create everything [physics, ...] by myself), I will go for the asynchronous approach, because it seems to be faster than the concurrent one. Furthermore, it scales better.
[Edited by - Mononofu on July 23, 2008 5:05:56 PM]
Sounds logical.
So it would be better if I'd use a normal game loop and use multithreaded libs for CPU heavy tasks, like physics, as well as (libs with) background threads for audio, network and input?
Edit:
I finally came to this conclusion:
My suggestion:
+ threads will run constantly, no matter what other threads are doing - constant workload
- probably much overhead due to copying of all that world stuff - I don't know how much time that will take
Your suggestion:
+ not much overhead
- if one thing is only serial, it will force the other cores to do nothing
I don't really have any experience in practice with these things, so I would like to hear some opinions on this topic before I finally decide. Meanwhile, I'll read everything you linked.
Edit: Finally read all posted articles.
My conclusion: Since I'm not able to create a fully data-centric engine (I can't possibly create everything [physics, ...] by myself), I will go for the asynchronous approach, because it seems to be faster than the concurrent one. Furthermore, it scales better.
[Edited by - Mononofu on July 23, 2008 5:05:56 PM]
If you are going to thread pool each individual subsystem I don't think you gain anything from threading the sub sytems at the courser grain(unless you have more cores than any single sub system can employ) but you get saddled with synchronization issues that will kill lesser systems. Thread pools are nice because you can have a pool of small size on a single core system and not have much overhead.
Have you looked at using something like openMP or Intel's TBB library?
Have you looked at using something like openMP or Intel's TBB library?
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement