Parallel Game Engine Architecture (Intel Smoke)

Started by
4 comments, last by steven katic 15 years, 1 month ago
Have you guys read the Intel Smoke article? It should be quite old news by now, and I suppose most of you engine developers have already seen it a long time ago (first on Intel site and then at Dj Dobb's, and whereever else it has been posted at). Now this time they republished the article at Gamasutra just a moment ago, where it caught my attention once again. Gamasutra: Sponsored Feature: Designing the Framework of a Parallel Game Engine or the old (Nov 2008) link to the same article at Intel website. Source and exe download is available here: And if you're lazy, there's just a video here:
">YouTube: Intel Nehalem Smoke Demo I tried to search for a thread where the design would have been discussed before, but couldn't find one. My first question is that what is your take on the article? Anyone followed the design all the way to their own implementation? Anyone read the article in detail? I've played with the demo for a while, and have tweaked the code a bit to try to get some performance estimates on how it might perform in my application, but it's difficult. My issues with the demo was the following. Most of the threads run at <5% most of the time: Geometry, AI, Audio, Scripting, Input, Animation, Procedural Trees, Volumetric Smoke. Two threads, Graphics and Physics, are running at about 10%. The big performance hog in the whole application is Procedural Fire, which takes 30%-50% of the time. Now, this is a really bad demo in this respect, as procedural fire simulation is not something you would need a "generic" parallel architecture for, since you could easily contain the fire simulation to run on its nonshared data without having to complicate the grand-scale application design. And it is questionable as to how much it makes sense to parallelize Input updates for example. The performance statistics seem smelly. I had a case where the procedural fire was reported to take 118% of the time. What does that mean.. Also, if you sum all those percentages up, most of the time the sum is less than 50%. Does that mean the other 50% is spent in sequential synchronization points? The resulting performance compared to the amount of visual complexity (disregarding the fire) is underwhelming. On my Intel quadcore system, all the 8 threads go at about 90%, but I get less than 30 fps most of the time. Looking at the number of objects in this scene, I would be expecting a lot more. To the article then. I'm surprised that, coming from Intel, how low the quality of the writing seems to be. I'm not a native english speaker, so that might affect my ability to consume the text, but to me it seems to be filled with difficult ways to put words. For example,
Quote:The Smoke article, paragraph 3. "The interfaces are the means of communication between the engine and the systems. Systems implement the interface so that the engine can get access to a system’s functionality, and the engine implements the interface so that the systems can access the managers."
Quote:The Smoke article, paragraph 4.2. "The managers, even though they are singletons, are only directly available to the framework which means that the different systems do not have access to them."
The first paragraph says the systems can access the managers, but the second paragraph contradicts that the managers are not accessible by the systems. Now, after looking at the code it is clear that the individual systems do gain the access to the managers, but the article is just being horribly misaccurate in ways like this.. Now, the point of this post is obviously not to focus on bashing the demo. What I'd like to know is what you think as game developers about their overall design? Did you get something out of it? Did you/Are you going to model your next architecture according to something similar, or even be bold to take it directly as the base design? Do you think it's flawed/useless/using a wrong approach? Do you think it's overcomplicated? I'm being really cautious about giving too much time to entertain this kind of design in our next project, especially since that demo doesn't really make go "wow, cool!" in any way (it's slightly the opposite). But I don't want to turn down a new idea, hence trying to raise some discussion. Then the obvious second question. How do you utilize multiple cores in your engine? No need to write a "minitutorial", I'm quite familiar with several techniques, and so far I've usually gone with one of the two different approaches. In an old project I've used a simple OpenMP-style "parallel for" data parallelization methods inside a sequential game loop, and in a more recent projects I've usually utilized "hardcoded" threading, where I manually craft the different thread systems (usually just main loop, AI, renderer and data loading) that are needed and explicitly define their synchronization boundaries. The first approach (OpenMP) is just devilishly easy. Small oneliners to your foreach(particle), foreach(skinnedobject), foreach(physicsupdatable), etc. Also, complex subroutines like PVS queries from an octree (and updates) have usually deserved their own multithreaded approach, but as the whole app is sequential, it's very easy to reason about. The second approach is not that difficult to implement either. The possible "worry" about it is that you would be explicitly writing the threading system (thread code, identifying what data is shared and needs to be locked, how to signal/message between threads, possible timing/waiting semantics, how to do updates back and forth and so on) for each threaded task separately, which of course requires effort. The good side of this is of course that all the threads you create and their cooperation will be manually crafted to yield the best possible performance. There will be no unnecessary data in shared memory and so forth. Now, I'm more interested in what do you think about these kind of more "generic" parallel architectures, where the whole system is revolving around frame tasks being solved by different threaded subsystems, the synchronization mechanisms of which are specified in an abstract way, like in the Intel Smoke demo. Do you think there is a big point to those? Are you doing something like this? What do you think about flexibility or performance in this kind of architecture? If anyone has a summary of profiling statistics from their projects, those would be interesting to read as well. Last note, I'd like to keep any discussion away from Larrabee/CUDA/GPGPU/alternatives. Last last note, anyone have the Game Programming Gems 6 that has the article "Managing High-Level Script Execution Within Multithread Environments" and do a quick one or two lines of review about it? Finally for reference, here are some related game multithreading articles around the web, hoping they might ease the discussion: A short survey to different techniques (not that descriptive, read the references section): Gamasutra: Ville Mönkkönen: Multithreaded Game Engine Architectures Threading for game logic updates. Very interesting, but I'm more interested in threading at the whole application level: AiGameDev.com: Parallel Game Logic with Independent Entity Updates AiGameDev.com: Alex J. Champandard: Hierarchical Logic and Multi-threaded Game AI Very good source for general high-performance threading programming: Dr Dobb's.:Go Parallel Blog Dr Dobb's applied to games: Gamasutra: Gabb, Lake: Threading 3D Game Engine Basics GarageGames: Eric Preisz: Multithreading in games- the future, the scam! Gamasutra: Tommy Refenes: Sponsored Feature: Multi-Threading Goo!: A Programmer’s Diary
Advertisement
Quote:Original post by clb

The performance statistics seem smelly.


The way performance metrics is gathered is inaccurate. On one system, the counters don't work at all. IIRC, it uses Windows performance counters, which in quite a few cases don't work at all, aren't installed, or report only highly inaccurate information.

Quote:The resulting performance compared to the amount of visual complexity (disregarding the fire) is underwhelming. On my Intel quadcore system, all the 8 threads go at about 90%, but I get less than 30 fps most of the time. Looking at the number of objects in this scene, I would be expecting a lot more.


It's not about FPS on 1, 2, 4 or 8 threads. It's about single architecture scaling almost linearly from 1-n threads transparently. That is the difficult part. And such framework defines a way to split work both horizontally and vertically (as opposed to usually focusing on one way only).

Naive scalability as number of cores increases is typically log(n) or sqrt(n)-like function. Simply splitting sub-systems across cores manually will result in 90% 2-core and 75% 4-core utilization which is somewhat easy to show. Performance of naive partitioning can even fall under 50% on 8-core systems (worse than 4-core) due to synchronization overhead.

Quote:What I'd like to know is what you think as game developers about their overall design? Did you get something out of it? Did you/Are you going to model your next architecture according to something similar, or even be bold to take it directly as the base design? Do you think it's flawed/useless/using a wrong approach? Do you think it's overcomplicated?


It's one way to approach scalability transparently. Due to too many gotchas and platform differences, there is no real silver bullet.

Quote:I'm being really cautious about giving too much time to entertain this kind of design in our next project, especially since that demo doesn't really make go "wow, cool!" in any way (it's slightly the opposite).


Counter-point: can you design a system which will scale linearly and transparently from 1 - n cores? Optimizing for n==X is "easy". But having an application scale to arbitrary number of cores is something completely else.

The trick here is that n can be 1 (worst case scenario, but must still be supported), any power of n (2, 4, 8) or arbitrary value 17. Or 55. or 6. Any kind of manual partitioning will fail at some large enough n.

In practice, it would be perfectly possible to manually tune thread use for each individual n < 64 case, it would just be too time consuming. Imagine running every benchmark 64 times, and studying thread contention where one change may improve half of the cases, but worsen the other half.

Quote:Do you think there is a big point to those? Are you doing something like this? What do you think about flexibility or performance in this kind of architecture?
You've mentioned both, OpenMP as well as hard-coded systems. The presented architecture merges both, and adds a few other into the mix.

There is no silver bullet. Smoke demo merely shows how best of each world can be merged. Particle system could utilize OpenMP, physics could use PhysX, and logic could use Lua, while entire framework makes sure all these tasks start appropriately.

Quote:Last note, I'd like to keep any discussion away from Larrabee/CUDA/GPGPU/alternatives.
No need, Smoke explicitly supports, and even suggests, embarrassingly parallel designs, which are encapsulated inside a single system. I believe that particle system is designed as single sequential system, but can be vectorized (or was that part commented out).

Scalable and concurrent design is not about threads or cores, but about how can a sequential process and data it uses be partitioned in such a way that part or whole of it can be performed concurrently. While vectorization may be one solution, it doesn't solve the problem of interdependant or strictly sequential tasks and dependencies.

The demo however still doesn't solve one problem, namely the dynamic task ordering. It's documented that system dependencies are resolved (via topological sort IIRC?) only once during start-up.

Begin able to dynamically determine dependencies for each frame based on tasks that need to be performed could bring more scalable design, but the overhead is almost guaranteed to negate any benefit of doing so.

Quote:On my Intel quadcore system, all the 8 threads go at about 90%, but I get less than 30 fps most of the time. Looking at the number of objects in this scene, I would be expecting a lot more.


I don't really recall if it was smoke, or one of related presentations, but I remember seeing single thread being a congestion point. After all systems have completed, main thread performs some work where all other cores sit idle.

I don't remember why exactly that is required, but it's possible to avoid that to a degree, which results closer to 100% utilization.

The demo (and architecture) is still limited by slowest sequential part.

But at the same time, I don't recall particle system being the bottle-neck. In my case it was physics, and turning it off increased the FPS by 4 times or so. Fire and water however were never an issue.
Quote:Original post by Antheus
It's not about FPS on 1, 2, 4 or 8 threads. It's about single architecture scaling almost linearly from 1-n threads transparently. That is the difficult part. And such framework defines a way to split work both horizontally and vertically (as opposed to usually focusing on one way only).

Quote:
Counter-point: can you design a system which will scale linearly and transparently from 1 - n cores? Optimizing for n==X is "easy". But having an application scale to arbitrary number of cores is something completely else.

Quote:
It's one way to approach scalability transparently. Due to too many gotchas and platform differences, there is no real silver bullet.



Thanks for the quick reply Antheus.

Naturally, I don't expect there to be linear speedup in total application performance, or a silver bullet, or anything of the sorts. And I don't think I claimed anything like this either. (Also would like to use the opportunity to state that any replies referring to premature optimization will be ignored. Hope it is clear why.)

And I would hope to get something better out of the discussion rather than contending to stating that "meh, it's just one way of doing this, there's no real answer." What's particularly good about doing it this way? What's bad about doing it like this compared to other methods?
Ok, since you linked to my articles at the bottom of your post, I'll bite!

I agree that the Smoke demo isn't an ideal showcase. There are some n^2 nearest neighbor searches for the horse AI in the code, and the flocking of the birds just goes at 8x the speed when you enable all the threads... so there are many things that are flawed, but at least they're trying :-)


We're trying out various multi-threading ideas as part of the AiGameDev.com Sandbox, and overall the modular architecture impressed me. We're using something similar, based on the MVC-style architecture that Killzone had (see "The Guerrilla Guide to Game Code" on Gamasutra). What strikes me is that this is a good architecture regardless of whether you use multi-threading or not...

So once you boil it down, the big difference is in this paragraph of the Intel article:

"In order for a game engine to truly run parallel, with as little synchronization overhead as possible, it will need to have each system operate within its own execution state with as little interaction as possible to anything else that is going on in the engine. Data still needs to be shared however, but now instead of each system accessing a common data location to say, get position or orientation data, each system has its own copy. This removes the data dependency that exists between different parts of the engine. Notices of any changes made by a system to shared data are sent to a state manager which then queues up all the changes, called messaging. Once the different systems are done executing, they are notified of the state changes and update their internal data structures, which is also part of messaging. Using this mechanism greatly reduces synchronization overhead, allowing systems to act more independently."


So the main question is what you think of this part of the code, as the rest seems relatively sensible to me.

Alex
AiGameDev.com

Join us in Vienna for the nucl.ai Conference 2015, on July 20-22... Don't miss it!

After studying the whole "paper" in depth, I've come to the following conclusion:

+ Distributed state seems like a key ingredient to making computation scale well if you don't mind a bit of latency.
+ The idea of having interfaces between the different managers is a good way to implement these system modularly.
+ The observer model ties in well to the MVC pattern, and it seems generally sensible.


Those were the positives, now the negatives:

- I don't get the "conflict resolution" part, or at least their entire solution to it... For example, they talk about having multiple systems set the position and orientation of an object. That sounds like bad design to me, I'd always want to have more control over this.
- I'd prefer to have one and only one "controller" responsible for each change, and let it grab data from multiple sources if necessary. This may potentially introduce a little more latency if done wrong though.
- I'm wondering how a traditional blackboard ties into this approach, as it would sound much more sensible.


I think they're on the right track, but they haven't taken it quite far enough yet. Having tasks themselves responsible for resolving conflict (rather than their ad-hoc policies for replacing data) seems like a better model to me, and certainly more proven. See Erlang.


Thoughts?

Alex

Join us in Vienna for the nucl.ai Conference 2015, on July 20-22... Don't miss it!

Re: My take on the article?

It's an insignificant baby step into a long journey to game development using multi-cores ( Hey, if there are going to be quite a few cores in there, how can we use them?).

Well I cannot help thinking that the most important feature of this article is that its contents is secondary to its actual purpose. Particularly if you consider for a moment that the medium is the message.

It is designed to create interest/discussion/promotion in the use of multithreading in the wider game development community as the future of game development.

The contents seems ordinary(remember my bias is that the meduim is the message here), as is the writing style (that the OP makes mention of). But that's Ok: Its tone certainly isn't that of a seminal academic paper targetted towards experts/researchers (that you usually have to pay for). Grass roots article for the grass roots?

One thought that arose when I read it, was that it sounds like a database developer describing the concept of 'maintenance of referential integrity in a distributed database system' applied to a multithreaded game engine (maybe?).

The most clear message this article gives to me is that: multithreading is the future of game development on Intel multi core cpus of the future. The distribution of this article and the other sponsored Intel articles are part of the strategy to entice the game development community to participate in the development of solutions that utilise multi core cpus.(a participative approach that is to be commended?)

I see a very interesting future:

If the game development community embraces the the current promotion of multithreading, it will become conventional practice.

If not: maybe the multithreading will have to be hidden in (yet) another (opaque?) layer of abstraction between the person and the machine or else find and use a non multi-threading method of utilizing mutli-cores.

I see a cpu with multi-cores where some cores are dedicated to graphics. ATI/AMD seem nicely placed for the future in this regard.

So as you can see, I seemed to have skewd off into some la la land while reading the article...no wonder I didn't address it's contents! Sorry about that (I couldn't get past the OP's first question). Hopefully the next poster will get the thread back on track: i.e. the article contents the OP asks about.

This topic is closed to new replies.

Advertisement