Quote:
Original post by clb
The performance statistics seem smelly.
The way performance metrics is gathered is inaccurate. On one system, the counters don't work at all. IIRC, it uses Windows performance counters, which in quite a few cases don't work at all, aren't installed, or report only highly inaccurate information.
Quote:
The resulting performance compared to the amount of visual complexity (disregarding the fire) is underwhelming. On my Intel quadcore system, all the 8 threads go at about 90%, but I get less than 30 fps most of the time. Looking at the number of objects in this scene, I would be expecting a lot more.
It's not about FPS on 1, 2, 4 or 8 threads. It's about single architecture scaling almost linearly from 1-n threads transparently. That is the difficult part. And such framework defines a way to split work both horizontally and vertically (as opposed to usually focusing on one way only).
Naive scalability as number of cores increases is typically log(n) or sqrt(n)-like function. Simply splitting sub-systems across cores manually will result in 90% 2-core and 75% 4-core utilization which is somewhat easy to show. Performance of naive partitioning can even fall under 50% on 8-core systems (worse than 4-core) due to synchronization overhead.
Quote:
What I'd like to know is what you think as game developers about their overall design? Did you get something out of it? Did you/Are you going to model your next architecture according to something similar, or even be bold to take it directly as the base design? Do you think it's flawed/useless/using a wrong approach? Do you think it's overcomplicated?
It's one way to approach scalability transparently. Due to too many gotchas and platform differences, there is no real silver bullet.
Quote:
I'm being really cautious about giving too much time to entertain this kind of design in our next project, especially since that demo doesn't really make go "wow, cool!" in any way (it's slightly the opposite).
Counter-point: can you design a system which will scale linearly and transparently from 1 - n cores? Optimizing for n==X is "easy". But having an application scale to arbitrary number of cores is something completely else.
The trick here is that n can be 1 (worst case scenario, but must still be supported), any power of n (2, 4, 8) or arbitrary value 17. Or 55. or 6. Any kind of manual partitioning will fail at some large enough n.
In practice, it would be perfectly possible to manually tune thread use for each individual n < 64 case, it would just be too time consuming. Imagine running every benchmark 64 times, and studying thread contention where one change may improve half of the cases, but worsen the other half.
Quote:
Do you think there is a big point to those? Are you doing something like this? What do you think about flexibility or performance in this kind of architecture?
You've mentioned both, OpenMP as well as hard-coded systems. The presented architecture merges both, and adds a few other into the mix.
There is no silver bullet. Smoke demo merely shows how best of each world can be merged. Particle system could utilize OpenMP, physics could use PhysX, and logic could use Lua, while entire framework makes sure all these tasks start appropriately.
Quote:
Last note, I'd like to keep any discussion away from Larrabee/CUDA/GPGPU/alternatives.
No need, Smoke explicitly supports, and even suggests, embarrassingly parallel designs, which are encapsulated inside a single system. I believe that particle system is designed as single sequential system, but can be vectorized (or was that part commented out).
Scalable and concurrent design is not about threads or cores, but about how can a sequential process and data it uses be partitioned in such a way that part or whole of it can be performed concurrently. While vectorization may be one solution, it doesn't solve the problem of interdependant or strictly sequential tasks and dependencies.
The demo however still doesn't solve one problem, namely the dynamic task ordering. It's documented that system dependencies are resolved (via topological sort IIRC?) only once during start-up.
Begin able to dynamically determine dependencies for each frame based on tasks that need to be performed could bring more scalable design, but the overhead is almost guaranteed to negate any benefit of doing so.
Quote:
On my Intel quadcore system, all the 8 threads go at about 90%, but I get less than 30 fps most of the time. Looking at the number of objects in this scene, I would be expecting a lot more.
I don't really recall if it was smoke, or one of related presentations, but I remember seeing single thread being a congestion point. After all systems have completed, main thread performs some work where all other cores sit idle.
I don't remember why exactly that is required, but it's possible to avoid that to a degree, which results closer to 100% utilization.
The demo (and architecture) is still limited by slowest sequential part.
But at the same time, I don't recall particle system being the bottle-neck. In my case it was physics, and turning it off increased the FPS by 4 times or so. Fire and water however were never an issue.