Slow parallel code

Started by
9 comments, last by Narf the Mouse 11 years, 9 months ago
According to SlimTune (which seems more reliable than Eqatec), the current slowdown in my code is a parallel loop that logs draw commands (by way of a MultiThreadedRenderer). This slowdown is like a switch; fewer than x^3 objects (where x is 27) and it goes at or close to 90 FPS; at x = 27, at it drops down to about 48 FPS. at x = 30, it's 40 FPS, so it probably isn't directly the increased number of models.

Previous thread on this performance test

Highlights:

1) It's .Net/C#
2) It's a 3D game engine.
3) The objects are, as stated, 3D models (boxes, untextured, unlit, although both are possible).
4) The boxes are placed at regular intervals all around the camera.
5) None of them are instanced, although they all share the same mesh. None of them move, although the game engine has to support moving objects.
6) Using DirectX 9 (I want my engine to support from DX 9 up).
7) The speed profiler is SlimeTune.
8) One call is made per model and about 48% of the codes' time is spent in this parallel loop. (The profiler, of course, adds its own overhead, so the number may not be fully accurate). ms timing does not seem to be available.
9) The parallel code is tested faster than the single-threaded code. Singlethreaded gets about 22 FPS, compared to about 48 FPS multithreaded, when x = 27.
10) The MultiThreadedRenderer queues commands in a ConcurrentStack when told to Render and executes them on a call to .Finish( ) (not shown).

Code:

System.Threading.ThreadLocal<Effect> effect =
new System.Threading.ThreadLocal<Effect>( ) ;

var drawn = models.AsParallel( ).AsUnordered( ).Where(
model => IntersectionTests.AABBXFrustum( model.BoundingAABB, frustum )
).ToList( ) ;
// These two parallel loops are each as fast, +/- 1-2 FPS
/* drawn.AsParallel( ).AsUnordered( ).ForAll(
model => */
System.Threading.Tasks.Parallel.For(
0, drawn.Count, t =>
{
Model model = drawn[ t ] ;
if ( effect.Value != model.UseEffect )
{
effect.Value = model.UseEffect ;
foreach (var light in scene.Lights)
Renderer.Render( effect.Value, light ) ;
}
model.Draw( deltaTime, Renderer );
}
) ;


Questions:

1) Where does the slowdown come from?
2) How can I reduce or eleminate it?

Thoughts:

1) Cache coherency issue?
2) Inefficient partitioning?
Advertisement
Much of the time is spent in the loop because it's the wrong unit of work to make parallel.

There is an overhead to create the threads, distribute data between threads, and so on.
Then after the overhead you do a tiny bit of work and make an asynchronous render call. Those calls need to re-synchronize internally to prevent the rendering system from getting corrupted, so it essentially undoes the distributed work.
Then you do another overhead to clean up all those threads.

It is good that you are measuring. Look at your measurements and find the most compute-intensive stuff first.

Look for large blocks of actual compute-intensive work, or small units of blocking functions (stuff that stalls the application). Those are things to do in parallel.
What frob said.

While the intersection test being done is parallel is fine, once you get to drawing you only want to interact with a D3D device from one thread at a time because all the internal locks etc are going to trash your performance.
Read it again:

10) The MultiThreadedRenderer queues commands in a ConcurrentStack when told to Render and executes them on a call to .Finish( ) (not shown). - It's not actually rendering from multiple threads, just queueing/doing prep work from multiple threads. The rendering takes places on a single thread which executes the commands. This is faster because the major slowdown is the prep work.
9) The parallel code is tested faster than the single-threaded code. Singlethreaded gets about 22 FPS, compared to about 48 FPS multithreaded, when x = 27. - The multi-threaded code is faster. Single-threaded also does less work overall, so it's not the amount of work either are doing - My multi-threaded is just faster than my single-threaded.
For x = 27 and with one draw-call per object you're looking at ~20,000 draw calls per frame. Get some instancing in there, I think.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


For x = 27 and with one draw-call per object you're looking at ~20,000 draw calls per frame. Get some instancing in there, I think.

At the moment, I'm not working on instancing, and I'd really like to know why the parallel code experiences a sudden drop in speed.

At the moment, I'm not working on instancing, and I'd really like to know why the parallel code experiences a sudden drop in speed.


You mentioned that the code chunk you posted takes 48% of the time in one test. Do you know for sure that it is the part that is growing out of proportion to the other part as x increases? Like, as you increase x does that percentage go up, or down, or stay the same?

I would time the first parallel loop, the second parallel loop, and your actual rendering code, all separately, and see which one is growing faster than the others as you increase x. You don't even need SlimTune, just use a System.Diagnostics.Stopwatch and draw the times, or percentages of frametime, on the screen. That way you can at least verify that you are targeting exactly the right section.

Also, if you are not doing it already, I would make sure you compare tests with the same percentage of objects visible -- either all, or none, or some constant value like 50%. If you compare 60% objects visible at x=27 to 40% objects visible at x=28 you will see changes in the relative timing of different sections that are only based on that percentage, not on x.

Also consider memory allocation and garbage collection. As you start allocating bigger chunks of RAM it starts getting more and more expensive, and the GC may start to thrash more and more often. One VERY SUSPICIOUS fact is that an allocation of 27^3 32-bit references is on the order of the size things start getting put in a separate large object heap (85,000 bytes.) I wrote a program to time how long it took per allocation averaged over 100,000 allocations, and I get this chart:


allocscale2.png

(EDIT: The graph is actually milliseconds for a pair of allocations -- first allocating an array of that size, and then calling ToList on it.)

Notice that the Y axis is milliseconds, though -- so if it is affecting you it is probably because the number of objects in your heap is much larger than my test program's, or you allocate many times per frame.

Maybe try and see how frequently GCs are happening. There are some performance counters that can tell you lots of details about this. GCs can strike anywhere in your main loop even if they're caused by allocations in a localized position so it would be useful to rule that out first.

[quote name='mhagain' timestamp='1340406776' post='4951863']
For x = 27 and with one draw-call per object you're looking at ~20,000 draw calls per frame. Get some instancing in there, I think.

At the moment, I'm not working on instancing, and I'd really like to know why the parallel code experiences a sudden drop in speed.
[/quote]

I don't know how you came to those numbers (did the original poster explain more in his other thread maybe?), but if there really are 20,000 draw calls then I agree, it's pretty clear where the bottlenecks lies.

General advice is to stay well below 500 draw calls (and 200 for Xbox 360 class hardware if console support is of relevance). For a moderate number of objects (eg. a few hundred) use instancing. For everything above that use batching, which can handle tens of thousands of objects (see Minecraft, for example) of low complexity and is often a good candidate for threading.
Professional C++ and .NET developer trying to break into indie game development.
Follow my progress: http://blog.nuclex-games.com/ or Twitter - Topics: Ogre3D, Blender, game architecture tips & code snippets.

[quote name='Narf the Mouse' timestamp='1340411728' post='4951874']
[quote name='mhagain' timestamp='1340406776' post='4951863']
For x = 27 and with one draw-call per object you're looking at ~20,000 draw calls per frame. Get some instancing in there, I think.

At the moment, I'm not working on instancing, and I'd really like to know why the parallel code experiences a sudden drop in speed.
[/quote]

I don't know how you came to those numbers (did the original poster explain more in his other thread maybe?), but if there really are 20,000 draw calls then I agree, it's pretty clear where the bottlenecks lies.

General advice is to stay well below 500 draw calls (and 200 for Xbox 360 class hardware if console support is of relevance). For a moderate number of objects (eg. a few hundred) use instancing. For everything above that use batching, which can handle tens of thousands of objects (see Minecraft, for example) of low complexity and is often a good candidate for threading.
[/quote]
The engine will use instancing (what's batching? thanks), but before then, I want the ordinary objects to draw blazingly fast, too. I'm using C#, so speed has to be a higher priority than normal. (I'm also backbraining some ideas for using SIMD instructions using a C++ dll, but that's not really relevant to this test, which is focused on multithreading architecture)

Edit: Yeah, there's more explanation in the previous thread. In short, in the test, I'm drawing X^3 objects around the camera, frustum pruning, then drawing. At x > 26 (refined the test a little), using only integers, the speed drops by ~half. It then drops in a more ( 1 / linear ) fashion, which means the drop is something other than directly the number of objects drawn.

I'll check if it's the garbage collector - Does anyone know of a free/cheap memory profiler? Manually checking probably wouldn't be quite as easy.

I'll check if it's the garbage collector - Does anyone know of a free/cheap memory profiler? Manually checking probably wouldn't be quite as easy.


CLR Profiler is good and free. I seem to remember there being some trick to making it work with XNA but I don't recall what it was at the moment. It will tell you what you're allocating, how much, where and when.

You can use the the performance counters exposed by .NET to view what the GC is doing. Basically you go start->run "perfmon.exe" and add counters from under ".NET CLR Memory" to your graph. You can set it up to only show GC info for your specific program too.
http://msdn.microsof...y/x2tyfybc.aspx
http://blogs.msdn.co.../03/148029.aspx

Also, one trick I have used is to just make a single object that is only referenced via a WeakReference, and every frame check if that WeakReference's Target is now null. In my game I plot these events on a scrolling graph and their frequency tells me how often gen0 collections are happening, which tells me how much garbage I am generating per frame. But that method requires more game-specific setup than the CLR Profiler/Performance Counter route.

This topic is closed to new replies.

Advertisement