Multi-Threading FTW: Part 1

posted in Skipping a D
Published May 13, 2009
Advertisement
These entries are a bit rapid fire so far, but that's mainly the result of having a bunch of backlogged topics that finally added up enough for me to break down and write them. Things will slow down pretty soon.

Switching gears to a more concrete issue this time, I'm going to talk about the performance challenges of these algorithms, and a few things I've done to fight back. This is part 1 of a 2 part entry, because it got pretty long as I was writing it. Part 2 will have a bonus that other XNA developers might hopefully find useful.

First, before we can improve performance, we have to know where we stand. As discussed in the previous entry, a tesseract is composed of 40 tetrahedra, each of which must potentially be sliced to produce a 3D scene. Each slicing consists of 6 intersection checks, the successful of which will yield a linear interpolation factor that we use to generate a 3D vertex from the two 4D vertices of the edge. For a modest scene of say 100 visible tesseracts (meaning none of them are early-outed by lying entirely outside the camera realm) we then have a full 24,000 intersection checks. Many of these will require further linear interpolations to generate the geometry. I'd estimate roughly half on average for visible polychora.

That's a LOT of computation, especially on the 360, which means I was in for a rude awakening, and indeed it came. I want to ideally run at 60fps, so I spawned visible tesseracts in my test app to see how many I could get before dropping below that. Going in to this I had read up about the performance pitfalls of XNA, did my best to avoid created garbage, used ref passing for structs where I could; thought I had all my ducks in a row. Well, I got to... (drum roll)

25.

25 tesseracts. Release build. No debugger. I was... less than encouraged. Was this all a waste? Should I just pack it in and call it quits? My dream was crumbling, the sky was falling, the-

Oops. Logic error... The geometry was being sliced twice per frame instead of once.

50.

Okay, so that's... less slow. You can't scoff at a 2x performance boost right off the bat I suppose. This includes disabling the back-face calculation I mentioned in the previous entry, which bought me about 3 or 4 I think (I really wish I had kept detailed notes of this as it was happening). Even so, I was starting to face harsh reality. The naive approach of just plowing through all the work on the main thread with ears plugged and humming a tune pretending the other cores don't exist was right down the toilet. It was time to step up and get with the times. It was time to use... (zoom in on face) THREADS.

Reading the XNA forums had previously made it clear to me that the standard .NET thread pool was ill-suited for use on the 360 because on that platform, you must manually assign your threads to a hardware thread slot; no scheduler does the work for you. So it was time to go googling again. I eventually came upon a thread pool implementation designed for use on the 360. I didn't feel comfortable just slapping it into my engine though. This is also a learning exercise after all, and parallelism is an interesting topic, so I instead used the code as a guide and wrote my own.

It went pretty well actually. I used a standard .NET queue to store the work items, spun up some worker threads, assigned them to 360 hardware threads, and used an AutoResetEvent to block the workers until the main thread called DoTasks on my task component. When this happened, the main thread would one at a time awaken a worker by signaling the wait handle, waiting for the worker to signal back with another wait handle that it was activated. Once activated, the worker would lock the task queue, dequeue a work item, then run it. Once the main thread saw that there were no more tasks to be dispatched, it would wait for all outstanding tasks to complete, then return.

This wasn't too bad for a first attempt I thought, so I set about to make my game use it. Every polychoron was now a work item, slicing its geometry into a thread local buffer. Once all work items were done, the main thread would render each of the worker thread geometry buffers.

Unsurprisingly, there was a bug. I used a loop to spin-up my worker threads, and use the loop index inside a closure delegate to assign each thread a unique index so they can write to their own element of shared data arrays. For some reason though, all the indices were coming out as 2, which was the terminal loop value on my windows build, instead of being 0 and 1 like they should be. As it turns out, C# closures are mutable, and by the time the worker threads were activated, the loop counter had progressed to its final value. As a fix, I added a wait handle to the initialization so the main thread would wait until the worker was done reading the local values before proceeding to the next iteration. I'm suspicious, although not sure (because I haven't actually ran it) that the linked implementation might suffer from it as well. If anyone has used it (specifically on the 360) I'd be curious to know how it went, otherwise perhaps I'll investigate at some point.

Anyway, this worked out pretty nicely, and I was quite pleased to see my new performance number:

100.

The clouds are parting now. Starting to get to the point where I consider it adequate, especially if I was willing to settle for 30fps, because then I could reach around 200. I declared a preliminary victory and changed focus to other things, like the editor.

Something kept bugging me though. The 360 was now using 3 hardware threads to do the work instead of only 1, but I was "only" getting about 2x performance. I initially shrugged it off as the fact that 2 of those threads are actually on the same core, but then I thought "why not let the main thread do work too instead of just waiting for the other threads to do it." So I tried to add yet another worker thread and assign it to the same hardware thread as the main thread. Much to my dismay though, this had essentially no impact on performance.

I shrugged it off again, and again went back to the editor. Working on an editor is dry business though, so as a distraction I decided to completely tear down the task manager and try to fix 3 main problems that I suspected. First, activating workers one at a time felt inefficient, there has to be a way to activate them all at once and let them take care of acquiring their own work items. Second, I was still convinced that getting the main thread to consume work items as well would help. Third, I wanted to try and make it lockless to further reduce any blocking. Experts frequently warn against trying that, and for good reason as I'll detail next time, but still I wanted to try, even if just for fun. If it didn't work I could always roll back...

TO BE CONTINUED...
Previous Entry Geometry Slicing
0 likes 1 comments

Comments

Moe
Quote:
TO BE CONTINUED...

Dang, I hate being left in suspense! :P
May 13, 2009 05:43 PM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Profile
Author
Advertisement
Advertisement