Jump to content

  • Log In with Google      Sign In   
  • Create Account






Parallel For Loops

Posted by ApochPiQ, 01 March 2010 · 119 views

One of the big promises of the Epoch language is the ability to automatically move code around to various bits of host hardware. For example, suppose I write an app that relies on certain GPGPU logic, such as image filtering. Then, someone runs the app on a machine that lacks capable GPU hardware. Epoch programs should transparently relocate the GPGPU code over to the primary CPU, and do things like attempt to vectorize loops and such.

In other words, "write once, run everywhere. No, seriously - everywhere." Except Epoch doesn't suck like Java does [grin]


To provide this degree of flexibility, the Epoch VM obviously needs a serious arsenal of CPU-side parallelization tricks. It's no good to write solid, parallel code and then have it run in a single CPU thread.

A prime example is the "parallel for" concept, where a given set of calculations can be performed in parallel. In a traditional setting, you might see these calculations simply run in serial, in a single thread. The parallel-for construct allows you to split up that loop into chunks, and then feed each chunk to a worker thread to do the actual computations.

As I write this, I'm finishing up the polishing touches on Epoch's very own parallelfor loop. It's taken a couple of hours to really get all the semantics right, but the actual process of adding the control structure was surprisingly easy, albeit time consuming. This gives me a lot of hope for future expansions to the Epoch parallelization repertoire.


Of course, with Epoch, the big news right now is Release 9; as I've mentioned before I plan to debut R9 at GDC'10 this year. (Don't worry, I'll post the release package on the project site the same day [smile])

That leaves me with scant few hours to finish up the release package. I'm down to evenings and potentially a small chunk of time on Saturday, and then Sunday afternoon I leave for San Francisco. Nothing like a little bit of pressure to keep you on your toes...


The only really significant chunk of work left is to add the CPU failover logic so that when a suitable GPU is not present, the CUDA extension defers to standard CPU execution. This is slightly important because my demo machine (aka. my notebook) doesn't have a CUDA-ready GPU. It'd kind of look bad to present the project and show it failing to work correctly [grin]

After that, it's down to lots of small detail work; getting the release ready is a fairly involved process, as I'm doing my best not to release totally broken code. Unfortunately, many of these tasks are hard to predict and plan around, so I have no idea at this point if I'll be able to hit my desired R9 deadline.


But, hey, you can sleep when you're dead, right?




Quote:
Original post by ApochPiQ
A prime example is the "parallel for" concept, where a given set of calculations can be performed in parallel. In a traditional setting, you might see these calculations simply run in serial, in a single thread. The parallel-for construct allows you to split up that loop into chunks, and then feed each chunk to a worker thread to do the actual computations.


I've been tinkering on a little GPGPU library myself (though nowhere near as ambitious as Epoch's transparent facility), so I'm following your discoveries with great interest. You got me wondering if Epoch also needs to deal with GPU latency & sync issues. I'm just using the DX9 API (XNA actually) to do my GPU stuff, so it's entirely possible your CUDA based code doesn't suffer from this. I sure hope for you it doesn't [smile]

Anyway, uploading inputs to the GPU and downloading results causes a pipeline stall for me, which seems to be the key limiting factor to performance. Do you expect this pitfall in Epoch too? If so, how will you handle it? If not, why not pray tell?
Yeah, there's definitely some overhead for shuttling data onto and off of the device. The only real solution I can offer at the moment is to ensure that you only offload sufficiently large working sets, to amortize the cost of spending all that time on the data bus.

I haven't tried to do anything in a realtime environment yet, and specifically I haven't tried running, say, a Direct3D rendering application interleaved with GPGPU work. That'll probably be a righteous beast to accomplish, because of all the data shuffling.

But for R9 I have what I want, which is a demonstration that the GPU is faster than my VM (big surprise there - the VM sucks for performance [grin]). All I really needed for now was a proof of concept, and that seems to work beautifully, so I'm happy. I'll tackle the performance stuff later on, probably around the time I start working on tuning up the rest of the VM.


Thanks for the feedback!

July 2014 »

S M T W T F S
  12345
6789101112
13141516171819
202122 23 242526
2728293031  

Recent Comments

PARTNERS