Jump to content

  • Log In with Google      Sign In   
  • Create Account





R9 Progress

Posted by ApochPiQ, 03 March 2010 · 151 views

The first version of CPU failover support is now present in the master source for Epoch Release 9. This feature allows a program to run GPGPU code directly on the main CPU if a suitable GPU is not present. The failover is completely transparent; the program need not do any special to benefit from this feature.


Here's a quick example:
//
// GPUTHREADS.EPOCH
//
// This program demonstrates running threaded computations on the GPU,
// courtesy of CUDA. A thread is created for each logical CPU core in
// the host machine, and the computations are run on both the CPU and
// GPU, using as many threads as possible in the GPU case.
//
// Note that both the GPU and CPU threaded functions invoke the same
// Epoch source code; there is no need to explicitly write the GPU-side
// code in another language, as the JIT cross-compiler will take care
// of generating the appropriate CUDA code for execution on the GPU.
//


extension("EpochCUDA")


//
// Win32 API access
//
structure SystemInfoType :
(
integer16(architecture),
integer16(reserved),
integer(pagesize),
integer(minappaddress),
integer(maxappaddress),
integer(activeprocmask),
integer(numprocessors),
integer(proctype),
integer(allocgranularity),
integer16(proclevel),
integer16(procrevision)
)

external "kernel32.dll" GetSystemInfo : (SystemInfoType ref(info)) -> ()
external "winmm.dll" timeGetTime : () -> (integer)


//
// Program entry point; execution begins here
//
entrypoint : () -> ()
{
integer(cpucount, GetCPUCount())

if(cpucount <= 0)
{
debugwritestring("Failed to determine number of logical CPUs available!")
return()
}


if(!IsCUDAAvailable())
{
debugwritestring("WARNING: No CUDA-capable hardware found!")
debugwritestring("Code targeted for CUDA will run on the CPU instead")
debugwritestring("")
}


debugwritestring("Generating test data...")

array(input1, CUDAGenerateDataArray())
array(input2, CUDAGenerateDataArray())
array(output, CUDAGenerateEmptyArray())


debugwritestring("Computation batch, CPU, " ; cast(string, cpucount) ; " threads (1 per core), " ; cast(string, length(input1)) ; " elements:")

integer(starttime, 0)
integer(finishtime, 0)
real(elapsedtime, 0.0)

starttime = timeGetTime()

parallelfor(i, 0, length(input1), cpucount)
{
compute(i, input1, input2, output)
}

finishtime = timeGetTime()
elapsedtime = cast(real, finishtime - starttime) / 1000.0

debugwritestring(cast(string, elapsedtime) ; " seconds")



debugwritestring("Computation batch, CUDA, " ; cast(string, length(input1)) ; " elements:")

starttime = timeGetTime()

cudafor(i, 0, length(input1))
{
compute(i, input1, input2, output)
}

finishtime = timeGetTime()
elapsedtime = cast(real, finishtime - starttime) / 1000.0

debugwritestring(cast(string, elapsedtime) ; " seconds")
}



//
// Actual worker kernel function; doesn't do much, just runs some throw-away calculations
//
compute : (integer(index), real array(input1), real array(input2), real array ref(output)) -> (integer(result, 0))
{
writearray(output, index, readarray(input1, index) + readarray(input2, index))
}



//
// Helper for getting the logical number of CPUs on the host machine
//
GetCPUCount : () -> (integer(retval, 0))
{
SystemInfoType(info, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
GetSystemInfo(info)
retval = info.numprocessors
}



Aside from the user-friendly "CUDA isn't available" warning, this program does nothing related to checking for GPGPU support. The EpochCUDA extension library simply reports that it is not willing to execute code, so the virtual machine just runs the appropriate stuff directly on the CPU. The cudafor construct basically decays into a regular parallelfor and runs on the CPU, should adequate hardware not be available.


I'd ramble more about the cool goodies coming up in R9, but frankly my brain is totally shot, and I need to get some sleep.

I'm down to 4 remaining tasks before I can package and ship R9, plus a hefty dose of TODO comments sprinkled around the code, so there's no shortage of work left to do. On the plus side, the absolutely-mandatory jobs are all small and quick, and I can just whittle away at the TODO list as I get spare time before GDC.


Haha, spare time. Me make funny joke.






Very nifty. I really have to look into Cuda in more detail when I have some spare time [smile]

My main point of wonder is how you manage to get that compute code to run on the GPU transparently. Do you emit that to Cuda instructions at build/interpretation time or is this something facilitated by Cuda itself?
The code is cross-compiled to CUDA, which is then passed to the NVCC compiler to produce an assembly-like code file. This file can then be handed to any CUDA-enabled set of drivers, which take care of actually converting the code to something runnable on the GPU.

The timing of all this and the specifics vary based on how the Epoch program is being executed. For a program executed directly from source, all the cross-compilation and assembly happens just prior to execution of the program entry point. For binary files and packaged .EXEs, things are slightly different; the assembly form of the CUDA code is packed in with the rest of the binary, and loaded at app start time. This eliminates the need for the end user to have the CUDA SDK installed, as well as making it trivial to swap between the original Epoch code and the cross-compiled CUDA, since both representations are stored in the file.
I've been following your progress on Epoch with quite some interest. It looks like you've been making great strides towards a useful language.

As someone who has recently dived into CUDA, I'm wondering how you generate the code for CUDA, as getting optimal performance is not easy. I converted some Doom3 animation code to CUDA, and optimized it on CPU as well, and although I could improve the CPU code easily, it's taken lots of twiddling to get good performance out of the CUDA code. Granted, I'm running on an integrated 9400M GPU, which will partially be to blame for the bad performance. However, you'll never easily get optimal performance on GPU with this kind of automatic code generation. But then I guess that's not your goal? Rather, you just want to get 'better' performance on GPU with minimal work?

I'm looking forward to your next posts.

Rick
You are of course correct that automatically cross-compiled code is never going to rival hand-tuned code, even with a good optimizing compiler in between; by nature, tools like CUDA are hypersensitive to issues like memory layout, pipeline stalls, conditional branching, and so on.

My goal isn't to automatically produce blazing fast code, although I'm not convinced that this is impossible to do (someone with a good understanding of the memory architecture could probably write some basic static code transformations that would help with things like coalescing reads, etc.). I'm personally much more interested in the ability to move that code around to different types of processing hardware on the fly.

I guess my feeling at this point is that it's easy to write slow code in any language; I don't think it's reasonable to expect a language like Epoch to automatically speed everything up for you. So people who write GPGPU code in Epoch still need to be familiar with the realities of the platform, and write code accordingly. It isn't too hard to write Epoch code that cross-compiles into pretty efficient code, provided you know how to write good CUDA kernels in the first place.
Hmm, I went to brush up on my CUDA knowledge a bit (reading this), but coming from Java/C# it looks rather daunting. I'd like to think I know a thing or two about GPUs, but the syntax and semantics in CUDA makes it look like a whole new ballgame. For me it's easier to grasp the concepts by writing pixel shaders accessing constants and textures, than this CUDA stuff pretending to be a regular program.

On the other hand, Epoch looks a lot friendlier to write with the transparent marshaling and failover. Like any complacent programmer I'm sceptical about new languages, but I have to say Epoch is slowly winning me over [smile]

From what I read there's quite some optimization to be done on typical CUDA code, particularly memory access, to make it run perfectly. Is this something you want to expose in Epoch as well? I personally subscribe to the idea that if there's no value in running code on the GPU without optimizations, it's probably safe to say the calculation is conceptually not worth running on the GPU in the first place. So what I'm trying to say is that I'd be perfectly fine with Epoch not exposing this.

Edit - gotta type faster [smile]

October 2014 »

S M T W T F S
   1234
567891011
12131415161718
19202122 23 2425
262728293031 

Recent Comments

PARTNERS