Poor OpenCL performance

Started by
2 comments, last by ChugginWindex 12 years ago
I'm learning OpenCL for a project and so far I'm a little disappointed in the performance I've been getting with a really basic kernel so I'm hoping there's just something I'm missing. Here's the kernel that I'm using, all it does is calculate a gradient that's applied to a texture 640x480 resolution texture:

__kernel void debug(__write_only image2d_t resultTexture)
{
int2 imgCoords = (int2)(get_global_id(0), get_global_id(1));
int2 imgDims = (int2)(get_image_width(resultTexture), get_image_height(resultTexture));

float4 imgVal = (float4)((float)imgCoords.x / (float)imgDims.x, (float)imgCoords.y / (float)imgDims.y, 0.0f, 1.0f);
write_imagef(resultTexture, imgCoords, imgVal);
}


My video card is an Nvidia Geforce 285M GTX, with this kernel running in a release build (C++) I'm getting ~750 FPS. That's not low...but it's not as high as I was expecting. I figure calculating this gradient on this card in GLSL would probably give me quite a bit more. Now I know that GLSL is optimized for this sort of thing whereas raw OpenCL is not so it could just be that, but I thought I should make sure before I get into more complex things since I have plans to really tax this card once I figure out the intricacies of OpenCL. Also here is the code I'm using each frame to execute the kernel:

void CLContext::runKernelForScreen(int screenWidth, int screenHeight) {
cl_int result = CL_SUCCESS;
cl::Event ev;
cl::NDRange localRange = cl::NDRange(32, 16);
cl::NDRange globalRange = cl::NDRange(screenWidth, screenHeight);
//make sure OpenGL isn't using anything
glFlush();
//get the OpenGL shared objects
result = _commandQueue.enqueueAcquireGLObjects(&_glObjects, 0, &ev);
ev.wait();
if(result != CL_SUCCESS) {
throw OCException(LookupErrorString(result));
}

//set the argument to be the image
_primaryKernel.setArg(0, _screenTextureImage);
//enqueue operations to perform on the texture
result = _commandQueue.enqueueNDRangeKernel(_primaryKernel, cl::NullRange, globalRange, cl::NullRange, 0, &ev);
ev.wait();
if (result != CL_SUCCESS) {
throw OCException(LookupErrorString(result));
}
result = _commandQueue.enqueueReleaseGLObjects(&_glObjects, 0, &ev);
ev.wait();
if (result != CL_SUCCESS) {
throw OCException(LookupErrorString(result));
}
_commandQueue.finish();
}


I profiled this and found that the bulk of the time is spent on the ev.wait() lines, and commenting those out doesn't do any direct harm but only yields around a 100 FPS gain, also at that point the execution time is almost entirely in _commandQueue.finish() for obvious reasons.

If it matters at all, I'm initializing the OpenGL texture as such:
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA16F, screenWidth, screenHeight, 0, GL_RGBA, GL_FLOAT, NULL);

And the respective OpenCL texture object is created with:
_screenTextureImage = cl::Image2DGL(_context, CL_MEM_WRITE_ONLY, GL_TEXTURE_2D, 0, textureId, &err);

Lastly in addition to profiling from the host side, I've also used gDebugger to try and see where the issue is, but the tool (at least as I'm capable of using it) doesn't yield much performance data other than to say that on average the kernel uses around [s]70%[/s] (17%) of the GPU to run. I've tried Parallel NSight as well, but it seems a bit daunting to me in it's complexity.

Hopefully I've preempted most of the questions concerning how I'm doing things and someone can make some sense of all this. Is my head on straight here? I don't think I'll be surprised either way if I hear that this is the kind of performance I should or should not expect from OpenCL on this hardware, but like I said I feel like I'd be getting a bit more at this stage from GLSL.
Advertisement
I have a similar GPU and have experienced similar performance when using OpenCL. I've seen CUDA use about half as much time in some cases for the exact same (simple) kernel! You can also verify this by looking at Nvidia's compute SDK. they have a few samples that they have implemented in OpenCL, CUDA, and DirectCompute. The OpenCL versions are always the slowest for me.

I can't imagine whats going on behind the scenes, but I eventually stopped trying to speed up my OpenCL and switched to using pixels shaders for just about everything. OpenGL 4.2 now supports atomic operations in shaders. So now you can almost do the same things in a shader that you would do in OpenCL, except it runs much faster because of who knows why haha.
Thanks for the reply. Well that sucks! I'm doing this for a independent study at my college and I don't really have the time to be restarting in CUDA :( . I'm gonna wait and see if anyone else weighs in with similar experience and then make a decision. Thanks again for the info, even though it's not at all what I wanted to hear haha.
The fact that commenting out [font=courier new,courier,monospace]ev.wait()[/font] which in fact does "nothing" gives a huge boost suggests that OpenCL as such is not really to blame, but it is a scheduling thing. Waiting on an event twice means being taken off the ready-to-run list and put on it again when the event is set, and being scheduled again when the next time slice becomes available. If you do this thousands of times and time slices are, say, 15-20 milliseconds, this can be a long, long time.

Have you tried increasing the schdeuler's frequency (I'm not sure how to do it under any other OS but Windows, where that would be [font=courier new,courier,monospace]timeBeginPeriod(1)[/font])?

Alternatively, push a hundred kernels onto the task queue and let them execute, then block in [font=courier new,courier,monospace]finish()[font=arial,helvetica,sans-serif] and see how [/font][/font]long it took all of them to run. I'm sure it will be much faster. You're not benchmarking OpenCL here, you're benchmarking waiting on an event...

The fact that commenting out [font=courier new,courier,monospace]ev.wait()[/font] which in fact does "nothing" gives a huge boost suggests that OpenCL as such is not really to blame, but it is a scheduling thing. Waiting on an event twice means being taken off the ready-to-run list and put on it again when the event is set, and being scheduled again when the next time slice becomes available. If you do this thousands of times and time slices are, say, 15-20 milliseconds, this can be a long, long time.

Have you tried increasing the schdeuler's frequency (I'm not sure how to do it under any other OS but Windows, where that would be [font=courier new,courier,monospace]timeBeginPeriod(1)[/font])?

Alternatively, push a hundred kernels onto the task queue and let them execute, then block in [font=courier new,courier,monospace]finish()[font=arial,helvetica,sans-serif] and see how [/font][/font]long it took all of them to run. I'm sure it will be much faster. You're not benchmarking OpenCL here, you're benchmarking waiting on an event...


Sorry, I'm really new to OpenCL and a lot of what you said is lost on me...

What do you mean pushing a hundred kernels onto the task queue at a time? Currently I'm just enqueueing an NDrange kernel and letting it execute with a workgroup equivalent to the dimensions of the texture I'm writing to. I don't think I understand what you mean.

Also I can't quite see what you're talking about with the ev.wait() commands. Are you saying that they should be slowing it down or that they shouldn't be? Do I have too many or too few? I just figured out how to use the Cuda Toolkit's Visual Profiler, and that spat back that I have a very low compute utilization (~24%) so if my GPU is idle most of the time I'm sure I'm not getting the performance I could be in theory...I just don't quite get how to go about that. I've pretty much split up the tasks each work item carries out as much as possible (I'm using a different kernel than the one originally posted, but it's still fairly basic) so I'm unsure how to increase the amount that gets done at one time. I'm still using an NDRange of width x height and letting the drivers decide what local workgroup size to use, could that be the problem?

This topic is closed to new replies.

Advertisement