Sign in to follow this  
BadEggGames

OpenCL not working on every pixel

Recommended Posts

Hi all

 

I have a simple kernel that sets a pixel color.

 

I have a screen that is 600*600 pixels. The kernel changes the color of every pixel except for the last 50 or so. Imagine a blue screen with a small 1 pixel wide black line on the top right.

 

I create my buffer like so:

error = clEnqueueNDRangeKernel(queue, kernel,
		1,
		NULL,
		&DATA_SIZE,
		NULL, 0, NULL, NULL);

And read it like:

cl_float4 *ptr = (cl_float4 *)clEnqueueMapBuffer(queue,
		buffer2,
		CL_TRUE,
		CL_MAP_READ,
		0,
		DATA_SIZE * sizeof(cl_float4),
		0, NULL, NULL, NULL);

DATA_SIZE = 600*600

 

Why is it not applying the kernel to the last 50 or so pixels?

 

Does NDRANGE have to be a multiple of a certain number or something?

 

Any help would be great.

 

Thanks :)

 

Share this post


Link to post
Share on other sites

You set a work group size of zero - this should not work at all.

Probably driver raised it to 1, but you get only a fraction of possible performance this way.

 

Work group size = the number of threads that work in parallel per kernel. Must be a power of two.

Minimum on AMD is 64 (Wavefront), on Nvidia 32 (Warp). If you use less than that, remainig threads do nothing.

 

 
// assuming 64 threads work on one pixel:
 
int count = 600*600;
size_t local_size = 64; // work group size
size_t global_size = local_size * count; // total number of threads needed to do all work
clEnqueueNDRangeKernel (queues[0], kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);
 
// assuming each thread works on only one pixel:
 
int count = 600*600;
size_t local_size = 64;
size_t global_size = local_size * ((count + local_size - 1) / local_size); // proper rounding - the kernel must check to go not beyond count if the division would have a remainder
clEnqueueNDRangeKernel (queues[0], kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);

 

... this is a bit confusing at first, because you need to handle is as well in the kernel to get expected indexing, see https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/workItemFunctions.html

Share this post


Link to post
Share on other sites

Thanks Joe.

 

My original problem was that my image was 600x600 and it needed to be a multiple of 256.

 

I changed it to 512x512 and the full image gets rendered.

What I dont understand is that every realtime raytracer that i have looked at sets up NDRange like I have.

Another thing i dont understand is that i run the same program using the cpu and i get 1 frame per second and i gett 30 fps out of the gpu using NULL as local size. If the cpu has a faster clock speed and setting local size to null makes the gpu run one kernel at a time, how come the gpu is 30 times faster?

 

Thanks again

 

edit:


Never mind, i just read the following in the opencl spec:

 

 

 

local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.
Edited by BadEggGames

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this