Jump to content
  • Advertisement
Sign in to follow this  

OpenCL not working on every pixel

This topic is 690 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi all


I have a simple kernel that sets a pixel color.


I have a screen that is 600*600 pixels. The kernel changes the color of every pixel except for the last 50 or so. Imagine a blue screen with a small 1 pixel wide black line on the top right.


I create my buffer like so:

error = clEnqueueNDRangeKernel(queue, kernel,

And read it like:

cl_float4 *ptr = (cl_float4 *)clEnqueueMapBuffer(queue,
		DATA_SIZE * sizeof(cl_float4),

DATA_SIZE = 600*600


Why is it not applying the kernel to the last 50 or so pixels?


Does NDRANGE have to be a multiple of a certain number or something?


Any help would be great.


Thanks :)


Share this post

Link to post
Share on other sites

You set a work group size of zero - this should not work at all.

Probably driver raised it to 1, but you get only a fraction of possible performance this way.


Work group size = the number of threads that work in parallel per kernel. Must be a power of two.

Minimum on AMD is 64 (Wavefront), on Nvidia 32 (Warp). If you use less than that, remainig threads do nothing.


// assuming 64 threads work on one pixel:
int count = 600*600;
size_t local_size = 64; // work group size
size_t global_size = local_size * count; // total number of threads needed to do all work
clEnqueueNDRangeKernel (queues[0], kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);
// assuming each thread works on only one pixel:
int count = 600*600;
size_t local_size = 64;
size_t global_size = local_size * ((count + local_size - 1) / local_size); // proper rounding - the kernel must check to go not beyond count if the division would have a remainder
clEnqueueNDRangeKernel (queues[0], kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);


... this is a bit confusing at first, because you need to handle is as well in the kernel to get expected indexing, see https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/workItemFunctions.html

Share this post

Link to post
Share on other sites

Thanks Joe.


My original problem was that my image was 600x600 and it needed to be a multiple of 256.


I changed it to 512x512 and the full image gets rendered.

What I dont understand is that every realtime raytracer that i have looked at sets up NDRange like I have.

Another thing i dont understand is that i run the same program using the cpu and i get 1 frame per second and i gett 30 fps out of the gpu using NULL as local size. If the cpu has a faster clock speed and setting local size to null makes the gpu run one kernel at a time, how come the gpu is 30 times faster?


Thanks again



Never mind, i just read the following in the opencl spec:




local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.
Edited by BadEggGames

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!