OpenCL beginner

Started by
6 comments, last by Yours3!f 12 years, 6 months ago
Hi,

I'm trying to get a little bit of experience with OpenCL. I've successfully made a kernel and it does work, however I do have some questions. I've watch and read many tutorials, however there are stuff that's not clear.

so here's the kernel, its supposed to add together two number and output them:

__kernel void add(__global float* a, __global float* b, __global float* result)
{
int id = get_global_id(0);
result[id] = a[id] + b[id];
}


and here's how I use it:

const int size = 512;
float* a = new float[size];
float* b = new float[size];
float* result = new float[size];

for(int c = 0; c < size; c++)
{
a[c] = (float)(c * 2); //outputs 0, 2, 4, 6, 8, .... 1024
b[c] = (float)((size - c) * 2); //outputs 1024, 1022, 1020 ... 0
}

cl_int error;
const int mem_size = sizeof(float) * size;
cl_mem a_buffer = clCreateBuffer(objs::get()->the_compute_context.the_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, a, &error);
assert(error == CL_SUCCESS);
cl_mem b_buffer = clCreateBuffer(objs::get()->the_compute_context.the_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, b, &error);
assert(error == CL_SUCCESS);
cl_mem result_buffer = clCreateBuffer(objs::get()->the_compute_context.the_context, CL_MEM_WRITE_ONLY, mem_size, NULL, &error);
assert(error == CL_SUCCESS);

the_render.compute.set_kernel_arg("add", 0, sizeof(cl_mem), &a_buffer);
the_render.compute.set_kernel_arg("add", 1, sizeof(cl_mem), &b_buffer);
the_render.compute.set_kernel_arg("add", 2, sizeof(cl_mem), &result_buffer);

const size_t local_ws = 64;
const size_t global_ws = 512;
the_render.compute.execute("add", objs::get()->the_compute_context.the_queue, 1, &global_ws, &local_ws);

the_render.compute.read_back(objs::get()->the_compute_context.the_queue, result_buffer, CL_TRUE, 0, mem_size, result);

for(int c = 0; c < size; c++)
{
std::cout << result[c] << "\n";
}


so what happens is when I set the size to 512 (as now) it outputs "1024" 512 times, this is the normal behaviour. When I change size to 513 it should output "1024" 513 times, right? Now it doesn't. It outputs some random value... To add I can't change global_ws (global work size) to more than 512, because it makes my app crash. Why does that happen?

Best regards,
Yours3!f
Advertisement
BUMP :)

anyone???
Did you notice that this is an OpenGL forum?

So, don't wonder why nobody answers your question. It is purely OpenCL related. OpenCL and OpenGL are totally different APIs. OpenCL is a computational while OpenGL is graphical API. On the first glance, the difference is just in one letter, but essentially they targeting completely different applications.


Did you notice that this is an OpenGL forum?

So, don't wonder why nobody answers your question. It is purely OpenCL related. OpenCL and OpenGL are totally different APIs. OpenCL is a computational while OpenGL is graphical API. On the first glance, the difference is just in one letter, but essentially they targeting completely different applications.



Yes I did, I just don't know where else should I post it, so I posted it here... There's no OpenCL forum here. Maybe we need one.
OpenCL can be used for graphics, see http://www.idevgames.com/forums/thread-8649.html
In fact this is why I'm trying to learn it.

Hi,

I'm trying to get a little bit of experience with OpenCL. I've successfully made a kernel and it does work, however I do have some questions. I've watch and read many tutorials, however there are stuff that's not clear.

so here's the kernel, its supposed to add together two number and output them:

__kernel void add(__global float* a, __global float* b, __global float* result)
{
int id = get_global_id(0);
result[id] = a[id] + b[id];
}


and here's how I use it:

const int size = 512;
float* a = new float[size];
float* b = new float[size];
float* result = new float[size];

for(int c = 0; c < size; c++)
{
a[c] = (float)(c * 2); //outputs 0, 2, 4, 6, 8, .... 1024
b[c] = (float)((size - c) * 2); //outputs 1024, 1022, 1020 ... 0
}

cl_int error;
const int mem_size = sizeof(float) * size;
cl_mem a_buffer = clCreateBuffer(objs::get()->the_compute_context.the_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, a, &error);
assert(error == CL_SUCCESS);
cl_mem b_buffer = clCreateBuffer(objs::get()->the_compute_context.the_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, b, &error);
assert(error == CL_SUCCESS);
cl_mem result_buffer = clCreateBuffer(objs::get()->the_compute_context.the_context, CL_MEM_WRITE_ONLY, mem_size, NULL, &error);
assert(error == CL_SUCCESS);

the_render.compute.set_kernel_arg("add", 0, sizeof(cl_mem), &a_buffer);
the_render.compute.set_kernel_arg("add", 1, sizeof(cl_mem), &b_buffer);
the_render.compute.set_kernel_arg("add", 2, sizeof(cl_mem), &result_buffer);

const size_t local_ws = 64;
const size_t global_ws = 512;
the_render.compute.execute("add", objs::get()->the_compute_context.the_queue, 1, &global_ws, &local_ws);

the_render.compute.read_back(objs::get()->the_compute_context.the_queue, result_buffer, CL_TRUE, 0, mem_size, result);

for(int c = 0; c < size; c++)
{
std::cout << result[c] << "\n";
}


so what happens is when I set the size to 512 (as now) it outputs "1024" 512 times, this is the normal behaviour. When I change size to 513 it should output "1024" 513 times, right? Now it doesn't. It outputs some random value... To add I can't change global_ws (global work size) to more than 512, because it makes my app crash. Why does that happen?

Best regards,
Yours3!f





Here I come.

"It outputs some random value...": it outputs random values which you not expected to see because it doesn't even executed your kernel. Why?...

Because OpenCL always do checking for valid data before sending to GPU. What data must be a valid?...

check your variables ("global_ws" (when you put there 513) and "local_ws").

At first, global_ws % local_ws must be 0.

Secondly, according to OpenCL limits, local_ws must be greater 0 and less (or equal) value returned from clGetDevice with param CL_DEVICE_MAX_WORK_ITEM_SIZES (with element work_dim id you are using)..

There are other rules but you don't have a problem with them yet.

this outputs limits of local work groups:



size_t dim;
size_t lws[256]={0},s;// I think 256 is enough size

::clGetDeviceInfo(device_id,CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS,4,&dim,&s);
::clGetDeviceInfo(device_id,CL_DEVICE_MAX_WORK_ITEM_SIZES,256*4,lws,&s);
for(int i = 0; i<dim; i++)
printf("local work size [%d] : %d\n", i, lws);




In your code you are using id 0.

When these rules which I said above are failing OpenCL would always return one of these error codes:

CL_INVALID_WORK_DIMENSION

CL_INVALID_WORK_GROUP_SIZE

CL_INVALID_WORK_ITEM_SIZE




When you said "crashes application" what kind of crash was it? (Like "assertion failed" with error code which is not equal to CL_SUCCESS?)

If you would set local_ws to 1 it should work any way but speed of computation always would (slower?) than CPU. (In this case what graphics card you are using... On my pc - slower)





OpenCL Specification: (clEnqueueNDRangeKernel)

arg. local_work_size can also be NULL in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.

end.

So, If you would set local_work_size arg of clEnqueueNDRangeKernel (you are using this function aren't you?) to NULL you don't need to worry about how to divide global work to elements (OpenCL will do it).




In this topic was an question "what difference between OpenCL and OpenGL?":

It's easier to ask "what things of OpenCL can be similar to OpenGL?" - GPU and it's memory.

With OpenGL using OpenCL you can do:

To have a memory access to VBOs and also process them by OpenCL kernel (see clCreateFromGLBuffer)

Also you can do the same to textures and renderbuffers (see clCreateFromGLTexture2D, clCreateFromGLTexture3D, clCreateFromGLRenderbuffer)




I'm successfully integrated OpenCL into my engine and worked with a large amount of data which were proceed by GPU. OpenCL is worth to know. Good luck in work to all you, guys!

Best wishes, FXACE.
Thank you for replying, this is what I've been looking for :)

I implemented all the checks you advised.

When you said "crashes application" what kind of crash was it? (Like "assertion failed" with error code which is not equal to CL_SUCCESS?)[/quote]
it was a sigsev... and as it turned out I have some driver limitations of how large my global_ws can be. It cannot be greater than 512, and this is a hardware limitation. So I digged deeper and found that I need to loop through the data, so execute with offsets. This way for a data set of 1024 I have to iterate twice.

If you would set local_ws to 1 it should work any way but speed of computation always would (slower?) than CPU. (In this case what graphics card you are using... On my pc - slower)[/quote]
well I tried with size = 2^23, local_ws = 1 and it was 100ms slower, than with local_ws = NULL, I have a AMD Radeon HD 5670 (I only had like 75$ for a GPU lol)


OpenCL Specification: (clEnqueueNDRangeKernel)

arg. local_work_size can also be NULL in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.

end.

So, If you would set local_work_size arg of clEnqueueNDRangeKernel (you are using this function aren't you?) to NULL you don't need to worry about how to divide global work to elements (OpenCL will do it).
[/quote]
(Yes I do use clEnqueueNDRangeKernel, in fact execute() == clEnqueueNDRangeKernel + some checking)
Thats a good idea :) Is there a way to do this with global_ws, or get global_ws with clGetDeviceInfo?
 


OpenCL Specification: (clEnqueueNDRangeKernel)

arg.  local_work_size can also be NULL in which case the OpenCL implementation  will determine how to be break the global work-items into appropriate  work-group instances.

end.

So, If you would set  local_work_size arg of clEnqueueNDRangeKernel (you are using this  function aren't you?) to NULL you don't need to worry about how to  divide global work to elements (OpenCL will do it).

(Yes I do use clEnqueueNDRangeKernel, in fact execute() == clEnqueueNDRangeKernel + some checking)
Thats a good idea :) Is there a way to do this with global_ws, or get global_ws with clGetDeviceInfo?
[/quote]

No, there is no similar way to do with global_work_size because how can OpenCL determine "did you want GPU to process a kernel with all elements of array or with some part of array"...

Best wishes, FXACE.


[quote name='Yours3!f' timestamp='1318061120' post='4870411']

OpenCL Specification: (clEnqueueNDRangeKernel)

arg. local_work_size can also be NULL in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.

end.

So, If you would set local_work_size arg of clEnqueueNDRangeKernel (you are using this function aren't you?) to NULL you don't need to worry about how to divide global work to elements (OpenCL will do it).

(Yes I do use clEnqueueNDRangeKernel, in fact execute() == clEnqueueNDRangeKernel + some checking)
Thats a good idea :) Is there a way to do this with global_ws, or get global_ws with clGetDeviceInfo?
[/quote]

No, there is no similar way to do with global_work_size because how can OpenCL determine "did you want GPU to process a kernel with all elements of array or with some part of array"...

Best wishes, FXACE.


[/quote]

ok, thanks!

This topic is closed to new replies.

Advertisement