OpenCL is very slow comparing to cpu.

Started by
10 comments, last by WhiskyJoe 11 years, 2 months ago

Of course the problem is the memory to pci-e bus then process then back up the pci-e bus and back into memory

One thing that I believe your over looking is modern processors can also process 4 float operations at once so that probably accounts for quite a bit of it as well. The cpu will be limited by memory in this scenario as well.

Between those two issues you could do it on the cpu before it even transfers down the pci-e bus.

Advertisement
I agree. After i've done this.
clGetDeviceInfo(device[0], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &work_group, &ret_work_group);local_ws = work_group;global_ws = var_size / work_group;
then cpu was faster anyway.
But after i've changed function on:
c[iGID] = a[iGID] + sqrt(b[iGID] * b[iGID]);
then gpu was faster in about 3.5 times. So it works. Thanks for your help guys. But i'm wondering about global and local groups. If i have that right the local group is amount of gpu processors which run kernel function simultaneously. And global is count of such passes. Is it right? How should i calculate global group properly?

Take a look at this: http://3dgep.com/?p=2192

It's an introduction to opencl and explains some of the concepts, the other cuda related articles might also be of use in terms of understanding how the GPU "works".

This topic is closed to new replies.

Advertisement