Jump to content

  • Log In with Google      Sign In   
  • Create Account

OpenCL is very slow comparing to cpu.


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
12 replies to this topic

#1 _Flame1_   Members   -  Reputation: 99

Like
0Likes
Like

Posted 31 January 2013 - 08:48 AM

Hello. I've created my first program with opencl.

__kernel void vector_add_gpu (__global const float* a, __global const float* b, __global float* c, int iNumElements){ // get index into global data array int iGID = get_global_id(0); // bound check (equivalent to the limit on a 'for' loop for standard/serial C code if (iGID < iNumElements) { // add the vector elements c[iGID] = a[iGID] + b[iGID]; }}
I have a quite big buffer with numbers(about 240 mbyte). Opencl spends in 5 time more then a cpu loop. Is it ok or something is wrong? If i have more complicated function(c[iGID] = a[iGID] + sqrt(b[iGID] * b[iGID]);) than difference is much bigger(in 150 times) :)Thank you.

 

P.S. sorry my previous case was wrong i forget to put opencl file to the folder. :)



Sponsor:

#2 samoth   Crossbones+   -  Reputation: 5039

Like
4Likes
Like

Posted 31 January 2013 - 08:55 AM

Two possible reasons:
  • Your OpenCL kernel actually runs on the CPU (you didn't tell what implementation you use)
  • Your OpenCL kernel runs on a GPU, but the runtime is absolutely dominated by PCIe transfer latency, not execution speed.
Note that adding together two values on a GPU is a ridiculously small amount of work compared to PCIe bandwidth (or even GPU memory bandwidth). It is therefore not surprising that any measurements you make turn out "kind of strange".

Also, launching a kernel and synchronizing for the result isn't completely "free" either.

Try again with a much more complicated kernel, and you'll likely see a much bigger (50-100 times) difference.

Edited by samoth, 31 January 2013 - 08:56 AM.


#3 _Flame_   Members   -  Reputation: 113

Like
0Likes
Like

Posted 31 January 2013 - 02:53 PM

Sorry guys. But opencl is extemely slow comparing to cpu in my case. It's not possible to explain it through just memory bandwidth. Video card is gf 6800 with pcexpress 3.0. And it doesn't matter how much data i calculate cpu is faster anyway. I don't think that kernel runs on cpu since i've chosen CL_DEVICE_TYPE_GPU. Anyway emulation on cpu can't be slower in 5 times comparing to cpu. :) As i said before more complicated function makes difference only bigger.

You can see code here - http://pastebin.com/M3kjrLtM

Edited by _Flame_, 31 January 2013 - 02:59 PM.


#4 mhagain   Crossbones+   -  Reputation: 8286

Like
1Likes
Like

Posted 31 January 2013 - 03:43 PM

gf 6800

 

It was a great card in it's time, but as of today it's almost 9 years old and it never supported any GPGPU stuff in hardware to begin with.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#5 phantom   Moderators   -  Reputation: 7597

Like
1Likes
Like

Posted 31 January 2013 - 03:53 PM

I'm not convinced you are using the GPU how you think you are.

While you are telling OpenCL to launch 0x4000000 threads you are telling it that each work group consists of a single thread, which means you are wasting a vast amount of GPU resources as it will be launching 'preferred_work_group_size_multiple * 0x4000000' warps or wave fronts but only using one thread in each one.

If you run 'clinfo' in a cmd window you should get told the preferred work group size multiple; set the local_ws value to that and you should end up using all the GPU threads instead of just one in each warp/wave front launched.
(For example on my card this value is 64, so a work group of less than 64 is going to waste resources.)

Edited by phantom, 31 January 2013 - 03:53 PM.


#6 Ravyne   GDNet+   -  Reputation: 8188

Like
3Likes
Like

Posted 31 January 2013 - 04:04 PM

It was a great card in it's time, but as of today it's almost 9 years old and it never supported any GPGPU stuff in hardware to begin with.

 

Maybe he meant 680? That's current top-of-the-line.

 

However, it sounds like OP probably isn't using the thing right -- You can't just spawn a billion threads and expect it to work faster. 

 

OP, you first have to have a reasonable understanding that your problem is suitable for OpenCL or other similar libraries. In most cases, you also can't just throw a best-in-class serial algorithm at a GPU and expect it to speed up. To get real gains you very often have to tailor-make a new algorithm that's suitable for parallel execution, and which can take advantage of the other resources that a modern GPU shares across groups of threads, but not equally with all threads. GPU is an entirely different world of performance expectations and trade-offs.

 

Its possible that CPU algorithms can be faster for some problems, but that the performance disparity gets worse and worse with the size of the algorithm tends to indicate that there's something fundamentally wrong with your approach.



#7 Geometrian   Crossbones+   -  Reputation: 1603

Like
0Likes
Like

Posted 31 January 2013 - 05:24 PM

OP, you first have to have a reasonable understanding that your problem is suitable for OpenCL or other similar libraries. In most cases, you also can't just throw a best-in-class serial algorithm at a GPU and expect it to speed up. To get real gains you very often have to tailor-make a new algorithm that's suitable for parallel execution, and which can take advantage of the other resources that a modern GPU shares across groups of threads, but not equally with all threads. GPU is an entirely different world of performance expectations and trade-offs.

Excellent point. OpenCL is geared for repeated, heavy, parallel calculations. I am almost certain that you are bus-bound. Sure, you can add two numbers in parallel on the GPU, but you have to put each one there from the CPU (kinda). So, the trivial addition kernel you have, I would never expect any performance improvement from--no matter how large your data is.

Try a more complicated calculation (e.g. for each x, calculate x! or x!! You'll get thousands or billions of multiply operations, depending on type (integers overflow, so huge factorials won't run indefinitely). This will outweigh your data transfer cost by a lot, and you'll definitely see performance gains).


And a Unix user said rm -rf *.* and all was null and void...|There's no place like 127.0.0.1|The Application "Programmer" has unexpectedly quit. An error of type A.M. has occurred.

#8 Bacterius   Crossbones+   -  Reputation: 9306

Like
4Likes
Like

Posted 31 January 2013 - 05:57 PM

It's not possible to explain it through just memory bandwidth.

You would be surprised.


The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#9 _Flame_   Members   -  Reputation: 113

Like
0Likes
Like

Posted 31 January 2013 - 05:58 PM

Maybe he meant 680?

Yes. :)

While you are telling OpenCL to launch 0x4000000 threads you are telling
it that each work group consists of a single thread, which means you
are wasting a vast amount of GPU resources as it will be launching
'preferred_work_group_size_multiple * 0x4000000' warps or wave fronts
but only using one thread in each one.

It's a real problem. I don't understand what local and global groups are and how i should choose such values for the best performance. I've tried to set local group into different values but i've had an error.

Edited by _Flame_, 31 January 2013 - 06:07 PM.


#10 Bacterius   Crossbones+   -  Reputation: 9306

Like
1Likes
Like

Posted 31 January 2013 - 06:09 PM

It's a real problem. I don't understand what local and global groups are and how i should choose such values for best performance. I've tried to set local group into value different from zero but i've had an error.

You should query your device for CL_DEVICE_MAX_WORK_GROUP_SIZE using clGetDeviceInfo, and use the returned value as the local work group size. Also make sure your global work group size is a multiple of the local group size (which will probably end up being some power of two between 64 and 1024). For best performance you should let the OpenCL compiler work out the optimal work group size by analyzing the kernel, but this should be plenty enough for now.

 

But still, as said above, just adding two numbers in a kernel isn't enough work for the GPU to show any performance advantage. You'll spend most of your time transferring buffers across the PCI-e bus, and your GPU will even be constrained by its own memory bandwidth (which is saying something, as your GTX 680 has a ludicrous global memory bandwidth of 200GB/s)


The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#11 jeff8j   Members   -  Reputation: 780

Like
0Likes
Like

Posted 31 January 2013 - 09:26 PM

Of course the problem is the memory to pci-e bus then process then back up the pci-e bus and back into memory

 

One thing that I believe your over looking is modern processors can also process 4 float operations at once so that probably accounts for quite a bit of it as well. The cpu will be limited by memory in this scenario as well.

 

Between those two issues you could do it on the cpu before it even transfers down the pci-e bus.


Firefox youtube video and audio downloader MP3 MP4 OGG WEBM

https://addons.mozilla.org/en-US/firefox/addon/simple-youtube-converter/


#12 _Flame1_   Members   -  Reputation: 99

Like
0Likes
Like

Posted 01 February 2013 - 08:29 AM

I agree. After i've done this.
clGetDeviceInfo(device[0], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &work_group, &ret_work_group);local_ws = work_group;global_ws = var_size / work_group;
then cpu was faster anyway.
But after i've changed function on:
c[iGID] = a[iGID] + sqrt(b[iGID] * b[iGID]);
then gpu was faster in about 3.5 times. So it works. Thanks for your help guys. But i'm wondering about global and local groups. If i have that right the local group is amount of gpu processors which run kernel function simultaneously. And global is count of such passes. Is it right? How should i calculate global group properly?

Edited by _Flame1_, 01 February 2013 - 08:54 AM.


#13 Rld_   Members   -  Reputation: 1527

Like
1Likes
Like

Posted 06 February 2013 - 05:47 AM

Take a look at this: http://3dgep.com/?p=2192

 

It's an introduction to opencl and explains some of the concepts, the other cuda related articles might also be of use in terms of understanding how the GPU "works".






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS