Differences between OpenMP, OpenACC, OpenCL, SIMD, and MIMD

Started by
7 comments, last by JoeJ 5 years, 10 months ago

On stackoverflow, I asked the same question I'm about to ask now. There are commentaries on the question that gives information.

 

What are the differences between OpenMP, OpenACC, OpenCL, SIMD, and MIMD? Also, in which cases each library is more suited for?

What I currently know :

  • OpenCL and CUDA are for GPU programming. They take advantage of the fact that GPUs have a lot of cores.
  • CUDA is proprietary to NVIDIA and only works on its GPUs, whilst OpenCL is multiplatform.
  • OpenMP is for CPU wise parallelism.
  • OpenACC also seems to be for CPU task parallelism.
  • SIMD is for executing an operation on multiple cores (CPU only?).
  • MIMD seems to be for executing multiple operations on multiple cores (CPU only?).

What I aim to learn here is which libraries are best suited for optimizing an algorithm on the CPU & GPU. Hopefully, I would like to use only one library to do both.

Hide yo cheese! Hide yo wife!

Advertisement
1 hour ago, thecheeselover said:

SIMD is for executing an operation on multiple cores (CPU only?).

This is wrong. SIMD functionality on CPU means executing a single instruction on a single core, but on a wide register which contains multiple numbers, e.g. 4 or 8. This has nothing todo with multiple cores, which then of course can execute multiple independent programs, each eventually using different SIMD instructions.

1 hour ago, thecheeselover said:

OpenCL and CUDA are for GPU programming.

OpenCL works on CPU as well, CUDA does not. OpenCL on CPU will likely utilize SIMD instructions AND multiple cores. (But in practice you use it mainly for GPUs, because on CPU you can do this simply with regular programming, which is more comfortable.)

1 hour ago, thecheeselover said:

OpenMP is for CPU wise parallelism.

Yes, i use it for some speed ups with very little effort. Programming multithreading myself, e.g. using a job system, has much better performance but is more work.

 

Never heard of OpenACC or MIMD (the latter probably means multiple instruction, multiple data, so regular CPU multithreading, but GPUs can do this as well up to some degree.) It's clear you are interested in parallel programming, but you mix too much things so there is no short but good answer / advise i can give you.

1 hour ago, thecheeselover said:

Hopefully, I would like to use only one library to do both.

For CPU you don't need a library. For GPU, notice that OpenCL 1.x is the same as Compute Shaders in DirectX, OpenGL and Vulkan. OpenCL 2.0 has functionality not available there, but Nvidia does not support it on customer hardware. This makes Compute Shaders a good option if you do graphics as well. For me Vulkan is almost 2 times faster than OpenCL, but OpenCL is MUCH easier to use. For beginners, i would recommend OpenCL to learn GPGPU. For games Compute Shaders is the way to go, but no necessarily the best way to learn.

On 6/10/2018 at 10:45 PM, JoeJ said:

For me Vulkan is almost 2 times faster than OpenCL

Why ? do you have some test or other information ?

3DGraphics,Direct3D12,Vulkan,OpenCL,Algorithms

13 minutes ago, Andrey OGL_D3D said:

Why ? do you have some test or other information ?

I have a very big project covering different algorithms, so this is an average factor not coincidence.

I've measured this on AMD GPU mainly, but on NV it is similar. There are 3 reasons:

1. Vulkan compiler is better, maybe 15%. (CL is initially mostly faster, but after optimizing, VK wins. Exceptions are rare.)

2. Prerecorded command buffers. In VK you can record the whole program flow (invoking programs and memory barriers in between.) to a GPU command buffer, and per frame you only say 'run this command buffer'. So there is almost no CPU<->GPO communication necessary.

3. Indirect Dispatch. With VK you can set the workload for a later shader from the current shader. With OpenCL you need to download result from current shader to CPU, and set the workload for the later shader from there. (a real performance killer)

 

So the combination of points 2 and 3 allow to do complex stuff without any CPU<->GPO communication. This means you often have recorded some dispatches that at runtime turn out to have zero work. This has a cost (memory barriers still being executed), but overall it explains the big speed up i see. (I think i have about 100 dispatches.)

This is all about OpenCL 1.x (!) I assume CL 2.0 has similar / better things as well, but i've never used it. 

If you consider to port to Vulkan, the work is mainly on the host side. Vulkan API results in 10 times more code than OpenCL. 

For the shaders i use a C preprocessor to make my CL code look like GLSL. Basically i only have to write function headers twice an there is a bit of clutter with #ifdefs, but it's no problem to work with. (I also solve things like missing pointers in GLSL with #defines, no downsides there.)

 

JoeJ, thank you for your full information, with my point of view, i think the most decrease of performance during using OpenCL this is CPU/GPU communication, also the sharing graphics resources between OpenCL, am i right ?

What do you think about the OpenCL future? for example Khronos has plans with unique Vulkan and OpenCL interoperability.

I try to use OpenCL for my 3D Engine, but i have some decrease performance with using OpenCL for frustum culling and Indirect Drawing and simulation of particle system.

nVidia has some OpenCL Driver bugs: https://devtalk.nvidia.com/default/topic/1035913/cuda-programming-and-performance/clcreatefromd3d11buffernv-returns-cl_invalid_d3d11_resource_nv-for-id3d11buffer-with-d3d11_resource_-/

But in some project still uses OpenCL: Bullet Physics, AMD Radeon Rays, AMD Pro Render

 

3DGraphics,Direct3D12,Vulkan,OpenCL,Algorithms

1 hour ago, Andrey OGL_D3D said:

JoeJ, thank you for your full information, with my point of view, i think the most decrease of performance during using OpenCL this is CPU/GPU communication, also the between OpenCL, am i right ?

Yes for the communication, but what do you mean by sharing graphics resources?

1 hour ago, Andrey OGL_D3D said:

What do you think about the OpenCL future? for example Khronos has plans with unique Vulkan and OpenCL interoperability.

I hope this includes CL 2.0 features. I really want kernels to enqueue other kernels.

But even support for CL 1.x would be nice. The one problem with GLSL is due to toe lack of pointers, you can not cast reserved LDS to a different type. (Also C is just nicer than those shading languages.)

So i think fusing CL into VK is a good thing and i look forward it. For some things (e.g. a preprocessing tool) i'd stick to CL as is because its ease of use. CL should not be completely abandoned, like Apple already did.

1 hour ago, Andrey OGL_D3D said:

I try to use OpenCL for my 3D Engine, but i have some decrease performance with using OpenCL for frustum culling and Indirect Drawing and simulation of particle system.

Yeah, compute shaders are better there, as you can run them async with rendering. On the other hand, personally i found CL faster than OpenGL compute (AMD 1.1 and NV 2.0 times faster! Even though GL has indirect dispatch.) This dates back 5 years i guess. Not sure if this is still the case. I assume your slowdown comes from CL <-> graphics interop, but i have no experience myslef with this. VK and DX12 really are the cure for all that mess. 

2 hours ago, Andrey OGL_D3D said:

But in some project still uses OpenCL: Bullet Physics, AMD Radeon Rays, AMD Pro Render

Makes sense to me, i'd do the same. Rays is a tool, and Bullet can't know which API it's users are using. (And GPU physics is just an experiment without practical sense anyways, IMHO)

Personally i think there is no more place for CL in game runtimes, but i really do not understand all this thoughts it will be eventually discontinued. There is no replacement for it. VK/DX12 is too involved, Cuda is NV exclusive, so what should people use for GPGPU then? Pretty sure it will stay. (If only NV would support 2.0 - they do on Quadro and Tesla GPUs!)

51 minutes ago, JoeJ said:

but what do you mean by sharing graphics resources?

OpenGL/Direct3D can share Buffers/textures with OpenCL using clCreateFromD3D11BufferKHR/clCreateFromGLBuffer,

clCreateFromGLTexture/clCreateFromD3D11TextureXDKHR

I think that any call clEnqueueAcquireGLObjects/clEnqueueReleaseGLObjects and clEnqueueAcquireD3D11ObjectsKHR/clEnqueueReleaseD3D11ObjectsKHR can be decrease performance. But I have no complete information about it. I will try to use notification using clCreateUserEvent/clSetEventCallback instaded of clWaitForEvents. Also may be we need to know where the performance can be decreased using CL_QUEUE_PROFILING_ENABLE. What do you think ?

1 hour ago, JoeJ said:

Yeah, compute shaders are better there, as you can run them async with rendering. 

But we should use the separate code(GLSL/HLSL compute shaders) for different API, if we need to support OpenGL/Direct3D11. But SPIR-V can be help to use common code in this case.

1 hour ago, JoeJ said:

On the other hand, personally i found CL faster than OpenGL compute (AMD 1.1 and NV 2.0 times faster! Even though GL has indirect dispatch.) This dates back 5 years i guess.

Have you done it without CL<->GL/D3D interop ?

So, in some cases OpenCL can be faster than VK/GL/D3D compute shaders in case where we needn't share resources ?

Thank for for your answers, it can help me to support GPU compute.

3DGraphics,Direct3D12,Vulkan,OpenCL,Algorithms

35 minutes ago, Andrey OGL_D3D said:

Have you done it without CL<->GL/D3D interop ?

So, in some cases OpenCL can be faster than VK/GL/D3D compute shaders in case where we needn't share resources ?

Yes, i have never used interop, used graphics just to visualize results, but did not care about graphics performance back then.

So i can not comment on interop, but people always complained about it.

Profiling compute alone revealed that CL was faster on any vendor, and i was really surprised about NVs bad GL performance, assuming they would care less about CL. But that really was at the time compute was pretty new, i guess it's better now.

However it might be worth to implement some of your compute stuff in DX11/GL and compare performance. I assume the differences are still larger than they should be.

This topic is closed to new replies.

Advertisement