OpenCL Driver/Runtime

Started by
0 comments, last by h3xl3r 9 years, 11 months ago

So I have isolated some portions of an application that lend itself to being parallelized. I already have some multi-threading in place to spin off threads to distribute these "tasks" over multiple available CPU cores.

Now I am interested in (optionally) delegating some of these tasks to a second GPU and would like to use OpenCL to do so. After a (albeit tiny) bit of research it seems that the situation of detecting and initializing a GPU as compute device is anything but straightforward. Especially the OpenCL driver situation is rather confusing to me as it seems that all major vendors are supplying OpenCL drivers/runtimes that need to be shipped and installed with my application? Because of this I am also wondering if these drivers are actually specific to the vendors' hardware and if I'd have to ship all possible drivers with the application and then choose the "right" one to install depending on the client hardware?

Anyone have any experience with this or a reference to an article/blog that gives an overview of this situation?

Advertisement

For me, OpenCL turned out to be a big letdown although it looked really cool and promising at first.

About the driver, this is very simple. The user has installed OpenCL when installing the driver for the graphics card (without even knowing!), which includes a vendor-specific component, and the stub DLL that you use. Nothing to do for you, nothing to distribute or install.

If no OpenCL has been installed by the user (10 year old graphics card?), there's nothing you can do about.

In easy words, you either simply link to opencl.lib (using the present opencl.dll) or load the DLL/so dynamically (I prefer that, having had trouble linking directly, and dynamic loading isn't very hard), and this one will forward your calls to the "secret" implementation of the platform/device combo that you use. Your work is basically the same thing as with OpenGL using 2/3/4 functionality or extensions.

You basically need to write a small GLEW for CL. Searching the internet for "OpenCL ICD loader" gave me a BSD-licensed library for CL 1.0 on Apple when I tried a year or two ago, it only needed some minor fixup to work with Windows, and I had to add a few tidbits for CL 1.1 (which is like 2 minutes of work once you have the skeleton!).

So far so good. Now comes the nasty part. Identifying the "correct" device to use isn't really easy or straighforward. OpenCL is maximally flexible and maximally portable, and maximally heterogenous and whatnot, and this is maximally shit. There is no single good way to choose the "correct" thing.

The only thing for the "usual" usage where you wish to consume the output for some kind of rendering that reliably works (works at all, or works without an explicit roundtrip) is creating a compatible CL context that lives on the same device from an existing GL context. For this, you need to use an extension (which is factually omni-present, but still it could in theory not be present... what do you do if it's not present?), and despite all "portability" this requires platform-specific code, grrr...

Now of course, you might not want a context that lives on the same device, but instead use another device (you've explicitly said so, too). If you have two GPUs, it makes for example sense to use one for graphics, and one for physics. And it "just works", right?

Sadly, this isn't well-supported, or supported at all. You must do some manual copying back and forth to/from the host to make it work (which may be slower than doing it on the CPU or on the main GPU), even if common sense tells you "hey, I have SLI/Crossfire, the driver could do that an order of magnitude faster and easier, without me even knowing". Maybe there is a way to get this working, but I'm not aware of it. In my experience, everything except "create CL context from GL context" sucks big time.

Other than the "create from GL" approach, you can enumerate platforms and devices and choose whatever you want to use, but if you search the internet, you'll be surprised to find that hardly anyone does anything but pick the first platform and the first device that comes up. You wonder why? Because that's the only thing that isn't totally convoluted and that actually works fine. You can easily write 50-100 lines of code only for figuring out what device to create a context for, and what you end up with may not be the best choice at all.

Oh wow. Thanks a lot for the detailed reply!

Indeed in my case I am already using one of the GPUs for the (OpenGL-)rendering context and would like to use the second GPU for some things that right now are done on CPU (again, optionally of course, re-inforced by all the gotchas you mention).

I hadn't actually thought about using some sort of Crossfire/SLI technology here, but it makes a lot of sense, as for example one of the uses In my case would be tessellation, and the trip from

CPU -> 2nd GPU -> CPU (result) -> 1st GPU (render)

could then be trimmed down to

CPU -> 2nd GPU -> 1stGPU

and that would be great, but I guess the tech is not really there yet in reality. I did indeed also hear some horror-stories about OpenCL/OpenGL interop breaking randomly with driver releases and therefore not really being used outside of scientific ( = controlled environement) uses.

Is there any hope of these things being fixed in the not so distant future? Is maybe CUDA worth looking at more in terms of reliability (even if hardware specific)?

Edit:

Also wondering if the situation is any better or worse on OS X? Specifically looking at hardware like their new dual AMD MacPro machines (where CUDA is obviously a no-go),

CUDA means "will never, not ever, not even a bit, run on AMD or Intel" which is a dealbreaker for me. Though of course if that little detail doesn't matter to you, then CUDA is a whole lot better.

Actually OpenCL tries to do the right thing, it only doesn't just get there because it's too complex/obscure (and went a step too far), and because the interoperability is too badly implemented.

I would wish for something like the user being able to define in the control panel which GPUs are elegible for computing, and you just say "give me a compute device", then I throw a kernel and some input and output buffer objects at it, and the rest is the driver's problem.

The only thing I'd really want to know is that my device is able to efficiently interoperate with the main GPU (for that there'd need to be a feature flag at context creation). I really don't want to know on which card a buffer lives or where a kernel executes, or anything else. I really don't want to know what it takes so the GPU will use my generated data as vertex input to draw some stuff. If it needs to be copied over crossfire, the driver should just do it, if it needs to go down and up again PCIe, then do that. If it's the same GPU, even better.

Of course in theory there exist those mysthical OpenCL accelerator cards, but nobody has them, and you wouldn't want to use them anyway, so all that is purely hypothetical. And then, there's CPU implementations, but you can likely write equally fast (or possibly faster, since you are not bound by the API contract and the execution model) code on the CPU with less trouble. What you realistically want is to use the GPU (or one GPU if there are several), and you want this fast and with little trouble.

Compute shaders may very well be an alternative to OpenCL, as they're basically just what one wants (and with one less dependency!). Unluckily, they're not available until OpenGL 4.3 and on the latest hardware. There is no such thing as a downgraded compute shader version akin to OpenCL 1.0 which basically runs fine on 10 year old hardware and is kind of sufficient for 97.5% of everything.

This topic is closed to new replies.

Advertisement