Cases for multithreading OpenGL code?

Started by
11 comments, last by Ubik 9 years, 11 months ago

I have wanted to support multiple contexts to be used in separate threads with shared resources in this convenience GL wrapper I've been fiddling with. My goal hasn't been to expose everything it can do but in a nice way, but the multi-context support has seemed like a good thing, to align the wrapper with a pretty big aspect of the underlying system. However, multithreaded stuff is hard, so I finally started to question if supporting multiple contexts with resource sharing is even worth it.

If the intended use is PC gaming - a single simple window and so on (a single person project too, to put things to scale, and currently targeting version 3.3 if that has any relevance), what reasons would there be to take the harder route? My understanding is that the benefits might actually be pretty limited, but my knowledge of all the various implementations and their capabilities is definitely limited.

Advertisement

The long and short of it is that multiple contexts and context switching are so horrifically broken on the driver side, across all vendors for all platforms, that there is nothing to gain and everything to lose in going down this road. If you want to go down the multithreaded GL road in any productive way whatsoever, it'll be through persistent mapped buffers and indirect draws off queued up buffers. More info here: http://www.slideshare.net/CassEveritt/beyond-porting

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
The benefit of multi core in D3D11 is that you can create resources (textures/shaders/etc) without blocking the main thread.

So in theory GL can give you the same benefit if required - just be sure to test extensively on ever vendor/driver/GPU that you can!

In theory, D3D11 also lets you submit draw calls from multiple threads, but in practice all drivers just use this to send messages back to the main thread, which still does all the work.
D3D12 and Mantle are fixing this, but GL isn't... The closest GL feature is multi-draw-indirect, where you still only have one "GL thread"/context, but you fill indirect buffers with draw parameters however you like (such as writing to them from other threads).

I agree with Promit; I have recently been researching multiple contexts for my book and the results are:

#1: It only serves a very focused purpose. If you deviate from that purpose even a tiny bit you would be better off just using a single context.

The steps you have to take to synchronize resources involve unbinding the resource from all contexts and then flushing all contexts.

A lock-step system that would keep each context on their own threads and perform these flushes etc. at their own leisure is not practical, so you will end up moving contexts over to other threads anyway. This means covering the whole game loop in a critical section and a glMakeCurrent() every frame, which is a performance issue.

If you are moving all the contexts over to the loading thread and then restoring them back to their original threads, you may as well just be using a single context; at least then you can avoid all the extra flushing etc.

Which means you can only work with multiple contexts in a way that never requires them to leave their assigned threads.

This means they can be used for and only for loading of resources that previously did not exist. They cannot be used to reload resources or update them. They can’t be used to issue draw calls from multiple threads etc.

#2: Now that it is clear what purpose multiple contexts serve, what is the value of that purpose? Virtually nothing.

In terms of spent CPU cycles, most of the loading process is consumed by file access, loading data to memory, and perhaps parsing it if it is not already in a format that can be directly fed to OpenGL.

The part where you actually call an OpenGL function to create the OpenGL resource is a fairly small part of the process. If it and only it is done on the main thread while loading of everything else happens on a second thread you will hardly see any skips in the frame rate.

Overall, it’s only usefulness is that the contexts can be assigned to a thread once and never be changed, and the only way that can actually work in practice is if all the resources created on the secondary context are entirely fresh new resources (not updates, not delete-and-create-new (it will likely use the same ID for the new resource as the old one which will cause bugs on the main context)), and ultimately that isn’t where the overhead lies.

Simply pointless.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Alright, I honestly hadn't expected this to be so clear-cut. A bit sad and maybe a little ironic too that GPU rendering can't be parallelized from client side too, with OpenGL anyway.

Thank you for sharing the knowledge!

Well, it's only a partly parallelisable problem as the GPU is reading from a single command buffer (well, in the GL/D3D model, the hardware doesn't work quite the same as Mantle shows giving you 3 command queues per device but still...) so at some point your commands have to get into that stream (be it by physically adding to a chunk of memory or inserting a jump instruction to a block to execute) so you are always going to a single thread/sync point going on.

However, command sub-buffer construction is a highly parallelisable thing, consoles have been doing it for ages, the problem is the OpenGL mindset seems to be 'this isn't a problem - just multi-draw all the things!' and the D3D11 "solution" was a flawed one because of how the driver works internally.

D3D12 and Mantle should shake this up and hopefully show that parallel command buffer construction is a good thing and that OpenGL needs to get with the program (or, as someone at Valve said, it'll get chewed up by the newer APIs).

Doing texture streaming in parallel works great on all GPUs I've tested on, which include a shitload of Nvidia GPUs, at least an AMD HD7790 and a few Intel GPUs. It's essentially stutter-free.

Here's a silly but related question. If I use a second context to upload data that takes, for example, a second to transfer over the PCI bus, will the main rendering thread stall while it waits for it's own per-frame data, or will the driver split the larger data into chunks, thereby allowing the two threads to interlace their data?

Well, it's only a partly parallelisable problem as the GPU is reading from a single command buffer (well, in the GL/D3D model, the hardware doesn't work quite the same as Mantle shows giving you 3 command queues per device but still...) so at some point your commands have to get into that stream (be it by physically adding to a chunk of memory or inserting a jump instruction to a block to execute) so you are always going to a single thread/sync point going on.

However, command sub-buffer construction is a highly parallelisable thing, consoles have been doing it for ages, the problem is the OpenGL mindset seems to be 'this isn't a problem - just multi-draw all the things!' and the D3D11 "solution" was a flawed one because of how the driver works internally.

D3D12 and Mantle should shake this up and hopefully show that parallel command buffer construction is a good thing and that OpenGL needs to get with the program (or, as someone at Valve said, it'll get chewed up by the newer APIs).

Good clarification. Makes sense that even if the GPU has lots of pixel/vertex/computing units, the system controlling them isn't necessarily as parallel-friendly. For a non-hw person the number three sounds like a curious choice, but in any case it seems to make some intuitive sense to have the number close to a common number of CPU cores. That's excluding hyper-threading but that's an Intel thing so doesn't matter to folks at AMD. (Though there's the consoles with more cores...)

I'm wishing for something nicer than OpenGL to happen too, but it's probably going to take some time for things to actually change. Not on Windows here, so the wait is likely going to be longer still. Might as well use GL in the mean time.

Doing texture streaming in parallel works great on all GPUs I've tested on, which include a shitload of Nvidia GPUs, at least an AMD HD7790 and a few Intel GPUs. It's essentially stutter-free.

Creating resources or uploading data on a second context is what I've mostly had in mind earlier. I did try to find info on this, but probably didn't use the right terms because I got the impression that actually parallel data transfer isn't that commonly supported.

I've now thought that if I'm going to add secondary context support anyway, it will be in a very constrained way, so that the other context (or wrapper for it to be specific) won't be a general purpose one but targeting things like resource loading specifically. That could allow me to keep the complexity at bay.

As we've pretty much got the answer to the original question I'm going to take a moment to quickly (and basically) cover a thing smile.png

Good clarification. Makes sense that even if the GPU has lots of pixel/vertex/computing units, the system controlling them isn't necessarily as parallel-friendly. For a non-hw person the number three sounds like a curious choice, but in any case it seems to make some intuitive sense to have the number close to a common number of CPU cores. That's excluding hyper-threading but that's an Intel thing so doesn't matter to folks at AMD. (Though there's the consoles with more cores...)


So, the number '3' has nothing to do with CPU core counts; when it comes to GPU/CPU reasoning very little of one directly impacts the other.

A GPU works by consuming 'command packets'; the OpenGL calls you make get translated by the driver into bytes the GPU can natively read and understand, in the same way a compiler transforms your code to binary for the CPU.

The OpenGL and D3D11 model of a GPU presents a case where the command stream is handled by a single 'command processor' which is the hardware which decodes the command packets to make the GPU do it's work. For a long time this was probably the case too so the conceptual model 'works'.

However, a recent GPU, such as AMD's Graphics Core Next series is a bit more complicated than that as the interface which deals with the commands isn't a single block but in fact 3 which can each consume a stream of commands.

First is the 'graphics command processor'; this can dispatch graphics and compute workloads to the GPU hardware to work - glDraw/glDispatch family of functions - and is where your commands end up.

Secondly there is the 'compute command processors' - these can handle compute only workloads. Not exposed via GL, I think OpenCL can kind of expose them but with Mantle it is a separate command queue. (The driver might make use of them as well behind the scenes)

Finally 'dma commands' which is a separate command queue to move data to/from the GPU which is handled in OpenGL behind the scenes by the driver (but in Mantle would allow you to kick your own uploads/downloads as required.

So the command queues as exposed by Mantle more closely mirror the operation of the hardware (it still hides some details) which explains why you have three, to cover the 3 types of command work the GPU can do.

If you are interested AMD have made a lot of this detail available which is pretty cool.
(Annoyingly NV are very conservative about their hardware details which makes me sad sad.png)

To be clear, you don't need to know this stuff although I personally find it interesting - this is also a pretty high level overview of the situation so don't take it as a "this is how GPUs work!" kinda thing smile.png

This topic is closed to new replies.

Advertisement