Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


Cases for multithreading OpenGL code?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
12 replies to this topic

#1 Ubik   Members   -  Reputation: 691

Like
0Likes
Like

Posted 15 May 2014 - 03:15 PM

I have wanted to support multiple contexts to be used in separate threads with shared resources in this convenience GL wrapper I've been fiddling with. My goal hasn't been to expose everything it can do but in a nice way, but the multi-context support has seemed like a good thing, to align the wrapper with a pretty big aspect of the underlying system. However, multithreaded stuff is hard, so I finally started to question if supporting multiple contexts with resource sharing is even worth it.

 

If the intended use is PC gaming - a single simple window and so on (a single person project too, to put things to scale, and currently targeting version 3.3 if that has any relevance), what reasons would there be to take the harder route? My understanding is that the benefits might actually be pretty limited, but my knowledge of all the various implementations and their capabilities is definitely limited.



Sponsor:

#2 Promit   Moderators   -  Reputation: 7560

Like
10Likes
Like

Posted 15 May 2014 - 03:20 PM

The long and short of it is that multiple contexts and context switching are so horrifically broken on the driver side, across all vendors for all platforms, that there is nothing to gain and everything to lose in going down this road. If you want to go down the multithreaded GL road in any productive way whatsoever, it'll be through persistent mapped buffers and indirect draws off queued up buffers. More info here: http://www.slideshare.net/CassEveritt/beyond-porting


Edited by Promit, 15 May 2014 - 03:22 PM.


#3 Hodgman   Moderators   -  Reputation: 31781

Like
6Likes
Like

Posted 15 May 2014 - 04:26 PM

The benefit of multi core in D3D11 is that you can create resources (textures/shaders/etc) without blocking the main thread.

So in theory GL can give you the same benefit if required - just be sure to test extensively on ever vendor/driver/GPU that you can!

In theory, D3D11 also lets you submit draw calls from multiple threads, but in practice all drivers just use this to send messages back to the main thread, which still does all the work.
D3D12 and Mantle are fixing this, but GL isn't... The closest GL feature is multi-draw-indirect, where you still only have one "GL thread"/context, but you fill indirect buffers with draw parameters however you like (such as writing to them from other threads).

#4 L. Spiro   Crossbones+   -  Reputation: 14197

Like
8Likes
Like

Posted 15 May 2014 - 06:20 PM

I agree with Promit; I have recently been researching multiple contexts for my book and the results are:

 

#1: It only serves a very focused purpose.  If you deviate from that purpose even a tiny bit you would be better off just using a single context.

The steps you have to take to synchronize resources involve unbinding the resource from all contexts and then flushing all contexts.

A lock-step system that would keep each context on their own threads and perform these flushes etc. at their own leisure is not practical, so you will end up moving contexts over to other threads anyway.  This means covering the whole game loop in a critical section and a glMakeCurrent() every frame, which is a performance issue.

 

If you are moving all the contexts over to the loading thread and then restoring them back to their original threads, you may as well just be using a single context; at least then you can avoid all the extra flushing etc.

 

Which means you can only work with multiple contexts in a way that never requires them to leave their assigned threads.

This means they can be used for and only for loading of resources that previously did not exist.  They cannot be used to reload resources or update them.  They can’t be used to issue draw calls from multiple threads etc.

 

 

#2: Now that it is clear what purpose multiple contexts serve, what is the value of that purpose?  Virtually nothing.

In terms of spent CPU cycles, most of the loading process is consumed by file access, loading data to memory, and perhaps parsing it if it is not already in a format that can be directly fed to OpenGL.

The part where you actually call an OpenGL function to create the OpenGL resource is a fairly small part of the process.  If it and only it is done on the main thread while loading of everything else happens on a second thread you will hardly see any skips in the frame rate.

 

 

Overall, it’s only usefulness is that the contexts can be assigned to a thread once and never be changed, and the only way that can actually work in practice is if all the resources created on the secondary context are entirely fresh new resources (not updates, not delete-and-create-new (it will likely use the same ID for the new resource as the old one which will cause bugs on the main context)), and ultimately that isn’t where the overhead lies.

Simply pointless.

 

 

L. Spiro


It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#5 Ubik   Members   -  Reputation: 691

Like
0Likes
Like

Posted 16 May 2014 - 12:12 AM

Alright, I honestly hadn't expected this to be so clear-cut. A bit sad and maybe a little ironic too that GPU rendering can't be parallelized from client side too, with OpenGL anyway.

 

Thank you for sharing the knowledge!



#6 phantom   Moderators   -  Reputation: 7554

Like
2Likes
Like

Posted 16 May 2014 - 05:29 AM

Well, it's only a partly parallelisable problem as the GPU is reading from a single command buffer (well, in the GL/D3D model, the hardware doesn't work quite the same as Mantle shows giving you 3 command queues per device but still...) so at some point your commands have to get into that stream (be it by physically adding to a chunk of memory or inserting a jump instruction to a block to execute) so you are always going to a single thread/sync point going on.

However, command sub-buffer construction is a highly parallelisable thing, consoles have been doing it for ages, the problem is the OpenGL mindset seems to be 'this isn't a problem - just multi-draw all the things!' and the D3D11 "solution" was a flawed one because of how the driver works internally.

D3D12 and Mantle should shake this up and hopefully show that parallel command buffer construction is a good thing and that OpenGL needs to get with the program (or, as someone at Valve said, it'll get chewed up by the newer APIs).

#7 theagentd   Members   -  Reputation: 602

Like
0Likes
Like

Posted 16 May 2014 - 07:02 AM

Doing texture streaming in parallel works great on all GPUs I've tested on, which include a shitload of Nvidia GPUs, at least an AMD HD7790 and a few Intel GPUs. It's essentially stutter-free.



#8 mark ds   Members   -  Reputation: 1477

Like
0Likes
Like

Posted 16 May 2014 - 10:02 AM

Here's a silly but related question. If I use a second context to upload data that takes, for example, a second to transfer over the PCI bus, will the main rendering thread stall while it waits for it's own per-frame data, or will the driver split the larger data into chunks, thereby allowing the two threads to interlace their data?



#9 Ubik   Members   -  Reputation: 691

Like
0Likes
Like

Posted 16 May 2014 - 11:09 AM

Well, it's only a partly parallelisable problem as the GPU is reading from a single command buffer (well, in the GL/D3D model, the hardware doesn't work quite the same as Mantle shows giving you 3 command queues per device but still...) so at some point your commands have to get into that stream (be it by physically adding to a chunk of memory or inserting a jump instruction to a block to execute) so you are always going to a single thread/sync point going on.

However, command sub-buffer construction is a highly parallelisable thing, consoles have been doing it for ages, the problem is the OpenGL mindset seems to be 'this isn't a problem - just multi-draw all the things!' and the D3D11 "solution" was a flawed one because of how the driver works internally.

D3D12 and Mantle should shake this up and hopefully show that parallel command buffer construction is a good thing and that OpenGL needs to get with the program (or, as someone at Valve said, it'll get chewed up by the newer APIs).

Good clarification. Makes sense that even if the GPU has lots of pixel/vertex/computing units, the system controlling them isn't necessarily as parallel-friendly. For a non-hw person the number three sounds like a curious choice, but in any case it seems to make some intuitive sense to have the number close to a common number of CPU cores. That's excluding hyper-threading but that's an Intel thing so doesn't matter to folks at AMD. (Though there's the consoles with more cores...)

 

I'm wishing for something nicer than OpenGL to happen too, but it's probably going to take some time for things to actually change. Not on Windows here, so the wait is likely going to be longer still. Might as well use GL in the mean time.

 

Doing texture streaming in parallel works great on all GPUs I've tested on, which include a shitload of Nvidia GPUs, at least an AMD HD7790 and a few Intel GPUs. It's essentially stutter-free.

Creating resources or uploading data on a second context is what I've mostly had in mind earlier. I did try to find info on this, but probably didn't use the right terms because I got the impression that actually parallel data transfer isn't that commonly supported.

 

I've now thought that if I'm going to add secondary context support anyway, it will be in a very constrained way, so that the other context (or wrapper for it to be specific) won't be a general purpose one but targeting things like resource loading specifically. That could allow me to keep the complexity at bay.



#10 phantom   Moderators   -  Reputation: 7554

Like
3Likes
Like

Posted 16 May 2014 - 03:42 PM

As we've pretty much got the answer to the original question I'm going to take a moment to quickly (and basically) cover a thing smile.png

Good clarification. Makes sense that even if the GPU has lots of pixel/vertex/computing units, the system controlling them isn't necessarily as parallel-friendly. For a non-hw person the number three sounds like a curious choice, but in any case it seems to make some intuitive sense to have the number close to a common number of CPU cores. That's excluding hyper-threading but that's an Intel thing so doesn't matter to folks at AMD. (Though there's the consoles with more cores...)


So, the number '3' has nothing to do with CPU core counts; when it comes to GPU/CPU reasoning very little of one directly impacts the other.

A GPU works by consuming 'command packets'; the OpenGL calls you make get translated by the driver into bytes the GPU can natively read and understand, in the same way a compiler transforms your code to binary for the CPU.

The OpenGL and D3D11 model of a GPU presents a case where the command stream is handled by a single 'command processor' which is the hardware which decodes the command packets to make the GPU do it's work. For a long time this was probably the case too so the conceptual model 'works'.

However, a recent GPU, such as AMD's Graphics Core Next series is a bit more complicated than that as the interface which deals with the commands isn't a single block but in fact 3 which can each consume a stream of commands.

First is the 'graphics command processor'; this can dispatch graphics and compute workloads to the GPU hardware to work - glDraw/glDispatch family of functions - and is where your commands end up.

Secondly there is the 'compute command processors' - these can handle compute only workloads. Not exposed via GL, I think OpenCL can kind of expose them but with Mantle it is a separate command queue. (The driver might make use of them as well behind the scenes)

Finally 'dma commands' which is a separate command queue to move data to/from the GPU which is handled in OpenGL behind the scenes by the driver (but in Mantle would allow you to kick your own uploads/downloads as required.

So the command queues as exposed by Mantle more closely mirror the operation of the hardware (it still hides some details) which explains why you have three, to cover the 3 types of command work the GPU can do.

If you are interested AMD have made a lot of this detail available which is pretty cool.
(Annoyingly NV are very conservative about their hardware details which makes me sad sad.png)

To be clear, you don't need to know this stuff although I personally find it interesting - this is also a pretty high level overview of the situation so don't take it as a "this is how GPUs work!" kinda thing smile.png

#11 Ubik   Members   -  Reputation: 691

Like
0Likes
Like

Posted 16 May 2014 - 04:13 PM

Ah, so it's more like different handlers for different tasks instead of three general-purpose units, though by the sound of it it's not even that clear with the distinction between drawing and computing. The latter was for whatever reason my first assumption. From that perspective it felt reasonable to think that there would be no reason to be more of them than there are CPU cores to feed them with commands. Teaches me to not make quick assumptions!

 

Thanks for taking the time to tell about this. Even though I obviously haven't delved into these lower level matters much, they are interesting to me too.



#12 Hodgman   Moderators   -  Reputation: 31781

Like
4Likes
Like

Posted 16 May 2014 - 05:55 PM

Yeah there's overlap between the draw+compute queue and the compute-only queue.
The idea here is kinda like hyper-threading -- if the current draw/dispatch command from the graphics queue isn't using 100% of the GPU's resources (maybe it's stalled, waiting for memory) then the GPU can use those idle resources to perform some work for the compute queue.
In the future, more and more compute-only queues will likely be added too. More queues = more independent jobs ready to be run = more chances for the GPUs scheduler to actually make use of all of its resources.

Extra draw queues aren't added simply because of complexity AFAIK. Draw commands require special scheduling, fixed function logic, but most importantly, are stateful (rely on a state machine, even at the hardware level). Compute (dispatch) commands are stateless at the hardware level (everything required to perform it is bundled into the command), making it much easier to juggle parallel execution, or constantly start/stop processing them.

The DMA queue is similar to the others, but it can't process draw/dispatch/compute tasks at all, it can only perform (async) memcpy tasks. Whenever the memory controller is not busy, it will try to do the next item in the DMA queue -- allowing it to sustain being busy if you've queues up a lot of work for it.
It might get magically used by things like glBufferData if you're lucky ;)

Note though that any number of CPU threads (or applications!) can share these queues. Console games can have 8 threads each producing their own "command buffers", and then doing some very minor synchronization to insert these command buffers into a single GPU queue.
Mantle/D3D12 will allow you to do the same, greatly increasing the amount of commands that the CPU is able to produce each frame ;-)

If Intel/nVidia actually get on board and form a Mantle committee with AMD, then I could see it seriously competing with GL, even replacing GL in future games... Unless the GL committee lifts their game!

#13 Ubik   Members   -  Reputation: 691

Like
0Likes
Like

Posted 17 May 2014 - 06:30 AM

It might even make some sense for Khronos to have some kind of "low-level GL" specification alongsinde OpenGL, because that way the latter could have an actual cross-platform and cross-vendor implementation built on the former. Then the folks who still want or need to use OpenGL would have to fight against just one set of quirks. Well, as long as the LLGL actually had uniformly working implementations...






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS