multiple threads, OpenGL, and worker threads...

Started by
15 comments, last by cr88192 11 years ago
earlier, I was off moving a few things from my main thread to secondary threads, and made a discovery:
http://www.opengl.org/wiki/OpenGL_and_multithreading

notable here is that using wglShareLists, it is apparently possible for multiple threads to both access OpenGL (at least on Windows...).
(ADD: apparently in the Linux case, it is mostly as an argument to glXCreateContext).


this is, in effect, pretty nifty.

however, I am doing this in a slightly funky manner:
effectively, work-items are submitted via a "work queue", with worker threads which grab items off the queue, then invoke their supplied function-pointers.

(this being in contrast to using a dedicated thread for the task).


previously, there was no way of saying which worker thread would get which work items, so there was a potential issue of OpenGL-specific work-items being grabbed up by a worker thread which didn't have OpenGL available.

a solution then was to add the concept of a worker-ID (and make some "clever" changes to the work-queue code), where worker creation functions could be "registered", which would return an info structure containing function-pointers to be called when beginning/ending execution of the new worker thread (or also deny creation of new workers), ... and making it so that work-items would only be executed with a worker of the matching type (the default/generic workers were then given an ID of 0).

a little logic later, and now I seem to have worker threads which can access GL (and things like rebuilding the terrain can now happen asynchronously, and is also a little bit faster).

(granted, yes, did have to go add mutexes and similar in a few places... mostly as the terrain-building code wasn't exactly thread-safe).


thoughts?...
Advertisement

As far as I know, it is not safe to draw to a window created in an other thread using OpenGL. Source: Interview with John Carmack on the Doom3 rendering code (2012, third answer). He doesn't go into details but says:

[...] on windows, OpenGL can only safely draw to a window that was created by the same thread. We created the window on the launch thread, but then did all the rendering on a separate render thread. It would be nice if doing this just failed with a clear error, but instead it works on some systems and randomly fails on others for no apparent reason. [...]

I'm not that well versed in graphics programming though. If this doesn't apply to what you are doing feel free to ignore/correct me :)

As far as I know, it is not safe to draw to a window created in an other thread using OpenGL. Source: Interview with John Carmack on the Doom3 rendering code (2012, third answer). He doesn't go into details but says:

[...] on windows, OpenGL can only safely draw to a window that was created by the same thread. We created the window on the launch thread, but then did all the rendering on a separate render thread. It would be nice if doing this just failed with a clear error, but instead it works on some systems and randomly fails on others for no apparent reason. [...]


I'm not that well versed in graphics programming though. If this doesn't apply to what you are doing feel free to ignore/correct me smile.png


these separate worker threads are more related to things like loading, geometry generation, and possibly some offline rendering (using FBOs or similar).

probably, there isn't as much of an advantage for normal rendering tasks, since the main rendering thread is largely already using up the GPU's resources, so it more makes sense for more CPU-bound or IO-bound tasks (like loading files and textures and similar).


these would be tasks which don't draw onto the main screen, and don't necessarily need to happen in-sync with the rendering-frame.

the advantage of having a shared OpenGL state in these worker threads, is so that these tasks can be done without needing to coordinate the process with the main thread (for example, still needing to rely on the main thread to upload the texture or load vertex-arrays into a VBO or similar). (granted, some coordination is still needed for creating these worker threads).

or such...

The first thing I tend to suggest is not to use the shared lists as a solution here, that item has a very specific use case (a rather outdated one at that) for the most part and really is not intended for what you are likely trying to accomplish. Shared lists were primarily used specifically for display lists where you are recording GL calls via glBegin/glEnd and while it has been updated to store more things it is still not your best solution. You should be using buffers for most things by now and those are not very well integrated into the sharing system since you need fairly precise control over when a buffer is mapped/unmapped and as such when it gets copied over to the driver side.

Now, having said that that; is threading the API calls worth it? It really depends. Most drivers already use a separate thread between the user space calls and the actual driver side. What is the actual goal of the threading then? Mostly the threading between calling the GL api and your primary thread is specifically intended to avoid having your primary thread block on some of the items which can/will block. I.e. glMap/glUnmap/glFlush etc can/will block in certain cases. The work queue likely solves this but I suspect it is just hiding the issue and causing other problems which you haven't noticed.

What other problems? Well, if you are specifically feeding the work into a single thread anyway, you probably end up with that worker thread saturated by graphics calls so it is effectively a dedicated graphics thread already. Ok, so what is the problem with that? Well it is very likely that you are slowing down your entire engine since the number of tasks in the queue likely went up considerably and you are increasing contention. Additionally, most of the task based systems tend to use a worker thread somewhat like:

thread

wait for work in queue

do work

repeat

So you are effectively multiplying contention on your task queue, one by pushing more work into the queue, the other by only allowing the single thread to pull rendering tasks off the queue so all the other threads are contending for less available work.

So, to figure out if you really want to thread this, I'd look at your performance and goals. I only tend to thread out my graphics driver for three primary purposes:

1. Running the underlying game simulation at a fixed frame rate no matter what the graphics frame rate may be. For instance, if I'm working on a networked game, I might run the underlying simulation at 10-30FPS and completely decouple that from the rendering using interpolated "snap shots" of game state. In this case the graphics thread makes a lot of sense.

2. Performing large amounts of geometry generation (procedural terrain, tree's etc) in worker threads. Separating the work from the API so you don't hold glMap/glUnmap calls open consistently is generally important. I.e. do the work in a memory buffer, pass it to the driver which does glMap/memcpy/glUnmap/free the memory isn't the "best" solution but prevents big hitches usually.

3. Large load of geometry being loaded from an async file io thread. Similar to #2.

If those reasons fit into your reasons, I can make suggestion on how to go about things. It's really a tricky subject though.

I think as long as you create a context for each worker thread, immediately use wglShareLists and make it current and then only use it for creating or uploading OpenGL objects and not for drawing it should work. If its efficient would be the next question.

I would not try to share one context for different threads by only making it current shortly everytime its needed.

Edit: wglShareLists is intended for all kinds of objects, not just displaylists. They just kept the name from the old version when there were no other objects.

I think as long as you create a context for each worker thread, immediately use wglShareLists and make it current and then only use it for creating or uploading OpenGL objects and not for drawing it should work. If its efficient would be the next question.

This is a valid use, unfortunately when I was doing this the whole thing was extremely inconsistent across drivers. Some of them supported some things and not others, crashed if you uploaded a texture which was accidentally in use in an invalid state, etc etc. After figuring out how to do things without using shared lists (and whatever else) I didn't see any reason to go back, it was much more consistent and allowed a better architecture to boot.

would not try to share one context for different threads by only making it current shortly everytime its needed.

This I know for a fact is bad. At 15k+ cycles to make the context valid on a new thread it was just pitiful. I tried basically the same thing as the generic thread queue solution and destroyed performance. :)

Edit: wglShareLists is intended for all kinds of objects, not just displaylists. They just kept the name from the old version when there were no other objects.

I realize they share a lot beyond display lists and the systems have been updated as mentioned to contain more of the new stuff. After the initial snafu's and then just going and writing things in a different manner, I never revisited them nor cared to. Hopefully they are considerably better now a days but I can't suggest them myself.

I think as long as you create a context for each worker thread, immediately use wglShareLists and make it current and then only use it for creating or uploading OpenGL objects and not for drawing it should work. If its efficient would be the next question.
I would not try to share one context for different threads by only making it current shortly everytime its needed.

Edit: wglShareLists is intended for all kinds of objects, not just displaylists. They just kept the name from the old version when there were no other objects.

yes, it is part of the point to create new contexts and use ShareLists.


from what I read, it shares the various kinds of objects (textures, VBOs, FBOs, ...).

I mostly just ended up moving the geometry-generation for the voxel terrain over to worker threads (which basically takes 16x16x16 cubes of voxels and converts them into triangle meshes), mostly because this task is CPU intensive and resulted in a framerate hit whenever the player was moving around (like, it is kind of lame to have 40fps standing in one spot, only to have it drop to around 10fps whenever the player is walking around). moving this over to threads has somewhat reduced the performance impact (albeit still leading to occasional bugs, ...).


the main other things I am considering at the moment are model and texture loading, and probably also video decoding (at the moment, I am still using my custom MJPEG variant for video-maps).

...
status:
added multithreaded texture loading (partial).

the (low-resolution) base-texture is still loaded in the main render-thread, mostly because normally the caller expects the base-texture number and resolution to be available immediately (in many areas, the resolution is used in ST coordinate calculations, and likewise for the texture number).

however, the high-resolution texture and normal/specular/... textures have been moved to worker threads.

I ended up having to mutex-protect uploading the textures, as otherwise I seemed to be getting garbled mixes of multiple textures (the texture upload code in my case may not be thread-safe).
likewise, it was observed that it is necessary to call "glFinish()" after uploading the textures, otherwise the texture spends a while basically just looking like random garbage until randomly popping into being the expected texture some-time (often many seconds) later.


then also ended up mutex-protecting a few other things, like my inflate code (which currently uses shared buffers, *1), and also the rotating allocator (basically, an allocator which allocates temporary-use memory from a ring-buffer, generally used for temporary string buffers for parsing things and similar). the rotating allocator basically works under the premise that everything will be done using the memory by the time the rover comes around again and ends up overwriting it. (this is frequently used by many of my loaders, and the multithreaded loading was causing them to step on each other).

*1: this saves the cost of having to allocate/free the buffers between uses, but probably does more harm if used from multiple threads, as then only a single thread can inflate something at a time.

also slightly worrying is that the VFS might not be thread-safe in some areas, but would be harder to protect as many of these areas are potentially also re-entrant (VFS FS-drivers also using VFS calls), so naively mutex locking them would have an unreasonable likelihood of producing deadlocks (absent reorganizing the code). the main risk area has to do with operations like mounting/unmounting and also things like "mkdir" / "rmdir" / ..., which are fairly infrequent (nearly all mounting happens at initial engine start-up, ...).


next up, probably model loading...
maybe also video-maps.

will have to think up how to best handle "model which isn't quite loaded yet". it is a tradeoff between either skipping it, or drawing a temporary placeholder. for now, will probably skip it (less effort), and maybe later add a placeholder.


ADD: update again:
model-loading and video-decoding are now also moved off to the work-queue.

for some reason, LOD leveling has gotten a bit glitchy (despite using mutex locking around it). then again, I suspect it was buggy already before this.

current major threads:
main thread (manages GC and work-queues);
render thread (client and rendering);
server thread (server tick: physics, game-logic, ...);
GL worker threads (variable number: handle model/texture loading, video decoding, ...).

framerates have gone up somewhat, and there are less stalls.
most time of the execution time still goes into the renderer.

as-is, with a quad-core, engine runs it at around 40-60% load (vs 25% when it was still mostly single-threaded).

the advantage of having a shared OpenGL state in these worker threads, is so that these tasks can be done without needing to coordinate the process with the main thread (for example, still needing to rely on the main thread to upload the texture or load vertex-arrays into a VBO or similar). (granted, some coordination is still needed for creating these worker threads).

This is a big fallacy. Shared contexts work, and they work fine (and I can't tell of any mysterious failures, at least not if you create them "correctly", i.e. using wglCreateContextAttribsARB). However, it does not work without coordinating. You only don't see it happening, and also it sometimes does not work the way you think -- sometimes you will run into subtle bugs. The driver will do some heavy lifting to coordinate what you're doing, which results in about one millisecond of added frame time (depends on card). It also seems to make little or no difference whether you properly schedule updates (i.e. only modify objects that are not in use) or not -- though that particular thing may be simply because drivers to date are not handling this path optimally since nobody uses it anyway. Might change in the future (or not), who knows.

In some cases, you can, despite driver synchronization, generate undesired rendering effects, too. The driver will usually (pretty much guaranteed) make sure that you are not overwriting a buffer it is currently rendering from. However, it doesn't coordinate modifications on several objects in a meaningful way (how could it do that!). Which means you may end up modifying a vertex buffer, and then a texture, and the driver will properly synchronize so none of them is garbled -- but it will pick up the old texture with the new vertices for rendering.

There's a chapter about this very thing in OpenGL Insights as well (chapter "Asynchronous Transfers", incidentially this happens to be one of the free sample chapters), if you don't trust my word alone.

A millisecond is a lot of time (if you are aiming for 16.6ms frame time), so shared contexts are not what you want to use most of the time. You will usually get better performance if you have just one context, and your render thread maps buffers and passes a raw pointer to a worker thread. The worker can then fill the buffer with something meaningful, and finally the render thread unmaps the buffer.

@cr88192, with all your mutexes about the place, how much of a performance increase have you measured over your original single-threaded GL code?

This topic is closed to new replies.

Advertisement