loading stuff is a fairly CPU intensive process though.samoth, on 18 Mar 2013 - 07:58, said:
This is a big fallacy. Shared contexts work, and they work fine (and I can't tell of any mysterious failures, at least not if you create them "correctly", i.e. using wglCreateContextAttribsARB). However, it does not work without coordinating. You only don't see it happening, and also it sometimes does not work the way you think -- sometimes you will run into subtle bugs. The driver will do some heavy lifting to coordinate what you're doing, which results in about one millisecond of added frame time (depends on card). It also seems to make little or no difference whether you properly schedule updates (i.e. only modify objects that are not in use) or not -- though that particular thing may be simply because drivers to date are not handling this path optimally since nobody uses it anyway. Might change in the future (or not), who knows.Quote
the advantage of having a shared OpenGL state in these worker threads, is so that these tasks can be done without needing to coordinate the process with the main thread (for example, still needing to rely on the main thread to upload the texture or load vertex-arrays into a VBO or similar). (granted, some coordination is still needed for creating these worker threads).
In some cases, you can, despite driver synchronization, generate undesired rendering effects, too. The driver will usually (pretty much guaranteed) make sure that you are not overwriting a buffer it is currently rendering from. However, it doesn't coordinate modifications on several objects in a meaningful way (how could it do that!). Which means you may end up modifying a vertex buffer, and then a texture, and the driver will properly synchronize so none of them is garbled -- but it will pick up the old texture with the new vertices for rendering.
There's a chapter about this very thing in OpenGL Insights as well (chapter "Asynchronous Transfers", incidentially this happens to be one of the free sample chapters), if you don't trust my word alone.
A millisecond is a lot of time (if you are aiming for 16.6ms frame time), so shared contexts are not what you want to use most of the time. You will usually get better performance if you have just one context, and your render thread maps buffers and passes a raw pointer to a worker thread. The worker can then fill the buffer with something meaningful, and finally the render thread unmaps the buffer.
most of my models are in ASCII text formats, textures need to be decompressed (from PNG or JPEG) and converted to DXT prior to upload, ... this does at least take a lot of this work out of the main render thread (reducing obvious stalls, ...).
not having to deal with inter-thread communication is easier though, as it avoids a lot of the awkwardness of logic working around shared-object-states, and also the need for doing (explicit) event-driven / message-passing type stuff.
whether or not OpenGL needs to synchronize internally is less of an issue here (it may effect performance, but not how much code needs to be written).
the main considered alternative, had I gone this way previously (before knowing about this GL feature), would have likely involved putting all of the texture uploads and similar in an event queue (where the renderer thread would proceed to spin in a loop and invoke event-handling logic for all the textures to be uploaded, ...). (whether this would be faster or slower, I don't really know).
note that the current implementation doesn't rule out the possibility of going over to an event-queue if really needed.
~ 10 fps while walking around to often more around 30 fps.Hodgman, on 18 Mar 2013 - 08:17, said:
@cr88192, with all your mutexes about the place, how much of a performance increase have you measured over your original single-threaded GL code?
moving the video-map decoding to workers got framerates often up from around 20-30 fps to around 40-50 fps (except it seems when looking at water, or when a lot of character models come into the scene, both of which are still "less than ideal").
video-maps are used mostly for fires and a few other misc animated-texture effects (and currently involve decoding video frames in a modified M-JPEG format, but I had designed/implemented a faster-decoding format which goes straight to DXTn, but haven't switched over to it yet).
generally though, locking is needed to prevent threads from doing things like stomping on shared state and similar (like incorrectly updating linked lists or rovers, ...).
note that I am generally using the faster "Critical Section Object" mutexes, rather than the slower "CreateMutex"/"WaitForSingleObject" mutexes, mostly as past testing had shown these later ones to themselves be pretty slow (~ 1us to lock or unlock), but they are more well-behaved in some cases. (I had noted all this before when dealing with multi-threading in my Script-VM and MM/GC).
FWIW, 1 us is actually a surprisingly long time, so I am not sure what exactly MS is doing in there.
in my own threading wrapper, these are called Mutex and FastMutex objects.
I was originally using custom-written spinlocks for FastMutex, but then observed that Critical Section objects were similarly fast.
on Linux, there is no difference, as the default Linux "pthread_mutex_t" seems to behave more like the Critical Section objects, thus I just use the default mutexes for both cases.
did record a video (note, recording limited to 15fps):
there are still some stalls, but when going around on that track, it is much smoother than it was previously, though there is still an issue of the engine using up most of the available address space for a 32-bit process (reducing memory footprint is still an ongoing issue, the main alternative being to be x64-only...).