multiple threads, OpenGL, and worker threads...

Started by
15 comments, last by cr88192 11 years, 1 month ago

samoth, on 18 Mar 2013 - 07:58, said:

Quote
the advantage of having a shared OpenGL state in these worker threads, is so that these tasks can be done without needing to coordinate the process with the main thread (for example, still needing to rely on the main thread to upload the texture or load vertex-arrays into a VBO or similar). (granted, some coordination is still needed for creating these worker threads).

This is a big fallacy. Shared contexts work, and they work fine (and I can't tell of any mysterious failures, at least not if you create them "correctly", i.e. using wglCreateContextAttribsARB). However, it does not work without coordinating. You only don't see it happening, and also it sometimes does not work the way you think -- sometimes you will run into subtle bugs. The driver will do some heavy lifting to coordinate what you're doing, which results in about one millisecond of added frame time (depends on card). It also seems to make little or no difference whether you properly schedule updates (i.e. only modify objects that are not in use) or not -- though that particular thing may be simply because drivers to date are not handling this path optimally since nobody uses it anyway. Might change in the future (or not), who knows.
In some cases, you can, despite driver synchronization, generate undesired rendering effects, too. The driver will usually (pretty much guaranteed) make sure that you are not overwriting a buffer it is currently rendering from. However, it doesn't coordinate modifications on several objects in a meaningful way (how could it do that!). Which means you may end up modifying a vertex buffer, and then a texture, and the driver will properly synchronize so none of them is garbled -- but it will pick up the old texture with the new vertices for rendering.

There's a chapter about this very thing in OpenGL Insights as well (chapter "Asynchronous Transfers", incidentially this happens to be one of the free sample chapters), if you don't trust my word alone.

A millisecond is a lot of time (if you are aiming for 16.6ms frame time), so shared contexts are not what you want to use most of the time. You will usually get better performance if you have just one context, and your render thread maps buffers and passes a raw pointer to a worker thread. The worker can then fill the buffer with something meaningful, and finally the render thread unmaps the buffer.


loading stuff is a fairly CPU intensive process though.

most of my models are in ASCII text formats, textures need to be decompressed (from PNG or JPEG) and converted to DXT prior to upload, ... this does at least take a lot of this work out of the main render thread (reducing obvious stalls, ...).


not having to deal with inter-thread communication is easier though, as it avoids a lot of the awkwardness of logic working around shared-object-states, and also the need for doing (explicit) event-driven / message-passing type stuff.

whether or not OpenGL needs to synchronize internally is less of an issue here (it may effect performance, but not how much code needs to be written).

the main considered alternative, had I gone this way previously (before knowing about this GL feature), would have likely involved putting all of the texture uploads and similar in an event queue (where the renderer thread would proceed to spin in a loop and invoke event-handling logic for all the textures to be uploaded, ...). (whether this would be faster or slower, I don't really know).

note that the current implementation doesn't rule out the possibility of going over to an event-queue if really needed.

Hodgman, on 18 Mar 2013 - 08:17, said:
@cr88192, with all your mutexes about the place, how much of a performance increase have you measured over your original single-threaded GL code?

~ 10 fps while walking around to often more around 30 fps.

moving the video-map decoding to workers got framerates often up from around 20-30 fps to around 40-50 fps (except it seems when looking at water, or when a lot of character models come into the scene, both of which are still "less than ideal").

video-maps are used mostly for fires and a few other misc animated-texture effects (and currently involve decoding video frames in a modified M-JPEG format, but I had designed/implemented a faster-decoding format which goes straight to DXTn, but haven't switched over to it yet).


generally though, locking is needed to prevent threads from doing things like stomping on shared state and similar (like incorrectly updating linked lists or rovers, ...).

note that I am generally using the faster "Critical Section Object" mutexes, rather than the slower "CreateMutex"/"WaitForSingleObject" mutexes, mostly as past testing had shown these later ones to themselves be pretty slow (~ 1us to lock or unlock), but they are more well-behaved in some cases. (I had noted all this before when dealing with multi-threading in my Script-VM and MM/GC).

FWIW, 1 us is actually a surprisingly long time, so I am not sure what exactly MS is doing in there.


in my own threading wrapper, these are called Mutex and FastMutex objects.
I was originally using custom-written spinlocks for FastMutex, but then observed that Critical Section objects were similarly fast.

on Linux, there is no difference, as the default Linux "pthread_mutex_t" seems to behave more like the Critical Section objects, thus I just use the default mutexes for both cases.


did record a video (note, recording limited to 15fps):



there are still some stalls, but when going around on that track, it is much smoother than it was previously, though there is still an issue of the engine using up most of the available address space for a 32-bit process (reducing memory footprint is still an ongoing issue, the main alternative being to be x64-only...).
Advertisement

Wait. You're rendering this at 50 fps? Does the original above runs at 50 fps? The green framerate at top screen is often around 30 and sometimes drips as low as a single-digit number. Perhaps it's video capture framerate? What is your hardware?

Previously "Krohm"

most of my models are in ASCII text formats, textures need to be decompressed (from PNG or JPEG) and converted to DXT prior to upload, ... this does at least take a lot of this work out of the main render thread (reducing obvious stalls, ...).

That's something one normally does once during the build. Loading models as ASCII text or loading PNG and converting to DXT on the fly is kind of a design mistake. You really do not want to do this. Videos used as textures may be the only "legitimate" exception since video formats just compress so much better than DXT, so the saved bandwidth may be worth it.

Not only does using general formats (in particular text formats) waste CPU time which is absolutely unnecessary, but an offline DXT compressor will also be able to deliver a much better quality than your on-the-fly PNG-to-DXT transcoder (quality DXT compression is non-trivial). Any PNG you convert to DXT, you could equally well store as DXT at better quality in the first place.

That left out of consideration, even if you insist on doing the above, there is nothing that hinders you from doing the slow stuff in a worker thread, and still only using one OpenGL context. This avoids the driver synchronization overhead. More or less everybody is doing this (because disk access is always "slow"). Load and decompress/transcode/whatever in a worker thread, map a buffer in the main thread, fill the memory pointed to in the worker. Unmap buffer when ready. Data will usually only be ready several frames later anyway, since a single disk access is on the order of one full frame time. Therefore, there is really not much need for many ill-advised little syncs in the middle of the frame. Fire off requests, and see how many became ready at the end of the frame when you're not doing anything but wait for vertical sync.

whether or not OpenGL needs to synchronize internally is less of an issue here (it may effect performance, but not how much code needs to be written)

If you can live with the overhead, that is fine (though if you have 40-50 fps, you cannot in my opinion). However, as I pointed out earlier, there is another issue with this synchronization. The driver does not know what you're going to do, so it cannot synchronize in a "meaningful way". Which means you pay for something, but don't get anything back. You must in addition use an event object anyway, or it will not work reliably.

Otherwise, updates will "work" insofar as every individual update will be consistent, but you have no control over when it takes effect. In the very worst case, you might end up wondering why your render thread crashes (because the driver decided that it's OK to only make that buffer you're reading from available to the render thread some time later). Therefore you must sync explicitly, which means no more and no less than in summary you sync once more than would be necessary.

Mutex is slower than critical sections [...] 1 us is actually a surprisingly long time, so I am not sure what exactly MS is doing in there.

There are several reasons for this. First, entering a critical section is merely atomically incrementing an integer in the normal, non-contended case. In the moderately contended case, it is spinning a few dozen or so times. Only in the very unusual highly contended case, it is a system call that blocks the thread. A mutex is always a system call that blocks the thread when there is any kind of congestion. Also, a mutex is a heavyweight kernel object that can be used to synchronize different processes (whereas a critical section is a memory address plus a per-process keyed event object). When your thread blocks, it will be unblocked when the mutex is signalled and the scheduler next runs. This is "works as intended".

You can use keyed events (the underlying "block" mechanism used in critical sections) to build your own mutex which alltogether can be roughly 30% faster than a critical section, depending on your design. I advise against that, however. Rather get your threading and synchronization correct so you need only few synch points. This will not only make a 30% difference, but a 300% or maybe 3,000% difference.

on Linux, there is no difference, as the default Linux "pthread_mutex_t" seems to behave more like the Critical Section objects

That's because this is what they are. They're a spinlock plus a futex. Incidentially, futex works for processes, not just threads, but that is no surprise because Linux does not have support for threads at all (it only supports processes that share an address space).

Wait. You're rendering this at 50 fps? Does the original above runs at 50 fps? The green framerate at top screen is often around 30 and sometimes drips as low as a single-digit number. Perhaps it's video capture framerate? What is your hardware?

it isn't always 50fps, more it varies depending on where the player is looking (for example, looking at water also currently hurts the framerate pretty bad).

also, in most of the video the player is moving (especially on the track), which previously would have often been single-digits the whole time (rather than just occasionally), mostly as there is all the performance hit from streaming and rebuilding the terrain geometry. (it is at least better than it was before).


also, the engine is also doing video-capture in the above case, where video capture/encoding isn't free either, and usually results in a considerable framerate hit (originally, prior to some prior optimizations, video capture tended to make things almost unplayable).

I had considered before the possibility of writing a faster and more specialized JPEG encoder for this case (the in-engine video capture uses M-JPEG). (mostly as raw frames eat large amounts of HDD space).


hardware:
CPU: AMD Athlon II X4 at 2.8 GHz;
RAM: 16GB DDR3 (PC3-12800);
video: GeForce GTX 460.


most of my models are in ASCII text formats, textures need to be decompressed (from PNG or JPEG) and converted to DXT prior to upload, ... this does at least take a lot of this work out of the main render thread (reducing obvious stalls, ...).

That's something one normally does once during the build. Loading models as ASCII text or loading PNG and converting to DXT on the fly is kind of a design mistake. You really do not want to do this. Videos used as textures may be the only "legitimate" exception since video formats just compress so much better than DXT, so the saved bandwidth may be worth it.

Not only does using general formats (in particular text formats) waste CPU time which is absolutely unnecessary, but an offline DXT compressor will also be able to deliver a much better quality than your on-the-fly PNG-to-DXT transcoder (quality DXT compression is non-trivial). Any PNG you convert to DXT, you could equally well store as DXT at better quality in the first place.

That left out of consideration, even if you insist on doing the above, there is nothing that hinders you from doing the slow stuff in a worker thread, and still only using one OpenGL context. This avoids the driver synchronization overhead. More or less everybody is doing this (because disk access is always "slow"). Load and decompress/transcode/whatever in a worker thread, map a buffer in the main thread, fill the memory pointed to in the worker. Unmap buffer when ready. Data will usually only be ready several frames later anyway, since a single disk access is on the order of one full frame time. Therefore, there is really not much need for many ill-advised little syncs in the middle of the frame. Fire off requests, and see how many became ready at the end of the frame when you're not doing anything but wait for vertical sync.


as can be noted, I had considered moving to a DXT-based format for both textures and video, but thus far haven't done so, mostly as PNG and JPEG are much more convenient for use with graphics programs (and I don't have any sort of explicit "process resources for game" stage).

lots of ASCII formats was more so that data can be examined / edited manually when needed.

actually, most of the 3D models are in a variant of the AC3D file format (partly originally because it was ASCII text).

I had previously considered the possibility of moving over to something like IQM or similar.


the world is currently stored in a binary format, but mostly as voxels.

I had considered the possibility of moving the client/server interface to send geometry (rather than voxel data), which would mostly make all the geometry-building be a server-side task. eventually, caching prebuilt geometry in the world-regions could also possibly make sense.


whether or not OpenGL needs to synchronize internally is less of an issue here (it may effect performance, but not how much code needs to be written)

If you can live with the overhead, that is fine (though if you have 40-50 fps, you cannot in my opinion). However, as I pointed out earlier, there is another issue with this synchronization. The driver does not know what you're going to do, so it cannot synchronize in a "meaningful way". Which means you pay for something, but don't get anything back. You must in addition use an event object anyway, or it will not work reliably.
Otherwise, updates will "work" insofar as every individual update will be consistent, but you have no control over when it takes effect. In the very worst case, you might end up wondering why your render thread crashes (because the driver decided that it's OK to only make that buffer you're reading from available to the render thread some time later). Therefore you must sync explicitly, which means no more and no less than in summary you sync once more than would be necessary.


could be, but as is, it is better than having an obvious stall every time something loaded, which was the previous issue.


but, 40-50 is pretty good.

there was a time, a little earlier on in the engine development, when I thought I was doing pretty good when it was consistently breaking 10 fps, then 20 fps and later 30 fps became the goals.

then the Doom 3 source came out, and upon looking at it, I realized partly how terrible all my rendering architecture was, but it is a long road trying to make it (gradually) suck less (and also make the code less of a horrid mess as well...).


Mutex is slower than critical sections [...] 1 us is actually a surprisingly long time, so I am not sure what exactly MS is doing in there.

There are several reasons for this. First, entering a critical section is merely atomically incrementing an integer in the normal, non-contended case. In the moderately contended case, it is spinning a few dozen or so times. Only in the very unusual highly contended case, it is a system call that blocks the thread. A mutex is always a system call that blocks the thread when there is any kind of congestion. Also, a mutex is a heavyweight kernel object that can be used to synchronize different processes (whereas a critical section is a memory address plus a per-process keyed event object). When your thread blocks, it will be unblocked when the mutex is signalled and the scheduler next runs. This is "works as intended".

You can use keyed events (the underlying "block" mechanism used in critical sections) to build your own mutex which alltogether can be roughly 30% faster than a critical section, depending on your design. I advise against that, however. Rather get your threading and synchronization correct so you need only few synch points. This will not only make a 30% difference, but a 300% or maybe 3,000% difference.

on Linux, there is no difference, as the default Linux "pthread_mutex_t" seems to behave more like the Critical Section objects

That's because this is what they are. They're a spinlock plus a futex. Incidentially, futex works for processes, not just threads, but that is no surprise because Linux does not have support for threads at all (it only supports processes that share an address space).


there are still a few obvious things still to work on, like probably making the Inflate code able to be used from multiple threads, and also allowing more parallelism for the texture conversions and uploads (so that less of the process needs to be locked).

I had considered before the possibility of writing a faster and more specialized JPEG encoder for this case (the in-engine video capture uses M-JPEG). (mostly as raw frames eat large amounts of HDD space).

It might be simpler to write out DXT compressed video frames, and re-encode the video offline.

Outerra has a writeup on this technique.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

swiftcoder, on 19 Mar 2013 - 14:03, said:


cr88192, on 19 Mar 2013 - 13:09, said:
I had considered before the possibility of writing a faster and more specialized JPEG encoder for this case (the in-engine video capture uses M-JPEG). (mostly as raw frames eat large amounts of HDD space).

It might be simpler to write out DXT compressed video frames, and re-encode the video offline.

Outerra has a writeup on this technique.

yes, this is also possible, since converting to DXT is a bit faster than converting to JPEG.
the partial drawback though, granted, is the need for offline re-encoding (and/or needing a special codec).

basically, this is a drawback for the common case of recording video and then loading it up in MovieMaker or similar, as first manual transcoding would be needed.

yes, I did note that the article provided a codec for the specific format they are using, so this is still possible.


the most obvious "clever trick" optimization for a JPEG encoder is basically hard-coding most or all the tables (sort of like in MPEG), which could shave off several major steps of the encoding process (and mostly just give more room for micro-optimizing stuff). (basically: fork existing encoder, and probably hard-code and micro-optimize everything).

profiler results show most of the CPU time is going into video recording, and of this, primarily stuff related to VLC / entropy coding blocks (followed by DCT and colorspace-conversion). not sure how much of a speedup could be gained here, probably fairly small though...

so, may still need to think on it...

profiler results show most of the CPU time is going into video recording, and of this, primarily stuff related to VLC / entropy coding blocks (followed by DCT and colorspace-conversion). not sure how much of a speedup could be gained here, probably fairly small though...

The colour space conversion should be trivial to move to the GPU, right before read-back occurs.

Entropy encoding is probably not so simple smile.png

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]


profiler results show most of the CPU time is going into video recording, and of this, primarily stuff related to VLC / entropy coding blocks (followed by DCT and colorspace-conversion). not sure how much of a speedup could be gained here, probably fairly small though...

The colour space conversion should be trivial to move to the GPU, right before read-back occurs.

Entropy encoding is probably not so simple smile.png


fiddling with it still hasn't gotten it all that much faster...

luckily, I am only sometimes recording video.


side note:
little special is currently done for the read-back, basically it just sort of, at the end of the frame (after rendering everything else), issues a glReadPixels() call, and passes the buffer contents off to the video-encoder. yeah, probably a poor way to do it, I know... this is basically also how screen-shots are done, except done at a faster rate (and on a timer).

luckily, the actual video encoding happens in its own thread.

This topic is closed to new replies.

Advertisement