Jump to content

  • Log In with Google      Sign In   
  • Create Account


multiple threads, OpenGL, and worker threads...


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
18 replies to this topic

#1 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 16 March 2013 - 03:23 AM

earlier, I was off moving a few things from my main thread to secondary threads, and made a discovery:
http://www.opengl.org/wiki/OpenGL_and_multithreading

notable here is that using wglShareLists, it is apparently possible for multiple threads to both access OpenGL (at least on Windows...).
(ADD: apparently in the Linux case, it is mostly as an argument to glXCreateContext).


this is, in effect, pretty nifty.

however, I am doing this in a slightly funky manner:
effectively, work-items are submitted via a "work queue", with worker threads which grab items off the queue, then invoke their supplied function-pointers.

(this being in contrast to using a dedicated thread for the task).


previously, there was no way of saying which worker thread would get which work items, so there was a potential issue of OpenGL-specific work-items being grabbed up by a worker thread which didn't have OpenGL available.

a solution then was to add the concept of a worker-ID (and make some "clever" changes to the work-queue code), where worker creation functions could be "registered", which would return an info structure containing function-pointers to be called when beginning/ending execution of the new worker thread (or also deny creation of new workers), ... and making it so that work-items would only be executed with a worker of the matching type (the default/generic workers were then given an ID of 0).

a little logic later, and now I seem to have worker threads which can access GL (and things like rebuilding the terrain can now happen asynchronously, and is also a little bit faster).

(granted, yes, did have to go add mutexes and similar in a few places... mostly as the terrain-building code wasn't exactly thread-safe).


thoughts?...

Edited by cr88192, 16 March 2013 - 12:41 PM.


Sponsor:

#2 bonus.2113   Members   -  Reputation: 630

Like
0Likes
Like

Posted 16 March 2013 - 03:00 PM

As far as I know, it is not safe to draw to a window created in an other thread using OpenGL. Source: Interview with John Carmack on the Doom3 rendering code (2012, third answer). He doesn't go into details but says:

 

[...] on windows, OpenGL can only safely draw to a window that was created by the same thread. We created the window on the launch thread, but then did all the rendering on a separate render thread. It would be nice if doing this just failed with a clear error, but instead it works on some systems and randomly fails on others for no apparent reason. [...]

 

I'm not that well versed in graphics programming though. If this doesn't apply to what you are doing feel free to ignore/correct me :)



#3 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 16 March 2013 - 04:06 PM

As far as I know, it is not safe to draw to a window created in an other thread using OpenGL. Source: Interview with John Carmack on the Doom3 rendering code (2012, third answer). He doesn't go into details but says:

[...] on windows, OpenGL can only safely draw to a window that was created by the same thread. We created the window on the launch thread, but then did all the rendering on a separate render thread. It would be nice if doing this just failed with a clear error, but instead it works on some systems and randomly fails on others for no apparent reason. [...]

 
I'm not that well versed in graphics programming though. If this doesn't apply to what you are doing feel free to ignore/correct me smile.png


these separate worker threads are more related to things like loading, geometry generation, and possibly some offline rendering (using FBOs or similar).

probably, there isn't as much of an advantage for normal rendering tasks, since the main rendering thread is largely already using up the GPU's resources, so it more makes sense for more CPU-bound or IO-bound tasks (like loading files and textures and similar).


these would be tasks which don't draw onto the main screen, and don't necessarily need to happen in-sync with the rendering-frame.

the advantage of having a shared OpenGL state in these worker threads, is so that these tasks can be done without needing to coordinate the process with the main thread (for example, still needing to rely on the main thread to upload the texture or load vertex-arrays into a VBO or similar). (granted, some coordination is still needed for creating these worker threads).

or such...

#4 AllEightUp   Moderators   -  Reputation: 4186

Like
0Likes
Like

Posted 16 March 2013 - 04:22 PM

The first thing I tend to suggest is not to use the shared lists as a solution here, that item has a very specific use case (a rather outdated one at that) for the most part and really is not intended for what you are likely trying to accomplish.  Shared lists were primarily used specifically for display lists where you are recording GL calls via glBegin/glEnd and while it has been updated to store more things it is still not your best solution.  You should be using buffers for most things by now and those are not very well integrated into the sharing system since you need fairly precise control over when a buffer is mapped/unmapped and as such when it gets copied over to the driver side.

 

Now, having said that that; is threading the API calls worth it?  It really depends.  Most drivers already use a separate thread between the user space calls and the actual driver side.  What is the actual goal of the threading then?  Mostly the threading between calling the GL api and your primary thread is specifically intended to avoid having your primary thread block on some of the items which can/will block.  I.e. glMap/glUnmap/glFlush etc can/will block in certain cases.  The work queue likely solves this but I suspect it is just hiding the issue and causing other problems which you haven't noticed.

 

What other problems?  Well, if you are specifically feeding the work into a single thread anyway, you probably end up with that worker thread saturated by graphics calls so it is effectively a dedicated graphics thread already.  Ok, so what is the problem with that?  Well it is very likely that you are slowing down your entire engine since the number of tasks in the queue likely went up considerably and you are increasing contention.  Additionally, most of the task based systems tend to use a worker thread somewhat like:

 

thread

  wait for work in queue

  do work

  repeat

 

So you are effectively multiplying contention on your task queue, one by pushing more work into the queue, the other by only allowing the single thread to pull rendering tasks off the queue so all the other threads are contending for less available work.

 

So, to figure out if you really want to thread this, I'd look at your performance and goals.  I only tend to thread out my graphics driver for three primary purposes:

 

1.  Running the underlying game simulation at a fixed frame rate no matter what the graphics frame rate may be.  For instance, if I'm working on a networked game, I might run the underlying simulation at 10-30FPS and completely decouple that from the rendering using interpolated "snap shots" of game state.  In this case the graphics thread makes a lot of sense.

2.  Performing large amounts of geometry generation (procedural terrain, tree's etc) in worker threads.  Separating the work from the API so you don't hold glMap/glUnmap calls open consistently is generally important.  I.e. do the work in a memory buffer, pass it to the driver which does glMap/memcpy/glUnmap/free the memory isn't the "best" solution but prevents big hitches usually.

3.  Large load of geometry being loaded from an async file io thread.  Similar to #2.

 

If those reasons fit into your reasons, I can make suggestion on how to go about things.  It's really a tricky subject though.



#5 wintertime   Members   -  Reputation: 1645

Like
0Likes
Like

Posted 16 March 2013 - 04:25 PM

I think as long as you create a context for each worker thread, immediately use wglShareLists and make it current and then only use it for creating or uploading OpenGL objects and not for drawing it should work. If its efficient would be the next question.

I would not try to share one context for different threads by only making it current shortly everytime its needed.

 

Edit: wglShareLists is intended for all kinds of objects, not just displaylists. They just kept the name from the old version when there were no other objects.


Edited by wintertime, 16 March 2013 - 04:27 PM.


#6 AllEightUp   Moderators   -  Reputation: 4186

Like
0Likes
Like

Posted 16 March 2013 - 05:17 PM

I think as long as you create a context for each worker thread, immediately use wglShareLists and make it current and then only use it for creating or uploading OpenGL objects and not for drawing it should work. If its efficient would be the next question.

This is a valid use, unfortunately when I was doing this the whole thing was extremely inconsistent across drivers.  Some of them supported some things and not others, crashed if you uploaded a texture which was accidentally in use in an invalid state, etc etc.  After figuring out how to do things without using shared lists (and whatever else) I didn't see any reason to go back, it was much more consistent and allowed a better architecture to boot.

 would not try to share one context for different threads by only making it current shortly everytime its needed.

This I know for a fact is bad.  At 15k+ cycles to make the context valid on a new thread it was just pitiful.  I tried basically the same thing as the generic thread queue solution and destroyed performance.  :)

Edit: wglShareLists is intended for all kinds of objects, not just displaylists. They just kept the name from the old version when there were no other objects.

I realize they share a lot beyond display lists and the systems have been updated as mentioned to contain more of the new stuff.  After the initial snafu's and then just going and writing things in a different manner, I never revisited them nor cared to.  Hopefully they are considerably better now a days but I can't suggest them myself.



#7 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 16 March 2013 - 05:32 PM

I think as long as you create a context for each worker thread, immediately use wglShareLists and make it current and then only use it for creating or uploading OpenGL objects and not for drawing it should work. If its efficient would be the next question.
I would not try to share one context for different threads by only making it current shortly everytime its needed.
 
Edit: wglShareLists is intended for all kinds of objects, not just displaylists. They just kept the name from the old version when there were no other objects.

yes, it is part of the point to create new contexts and use ShareLists.


from what I read, it shares the various kinds of objects (textures, VBOs, FBOs, ...).

I mostly just ended up moving the geometry-generation for the voxel terrain over to worker threads (which basically takes 16x16x16 cubes of voxels and converts them into triangle meshes), mostly because this task is CPU intensive and resulted in a framerate hit whenever the player was moving around (like, it is kind of lame to have 40fps standing in one spot, only to have it drop to around 10fps whenever the player is walking around). moving this over to threads has somewhat reduced the performance impact (albeit still leading to occasional bugs, ...).


the main other things I am considering at the moment are model and texture loading, and probably also video decoding (at the moment, I am still using my custom MJPEG variant for video-maps).

...

#8 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 17 March 2013 - 03:21 PM

status:
added multithreaded texture loading (partial).

the (low-resolution) base-texture is still loaded in the main render-thread, mostly because normally the caller expects the base-texture number and resolution to be available immediately (in many areas, the resolution is used in ST coordinate calculations, and likewise for the texture number).

however, the high-resolution texture and normal/specular/... textures have been moved to worker threads.

I ended up having to mutex-protect uploading the textures, as otherwise I seemed to be getting garbled mixes of multiple textures (the texture upload code in my case may not be thread-safe).
likewise, it was observed that it is necessary to call "glFinish()" after uploading the textures, otherwise the texture spends a while basically just looking like random garbage until randomly popping into being the expected texture some-time (often many seconds) later.


then also ended up mutex-protecting a few other things, like my inflate code (which currently uses shared buffers, *1), and also the rotating allocator (basically, an allocator which allocates temporary-use memory from a ring-buffer, generally used for temporary string buffers for parsing things and similar). the rotating allocator basically works under the premise that everything will be done using the memory by the time the rover comes around again and ends up overwriting it. (this is frequently used by many of my loaders, and the multithreaded loading was causing them to step on each other).

*1: this saves the cost of having to allocate/free the buffers between uses, but probably does more harm if used from multiple threads, as then only a single thread can inflate something at a time.

also slightly worrying is that the VFS might not be thread-safe in some areas, but would be harder to protect as many of these areas are potentially also re-entrant (VFS FS-drivers also using VFS calls), so naively mutex locking them would have an unreasonable likelihood of producing deadlocks (absent reorganizing the code). the main risk area has to do with operations like mounting/unmounting and also things like "mkdir" / "rmdir" / ..., which are fairly infrequent (nearly all mounting happens at initial engine start-up, ...).


next up, probably model loading...
maybe also video-maps.

will have to think up how to best handle "model which isn't quite loaded yet". it is a tradeoff between either skipping it, or drawing a temporary placeholder. for now, will probably skip it (less effort), and maybe later add a placeholder.


ADD: update again:
model-loading and video-decoding are now also moved off to the work-queue.

for some reason, LOD leveling has gotten a bit glitchy (despite using mutex locking around it). then again, I suspect it was buggy already before this.

current major threads:
main thread (manages GC and work-queues);
render thread (client and rendering);
server thread (server tick: physics, game-logic, ...);
GL worker threads (variable number: handle model/texture loading, video decoding, ...).

framerates have gone up somewhat, and there are less stalls.
most time of the execution time still goes into the renderer.

as-is, with a quad-core, engine runs it at around 40-60% load (vs 25% when it was still mostly single-threaded).

Edited by cr88192, 17 March 2013 - 07:25 PM.


#9 samoth   Crossbones+   -  Reputation: 4684

Like
2Likes
Like

Posted 18 March 2013 - 06:53 AM

the advantage of having a shared OpenGL state in these worker threads, is so that these tasks can be done without needing to coordinate the process with the main thread (for example, still needing to rely on the main thread to upload the texture or load vertex-arrays into a VBO or similar). (granted, some coordination is still needed for creating these worker threads).

 

This is a big fallacy. Shared contexts work, and they work fine (and I can't tell of any mysterious failures, at least not if you create them "correctly", i.e. using wglCreateContextAttribsARB). However, it does not work without coordinating. You only don't see it happening, and also it sometimes does not work the way you think -- sometimes you will run into subtle bugs. The driver will do some heavy lifting to coordinate what you're doing, which results in about one millisecond of added frame time (depends on card). It also seems to make little or no difference whether you properly schedule updates (i.e. only modify objects that are not in use) or not -- though that particular thing may be simply because drivers to date are not handling this path optimally since nobody uses it anyway. Might change in the future (or not), who knows.

In some cases, you can, despite driver synchronization, generate undesired rendering effects, too. The driver will usually (pretty much guaranteed) make sure that you are not overwriting a buffer it is currently rendering from. However, it doesn't coordinate modifications on several objects in a meaningful way (how could it do that!). Which means you may end up modifying a vertex buffer, and then a texture, and the driver will properly synchronize so none of them is garbled -- but it will pick up the old texture with the new vertices for rendering.

 

There's a chapter about this very thing in OpenGL Insights as well (chapter "Asynchronous Transfers", incidentially this happens to be one of the free sample chapters), if you don't trust my word alone.

 

A millisecond is a lot of time (if you are aiming for 16.6ms frame time), so shared contexts are not what you want to use most of the time. You will usually get better performance if you have just one context, and your render thread maps buffers and passes a raw pointer to a worker thread. The worker can then fill the buffer with something meaningful, and finally the render thread unmaps the buffer.


Edited by samoth, 18 March 2013 - 11:31 AM.


#10 Hodgman   Moderators   -  Reputation: 29514

Like
2Likes
Like

Posted 18 March 2013 - 07:12 AM

@cr88192, with all your mutexes about the place, how much of a performance increase have you measured over your original single-threaded GL code?



#11 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 18 March 2013 - 01:03 PM

samoth, on 18 Mar 2013 - 07:58, said:

Quote
the advantage of having a shared OpenGL state in these worker threads, is so that these tasks can be done without needing to coordinate the process with the main thread (for example, still needing to rely on the main thread to upload the texture or load vertex-arrays into a VBO or similar). (granted, some coordination is still needed for creating these worker threads).

This is a big fallacy. Shared contexts work, and they work fine (and I can't tell of any mysterious failures, at least not if you create them "correctly", i.e. using wglCreateContextAttribsARB). However, it does not work without coordinating. You only don't see it happening, and also it sometimes does not work the way you think -- sometimes you will run into subtle bugs. The driver will do some heavy lifting to coordinate what you're doing, which results in about one millisecond of added frame time (depends on card). It also seems to make little or no difference whether you properly schedule updates (i.e. only modify objects that are not in use) or not -- though that particular thing may be simply because drivers to date are not handling this path optimally since nobody uses it anyway. Might change in the future (or not), who knows.
In some cases, you can, despite driver synchronization, generate undesired rendering effects, too. The driver will usually (pretty much guaranteed) make sure that you are not overwriting a buffer it is currently rendering from. However, it doesn't coordinate modifications on several objects in a meaningful way (how could it do that!). Which means you may end up modifying a vertex buffer, and then a texture, and the driver will properly synchronize so none of them is garbled -- but it will pick up the old texture with the new vertices for rendering.

There's a chapter about this very thing in OpenGL Insights as well (chapter "Asynchronous Transfers", incidentially this happens to be one of the free sample chapters), if you don't trust my word alone.

A millisecond is a lot of time (if you are aiming for 16.6ms frame time), so shared contexts are not what you want to use most of the time. You will usually get better performance if you have just one context, and your render thread maps buffers and passes a raw pointer to a worker thread. The worker can then fill the buffer with something meaningful, and finally the render thread unmaps the buffer.


loading stuff is a fairly CPU intensive process though.

most of my models are in ASCII text formats, textures need to be decompressed (from PNG or JPEG) and converted to DXT prior to upload, ... this does at least take a lot of this work out of the main render thread (reducing obvious stalls, ...).


not having to deal with inter-thread communication is easier though, as it avoids a lot of the awkwardness of logic working around shared-object-states, and also the need for doing (explicit) event-driven / message-passing type stuff.

whether or not OpenGL needs to synchronize internally is less of an issue here (it may effect performance, but not how much code needs to be written).

the main considered alternative, had I gone this way previously (before knowing about this GL feature), would have likely involved putting all of the texture uploads and similar in an event queue (where the renderer thread would proceed to spin in a loop and invoke event-handling logic for all the textures to be uploaded, ...). (whether this would be faster or slower, I don't really know).

note that the current implementation doesn't rule out the possibility of going over to an event-queue if really needed.

Hodgman, on 18 Mar 2013 - 08:17, said:
@cr88192, with all your mutexes about the place, how much of a performance increase have you measured over your original single-threaded GL code?

~ 10 fps while walking around to often more around 30 fps.

moving the video-map decoding to workers got framerates often up from around 20-30 fps to around 40-50 fps (except it seems when looking at water, or when a lot of character models come into the scene, both of which are still "less than ideal").

video-maps are used mostly for fires and a few other misc animated-texture effects (and currently involve decoding video frames in a modified M-JPEG format, but I had designed/implemented a faster-decoding format which goes straight to DXTn, but haven't switched over to it yet).


generally though, locking is needed to prevent threads from doing things like stomping on shared state and similar (like incorrectly updating linked lists or rovers, ...).

note that I am generally using the faster "Critical Section Object" mutexes, rather than the slower "CreateMutex"/"WaitForSingleObject" mutexes, mostly as past testing had shown these later ones to themselves be pretty slow (~ 1us to lock or unlock), but they are more well-behaved in some cases. (I had noted all this before when dealing with multi-threading in my Script-VM and MM/GC).

FWIW, 1 us is actually a surprisingly long time, so I am not sure what exactly MS is doing in there.


in my own threading wrapper, these are called Mutex and FastMutex objects.
I was originally using custom-written spinlocks for FastMutex, but then observed that Critical Section objects were similarly fast.

on Linux, there is no difference, as the default Linux "pthread_mutex_t" seems to behave more like the Critical Section objects, thus I just use the default mutexes for both cases.


did record a video (note, recording limited to 15fps):



there are still some stalls, but when going around on that track, it is much smoother than it was previously, though there is still an issue of the engine using up most of the available address space for a 32-bit process (reducing memory footprint is still an ongoing issue, the main alternative being to be x64-only...).

Edited by cr88192, 18 March 2013 - 02:01 PM.


#12 Krohm   Crossbones+   -  Reputation: 3052

Like
0Likes
Like

Posted 19 March 2013 - 03:42 AM

Wait. You're rendering this at 50 fps? Does the original above runs at 50 fps? The green framerate at top screen is often around 30 and sometimes drips as low as a single-digit number. Perhaps it's video capture framerate? What is your hardware?



#13 samoth   Crossbones+   -  Reputation: 4684

Like
0Likes
Like

Posted 19 March 2013 - 07:38 AM

most of my models are in ASCII text formats, textures need to be decompressed (from PNG or JPEG) and converted to DXT prior to upload, ... this does at least take a lot of this work out of the main render thread (reducing obvious stalls, ...).

That's something one normally does once during the build. Loading models as ASCII text or loading PNG and converting to DXT on the fly is kind of a design mistake. You really do not want to do this. Videos used as textures may be the only "legitimate" exception since video formats just compress so much better than DXT, so the saved bandwidth may be worth it.

Not only does using general formats (in particular text formats) waste CPU time which is absolutely unnecessary, but an offline DXT compressor will also be able to deliver a much better quality than your on-the-fly PNG-to-DXT transcoder (quality DXT compression is non-trivial). Any PNG you convert to DXT, you could equally well store as DXT at better quality in the first place.

That left out of consideration, even if you insist on doing the above, there is nothing that hinders you from doing the slow stuff in a worker thread, and still only using one OpenGL context. This avoids the driver synchronization overhead. More or less everybody is doing this (because disk access is always "slow"). Load and decompress/transcode/whatever in a worker thread, map a buffer in the main thread, fill the memory pointed to in the worker. Unmap buffer when ready. Data will usually only be ready several frames later anyway, since a single disk access is on the order of one full frame time. Therefore, there is really not much need for many ill-advised little syncs in the middle of the frame. Fire off requests, and see how many became ready at the end of the frame when you're not doing anything but wait for vertical sync.
 

whether or not OpenGL needs to synchronize internally is less of an issue here (it may effect performance, but not how much code needs to be written)

If you can live with the overhead, that is fine (though if you have 40-50 fps, you cannot in my opinion). However, as I pointed out earlier, there is another issue with this synchronization. The driver does not know what you're going to do, so it cannot synchronize in a "meaningful way". Which means you pay for something, but don't get anything back. You must in addition use an event object anyway, or it will not work reliably.

Otherwise, updates will "work" insofar as every individual update will be consistent, but you have no control over when it takes effect. In the very worst case, you might end up wondering why your render thread crashes (because the driver decided that it's OK to only make that buffer you're reading from available to the render thread some time later). Therefore you must sync explicitly, which means no more and no less than in summary you sync once more than would be necessary.
 

Mutex is slower than critical sections [...] 1 us is actually a surprisingly long time, so I am not sure what exactly MS is doing in there.

There are several reasons for this. First, entering a critical section is merely atomically incrementing an integer in the normal, non-contended case. In the moderately contended case, it is spinning a few dozen or so times. Only in the very unusual highly contended case, it is a system call that blocks the thread. A mutex is always a system call that blocks the thread when there is any kind of congestion. Also, a mutex is a heavyweight kernel object that can be used to synchronize different processes (whereas a critical section is a memory address plus a per-process keyed event object). When your thread blocks, it will be unblocked when the mutex is signalled and the scheduler next runs. This is "works as intended".

You can use keyed events (the underlying "block" mechanism used in critical sections) to build your own mutex which alltogether can be roughly 30% faster than a critical section, depending on your design. I advise against that, however. Rather get your threading and synchronization correct so you need only few synch points. This will not only make a 30% difference, but a 300% or maybe 3,000% difference.
 

on Linux, there is no difference, as the default Linux "pthread_mutex_t" seems to behave more like the Critical Section objects

That's because this is what they are. They're a spinlock plus a futex. Incidentially, futex works for processes, not just threads, but that is no surprise because Linux does not have support for threads at all (it only supports processes that share an address space).


Edited by samoth, 19 March 2013 - 07:42 AM.


#14 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 19 March 2013 - 12:04 PM

Wait. You're rendering this at 50 fps? Does the original above runs at 50 fps? The green framerate at top screen is often around 30 and sometimes drips as low as a single-digit number. Perhaps it's video capture framerate? What is your hardware?

it isn't always 50fps, more it varies depending on where the player is looking (for example, looking at water also currently hurts the framerate pretty bad).

also, in most of the video the player is moving (especially on the track), which previously would have often been single-digits the whole time (rather than just occasionally), mostly as there is all the performance hit from streaming and rebuilding the terrain geometry. (it is at least better than it was before).


also, the engine is also doing video-capture in the above case, where video capture/encoding isn't free either, and usually results in a considerable framerate hit (originally, prior to some prior optimizations, video capture tended to make things almost unplayable).

I had considered before the possibility of writing a faster and more specialized JPEG encoder for this case (the in-engine video capture uses M-JPEG). (mostly as raw frames eat large amounts of HDD space).


hardware:
CPU: AMD Athlon II X4 at 2.8 GHz;
RAM: 16GB DDR3 (PC3-12800);
video: GeForce GTX 460.

#15 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 19 March 2013 - 12:46 PM


most of my models are in ASCII text formats, textures need to be decompressed (from PNG or JPEG) and converted to DXT prior to upload, ... this does at least take a lot of this work out of the main render thread (reducing obvious stalls, ...).

That's something one normally does once during the build. Loading models as ASCII text or loading PNG and converting to DXT on the fly is kind of a design mistake. You really do not want to do this. Videos used as textures may be the only "legitimate" exception since video formats just compress so much better than DXT, so the saved bandwidth may be worth it.

Not only does using general formats (in particular text formats) waste CPU time which is absolutely unnecessary, but an offline DXT compressor will also be able to deliver a much better quality than your on-the-fly PNG-to-DXT transcoder (quality DXT compression is non-trivial). Any PNG you convert to DXT, you could equally well store as DXT at better quality in the first place.

That left out of consideration, even if you insist on doing the above, there is nothing that hinders you from doing the slow stuff in a worker thread, and still only using one OpenGL context. This avoids the driver synchronization overhead. More or less everybody is doing this (because disk access is always "slow"). Load and decompress/transcode/whatever in a worker thread, map a buffer in the main thread, fill the memory pointed to in the worker. Unmap buffer when ready. Data will usually only be ready several frames later anyway, since a single disk access is on the order of one full frame time. Therefore, there is really not much need for many ill-advised little syncs in the middle of the frame. Fire off requests, and see how many became ready at the end of the frame when you're not doing anything but wait for vertical sync.


as can be noted, I had considered moving to a DXT-based format for both textures and video, but thus far haven't done so, mostly as PNG and JPEG are much more convenient for use with graphics programs (and I don't have any sort of explicit "process resources for game" stage).

lots of ASCII formats was more so that data can be examined / edited manually when needed.

actually, most of the 3D models are in a variant of the AC3D file format (partly originally because it was ASCII text).

I had previously considered the possibility of moving over to something like IQM or similar.


the world is currently stored in a binary format, but mostly as voxels.

I had considered the possibility of moving the client/server interface to send geometry (rather than voxel data), which would mostly make all the geometry-building be a server-side task. eventually, caching prebuilt geometry in the world-regions could also possibly make sense.


whether or not OpenGL needs to synchronize internally is less of an issue here (it may effect performance, but not how much code needs to be written)

If you can live with the overhead, that is fine (though if you have 40-50 fps, you cannot in my opinion). However, as I pointed out earlier, there is another issue with this synchronization. The driver does not know what you're going to do, so it cannot synchronize in a "meaningful way". Which means you pay for something, but don't get anything back. You must in addition use an event object anyway, or it will not work reliably.
Otherwise, updates will "work" insofar as every individual update will be consistent, but you have no control over when it takes effect. In the very worst case, you might end up wondering why your render thread crashes (because the driver decided that it's OK to only make that buffer you're reading from available to the render thread some time later). Therefore you must sync explicitly, which means no more and no less than in summary you sync once more than would be necessary.


could be, but as is, it is better than having an obvious stall every time something loaded, which was the previous issue.


but, 40-50 is pretty good.

there was a time, a little earlier on in the engine development, when I thought I was doing pretty good when it was consistently breaking 10 fps, then 20 fps and later 30 fps became the goals.

then the Doom 3 source came out, and upon looking at it, I realized partly how terrible all my rendering architecture was, but it is a long road trying to make it (gradually) suck less (and also make the code less of a horrid mess as well...).


Mutex is slower than critical sections [...] 1 us is actually a surprisingly long time, so I am not sure what exactly MS is doing in there.

There are several reasons for this. First, entering a critical section is merely atomically incrementing an integer in the normal, non-contended case. In the moderately contended case, it is spinning a few dozen or so times. Only in the very unusual highly contended case, it is a system call that blocks the thread. A mutex is always a system call that blocks the thread when there is any kind of congestion. Also, a mutex is a heavyweight kernel object that can be used to synchronize different processes (whereas a critical section is a memory address plus a per-process keyed event object). When your thread blocks, it will be unblocked when the mutex is signalled and the scheduler next runs. This is "works as intended".

You can use keyed events (the underlying "block" mechanism used in critical sections) to build your own mutex which alltogether can be roughly 30% faster than a critical section, depending on your design. I advise against that, however. Rather get your threading and synchronization correct so you need only few synch points. This will not only make a 30% difference, but a 300% or maybe 3,000% difference.
 

on Linux, there is no difference, as the default Linux "pthread_mutex_t" seems to behave more like the Critical Section objects

That's because this is what they are. They're a spinlock plus a futex. Incidentially, futex works for processes, not just threads, but that is no surprise because Linux does not have support for threads at all (it only supports processes that share an address space).


there are still a few obvious things still to work on, like probably making the Inflate code able to be used from multiple threads, and also allowing more parallelism for the texture conversions and uploads (so that less of the process needs to be locked).

#16 swiftcoder   Senior Moderators   -  Reputation: 9860

Like
0Likes
Like

Posted 19 March 2013 - 12:58 PM

I had considered before the possibility of writing a faster and more specialized JPEG encoder for this case (the in-engine video capture uses M-JPEG). (mostly as raw frames eat large amounts of HDD space).

It might be simpler to write out DXT compressed video frames, and re-encode the video offline.

 

Outerra has a writeup on this technique.


Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#17 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 19 March 2013 - 03:29 PM

swiftcoder, on 19 Mar 2013 - 14:03, said:


cr88192, on 19 Mar 2013 - 13:09, said:
I had considered before the possibility of writing a faster and more specialized JPEG encoder for this case (the in-engine video capture uses M-JPEG). (mostly as raw frames eat large amounts of HDD space).

It might be simpler to write out DXT compressed video frames, and re-encode the video offline.

Outerra has a writeup on this technique.

yes, this is also possible, since converting to DXT is a bit faster than converting to JPEG.
the partial drawback though, granted, is the need for offline re-encoding (and/or needing a special codec).

basically, this is a drawback for the common case of recording video and then loading it up in MovieMaker or similar, as first manual transcoding would be needed.

yes, I did note that the article provided a codec for the specific format they are using, so this is still possible.


the most obvious "clever trick" optimization for a JPEG encoder is basically hard-coding most or all the tables (sort of like in MPEG), which could shave off several major steps of the encoding process (and mostly just give more room for micro-optimizing stuff). (basically: fork existing encoder, and probably hard-code and micro-optimize everything).

profiler results show most of the CPU time is going into video recording, and of this, primarily stuff related to VLC / entropy coding blocks (followed by DCT and colorspace-conversion). not sure how much of a speedup could be gained here, probably fairly small though...

so, may still need to think on it...

Edited by cr88192, 19 March 2013 - 03:34 PM.


#18 swiftcoder   Senior Moderators   -  Reputation: 9860

Like
0Likes
Like

Posted 19 March 2013 - 07:23 PM

profiler results show most of the CPU time is going into video recording, and of this, primarily stuff related to VLC / entropy coding blocks (followed by DCT and colorspace-conversion). not sure how much of a speedup could be gained here, probably fairly small though...

The colour space conversion should be trivial to move to the GPU, right before read-back occurs.

Entropy encoding is probably not so simple smile.png


Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#19 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 19 March 2013 - 11:42 PM


profiler results show most of the CPU time is going into video recording, and of this, primarily stuff related to VLC / entropy coding blocks (followed by DCT and colorspace-conversion). not sure how much of a speedup could be gained here, probably fairly small though...

The colour space conversion should be trivial to move to the GPU, right before read-back occurs.

Entropy encoding is probably not so simple smile.png


fiddling with it still hasn't gotten it all that much faster...

luckily, I am only sometimes recording video.


side note:
little special is currently done for the read-back, basically it just sort of, at the end of the frame (after rendering everything else), issues a glReadPixels() call, and passes the buffer contents off to the video-encoder. yeah, probably a poor way to do it, I know... this is basically also how screen-shots are done, except done at a faster rate (and on a timer).

luckily, the actual video encoding happens in its own thread.

Edited by cr88192, 20 March 2013 - 12:01 AM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS