Sign in to follow this  
bvanevery

DX11 DX11 multithreading - why bother?

Recommended Posts

DX11 allows a graphics pipeline to be multithreaded. But, why do I want to use my available CPU parallelism on feeding the graphics pipeline? Games have got other jobs to do, like physics simulation and AI. Coarse parallelism would seem to do fine here, and the app would be easier to write, debug, and port to other platforms. Maybe you say you want to use the GPU for physics simulation and AI, and so you need a tighter coupling between producer threads and the consuming graphics pipeline. Fine, but then you've locked yourself into DX11 HW. My 2.5 year old laptop is DX10 class HW, for instance. Also, your physics and AI code would be API specific. Not only does this limit you to Microsoft platforms, but GPUs do not have the nicest set of programming languages and tools available. We put up with GPUs when we want things to be fast; they're pretty much a detriment to programmer productivity. What am I missing here? Does anyone have a compelling rationale for bothering with more tightly coupled multithreading? Cynically, this seems like a way for Microsoft / NVIDIA / ATI to push perceived bells and whistles and sell HW "upgrades". Maybe they really can show a pure graphics benefit on high end HW with a lot of CPU cores. But most consumers don't have high end HW, and there's more to games than pure graphics. DX11 is way ahead of the installed base. Last I checked, consumers are only just now getting around to Vista / Windows 7 and DX10 class HW, and that took ~3 years. Do you want to waste all your time chasing around the top tier of game players? Some games have lost a lot of money doing that, like Crysis. Also, the performance results I've seen on my midrange consumer HW are not compelling: MultiThreadedRendering11 demo D3D11 Vsync off (640x480), R8G8B8A*_UNORM_SRGB (MS1, Q0) NVIDIA 8600M GT laptop, 256MB dedicated memory, driver 195.62 (this is DX10 class HW) windowed, with mouse focus in window ~22 fps Immediate ~20 fps Single Threaded, Deferred per Scene ~21 fps Multi Threaded, Deferred per Scene ~20 fps Single Threaded, Deferred per Chunk ~18 fps Multi Threaded, Deferred per Chunk Methodology: I manually observed the demo window. I picked fps values that seem to occur most frequently. I went through all the settings twice, just in case some system process happened to slow something down. These values seem reasonably stable. I didn't worry much about fractions. I wouldn't regard a difference of ~1 fps as significant, as it's probably a 0.5 fps difference. ~2 fps is observable, however. To the extent that multithreading matters at all, it seems to slow things down slightly. This demo does not make a compelling case for bothering with DX11 multithreading on midrange consumer HW. Does anyone have some code that demonstrates an actual benefit?

Share this post


Link to post
Share on other sites
DX11 multithreading needs to be supported by the hardware, otherwise it's just a software fallback and it's slower that way than immediate mode, obviously. AFAIK no pre-DX11 card supports it.

http://msdn.microsoft.com/en-us/library/ff476893%28VS.85%29.aspx

Share this post


Link to post
Share on other sites
The other point is: FPS is an old value. It is ok for single threaded games. But for multi-threading, this only shows you how many frames the graphics device can render a scene. In background, the game can run the speed it wants and perform much complex things. The more complex a scene becomes, the more important multi-threading will be.

Share this post


Link to post
Share on other sites
Multithreading the graphics pipeline is nothing new to directx. DX9 had some parallel ability that many game companies made use of.

Why go to all of the dev and test effort to make a parallel API if no one wants it? Well, people do want it, game companies want it. DX10 had no multithreading abilities and many many requests came in asking for it. So lets look at some of the reasons why.

Object creation is slow. It can stall your rendering thread any time your app discovers that it needs to create a new object. These calls are slow enough that MS, ATI, NVIDIA all wrote white papers telling developers to avoid creating and destroying resources during the application runtime. The API supports multithreaded creates so that you can defer to the driver to pick the best times to create objects -- for instance when it has a few spare cycles -- which allows your rendering thread to continue its work -- which is to get stuff drawn to the screen.

Next, DX11 supports deffered contexts. These allow multiple threads to build command lists at the same time and for the DX runtime to preform validation on separate threads in advance. DX10 was an API redesign where one of the many goals was to reduce CPU overhead of the API calls. CPU overhead was a huge problem for DX9 -- and many many game companies were limited in what they could get on screen just because the API ate too much CPU. So DX10 reduce that cost significantly, in some places by an order of 10-100. However there are some calls that were difficult to trim down because the validation was necessary, or perhaps the driver had a lot of work to do. Being able to build command lists on separate CPU threads allows some of that work to take place in parallel and in advance of trying to actually draw the data. Several game studios are already taking advantage of deffered contexts and are seeing improvements in performance, even when using the CPU fallback for lack of driver support.

So, DX9 would allow roughly 2000 API calls per frame before the API would become a bottle neck, DX10 is around 12000, DX11 should be even higher when using deffered contexts. These are call limits based on using the whole API to do actual work, not just calling some API like SetPrimitiveTopology() x number of times. The trouble is that studios are trying to put more and more stuff on the screen and would surely take advantage of anything that could be provided performance wise.

Plus your engine has to do a lot of CPU work anyway to make draw calls. It has to build matrix transformations, sort objects and draw calls, make all sorts of decisions on what to draw and how to draw it. All of this could be done in parallel with big wins -- provided that your app actually has enough work to do that these things become bottle necks.

A consumer won't need high end hardware to take advantage of multithreading. It's all about preventing the GPU from being starved of data to crunch.

I don't think that there are many drivers out yet that fully support the multithreading APIs yet. This feature requires a lot of effort to get right and is a huge test burden -- but they will come out eventually.

The DX10.1 feature level supports hardware multithreading. This means that there is a reasonable sized slice of hardware out there already that can support this stuff once drivers arrive.

AAA Games take 2-4 years to develop - about the time span you pointed out required to adopt a new technology. Interesting how that works.

Vendor lock in is not an insurmountably problem for developers. The reality is that there are lots of game engines that wrap the graphics system into a layer so that they can run on xbox, or PC, or Playstation. These problems have been solved over and over again and are just part of reality. These same game engines have multithreaded deffered contexts built in because it makes a difference. DX11 gives them a way to map their engine API more closely to the hardware which results in a bigger win. There's no reason why an API should really lock you to any vendor if you layer your software. You want to support someone else, then target them too.

GPU tools and languages have been getting better and better over the years. Sure it's not as ideal as native tools, but it's getting there. With the spreading use of DX, compute, CUDA, opencl, etc. more and more people invest in GPU technologies which means that the whole infrastructure continues to improve. Lack of perfect tools shouldn't stop you from leveraging the amazing power of the GPU -- though I admit there are areas of debugging that are still frustrating but they will get better. People with a lot of practice writing shaders can actually get a lot done. It's not python, but it's also not asm.

A new API or hardware rev will always be ahead of the install base at launch time. This is not new.

Not all of the available API's are needed by every developer. Multithreading probably falls into one of those categories of optimization -- why do it if you don't have a problem. Granted multithreading normally requires a lot more forethought in code design, but I guarantee that if you're not seeing a win, it's because you're not running a scenario that it was designed to fix -- which is CPU and DX API bottle-necking.

Share this post


Link to post
Share on other sites
Quote:
Original post by bvanevery
why do I want to use my available CPU parallelism on...
Why *don't* you want to use available parallelism on *everything*?
The game I'm writing at the moment is based on a SPMD (single program multiple data) type architecture, where essentially the same code is executed on every thread, with each thread processing a different range of the data. Every thread does physics together, then they all do AI together, then they all do rendering together, etc...

Share this post


Link to post
Share on other sites
Well there is a point, in that you don't have to use or follow multi-threading if the situation doesn't require. Just be flexible and pick the best suited tools/options/solutions for your project.

Share this post


Link to post
Share on other sites
Quote:
Original post by darkelf2k5
DX11 multithreading needs to be supported by the hardware, otherwise it's just a software fallback and it's slower that way than immediate mode, obviously. AFAIK no pre-DX11 card supports it.

http://msdn.microsoft.com/en-us/library/ff476893%28VS.85%29.aspx


That's not a HW support issue, that's a driver support issue. Theoretically, a DX11 multithreading application architecture should benefit a DX10 class card, if the drivers have been updated. In practice, I don't know if IHVs have updated their drivers, or will update them. It's quite possible that they'll be cheap bastards and expect people to just buy DX11 HW. If that happens in practice, then DX11 multithreading will have no benefit whatsoever on older HW.

I suppose I'll have to check my own driver. NVIDIA's support of older laptop HW has been notoriously poor. They dumped the problem in OEM's laps for some silly reason. For quite some time, their stock drivers refused to install on laptops; you had to get your driver from the OEM. Of course, the OEMs don't care about updating their drivers very often so you end up with really old drivers that don't have current features and fixes. Only recently did NVIDIA start to offer a stock driver that will work on laptops. There is still a disconnect as far as their most current drivers; for instance, the recently released OpenGL 3.3 driver will not install by default on my laptop. I have been getting around these problems using laptopvideo2go.com, a website that adds .inf files to enable the drivers on laptops. This doesn't help the general deployment situation however.

Share this post


Link to post
Share on other sites
Quote:
Original post by Pyrogame
The other point is: FPS is an old value. It is ok for single threaded games. But for multi-threading, this only shows you how many frames the graphics device can render a scene.


There is no readout for "CPU load" in the MultiThreadedRendering11 demo. This is unfortunate as it would be useful diagnostic information. That's part of why I asked if anyone had code that demonstrates an actual benefit.

Quote:
In background, the game can run the speed it wants and perform much complex things. The more complex a scene becomes, the more important multi-threading will be.


I think you may have missed the point. You don't need DX11 multithreading to do multithreading in your app. You can have an AI thread, a physics thread, or whatever. Your multithreading architecture will be simpler to write and debug, and it will not be tied to DX11.

Share this post


Link to post
Share on other sites
Quote:
Original post by Hodgman
Quote:
Original post by bvanevery
why do I want to use my available CPU parallelism on...
Why *don't* you want to use available parallelism on *everything*?


Because the debugging will drive you nuts.

Because it can easily become premature optimization.

Share this post


Link to post
Share on other sites
A current high end desktop CPU has 8 hardware threads, and that number is only going to rise in the future. What possible reason could MS have for not improving multithreaded support? Coarse parallelism in games is okay up to 4 threads, maybe 6. Moving past that will require us to move beyond the rather naive approach of one graphics thread.
Quote:
Quote:
Quote:
Original post by bvanevery
why do I want to use my available CPU parallelism on...
Why *don't* you want to use available parallelism on *everything*?


Because the debugging will drive you nuts.
Jeez, it's not like these are problems never tackled before. People in other segments of software have been dealing with these issues for ages.

Share this post


Link to post
Share on other sites
Quote:
Original post by DieterVW
AAA Games take 2-4 years to develop - about the time span you pointed out required to adopt a new technology. Interesting how that works.


For an indie working on shorter development cycles, these adoption timelines make no sense. Yes, the way it works is whatever "heavyweight" development wants. NVIDIA / ATI / Microsoft / EA all pushing their core product, using lots of programmer worker bees to do it. It's mainly for selling more HW, more OSes, and more AAA games. Except that it clearly doesn't sell AAA games if you get on the tech bandwagon too early, as what happened to Crysis. So it's mainly about selling more HW and more OSes... except that most consumers have wised up.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
A current high end desktop CPU has 8 hardware threads, and that number is only going to rise in the future. What possible reason could MS have for not improving multithreaded support? Coarse parallelism in games is okay up to 4 threads, maybe 6. Moving past that will require us to move beyond the rather naive approach of one graphics thread.
Quote:
Quote:
Quote:
Original post by bvanevery
why do I want to use my available CPU parallelism on...
Why *don't* you want to use available parallelism on *everything*?


Because the debugging will drive you nuts.
Jeez, it's not like these are problems never tackled before. People in other segments of software have been dealing with these issues for ages.


It's been tackled before, it will be tacked over and over again forever. It will still drive you nuts. As in, make development costs more expensive and time consuming.


Share this post


Link to post
Share on other sites
Everything makes development more expensive and time consuming. That's why major projects now are 30M-50M budgets. What on earth does any of it have to do with DX11 multithreading? Or indies?

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Everything makes development more expensive and time consuming. That's why major projects now are 30M-50M budgets. What on earth does any of it have to do with DX11 multithreading? Or indies?


Indies don't spend 30M..50M, DUH. It's about what API investments make sense from a money standpoint, and what's a trap / treadmill.

Share this post


Link to post
Share on other sites
So you're saying that DX11 is a terrible choice for indies because the optional multithreading support doesn't work well on your laptop?

Share this post


Link to post
Share on other sites
Quote:
Original post by bvanevery
Quote:
Original post by Pyrogame
The other point is: FPS is an old value. It is ok for single threaded games. But for multi-threading, this only shows you how many frames the graphics device can render a scene.


There is no readout for "CPU load" in the MultiThreadedRendering11 demo. This is unfortunate as it would be useful diagnostic information. That's part of why I asked if anyone had code that demonstrates an actual benefit.


My current engine runs a test scene on a single HT (Hardware Thread) with ~80 FPS with 16% global CPU-load. If I enable all the 8 threads, then this boosts the engine to ~3k FPS with near to 50% global CPU-load. Because the CPU uses Hyprethreading, the 50% is a very good value. With only 2 HT's enabled on different cores, I get 2.5k FPS with 24% load.

Ofcourse my engine does not render only things, but it calculates some other stuff (zero-gravity w/o collision detection physics, no AI). It renders a GUI, which renders the world on a window. The entire engine is based on a job manager, which creates at least 4 job workers (for example on a single HT). If the system has more then 4 HT's, more job workers will be created. Then all the work is done by jobs. If the engine want to calculate something, a job is created an in realtime attached to a job worker. Every camera (which is the world camera, gui camera, shadow camera, etc.) has its own rendering job. Each job can have a state machine, which can pause the calculation, if the job has dependency to another job. Because of this, I do not have any job or thread, that is called "the main renderer". All the rendering jobs can use the immediate context to calculate their prepared deffered contexts. But you have to synchronize your device to do this, because the device itself has not a real multithreding API (the device driver itself blocks the calls, so you get an exception, but not a self destructing graphics card ^^).

DX11 delivers the support for multithreading meaning the deffered contexts, that is in my opinion a very nice feature.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
So you're saying that DX11 is a terrible choice for indies because the optional multithreading support doesn't work well on your laptop?


Not just my laptop, probably 90% of the installed base.

Share this post


Link to post
Share on other sites
So, what you are saying is that because 90% (which sounds like a bullstat to me) of people currently can't then we shouldnt come up with technology to use in the future now?

So, what, in your world do we wait until everyone has at least 8 cores then dump this tech on people and say 'hey! get good at it now!'.. that's just madness and doing so would just stop progress.

No one says 'because you are using DX11 you must use Multi-threading' yet at the same time if you are targetting high end systems (my current target hardware is DX11 cards, 4 core/8thread systems) then it gives you a wonderful chunk of flexibility.

Oh, and if you are careful then frankly MT code is easy to write; hell back when I was 21 and pretty green I wrote an application which would query 50K game servers in less than 3 mins using a multi-threaded app. At the time I had practically zero experiance with networking and threads and wrote the whole thing in about 8 weeks; when it went live it never once crashed or gave the wrong output despite running every 3mins 24h a day.

And that was with raw threads, with task based systems it is even easier these days to do MT with existing libraries (be it MS's Concurrency Runtime, Intel's Threading Building Blocks or the .Net concurreny stuff).

Sure, you'll get bugs but if you are careful at what you write they won't be that hard to figure out... so either it's easier than people make out or I'm some sort of coding/design/multithreading god... come to think of it I'm good with either answer [grin]

Share this post


Link to post
Share on other sites
Quote:
Original post by phantom
So, what you are saying is that because 90% (which sounds like a bullstat to me)


Indeed. I was being too kind.

Quote:
of people currently can't then we shouldnt come up with technology to use in the future now?


I've watched the DX10 API impasse for ~3 years. Have fun watching the paint dry with DX11 for ~3 as well. The reality is that most games start life on consoles and they have DX9 HW specs.

Share this post


Link to post
Share on other sites
Yes, your link just proved my point somewhat; if you drill down into the numbers then you'll see that 26.54% of people have 4 core CPUs, which is an increase of 3.5% over Jan's numbers.

Now, maybe my maths isn't too hot so remind me; what is 100 - 26.54? Is it 90? I can't recall?

As for DX10, it as an API strangled by the FUD thrown at Vista; DX11 on the other hand has had games AT LAUNCH which support it.

And you also didn't answer my question; how are we, as game programmers, meant to test out multi-threaded designs without the API support there? Because I'd lay money on MS's next console supporting DX11 style multi-threaded submission and more cores in general so by learning how to do things NOW means we'll be better positioned in the future.

But hey, if you want to stay with single threaded stuff here is the scoop; no one is going to stop you. Carry on as you were and all that.. the rest of us will be over here, trying to advance the state of the art instead of holding back advancement...

Share this post


Link to post
Share on other sites
Quote:
Original post by bvanevery
The reality is that most games start life on consoles and they have DX9 HW specs.


On consoles you can multithread your command buffer generation. PC was the the odd man out in this regard until D3D11 came along.

Share this post


Link to post
Share on other sites
Quote:
Original post by phantom
Yes, your link just proved my point somewhat; if you drill down into the numbers then you'll see that 26.54% of people have 4 core CPUs, which is an increase of 3.5% over Jan's numbers.


3.29% are DX11 systems. Read my original post. I'm not against multithreading, I'm against multithreading that's tied to the DX11 API. Most of the installed base does not have enough cores to waste them on 3D graphics. Games have got other things they need to do.

Quote:
the rest of us will be over here, trying to advance the state of the art instead of holding back advancement...


You mean like Crysis? You learn slowly.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Partner Spotlight

  • Forum Statistics

    • Total Topics
      627665
    • Total Posts
      2978530
  • Similar Content

    • By evelyn4you
      hi,
      i have read very much about the binding of a constantbuffer to a shader but something is still unclear to me.
      e.g. when performing :   vertexshader.setConstantbuffer ( buffer,  slot )
       is the buffer bound
      a.  to the VertexShaderStage
      or
      b. to the VertexShader that is currently set as the active VertexShader
      Is it possible to bind a constantBuffer to a VertexShader e.g. VS_A and keep this binding even after the active VertexShader has changed ?
      I mean i want to bind constantbuffer_A  to VS_A, an Constantbuffer_B to VS_B  and  only use updateSubresource without using setConstantBuffer command every time.

      Look at this example:
      SetVertexShader ( VS_A )
      updateSubresource(buffer_A)
      vertexshader.setConstantbuffer ( buffer_A,  slot_A )
      perform drawcall       ( buffer_A is used )

      SetVertexShader ( VS_B )
      updateSubresource(buffer_B)
      vertexshader.setConstantbuffer ( buffer_B,  slot_A )
      perform drawcall   ( buffer_B is used )
      SetVertexShader ( VS_A )
      perform drawcall   (now which buffer is used ??? )
       
      I ask this question because i have made a custom render engine an want to optimize to
      the minimum  updateSubresource, and setConstantbuffer  calls
       
       
       
       
       
    • By noodleBowl
      I got a quick question about buffers when it comes to DirectX 11. If I bind a buffer using a command like:
      IASetVertexBuffers IASetIndexBuffer VSSetConstantBuffers PSSetConstantBuffers  and then later on I update that bound buffer's data using commands like Map/Unmap or any of the other update commands.
      Do I need to rebind the buffer again in order for my update to take effect? If I dont rebind is that really bad as in I get a performance hit? My thought process behind this is that if the buffer is already bound why do I need to rebind it? I'm using that same buffer it is just different data
       
    • By Rockmover
      I am really stuck with something that should be very simple in DirectX 11. 
      1. I can draw lines using a PC (position, colored) vertices and a simple shader just fine.
      2. I can draw 3D triangles using PCN (position, colored, normal) vertices just fine (even transparency and SpecularBlinnPhong shaders).
       
      However, if I'm using my 3D shader, and I want to draw my PC lines in the same scene how can I do that?
       
      If I change my lines to PCN and pass them to the 3D shader with my triangles, then the lighting screws them all up.  I only want the lighting for the 3D triangles, but no SpecularBlinnPhong/Lighting for the lines (just PC). 
      I am sure this is because if I change the lines to PNC there is not really a correct "normal" for the lines.  
      I assume I somehow need to draw the 3D triangles using one shader, and then "switch" to another shader and draw the lines?  But I have no clue how to use two different shaders in the same scene.  And then are the lines just drawn on top of the triangles, or vice versa (maybe draw order dependent)?  
      I must be missing something really basic, so if anyone can just point me in the right direction (or link to an example showing the implementation of multiple shaders) that would be REALLY appreciated.
       
      I'm also more than happy to post my simple test code if that helps as well!
       
      THANKS SO MUCH IN ADVANCE!!!
    • By Reitano
      Hi,
      I am writing a linear allocator of per-frame constants using the DirectX 11.1 API. My plan is to replace the traditional constant allocation strategy, where most of the work is done by the driver behind my back, with a manual one inspired by the DirectX 12 and Vulkan APIs.
      In brief, the allocator maintains a list of 64K pages, each page owns a constant buffer managed as a ring buffer. Each page has a history of the N previous frames. At the beginning of a new frame, the allocator retires the frames that have been processed by the GPU and frees up the corresponding space in each page. I use DirectX 11 queries for detecting when a frame is complete and the ID3D11DeviceContext1::VS/PSSetConstantBuffers1 methods for binding constant buffers with an offset.
      The new allocator appears to be working but I am not 100% confident it is actually correct. In particular:
      1) it relies on queries which I am not too familiar with. Are they 100% reliable ?
      2) it maps/unmaps the constant buffer of each page at the beginning of a new frame and then writes the mapped memory as the frame is built. In pseudo code:
      BeginFrame:
          page.data = device.Map(page.buffer)
          device.Unmap(page.buffer)
      RenderFrame
          Alloc(size, initData)
              ...
              memcpy(page.data + page.start, initData, size)
          Alloc(size, initData)
              ...
              memcpy(page.data + page.start, initData, size)
      (Note: calling Unmap at the end of a frame prevents binding the mapped constant buffers and triggers an error in the debug layer)
      Is this valid ? 
      3) I don't fully understand how many frames I should keep in the history. My intuition says it should be equal to the maximum latency reported by IDXGIDevice1::GetMaximumFrameLatency, which is 3 on my machine. But, this value works fine in an unit test while on a more complex demo I need to manually set it to 5, otherwise the allocator starts overwriting previous frames that have not completed yet. Shouldn't the swap chain Present method block the CPU in this case ?
      4) Should I expect this approach to be more efficient than the one managed by the driver ? I don't have meaningful profile data yet.
      Is anybody familiar with the approach described above and can answer my questions and discuss the pros and cons of this technique based on his experience ? 
      For reference, I've uploaded the (WIP) allocator code at https://paste.ofcode.org/Bq98ujP6zaAuKyjv4X7HSv.  Feel free to adapt it in your engine and please let me know if you spot any mistakes
      Thanks
      Stefano Lanza
       
    • By Matt Barr
      Hey all. I've been working with compute shaders lately, and was hoping to build out some libraries to reuse code. As a prerequisite for my current project, I needed to sort a big array of data in my compute shader, so I was going to implement quicksort as a library function. My implementation was going to use an inout array to apply the changes to the referenced array.

      I spent half the day yesterday debugging in visual studio before I realized that the solution, while it worked INSIDE the function, reverted to the original state after returning from the function.

      My hack fix was just to inline the code, but this is not a great solution for the future.  Any ideas? I've considered just returning an array of ints that represents the sorted indices.
  • Popular Now