Sign in to follow this  
Followers 0
BenS1

DX11
[DX11] Command Lists on a Single Threaded Renderer

12 posts in this topic

One of the new major features of DirectX 11 is its support for multithreaded rendering using Immediate and Deferred Contexts, however it seems to me that the ability to create a Command List would potentially be beneficial even for a single threaded renderer. Is this correct?

Basically a Command List is a more efficient way of submitting a number of state and draw commands than calling each API separately, so even if you do all your rendering on one thread it would still seem more efficient to use Command Lists to perform repeat lists of actions.

If this is correct, then why isn't this mentioned more? Why are Deferred Contexts pretty much exclusively documented as a multi threading feature?

Thanks
ben
0

Share this post


Link to post
Share on other sites
I'm pretty sure I remembered reading somewhere that the runtime/drivers weren't optimized for this case. However I still think it would be worth trying out, especially as the drivers get better support for deferred command list generation.
0

Share this post


Link to post
Share on other sites
[quote name='MJP' timestamp='1321297420' post='4883879']
I'm pretty sure I remembered reading somewhere that the runtime/drivers weren't optimized for this case. However I still think it would be worth trying out, especially as the drivers get better support for deferred command list generation.
[/quote]
You are right, I just ran a test on 10,000 cubes, single threaded:
[list][*]Immediate: 200 fps[*]Deferred: 150 fps[/list]So Deferred for single threaded application is slower (at least on my machine). Note that checking the threading support for command list for my graphics card (AMD 6970M) is returning false, so I assume that It is not supported natively by the driver but "emulated" by DX11...
0

Share this post


Link to post
Share on other sites
I'm not sure if the AMD GPU's support it yet, but my GTX 470 on latest drivers says that it does. I'll have to check on my HD 6970 when I get home.

If your test is easily packagable, I would be happy to try it out on my machine to see how it performs.
0

Share this post


Link to post
Share on other sites
[quote name='xoofx' timestamp='1321370413' post='4884183']
[quote name='MJP' timestamp='1321297420' post='4883879']
I'm pretty sure I remembered reading somewhere that the runtime/drivers weren't optimized for this case. However I still think it would be worth trying out, especially as the drivers get better support for deferred command list generation.
[/quote]
You are right, I just ran a test on 100,000 cubes, single threaded:
[list][*]Immediate: 200 fps[*]Deferred: 150 fps[/list]So Deferred for single threaded application is slower (at least on my machine). Note that checking the threading support for command list for my graphics card (AMD 6970M) is returning false, so I assume that It is not supported natively by the driver but "emulated" by DX11...
[/quote]
I would caution against making blanket statements about the performance of single vs. deferred rendering - it totally depends on what your renderer does when it is submitting work to the API. For example, if your engine does lots of work in between the API calls that it makes, then it would likely be beneficial to utilize multiple threads which could reduce the total time needed to process a rendering path. On the other hand, if your submission routines are very bare bones and only submits API calls, there could be some benefit to compiling a long list of commands into a command list and then reusing it from frame to frame. This will depend on the hardware, the driver, your engine, and the application that is using your engine - you need to profile and see if it is worth it in a given context. You could even dynamically test it out on the first startup of your application and then choose the appropriate rendering method.

To that end - I would suggest setting up your rendering code to not know if it is using a deferred or immediate context so that you can delay the decision as long as possible as to which method you will use. That is just my own suggestion though - I have found it to be useful, but your mileage may vary!
1

Share this post


Link to post
Share on other sites
[quote name='xoofx' timestamp='1321370413' post='4884183']
[quote name='MJP' timestamp='1321297420' post='4883879']
I'm pretty sure I remembered reading somewhere that the runtime/drivers weren't optimized for this case. However I still think it would be worth trying out, especially as the drivers get better support for deferred command list generation.
[/quote]
You are right, I just ran a test on 100,000 cubes, single threaded:
[list][*]Immediate: 200 fps[*]Deferred: 150 fps[/list]So Deferred for single threaded application is slower (at least on my machine). Note that checking the threading support for command list for my graphics card (AMD 6970M) is returning false, so I assume that It is not supported natively by the driver but "emulated" by DX11...
[/quote]

Thanks all

xoofx, in your test do you re-create the command list every frame, or do you create the command list once at startup and then just re-execute it every frame?

I'm amazed that AMD cards don't support multithreading at the driver level yet!

I'd be interested in seeing the results on a NVidia card too.

Thanks
Ben
0

Share this post


Link to post
Share on other sites
If have released the executable test along some analysis about the results [url="http://code4k.blogspot.com/2011/11/direct3d11-multithreading-micro.html"]here[/url].

Of course, I agree with Jason.Z statements about taking carefully this kind of results, and the fact that a renderer can easily be built to switch transparently from a deferred context to an immediate context.

To respond to your initial question BenS1, It seems that hardware support for command list doesn't seems to change a lot (using a pre-prepared command list once and run it on an immediate context) compare to using the default Direct3D11 behavior.
0

Share this post


Link to post
Share on other sites
The problem with using a deferred context in a single threaded system is that you are doing more work per core in that situation; you have to prepare the CL, which takes some extra CPU overhead as the driver needs to do things and then you have to reaccess it again to send it to the card properly. Spread across multiple threads the cost-per-setup drops significantly and, if you batch them, your send arch will benifit greatly from code cache reuse (and depending on how it's stored maybe some data cache too).

Any speed gain also depends very much on what you are doing; in a test case at work which was setup to be heavily CPU bound, switching on mutli-threading CL support when NV's drivers were updated to fix it did give us a speed boost however it wasn't that much. I then spent some time playing with the test case and discovered that when we got over a certain threshold for data per CL we started spending more and more time in the buffer swapping function than anywhere else in the submission due to the driver having to do more. (I can't recall the specifics but from what I do recall drivers are limited memory wise or something like that... basically we blew a buffer right out).

However up until that point the MT CL rendering WAS making a significant difference with our CPU time usage and we had near perfect scaling [b]for the test case[/b].

The key point from all this; MT CL, if implimented by the drivers, will help but ONLY your CPU time.

I make a point of saying this because there is no 'hardware support' for CL; Command Lists are purely a CPU side thing, the difference is between letting the DX11 runtime cache the commands or letting the driver cache them and optimise them. (AMD still lacks support for this, although it is apprently 'coming soon')

(Also, as a side note, I do recall reading that 'create, store and reuse' isn't an optimal pattern for command lists. The runtime isn't really setup for this case and it assumes you'll be remaking them each frame, which is a fair assumption because as you can't chain them together to adjust each others state and most command lists will change each frame in a 'real world' situation it is best to test against this)
0

Share this post


Link to post
Share on other sites
[quote name='phantom' timestamp='1321785963' post='4885837']
I then spent some time playing with the test case and discovered that when we got over a certain threshold for data per CL we started spending more and more time in the buffer swapping function than anywhere else in the submission due to the driver having to do more. (I can't recall the specifics but from what I do recall drivers are limited memory wise or something like that... basically we blew a buffer right out).[/quote]
this could come from the Map/UnMap on command buffers, with an immediate context that is giving directly a kind of DMA to the GPU memory, but with a deferred context, it has to copy to a temporary buffer (which is probably on the RAM, but not sure it is on a shared memory on the GPU)...

[quote name='phantom' timestamp='1321785963' post='4885837']
I make a point of saying this because there is no 'hardware support' for CL; Command Lists are purely a CPU side thing, the difference is between letting the DX11 runtime cache the commands or letting the driver cache them and optimise them. (AMD still lacks support for this, although it is apprently 'coming soon')
[/quote]
Indeed, if it is natively supported by the driver, It can be optimized. A coworker found also on NVIDIA a performance boost when they introduced support for command list, though on AMD, It is already fine without the support from the driver... probably the command buffer on AMD is already layout in the same way DirectX11 command buffer is layout...
0

Share this post


Link to post
Share on other sites
[quote name='xoofx' timestamp='1321774033' post='4885817']
If have released the executable test along some analysis about the results [url="http://code4k.blogspot.com/2011/11/direct3d11-multithreading-micro.html"]here[/url].

Of course, I agree with Jason.Z statements about taking carefully this kind of results, and the fact that a renderer can easily be built to switch transparently from a deferred context to an immediate context.

To respond to your initial question BenS1, It seems that hardware support for command list doesn't seems to change a lot (using a pre-prepared command list once and run it on an immediate context) compare to using the default Direct3D11 behavior.
[/quote]

Wow, great article! Thanks.

Its a shame that the results show that command lists aren't really a faster way of repeating the same drawing commands over and over for a single threaded renderer.

I suspect they have the potential to be faster if the driver developers had sufficient motivation to optimise this area of their code, especially if they optimised the command list when you call FinishCommandList. I guess the problem is that the driver has no idea if you're only going to use the command list once and throw it away (In which case the act of optimising the command list may cost more than the potential gains), if if you're going to create the comand list once and execute it many times (In which case optimisng the list may be beneficial).

I guess we'd need a tweak to the API so that you can either pass in a boolean to FinishCommandList to tell the driver whether the command list should be optimised or not, or maybe there could be a separate explicit OptimizeCommandList method.

Thanks again for your detailed analysis.

Thanks
Ben
0

Share this post


Link to post
Share on other sites
[quote name='phantom' timestamp='1321785963' post='4885837']
The problem with using a deferred context in a single threaded system is that you are doing more work per core in that situation; you have to prepare the CL, which takes some extra CPU overhead as the driver needs to do things and then you have to reaccess it again to send it to the card properly. Spread across multiple threads the cost-per-setup drops significantly and, if you batch them, your send arch will benifit greatly from code cache reuse (and depending on how it's stored maybe some data cache too).

<snip>

(Also, as a side note, I do recall reading that 'create, store and reuse' isn't an optimal pattern for command lists. The runtime isn't really setup for this case and it assumes you'll be remaking them each frame, which is a fair assumption because as you can't chain them together to adjust each others state and most command lists will change each frame in a 'real world' situation it is best to test against this)
[/quote]

Thanks Phantom, but in my case I was thinking of creating the command list once and then executing it for each frame.

As I'm sure you know, a command list containing a constant buffer will only contain references (Or pointers) to the constant buffer and not the actual data containined int he buffer itself, so an app can still change the data in the constant buffer from frame to frame without having to create a new command list.

So for example I was thinking:
1. At startup create a command list (DrawTankCL) that draws a Tank at a position defined in a Contant Buffer ("TankCB")
2. Update TankCB.position on the CPU based on user input, physics etc
3. ExecuteCommandList(DrawTankCL)
4. Repeat from step 2.

As you can see the command list is created once and executed over and over, and yet the tanks posiiton is still dynamic.

Its a shame that this "create, store, reuse" pattern is not optimised int he drivers.

Anyway, at least now I know the answer so I code my game accordingly.

Thanks for your help
Ben
0

Share this post


Link to post
Share on other sites
There are two problems with your idea.

Firstly, you are being too fine grain with your CL for it to really be useful. There is a good PDF from GDC2011 which covers some of this (google: Jon Jansen DX11 Performance Gems, that should get you it). The main thing is that a CL has overhead, apprently a few dozen API calls so doing too little work in one is going to be a problem as it will just get swamped with overhead. Depending on your setup scenes or material groups are better fits for CL building and execution.

Secondly; you run the risk of suffering a stall at step 2. The driver buffers commands and the GPU should be working at the same time as you execute other work, so there is a chance that when you come to update in step 2 you could be waiting a 'significant' amount of time for the GPU to be done with your buffer and release it so that you can update it again. Discard/lock or other update [i]might[/i] avoid the problem, I've not tried it myself, but it still presents an issue.
0

Share this post


Link to post
Share on other sites
[quote name='xoofx' timestamp='1321790455' post='4885852']
Indeed, if it is natively supported by the driver, It can be optimized. A coworker found also on NVIDIA a performance boost when they introduced support for command list, though on AMD, It is already fine without the support from the driver... probably the command buffer on AMD is already layout in the same way DirectX11 command buffer is layout...
[/quote]

NV is a strange beast; before they had 'proper' support they kinda emulated it by spinning up a 'server' thread and serialising the CL creation via that. Amusingly if any of your active threads ended up on the same core as the server thread it tended to murder performance but by staying clear you could get a small improvement. Once the drivers came out which did the work correctly this problem went away.

In our test NV with proper support soundly beat AMD without it; this was a 470GTX vs 5870 on otherwise basically identical hardware (i7 CPUs, the NV one had a few hundred Mhz over the AMD one, but not enough for the performance delta seen). AMD's performance was more in line with the single thread version. However our test was a very heavy CPU bound one; 15,000 draw calls spread over 6 cores each one drawing a single flat shaded cube. Basically an API worse nightmare ;)

(Amusing side note; the same test/code on an X360 @ 720p could render at a solid 60fps with a solid 16.6ms frame time. That's command lists being generated each frame over 6 cores; shows just how much CPU overhead/performance loss you take when running on Windows :( )
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0

  • Similar Content

    • By amadeus12
      I wrote computeshader to blur image.
      But it returns black texture.
      actually I'm not using FX file and effect library.
      I use hlsl file and I bind it to pipeline myself with direct api.
      I have 2 hlsl files which do vertical, horizontal blur.
      here my CPU code which executes computeshader.
      void BoxApp::callComputeShaderandBlur(ID3D11DeviceContext * dc, ID3D11ShaderResourceView * inputSRV, ID3D11UnorderedAccessView * inputUAV, int blurcount)
      {
          for (int i = 0; i < blurcount; i++)
          {
              dc->CSSetShader(m_CSH, 0, 0);
              dc->CSSetShaderResources(0, 1, &inputSRV);
              dc->CSSetUnorderedAccessViews(0, 1, &mBlurOutPutTexUAV, 0);
            
              UINT numGroupsX = (UINT)ceilf(m_Width / 256.0f);
              dc->Dispatch(numGroupsX, m_Height, 1);
             
              dc->CSSetShaderResources(1, 0, 0);
              dc->CSSetUnorderedAccessViews(1, 0, 0, 0);
              dc->CSSetShader(m_CSV, 0, 0);
              dc->CSSetShaderResources(0, 1, &mBlurOutPutTexSRV);
              dc->CSSetUnorderedAccessViews(0, 1, &inputUAV, 0);
              UINT numGroupY = (UINT)ceilf(m_Height / 256.0f);
              dc->Dispatch(m_Width, numGroupY, 1);
              dc->CSSetShaderResources(1, 0, 0);
              dc->CSSetUnorderedAccessViews(1, 0, 0, 0);
          }
          dc->CSSetShaderResources(1, 0, 0);
          dc->CSSetUnorderedAccessViews(1, 0, 0, 0);
          dc->CSSetShader(0, 0, 0);
      }
      If I don't call this function, everything is fine. (I rendered my scene to off screen redertarget and use this texture as quad texture. and render it to real rendertarget. it worked fined)
      That means there's problem in ComputeShader code.
      Every resource and view isn't null pointer, I checked it.
      all HRESULTs are S_OK.
       
      here my 2 shader codes
       
      this is CSH.hlsl
      static float gWeights[11] =
      {
          0.05f, 0.05f, 0.1f, 0.1f, 0.1f, 0.2f, 0.1f, 0.1f, 0.1f, 0.05f, 0.05f,
      };
      static const int gBlurRadius = 5;
      Texture2D gInput;
      RWTexture2D<float4> gOutput;
      #define N 256
      #define CacheSize (N + 2*gBlurRadius)
      groupshared float4 gCache[CacheSize];
      [numthreads(N, 1, 1)]
      void main(int3 groupThreadID : SV_GroupThreadID,
          int3 dispatchThreadID : SV_DispatchThreadID)
      {
          //
          // Fill local thread storage to reduce bandwidth.  To blur 
          // N pixels, we will need to load N + 2*BlurRadius pixels
          // due to the blur radius.
          //
          // This thread group runs N threads.  To get the extra 2*BlurRadius pixels, 
          // have 2*BlurRadius threads sample an extra pixel.
          if (groupThreadID.x < gBlurRadius)
          {
              // Clamp out of bound samples that occur at image borders.
              int x = max(dispatchThreadID.x - gBlurRadius, 0);
              gCache[groupThreadID.x] = gInput[int2(x, dispatchThreadID.y)];
          }
          if (groupThreadID.x >= N - gBlurRadius)
          {
              // Clamp out of bound samples that occur at image borders.
              int x = min(dispatchThreadID.x + gBlurRadius, gInput.Length.x - 1);
              gCache[groupThreadID.x + 2 * gBlurRadius] = gInput[int2(x, dispatchThreadID.y)];
          }
          // Clamp out of bound samples that occur at image borders.
          gCache[groupThreadID.x + gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy - 1)];
          // Wait for all threads to finish.
          GroupMemoryBarrierWithGroupSync();
          //
          // Now blur each pixel.
          //
          float4 blurColor = float4(0, 0, 0, 0);
          [unroll]
          for (int i = -gBlurRadius; i <= gBlurRadius; ++i)
          {
              int k = groupThreadID.x + gBlurRadius + i;
              blurColor += gWeights[i + gBlurRadius] * gCache[k];
          }
          gOutput[dispatchThreadID.xy] = blurColor;
      }
      and this is CSV
       
      static float gWeights[11] =
      {
              0.05f, 0.05f, 0.1f, 0.1f, 0.1f, 0.2f, 0.1f, 0.1f, 0.1f, 0.05f, 0.05f,
      };
      static const int gBlurRadius = 5;
      Texture2D gInput;
      RWTexture2D<float4> gOutput;
      #define N 256
      #define CacheSize (256 + 2*5)
      groupshared float4 gCache[CacheSize];

      [numthreads(1, N, 1)]
      void main(int3 groupThreadID : SV_GroupThreadID,
          int3 dispatchThreadID : SV_DispatchThreadID)
      {
          //
          // Fill local thread storage to reduce bandwidth.  To blur 
          // N pixels, we will need to load N + 2*BlurRadius pixels
          // due to the blur radius.
          //
          // This thread group runs N threads.  To get the extra 2*BlurRadius pixels, 
          // have 2*BlurRadius threads sample an extra pixel.
          if (groupThreadID.y < gBlurRadius)
          {
              // Clamp out of bound samples that occur at image borders.
              int y = max(dispatchThreadID.y - gBlurRadius, 0);
              gCache[groupThreadID.y] = gInput[int2(dispatchThreadID.x, y)];
          }
          if (groupThreadID.y >= N - gBlurRadius)
          {
              // Clamp out of bound samples that occur at image borders.
              int y = min(dispatchThreadID.y + gBlurRadius, gInput.Length.y - 1);
              gCache[groupThreadID.y + 2 * gBlurRadius] = gInput[int2(dispatchThreadID.x, y)];
          }
          // Clamp out of bound samples that occur at image borders.
          gCache[groupThreadID.y + gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy - 1)];

          // Wait for all threads to finish.
          GroupMemoryBarrierWithGroupSync();
          //
          // Now blur each pixel.
          //
          float4 blurColor = float4(0, 0, 0, 0);
          [unroll]
          for (int i = -gBlurRadius; i <= gBlurRadius; ++i)
          {
              int k = groupThreadID.y + gBlurRadius + i;
              blurColor += gWeights[i + gBlurRadius] * gCache[k];
          }
          gOutput[dispatchThreadID.xy] = blurColor;
      }
       
       
      sorry about poor english.
      plz help I'm really sad...
      I spend whole day for this...
      It doesn't work..
      feels bad man..
    • By maxest
      I implemented DX queries after this blog post:
      https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/
      Queries work perfectly fine... for as long as I don't use VSync or any other form of Sleep. Why would that happe? I record queries right before my Compute/Dispatch code, record right after and then read the results (spinning on GetData if returns S_FALSE).
      When I don't VSync then my code takes consistent 0.39-0.4 ms. After turning VSync on it starts with something like 0.46 ms, after a second bumps up to 0.61 ms and a few seconds after I get something like 1.2 ms.
      I also used this source:
      http://reedbeta.com/blog/gpu-profiling-101/
      The difference here is that the author uses the disjoint query for the whole Render()  function instead of using one per particular measurement. When I implemented it this way the timings were incosistent (like above 0.46, 0.61, 1.2) regardless of VSync.
    • By Jemme
      Howdy
      Ive got a WPF level editor  and a C++ Directx DLL.

      Here are the main functions:
      public static class Engine { //DX dll //Init [DllImport("Win32Engine.dll", CallingConvention = CallingConvention.Cdecl)] public static extern void Initialize(IntPtr hwnd, int Width, int Height); //Messages / Input [DllImport("Win32Engine.dll", CallingConvention = CallingConvention.Cdecl)] public static extern void HandleMessage(IntPtr hwnd, int msg, int wParam, int lParam); //Load [DllImport("Win32Engine.dll", CallingConvention = CallingConvention.Cdecl)] public static extern void Load(); //Update [DllImport("Win32Engine.dll", CallingConvention = CallingConvention.Cdecl)] public static extern void Update(); //Draw [DllImport("Win32Engine.dll", CallingConvention = CallingConvention.Cdecl)] public static extern void Draw(); //Shutdown [DllImport("Win32Engine.dll", CallingConvention = CallingConvention.Cdecl)] public static extern void ShutDown(); } Okay so what is the proper way to get the window hosted inside a control and the pump the engine?
      At the moment i have it inside a (winfom) panel and use:
       
      protected override void OnSourceInitialized(EventArgs e) { base.OnSourceInitialized(e); HwndSource source = PresentationSource.FromVisual(this) as HwndSource; source.AddHook(WndProc); } private static IntPtr WndProc(IntPtr hwnd, int msg, IntPtr wParam, IntPtr lParam, ref bool handled) { Engine.HandleMessage(hwnd, msg, (int)wParam, (int)lParam); Engine.Update(); Engine.Draw(); return IntPtr.Zero; } But there's just a few problems:
      Messages come from everywhere not just the panel (due to using the main window) The input doesn't actually work It's super duper ugly code wise In terms of c++ side the normal enigne (non-editor ) uses this pump:
      while (msg.message != WM_QUIT) { while (PeekMessage(&msg, NULL, 0, 0, PM_REMOVE)) { TranslateMessage(&msg); DispatchMessage(&msg); //Input if (msg.message == WM_INPUT) { //Buffer size UINT size = 512; BYTE buffer[512]; GetRawInputData((HRAWINPUT)msg.lParam, RID_INPUT, (LPVOID)buffer, &size, sizeof(RAWINPUTHEADER)); RAWINPUT *raw = (RAWINPUT*)buffer; if (raw->header.dwType == RIM_TYPEKEYBOARD) { bool keyUp = raw->data.keyboard.Flags & RI_KEY_BREAK; USHORT keyCode = raw->data.keyboard.VKey; if (!keyUp) { Keyboard::SetKeyState(keyCode, true); } else { Keyboard::SetKeyState(keyCode, false); } } } } time->Update(); engine->Update(time->DeltaTime()); engine->Draw(); } Not the nicest loop but works for now for testing and things.

      Now the Editor versions code is:
       
      //Initalize enigne and all sub systems extern "C" { //Hwnd is a panel usually DLLExport void Initialize(int* hwnd, int Width, int Height) { engine = new Engine(); time = new Timer(); time->Update(); if (engine->Initialize(Width, Height,(WINHANDLE)hwnd)) { //WindowMessagePump(); } else { //return a fail? } } } extern "C" { DLLExport void HandleMessage(int* hwnd, int msg, int wParam, int lParam) { //Input if (msg == WM_INPUT) { //Buffer size UINT size = 512; BYTE buffer[512]; GetRawInputData((HRAWINPUT)lParam, RID_INPUT, (LPVOID)buffer, &size, sizeof(RAWINPUTHEADER)); RAWINPUT *raw = (RAWINPUT*)buffer; if (raw->header.dwType == RIM_TYPEKEYBOARD) { bool keyUp = raw->data.keyboard.Flags & RI_KEY_BREAK; USHORT keyCode = raw->data.keyboard.VKey; if (!keyUp) { Keyboard::SetKeyState(keyCode, true); } else { Keyboard::SetKeyState(keyCode, false); } } } } } //Load extern "C" { DLLExport void Load() { engine->Load(); } } //Update extern "C" { DLLExport void Update() { time->Update(); engine->Update(time->DeltaTime()); } } //Draw extern "C" { DLLExport void Draw() { engine->Draw(); } } //ShutDown Engine extern "C" { DLLExport void ShutDown() { engine->ShutDown(); delete time; delete engine; } }  
      Any advice of how to do this properly would be much apprcieated.
      p.s in my opinion the loop should kind of stay the same? but allow the wpf to psuh the message through some how and the loop within c++ calls the update and draw still so :
       
      //Gets message from C# somehow while (PeekMessage(&msg, NULL, 0, 0, PM_REMOVE)) { TranslateMessage(&msg); DispatchMessage(&msg); //Input } time->Update(); engine->Update(time->DeltaTime()); engine->Draw(); //returns back to c# ^ some how have that in c++ as the message pump called from wpf 
      Thanks
    • By data2
      I'm an experienced programmer specialized in Computer Graphics, mainly using Direct3D 9.0c, OpenGL and general algorithms. Currently, I am evaluating Direct2D as rendering technology for a professional application dealing with medical image data. As for rendering, it is a x64 desktop application in windowed mode (not fullscreen).
       
      Already with my very initial steps I struggle with a task I thought would be a no-brainer: Rendering a single-channel bitmap on screen.
       
      Running on a Windows 8.1 machine, I create an ID2D1DeviceContext with a Direct3D swap chain buffer surface as render target. The swap chain is created from a HWND and buffer format DXGI_FORMAT_B8G8R8A8_UNORM. Note: See also the code snippets at the end.
       
      Afterwards, I create a bitmap with pixel format DXGI_FORMAT_R8_UNORM and alpha mode D2d1_ALPHA_MODE_IGNORE. When calling DrawBitmap(...) on the device context, a debug break point is triggered with the debug message "D2d DEBUG ERROR - This operation is not compatible with the pixel format of the bitmap".
       
      I know that this output is quite clear. Also, when changing the pixel format to DXGI_FORMAT_R8G8B8A8_UNORM with DXGI_ALPHA_MODE_IGNORE everything works well and I see the bitmap rendered. However, I simply cannot believe that! Graphics cards support single-channel textures ever since - every 3D graphics application can use them without thinking twice. This goes without speaking.
       
      I tried to find anything here and at Google, without success. The only hint I could find was the MSDN Direct2D page with the (supported pixel formats). The documentation suggests - by not mentioning it - that DXGI_FORMAT_R8_UNORM is indeed not supported as bitmap format. I also find posts talking about alpha masks (using DXGI_FORMAT_A8_UNORM), but that's not what I'm after.

      What am I missing that I can't convince Direct2D to create and draw a grayscale bitmap? Or is it really true that Direct2D doesn't support drawing of R8 or R16 bitmaps??
       
      Any help is really appreciated as I don't know how to solve this. If I can't get this trivial basics to work, I think I'd have to stop digging deeper into Direct2D :-(.
       
      And here is the code snippets of relevance. Please note that they might not compile since I ported this on the fly from my C++/CLI code to plain C++. Also, I threw away all error checking and other noise:
       
      Device, Device Context and Swap Chain Creation (D3D and Direct2D):
      // Direct2D factory creation D2D1_FACTORY_OPTIONS options = {}; options.debugLevel = D2D1_DEBUG_LEVEL_INFORMATION; ID2D1Factory1* d2dFactory; D2D1CreateFactory(D2D1_FACTORY_TYPE_MULTI_THREADED, options, &d2dFactory); // Direct3D device creation const auto type = D3D_DRIVER_TYPE_HARDWARE; const auto flags = D3D11_CREATE_DEVICE_BGRA_SUPPORT; ID3D11Device* d3dDevice; D3D11CreateDevice(nullptr, type, nullptr, flags, nullptr, 0, D3D11_SDK_VERSION, &d3dDevice, nullptr, nullptr); // Direct2D device creation IDXGIDevice* dxgiDevice; d3dDevice->QueryInterface(__uuidof(IDXGIDevice), reinterpret_cast<void**>(&dxgiDevice)); ID2D1Device* d2dDevice; d2dFactory->CreateDevice(dxgiDevice, &d2dDevice); // Swap chain creation DXGI_SWAP_CHAIN_DESC1 desc = {}; desc.Format = DXGI_FORMAT_B8G8R8A8_UNORM; desc.SampleDesc.Count = 1; desc.BufferUsage = DXGI_USAGE_RENDER_TARGET_OUTPUT; desc.BufferCount = 2; IDXGIAdapter* dxgiAdapter; dxgiDevice->GetAdapter(&dxgiAdapter); IDXGIFactory2* dxgiFactory; dxgiAdapter->GetParent(__uuidof(IDXGIFactory), reinterpret_cast<void **>(&dxgiFactory)); IDXGISwapChain1* swapChain; dxgiFactory->CreateSwapChainForHwnd(d3dDevice, hwnd, &swapChainDesc, nullptr, nullptr, &swapChain); // Direct2D device context creation const auto options = D2D1_DEVICE_CONTEXT_OPTIONS_NONE; ID2D1DeviceContext* deviceContext; d2dDevice->CreateDeviceContext(options, &deviceContext); // create render target bitmap from swap chain IDXGISurface* swapChainSurface; swapChain->GetBuffer(0, __uuidof(swapChainSurface), reinterpret_cast<void **>(&swapChainSurface)); D2D1_BITMAP_PROPERTIES1 bitmapProperties; bitmapProperties.dpiX = 0.0f; bitmapProperties.dpiY = 0.0f; bitmapProperties.bitmapOptions = D2D1_BITMAP_OPTIONS_TARGET | D2D1_BITMAP_OPTIONS_CANNOT_DRAW; bitmapProperties.pixelFormat.format = DXGI_FORMAT_B8G8R8A8_UNORM; bitmapProperties.pixelFormat.alphaMode = D2D1_ALPHA_MODE_IGNORE; bitmapProperties.colorContext = nullptr; ID2D1Bitmap1* swapChainBitmap = nullptr; deviceContext->CreateBitmapFromDxgiSurface(swapChainSurface, &bitmapProperties, &swapChainBitmap); // set swap chain bitmap as render target of D2D device context deviceContext->SetTarget(swapChainBitmap);  
      D2D single-channel Bitmap Creation:
      const D2D1_SIZE_U size = { 512, 512 }; const UINT32 pitch = 512; D2D1_BITMAP_PROPERTIES1 d2dProperties; ZeroMemory(&d2dProperties, sizeof(D2D1_BITMAP_PROPERTIES1)); d2dProperties.pixelFormat.alphaMode = D2D1_ALPHA_MODE_IGNORE; d2dProperties.pixelFormat.format = DXGI_FORMAT_R8_UNORM; char* sourceData = new char[512*512]; ID2D1Bitmap1* d2dBitmap; deviceContext->DeviceContextPointer->CreateBitmap(size, sourceData, pitch, d2dProperties, &d2dBitmap);  
      Bitmap drawing (FAILING):
      deviceContext->BeginDraw(); D2D1_COLOR_F d2dColor = {}; deviceContext->Clear(d2dColor); // THIS LINE FAILS WITH THE DEBUG BREAKPOINT IF SINGLE CHANNELED deviceContext->DrawBitmap(bitmap, nullptr, 1.0f, D2D1_INTERPOLATION_MODE_LINEAR, nullptr); swapChain->Present(1, 0); deviceContext->EndDraw();