Sign in to follow this  

DX11 dx 11 and multicore cpu

Recommended Posts

I read some documentation on intel tbb.
From what I understand the generation of the tasks and their execution with the task sheduler is very useful for application logic (in my case a game).
I was wondering if you can use the tasks and sheduler for rendering operations in dx 11.
Somewhere (I seem to recall an example of the last directx 11 right in the SDK) there is an example that explain how speed up the rendering with multicore.
Where i can find some documentation ?
In google there are a lot of articles but not tecnical articles in c++ and dx11

Share this post

Link to post
Share on other sites
I suppose any tutorial on multithreading will do.

That's about what we get with advancement on core number and simultaneous thread execution.
It's not a DX11 thing really, AFAIK DX11 doesn't care whether your cpu is qualcomm or intel, single or multipcore - Pleas correct me if i'm wrong.

If you want totorials of how to split up repeating DX processes, I think you can look at about anything related to graphics and multithreading. Check the graphics card pipeline and get a feeling of what things are thread safe and not.

For instance, having a geometry generator or texture loader could easily take place in a seperate thread without crashing a DX application.

But I'm rambling. Maybe that's not really what you wanted to know.

[Edited by - SuperVGA on November 30, 2010 6:57:23 AM]

Share this post

Link to post
Share on other sites
Direct3D 11 natively supports multithreading, and allows for multithreaded resource creation as well as multithreaded command list creation. The latter point is more interesting for performance, and could significantly reduce the amount of CPU overhead involved with submitting all of the API calls.

There is a sample in the DXSDK that shows how to implement the multithreaded command list generation. The Hieroglyph 3 engine also implements multithreading, so you can take a look at that too.

Share this post

Link to post
Share on other sites
Any insight into when using multithreaded command list generation is beneficial? When I run the SDK sample, it takes a few extra milliseconds when using the multithreaded paths, despite taking extra CPU time (which i suppose means its actually multithreaded at least haha)

I assume it must have something to do with the length of the command lists..

Share this post

Link to post
Share on other sites
Any insight into when using multithreaded command list generation is beneficial? When I run the SDK sample, it takes a few extra milliseconds when using the multithreaded paths, despite taking extra CPU time (which i suppose means its actually multithreaded at least haha)

I assume it must have something to do with the length of the command lists..

Share this post

Link to post
Share on other sites
It does indeed have to do with the size of the command lists, and it also depends on how many cores you have on your machine. If you run it on a single or dual core machine, then it may not help at all (and may actually decrease performance as well).

If your app is CPU bound though, it could be quite beneficial to parallelize the CPU submission calls. I don't have any hard numbers, but I have been developing some stress tests to see if I can find where it starts to benefit from having the multiple cores. Hopefully within the next couple weeks I will be able to add the stress test to my engine distribution...

Share this post

Link to post
Share on other sites
The main "problem" right now is that while drivers support multi-threaded resource creation they (when I last checked) didn't natively support multi-threaded command list creation.

So, the runtime and driver kind of 'fake' it right now (effectively storing everything up and then submitting as normal); although I think we are getting closer to native support for this functionality.

The main point being that, atm, any stress test won't give an accurate view of how things will work in the future [smile]

Share this post

Link to post
Share on other sites
The 'Driver Command Lists' ability reported by the runtime is based on the driver's ability to simultaneously create command lists from multiple threads. However, the benefit from multithreading can still be had depending on how much work is done to generate a rendering sequence. For example, if the transformation matrix concatenation is performed on each thread for independent rendering passes, then the net result is to divide the number of transforms needed to perform by the number of cores on the machine (neglecting memory access pattern effects). This will still have a positive effect on performance, regardless of if the command list is created in parallel or in sequential operation.

Share this post

Link to post
Share on other sites
Out of curiosity, I just ran the multithreaded rendering sample from the SDK on both my laptop and my desktop. Here are the configurations and the results:

Laptop: Core 2 Duo 1.6 GHz, 2 GB, Vista, 8600M GT - Single threaded ~5-6 FPS, Multi-threaded ~8-9 FPS
Desktop: Phenom II X4 3.2 GHz, 4 GB, Win7, 5700 series Radeon - Single threaded ~22-24 FPS, Multi-threaded ~44-45 FPS

So in both cases I see a significant speedup with the MT activated. Also interesting is the fact that the laptop GPU is a DX10 part, meaning that the CPU savings are implemented with a DX10 driver for the DX11 runtime. I suspect that if you see degraded performance in this sample, you are probably GPU bound meaning that the parallelism can't help you anyways...

Share this post

Link to post
Share on other sites
Hmm interesting. I updated my GPU drivers just now and it has improved performance of the multithreaded path. Immediate mode appears to be the fastest, so like you said I must be GPU bound

Desktop: i7 3.4 GHz, 64-bit win7, HD5750

Immidiate ~65 fps
ST Def/Scene ~39 fps
MT Def/Scene ~54 fps
ST Def/Chunk ~50 fps
MT Def/Chunk ~55 fps

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
    • Total Posts
  • Similar Content

    • By schneckerstein
      I manged so far to implement NVIDIA's NDF-Filtering at a basic level (the paper can be found here). Here is my code so far:
      //... // project the half vector on the normal (?) float3 hppWS = halfVector / dot(halfVector, geometricNormal) float2 hpp = float2(dot(hppWS, wTangent), dot(hppWS, wBitangent)); // compute the pixel footprint float2x2 dhduv = float2x2(ddx(hpp), ddy(hpp)); // compute the rectangular area of the pixel footprint float2 rectFp = min((abs(dhduv[0]) + abs(dhduv[1])) * 0.5, 0.3); // map the area to ggx roughness float2 covMx = rectFp * rectFp * 2; roughness = sqrt(roughness * roughness + covMx); //... Now I want combine this with LEAN mapping as state in Chapter 5.5 of the NDF paper.
      But I struggle to understand what theses sections actually means in Code: 
      I suppose the first-order moments are the B coefficent of the LEAN map, however things like
      float3 hppWS = halfVector / dot(halfVector, float3(lean_B, 0)); doesn't bring up anything usefull.
      Next theres:
      This simply means:
      // M and B are the coefficents from the LEAN map float2x2 sigma_mat = float2x2( M.x - B.x * B.x, M.z - B.x * B.y, M.z - B.x * B.y, M.y - B.y * B.y); does it?
      This is the part confuses me the most: how am I suppose to convolute two matrices? I know the concept of convolution in terms of functions, not matrices. Should I multiple them? That didn't make any usefully output.
      I hope someone can help with this maybe too specific question, I'm really despaired to make this work and i've spend too many hours of trial & error...
    • By Baemz
      I've been working on some culling-techniques for a project. We've built our own engine so pretty much everything is built from scratch. I've set up a frustum with the following code, assuming that the FOV is 90 degrees.
      float angle = CU::ToRadians(45.f); Plane<float> nearPlane(Vector3<float>(0, 0, aNear), Vector3<float>(0, 0, -1)); Plane<float> farPlane(Vector3<float>(0, 0, aFar), Vector3<float>(0, 0, 1)); Plane<float> right(Vector3<float>(0, 0, 0), Vector3<float>(angle, 0, -angle)); Plane<float> left(Vector3<float>(0, 0, 0), Vector3<float>(-angle, 0, -angle)); Plane<float> up(Vector3<float>(0, 0, 0), Vector3<float>(0, angle, -angle)); Plane<float> down(Vector3<float>(0, 0, 0), Vector3<float>(0, -angle, -angle)); myVolume.AddPlane(nearPlane); myVolume.AddPlane(farPlane); myVolume.AddPlane(right); myVolume.AddPlane(left); myVolume.AddPlane(up); myVolume.AddPlane(down); When checking the intersections I am using a BoundingSphere of my models, which is calculated by taking the average position of all vertices and then choosing the furthest distance to a vertex for radius. The actual intersection test looks like this, where the "myFrustum90" is the actual frustum described above.
      The orientationInverse is the viewMatrix in this case.
      bool CFrustum::Intersects(const SFrustumCollider& aCollider) { CU::Vector4<float> position = CU::Vector4<float>(aCollider.myCenter.x, aCollider.myCenter.y, aCollider.myCenter.z, 1.f) * myOrientationInverse; return myFrustum90.Inside({ position.x, position.y, position.z }, aCollider.myRadius); } The Inside() function looks like this.
      template <typename T> bool PlaneVolume<T>::Inside(Vector3<T> aPosition, T aRadius) const { for (unsigned short i = 0; i < myPlaneList.size(); ++i) { if (myPlaneList[i].ClassifySpherePlane(aPosition, aRadius) > 0) { return false; } } return true; } And this is the ClassifySpherePlane() function. (The plane is defined as a Vector4 called myABCD, where ABC is the normal)
      template <typename T> inline int Plane<T>::ClassifySpherePlane(Vector3<T> aSpherePosition, float aSphereRadius) const { float distance = (aSpherePosition.Dot(myNormal)) - myABCD.w; // completely on the front side if (distance >= aSphereRadius) { return 1; } // completely on the backside (aka "inside") if (distance <= -aSphereRadius) { return -1; } //sphere intersects the plane return 0; }  
      Please bare in mind that this code is not optimized nor well-written by any means. I am just looking to get it working.
      The result of this culling is that the models seem to be culled a bit "too early", so that the culling is visible and the models pops away.
      How do I get the culling to work properly?
      I have tried different techniques but haven't gotten any of them to work.
      If you need more code or explanations feel free to ask for it.

    • By evelyn4you
      i have read very much about the binding of a constantbuffer to a shader but something is still unclear to me.
      e.g. when performing :   vertexshader.setConstantbuffer ( buffer,  slot )
       is the buffer bound
      a.  to the VertexShaderStage
      b. to the VertexShader that is currently set as the active VertexShader
      Is it possible to bind a constantBuffer to a VertexShader e.g. VS_A and keep this binding even after the active VertexShader has changed ?
      I mean i want to bind constantbuffer_A  to VS_A, an Constantbuffer_B to VS_B  and  only use updateSubresource without using setConstantBuffer command every time.

      Look at this example:
      SetVertexShader ( VS_A )
      vertexshader.setConstantbuffer ( buffer_A,  slot_A )
      perform drawcall       ( buffer_A is used )

      SetVertexShader ( VS_B )
      vertexshader.setConstantbuffer ( buffer_B,  slot_A )
      perform drawcall   ( buffer_B is used )
      SetVertexShader ( VS_A )
      perform drawcall   (now which buffer is used ??? )
      I ask this question because i have made a custom render engine an want to optimize to
      the minimum  updateSubresource, and setConstantbuffer  calls
    • By noodleBowl
      I got a quick question about buffers when it comes to DirectX 11. If I bind a buffer using a command like:
      IASetVertexBuffers IASetIndexBuffer VSSetConstantBuffers PSSetConstantBuffers  and then later on I update that bound buffer's data using commands like Map/Unmap or any of the other update commands.
      Do I need to rebind the buffer again in order for my update to take effect? If I dont rebind is that really bad as in I get a performance hit? My thought process behind this is that if the buffer is already bound why do I need to rebind it? I'm using that same buffer it is just different data
    • By Rockmover
      I am really stuck with something that should be very simple in DirectX 11. 
      1. I can draw lines using a PC (position, colored) vertices and a simple shader just fine.
      2. I can draw 3D triangles using PCN (position, colored, normal) vertices just fine (even transparency and SpecularBlinnPhong shaders).
      However, if I'm using my 3D shader, and I want to draw my PC lines in the same scene how can I do that?
      If I change my lines to PCN and pass them to the 3D shader with my triangles, then the lighting screws them all up.  I only want the lighting for the 3D triangles, but no SpecularBlinnPhong/Lighting for the lines (just PC). 
      I am sure this is because if I change the lines to PNC there is not really a correct "normal" for the lines.  
      I assume I somehow need to draw the 3D triangles using one shader, and then "switch" to another shader and draw the lines?  But I have no clue how to use two different shaders in the same scene.  And then are the lines just drawn on top of the triangles, or vice versa (maybe draw order dependent)?  
      I must be missing something really basic, so if anyone can just point me in the right direction (or link to an example showing the implementation of multiple shaders) that would be REALLY appreciated.
      I'm also more than happy to post my simple test code if that helps as well!
  • Popular Now