• Hi guys, when I do picking followed by ray-plane intersection the results are all wrong. I am pretty sure my ray-plane intersection is correct so I'll just show the picking part. Please take a look:

// get projection_matrix DirectX::XMFLOAT4X4 mat; DirectX::XMStoreFloat4x4(&mat, projection_matrix); float2 v; v.x = (((2.0f * (float)mouse_x) / (float)screen_width) - 1.0f) / mat._11; v.y = -(((2.0f * (float)mouse_y) / (float)screen_height) - 1.0f) / mat._22; // get inverse of view_matrix DirectX::XMMATRIX inv_view = DirectX::XMMatrixInverse(nullptr, view_matrix); DirectX::XMStoreFloat4x4(&mat, inv_view); // create ray origin (camera position) float3 ray_origin; ray_origin.x = mat._41; ray_origin.y = mat._42; ray_origin.z = mat._43; // create ray direction float3 ray_dir; ray_dir.x = v.x * mat._11 + v.y * mat._21 + mat._31; ray_dir.y = v.x * mat._12 + v.y * mat._22 + mat._32; ray_dir.z = v.x * mat._13 + v.y * mat._23 + mat._33;
That should give me a ray origin and direction in world space but when I do the ray-plane intersection the results are all wrong.
If I click on the bottom half of the screen ray_dir.z becomes negative (more so as I click lower). I don't understand how that can be, shouldn't it always be pointing down the z-axis ?
I had this working in the past but I can't find my old code

• Hi,
I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.

• Hello,
in my game engine i want to implement my own bone weight painting tool, so to say a virtual brush painting tool for a mesh.
I have already implemented my own "dual quaternion skinning" animation system with "morphs" (=blend shapes)  and "bone driven"  "corrective morphs" (= morph is dependent from a bending or twisting bone)
But now i have no idea which is the best method to implement a brush painting system.
Just some proposals
a.  i would build a kind of additional "vertecie structure", that can help me to find the surrounding (neighbours) vertecie indexes from a given "central vertecie" index
b.  the structure should also give information about the distance from the neighbour vertecsies to the given "central vertecie" index
c.  calculate the strength of the adding color to the "central vertecie" an the neighbour vertecies by a formula with linear or quadratic distance fall off
d.  the central vertecie would be detected as that vertecie that is hit by a orthogonal projection from my cursor (=brush) in world space an the mesh
but my problem is that there could be several  vertecies that can be hit simultaniously. e.g. i want to paint the inward side of the left leg. the right leg will also be hit.
I think the given problem is quite typical an there are standard approaches that i dont know.
Any help or tutorial are welcome
P.S. I am working with SharpDX, DirectX11

• Hi, I'm implementing a simple 3D engine based on DirectX11. I'm trying to render a skybox with a cubemap on it and to do so I'm using DDS Texture Loader from DirectXTex library. I use texassemble to generate the cubemap (texture array of 6 textures) into a DDS file that I load at runtime. I generated a cube "dome" and sample the texture using the position vector of the vertex as the sample coordinates (so far so good), but I always get the same face of the cubemap mapped on the sky. As I look around I always get the same face (and it wobbles a bit if I move the camera). My code:
//Texture.cpp:         Texture::Texture(const wchar_t *textureFilePath, const std::string &textureType) : mType(textureType)         {             //CreateDDSTextureFromFile(Game::GetInstance()->GetDevice(), Game::GetInstance()->GetDeviceContext(), textureFilePath, &mResource, &mShaderResourceView);             CreateDDSTextureFromFileEx(Game::GetInstance()->GetDevice(), Game::GetInstance()->GetDeviceContext(), textureFilePath, 0, D3D11_USAGE_DEFAULT, D3D11_BIND_SHADER_RESOURCE, 0, D3D11_RESOURCE_MISC_TEXTURECUBE, false, &mResource, &mShaderResourceView);         }     // SkyBox.cpp:          void SkyBox::Draw()     {         // set cube map         ID3D11ShaderResourceView *resource = mTexture.GetResource();         Game::GetInstance()->GetDeviceContext()->PSSetShaderResources(0, 1, &resource);              // set primitive topology         Game::GetInstance()->GetDeviceContext()->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);              mMesh.Bind();         mMesh.Draw();     }     // Vertex Shader:     cbuffer Transform : register(b0)     {         float4x4 viewProjectionMatrix;     };          float4 main(inout float3 pos : POSITION) : SV_POSITION     {         return mul(float4(pos, 1.0f), viewProjectionMatrix);     }     // Pixel Shader:     SamplerState cubeSampler;     TextureCube cubeMap;          float4 main(in float3 pos : POSITION) : SV_TARGET     {         float4 color = cubeMap.Sample(cubeSampler, pos.xyz);         return color;     } I tried both functions grom DDS loader but I keep getting the same result. All results I found on the web are about the old SDK toolkits, but I'm using the new DirectXTex lib.
• By B. /
Hi Guys,
i want to draw shadows of a direction light but the shadows always disappear, if i translate my mesh (cube) in the world to far of the bounds of my orthographic projection matrix.
That my code (Based of an XNA sample i recode for my project):
// Matrix with that will rotate in points the direction of the light Matrix lightRotation = Matrix.LookAtLH(Vector3.Zero, lightDir, Vector3.Up); BoundingFrustum cameraFrustum = new BoundingFrustum(Matrix.Identity); // Get the corners of the frustum Vector3[] frustumCorners = cameraFrustum.GetCorners(); // Transform the positions of the corners into the direction of the light for (int i = 0; i < frustumCorners.Length; i++) frustumCorners[i] = Vector4F.ToVector3(Vector3.Transform(frustumCorners[i], lightRotation)); // Find the smallest box around the points BoundingBox lightBox = BoundingBox.FromPoints(frustumCorners); Vector3 boxSize = lightBox.Maximum - lightBox.Minimum; Vector3 halfBoxSize = boxSize * 0.5f; // The position of the light should be in the center of the back pannel of the box. Vector3 lightPosition = lightBox.Minimum + halfBoxSize; lightPosition.Z = lightBox.Minimum.Z; // We need the position back in world coordinates so we transform // the light position by the inverse of the lights rotation lightPosition = Vector4F.ToVector3(Vector3.Transform(lightPosition, Matrix.Invert(lightRotation))); // Create the view matrix for the light this.view = Matrix.LookAtLH(lightPosition, lightPosition + lightDir, Vector3.Up); // Create the projection matrix for the light // The projection is orthographic since we are using a directional light int amount = 25; this.projection = Matrix.OrthoOffCenterLH(boxSize.X - amount, boxSize.X + amount, boxSize.Y + amount, boxSize.Y - amount, -boxSize.Z - amount, boxSize.Z + amount); I believe the bug is by cameraFrustum to set a Matrix Idetity. I also tried with a Translation Matrix of my Camera Position and also the View Matrix of my Camera, but without success
Can anyone tell me, how to draw shadows of my direction light always where my camera is current in my scene?
Greets
Benjamin

# DX11 ID3D11Query reporting weird results

## Recommended Posts

I implemented DX queries after this blog post:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Queries work perfectly fine... for as long as I don't use VSync or any other form of Sleep. Why would that happe? I record queries right before my Compute/Dispatch code, record right after and then read the results (spinning on GetData if returns S_FALSE).
When I don't VSync then my code takes consistent 0.39-0.4 ms. After turning VSync on it starts with something like 0.46 ms, after a second bumps up to 0.61 ms and a few seconds after I get something like 1.2 ms.

I also used this source:
http://reedbeta.com/blog/gpu-profiling-101/
The difference here is that the author uses the disjoint query for the whole Render()  function instead of using one per particular measurement. When I implemented it this way the timings were incosistent (like above 0.46, 0.61, 1.2) regardless of VSync.

Are you spinning on the query results immediately, or do you wait a frame before trying to get the results?

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

Are you spinning on the query results immediately, or do you wait a frame before trying to get the results?

I tested both. No difference.

10 hours ago, ajmiles said:

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

Edited by maxest

I tested both. No difference.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

Even if you time only the work you're interested in (and not the whole frame), it's still going to take a variable amount of time depending on how high the GPU's clock speed happens to be at that point in time.

If the GPU can see it's only doing 2ms of work every 16ms, then it may underclock itself by a factor of 3-4x such that the 2ms of work ends up taking 6ms-8ms instead.

What's happening is something like this:

1) At 1500MHz, your work takes 0.4ms and ~16.2ms is spent idle at the end of the frame.
2) The GPU realises it could run a bit slower and still be done in plenty of time so it underclocks itself just a little bit to save power.
3) At 1200MHz, your work takes 0.5ms and ~16.1ms is spent idle at the end of the frame.
4) Still plenty of time spare, so it underclocks itself even further.
5) At 900MHz, your work takes 0.6ms and ~16.0ms is spent idle at the end of the frame.
6) *Still* plenty of time spare, so it dramatically underclocks itself.
7) At 500MHz, your work takes 3x longer than it did originally, now costing 1.2ms. There's still 15.4ms of idle time at the end of the frame, so this is still OK.
8) At this point the GPU may not have any lower power states to clock down to, so the work never takes any more than 1.2ms.

In D3D12 we (Microsoft) added an API called ID3D12Device::SetStablePowerState, in part to address this problem.

This API fixes the GPU's clock speed to something it can always run at without having to throttle back from due to thermal or power limitations. So if your GPU has a "Base Clock" of 1500MHz but can periodically "Boost" to 1650MHz, we'll fix the clock speed to 1500MHz. Note that this API does not work on end-users machines as it requires Debug bits to be installed, so can't be used in retail titles. Note also that performance will likely be worse than on an end-user's machine because we've artificially limited the clock speed below the peak to ensure a stable and consistent clock speed. With this in place, profiling becomes easier because the clock speed is known to be stable across runs and won't clock up and down as in your situation.

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

Edited by SoldierOfLight

@ajmiles: Thank you so so much for this detailed explanation. I hadn't thought about GPU clock changing its speed. This makes more sense that performing some redundant work :).

I have checked what you proposed. Got some simple DX12 sample, called SetStablePowerState and set it to true (needed to turn on Developer Mode on on my Windows 10; wasn't aware of its existence) and called permanent Sleep. Then I ran my application. Now regardless of whether I use VSync or not, call Sleep in my app or not, I get consistent 0.46 ms. It's more than without-VSync-and-SetStablePowerState 0.4 ms but at least it's stable. So as I understand the GPU is working at lower clock speed than it could (without Boost) but this speed is fixed.

I have one more case whose results I don't entirely understand. I have code of this form:

-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler


The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I would just like to ask one more, relevant thing. In my quest for search of reliable counters I stumbled upon this (https://msdn.microsoft.com/en-us/library/windows/desktop/ff476364(v=vs.85).aspx) but could find no simple example of usage. Is it working at all?

Edited by maxest

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

Interesting, it might be that we haven't pushed anything out yet with that change in. It still exists in the Creators Update SDK and whatever release of Windows 10 'maxest' is running it still seems to work.

I'll follow up with you offline why we decided the API wasn't useful. It feels like it still has value in scenarios where you want a consistent time from run-to-run and want to analyse whether an algorithmic change improves performance or not. Even if it doesn't give you real numbers for any user in the real world, consistency across runs still seems useful during development / optimisation.

38 minutes ago, maxest said:

I have one more case whose results I don't entirely understand. I have code of this form:


-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler


The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

I've never see D3D11_COUNTER used before, but Jesse (SoldierOfLight) may know whether it ever saw any use.

Edited by ajmiles

As far as counters go, they're all for IHV-specific counters. In D3D10 there were API-defined counters, but they were deprecated in D3D11.

The current model for performance counters is the plugin model exposed by PIX.

Also I just checked, and apparently I was wrong about SetStablePowerState, we did keep it around, we just moved it from requiring the D3D12 debug layers, to requiring developer mode. My bad.

Edited by SoldierOfLight

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

