Jump to content
  • Advertisement
Sign in to follow this  
krausest

OpenGL [Slimdx] D3D Performance Issue In Comparison With Opengl

This topic is 2807 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm currently porting some OpenGL (LWJGL), Java based demo to SlimDX (.NET 4.0, DirectX 11), but I'm running into massive performance problems. The demo uses cascaded shadow maps. I don't think there should be a large difference in performance between DirectX11 and OpenGL.
I tried a few things to find the reason for DirectX performing slower:
The demo loop takes equally long (commenting DrawIndexed out)
The main impact can already be seen when I render the shadow geometry to the screen (instead of a texture array which would be used in the CSM scenario). Just rendering all the model's triangle with a solid color takes 7.1 msecs for DirectX and 4.3 for OpenGL, so it's a way too big difference to ignore.
Currently I'm stuck and have no idea where the penalty could come from.
Here's the shader (Couldn't be any simpler...):


float4x4 worldViewProj;

float4 VS( float4 pos : POSITION ) : SV_POSITION {
float4 output;

output = mul(pos, worldViewProj);
return output;
}

float4 PS( ) : SV_Target {
return float4(1.0, 0.0, 0.0, 1.0);
}

technique10 Render {
pass P0 {
SetGeometryShader( 0 );
SetVertexShader( CompileShader( vs_5_0, VS() ) );
SetPixelShader( CompileShader( ps_5_0, PS() ) );
}
}


And here's the code using the shader:


public void Load(Device device)
{
var bytecode = ShaderBytecode.CompileFromFile("ZOnly.fx", "fx_5_0", ShaderFlags.WarningsAreErrors, EffectFlags.None);
effect = new Effect(device, bytecode);
technique = effect.GetTechniqueByIndex(0);
pass = technique.GetPassByIndex(0);

ShaderSignature signature = pass.Description.Signature;
inputLayout = new InputLayout(device, signature, new[] {
new InputElement("POSITION", 0, SlimDX.DXGI.Format.R32G32B32_Float, 0, 0),
});

var solidParentOp = new BlendStateDescription();
solidParentOp.RenderTargets[0].BlendOperationAlpha = BlendOperation.Add;
solidParentOp.RenderTargets[0].BlendOperation = BlendOperation.Add;
solidParentOp.RenderTargets[0].DestinationBlend = BlendOption.Zero;
solidParentOp.RenderTargets[0].DestinationBlendAlpha = BlendOption.Zero;
solidParentOp.RenderTargets[0].SourceBlend = BlendOption.One;
solidParentOp.RenderTargets[0].SourceBlendAlpha = BlendOption.One;
solidParentOp.RenderTargets[0].RenderTargetWriteMask = ColorWriteMaskFlags.All;
solidParentOp.RenderTargets[0].BlendEnable = false;
solidParentOp.AlphaToCoverageEnable = false;
solidParentOp.IndependentBlendEnable = false;
solidBlendState = BlendState.FromDescription(device, solidParentOp);

var dssdSolid = new DepthStencilStateDescription
{
IsDepthEnabled = true,
IsStencilEnabled = false,
DepthWriteMask = DepthWriteMask.All,
DepthComparison = Comparison.Less
};
depthStencilState = DepthStencilState.FromDescription(device, dssdSolid);

var rsDesc = new RasterizerStateDescription {
FillMode = FillMode.Solid,
CullMode = CullMode.Back,
IsScissorEnabled = false,
IsFrontCounterclockwise = false,
DepthBias = 0,
SlopeScaledDepthBias = 0,
IsMultisampleEnabled = false,
IsDepthClipEnabled = false,
IsAntialiasedLineEnabled = false
};
rasterizerState = RasterizerState.FromDescription(device, rsDesc);

MVPVariable = effect.GetVariableByName("worldViewProj").AsMatrix();
}

public void Init(Device device)
{
device.ImmediateContext.OutputMerger.DepthStencilState = depthStencilState;
device.ImmediateContext.OutputMerger.BlendState = solidBlendState;
device.ImmediateContext.Rasterizer.State = rasterizerState;
device.ImmediateContext.InputAssembler.InputLayout = inputLayout;
device.ImmediateContext.InputAssembler.PrimitiveTopology = SlimDX.Direct3D11.PrimitiveTopology.TriangleList;
device.ImmediateContext.InputAssembler.SetVertexBuffers(0, new VertexBufferBinding(BasicVertexData.GlobalVertexBuffer, BasicVertexData.GlobalVertexStride, 0));
device.ImmediateContext.InputAssembler.SetIndexBuffer(BasicVertexData.GlobalIndexBuffer, SlimDX.DXGI.Format.R32_UInt, 0);
}

public void UpdateParams(Device device, MatricesInfo matricesInfo, Light light)
{
MVPVariable.AsMatrix().SetMatrix(matricesInfo.getMVP());
pass.Apply(device.ImmediateContext);
}

public void Render(Device device, BasicVertexData bvd)
{
if (bvd.Indices.Length > 0)
{
device.ImmediateContext.DrawIndexed(bvd.Indices.Length, bvd.StartIndex, 0);
}
}

Load is called only once during intialisation. Init is called once per frame, UpdateParams whenever the MVP-matrix changes and Render for each object.

Do you have any idea what I'm missing here? (I guess it's as simple as forgetting to disable alpha testing, preventing early-z rejection, disabling back face culling)

Thanks,
Stefan

Share this post


Link to post
Share on other sites
Advertisement

The main impact can already be seen when I render the shadow geometry to the screen (instead of a texture array which would be used in the CSM scenario). Just rendering all the model's triangle with a solid color takes 7.1 msecs for DirectX and 4.3 for OpenGL, so it's a way too big difference to ignore.


Are we talking CPU or GPU performance here? How exactly are you measuring the time difference?

Share this post


Link to post
Share on other sites

Are we talking CPU or GPU performance here? How exactly are you measuring the time difference?


GPU performance. As I said the performance is comparable when the rendering call itself (e.g. DrawIndexed) is commented out. The shader does in this case nothing more than render many red triangles directly on the screen.
The measurement happens from frame to frame in the render loop where the current time is stored to a member and an averaged difference is computed. (Ah yes - and VSync is not enabled).

EDIT: Just to make sure I just double checked my FPS and durations with the results of FRAPS and my measurement is fine.

Yours,
Stefan

Share this post


Link to post
Share on other sites
Well you can't accurately profile GPU or CPU performance just by measuring your frame time. That will just tell you your overall performance, which is basically a max of your CPU time and your GPU time. Ideally you want to use PIX or another GPU tool that will perform the necessary GPU timing so that you can better isolate your bottleneck. It's really easy to do in PIX...just create a new experiment targeting your executable with the "Statistics for each frame" option checked, and then let it run for a little while. Then when you're done, the timeline view up top will show a graph of CPU time and GPU time for each frame.

Either way 7.3ms (or even 4.3ms for that matter) sounds like a realllllly long time just for some depth-only rendering, unless you're doing this on a very weak GPU.

Share this post


Link to post
Share on other sites

Well you can't accurately profile GPU or CPU performance just by measuring your frame time. That will just tell you your overall performance, which is basically a max of your CPU time and your GPU time. Ideally you want to use PIX or another GPU tool that will perform the necessary GPU timing so that you can better isolate your bottleneck. It's really easy to do in PIX...just create a new experiment targeting your executable with the "Statistics for each frame" option checked, and then let it run for a little while. Then when you're done, the timeline view up top will show a graph of CPU time and GPU time for each frame.

Either way 7.3ms (or even 4.3ms for that matter) sounds like a realllllly long time just for some depth-only rendering, unless you're doing this on a very weak GPU.


You're certainly right that the measured time is the overall performance, still commenting out the draw calls shows that the rendering causes the difference between DirectX and OpenGL (and since the rendering takes much more time I'd say it's not a CPU performance problem).

I'm rendering 1,673,088 triangles per frame, I don't think that this is acutally very bad for a laptop (if I'm not completely wrong that's about 400 MTris / sec with OpenGL)

Using PIX I didn't give me new information. I see that per frame there are 1104 DIP calls, no DPUP, DIPUP, Locks, 12 SetRenderState calls, 12 SetVertexShader calls, 12 SetPixelShader calls, 0 SetRenderTarget calls, 0 SetTextureStageState, 0 Misc FF state changes and the time spent in DIP calls is 281373.1.

Any ideas?

Share this post


Link to post
Share on other sites
Commenting out the Draw call isn't a good performance experiment, since Draw calls are a major source of API/Driver overhead on the CPU. In fact with such a high number of DIP calls, it's very likely that the driver overhead is what's slowing you down. Did you look at the graph in the timeline view in PIX to see what your CPU/GPU timings are for each frame? That was the important part.

Share this post


Link to post
Share on other sites

Commenting out the Draw call isn't a good performance experiment, since Draw calls are a major source of API/Driver overhead on the CPU. In fact with such a high number of DIP calls, it's very likely that the driver overhead is what's slowing you down. Did you look at the graph in the timeline view in PIX to see what your CPU/GPU timings are for each frame? That was the important part.


I've attached a screenshot from the timeline.
[attachment=976:pix_timeline.jpg]
What conclusions do you draw from that picture? Am I right with the interpretation, that the GPU has quite a lot of idle time that could be used for rendering?

Another thing I noticed is that the performance difference depends on the way the depth values are stored.
In the first case "depth texture" I'm rendering to a depth texture and have no color texture bound. The second case "color texture" I have a depth and color target and render the linear depth value to the color channel and use that color texture as a shader resource.
In the "depth texture" case the difference between OpenGL and DirectX is really large (128 FPS for OpenGL vs. 89 FPS for DirectX). OpenGL uses texture array with the format GL_DEPTH_COMPONENT32F and DirectX uses R32_Typeless for the texture2d, D32_Float for the DepthStencilView and R32_Float for the ShaderResourceView.
In the "color texture" case both are much closer (95 FPS in OpenGL vs. 89 in DirectX). OpenGL uses a GL_R32F color texture array and a single GL_DEPTH_COMPONENT32F depth rendertarget. DirectX11 a R32_Float texture array (same format for the RenderTarget and ShaderResourceView) and a single D32_Float texture for the depth buffer.

What can be seen is that OpenGL slows down quite a bit when I use a color texture for the depth value (somehow expected), whereas it has no impact in DirectX!?!

Share this post


Link to post
Share on other sites
The idle time indicates that your CPU is taking much longer than the GPU to finish a frame, and is idling while waiting for more CPU commands. In other words, you're heavily CPU-bound. I would suspect that the large number of Draw calls is what's slowing you down. You can try running a profiler to ensure that you're actually spending lots of time in DX functions.

Share this post


Link to post
Share on other sites

The idle time indicates that your CPU is taking much longer than the GPU to finish a frame, and is idling while waiting for more CPU commands. In other words, you're heavily CPU-bound. I would suspect that the large number of Draw calls is what's slowing you down. You can try running a profiler to ensure that you're actually spending lots of time in DX functions.


First off all: Thanks a lot for your patience, MJP!
I've modified the model such that it consists of a single mesh only. Since the scene consists of three of those models and I'm rendering 4 CSM-Layers + 1 Color Pass I'm now at 15 DIP calls.
Instead of 89 FPS I get now 90,9 FPS. The PIX timeline is attached. (For OpenGL the FPS go up from 128 to 136.)
The empty demo loop (only DIP calls commented out) takes 0,5 msecs.

All in all I'm just puzzled. The demo renders just as fast as before, but now the GPU is never idle. The GPU frames look very strange and appear to overlap in the timeline view (I've overlayed frame 148 and 150 in the screenshot):
[attachment=989:pix_timeline2.jpg]

The GPU duration in the event view looks more sensible (ranging from 9623084 to 12773134).
Now I'm lost. The FPS don't show much difference, such that reducing the DIP calls and reducing materials doesn't help. The CPU bound issue is gone but nothing is gained.
The demo loop allows for 2000 FPS, so I can't see that this would cause any blocking. Any ideas?

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!