Jump to content
  • Advertisement
Sign in to follow this  
JoeJ

OpenGL Extreme long compile times and bad performance on Nvidia

This topic is 658 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

...continuing from this older thread https://www.gamedev.net/topic/686395-simple-shader-causes-nvidia-driver-to-hang/
after days of debugging i found out: The driver does not hang, it just takes a very long time to return from calls to vkCreateComputePipelines()
(about 5 minutes for a simple shader like in the code snippet :( ).

I'm using a GTX 670.

It takes 3 hours to compile all shaders of my project.
Few shaders compile in seconds like they should, but almost all need minutes.

It is strange that changing a number in the code snippet makes the problem go away - do you think there could be some limit for practical buffer sizes?
I use 12 buffers for a total of 170MB, but i could split to more smaller buffers. (All storage buffers, all compute shaders)

Also the performance seems not right:

FuryX: 1.4 ms
7950: 2.3 ms

GTX 670: 30ms (!)

I did expect NV to perform much worse than AMD, but a factor more than 10 seems at least twice too much.
However there is no bad spot showing up in the profiler - relative runtimes match those i see on AMD, there's just a constant scale.

Anyone had similar issues?

I'll try my OpenCL implementation to see if it runs faster...



EDIT:

OpenCL performance:

FuryX: 2.2 ms
GTX 670: 12.5 ms

This looks familiar to me and makes sense.
Reminds me to OpenGL compute shaders where OpenCL was two times faster than OpenGL on Nvidia.

Why do i always need to learn this the hard way? Argh!
Seems i'll have to go to DX12 hoping they care a bit more there.

F***!







 
#version 450

layout (local_size_x = 64) in;

layout (std430, binding = 0) buffer bS_V4 { vec4 _G_sample_V4[]; };
layout (std430, binding = 1) buffer bS_SI { uint _G_sample_indices[]; };
layout (std430, binding = 2) buffer bLL { uint _G_lists[]; };
layout (std430, binding = 3) buffer bDBG { uint _G_dbg[]; };

void main ()
{
	uint lID = gl_LocalInvocationID.x;

	uint listIndex = gl_WorkGroupID.x * 64 + lID;
	if (listIndex < 100)
	{
		_G_dbg[8] = _G_lists[786522]; // changing to a smaller number like 78652 would compile fast
	}
}
Edited by JoeJ

Share this post


Link to post
Share on other sites
Advertisement

Sounds like you ought to make a test case and send it to NVidia. I'm sure this is the sort of thing that someone on their team would want to know about and to fix. Whether you can get the information to the relevant person is another matter, posting here would be my first port of call: https://devtalk.nvidia.com/default/board/166/vulkan/.

 

I don't have much experience with the compute side of things, but I wonder whether the SPIR-V of the shader looks as trivial as the GLSL. Maybe there's some clues in there.

Share this post


Link to post
Share on other sites
Thanks. I'll do that but first i'll test with a 1070 and run my shader in a project of some other person to ensure it's nothing on my side.
I've already wasted so much time on this, some hours more won't matter :)

Share this post


Link to post
Share on other sites
I loaded my shader into one of Sascha Willems examples.
First time it took 72 sec. to compile the simple shader.
Second time it took only one sec, bacause Sascha uses the pipeline cache but i'm not.

Going back to my own project, the shader has a different filename and date but the NV driver recoginzed it is the same shader, took it from cache so it also took only one sec.
(Notice they do this although i'm still not using pipeline cache).

Then after changing a number in the shader it takes 72 sec. again in my project.

I don't know if it's possible to ship pipeline cache results with a game so the user does not need to wait 3 hours, but at least it's a great improvement :)
Unfortunately lot's of my shaders have a mutation per tree level so even with the pipeline cache i have to wait up to half an hour to see the effect of a single changed shader.

Share this post


Link to post
Share on other sites
Very interesting: For the first 10 frames i get good VK performance of 10 ms with the GTX 670! - did not notice yesterday.

I assume the 670 goes to some kind of power saving mode after 10 frames (it's an abrupt change from 10 to 30 ms).
Most time of my frame is spent on GPU<->CPU data transfer, causing 1 FPS, so the 670 might think it's time to rest.
I'll see when i'm done and transfer is not necessary anymore.

10ms seems right. Some years back AMD 280X was twice as fast than Kepler Titan and IIRC 4-5 times faster than GTX 670 in compute.
I still wonder why there's such a huge difference (670 and 5970 have similar specs), but it's probably a hardware limit.
Still waiting and hoping for 1070 to perform better...




So finally there seems nothing wrong at all.
It's good to see NV is faster with VK than with OpenCL 1.2 too,
and the waiting on compiler is ok with the pipeline cache (but maybe i can still improve this by telling the driver i want my current GPU only, not all existing NV generations).

Edit:
Forgot to mention: I can confirm it works to have both AMD and NV in one machine. Can use one for rendering and the other for compute. Can optimize for both etc. Awesome! :) Edited by JoeJ

Share this post


Link to post
Share on other sites

Forgot to mention: I can confirm it works to have both AMD and NV in one machine. Can use one for rendering and the other for compute. Can optimize for both etc. Awesome! :)

You mean it's possible for a game to use simultaneously my integrated Intel graphics together with my GTX1070? Cool :-)

Share this post


Link to post
Share on other sites

You mean it's possible for a game to use simultaneously my integrated Intel graphics together with my GTX1070?


I think so, but i don't have iGPU and some people say it gets turned off if it is unplugged and a dedicated GPU is detected.
But i guess this is not true for modern APIs or even OpenCL. It should be available for compute.

Although iGPU throttles CPU cores due to heat and bandwidth, i think its perfect for things like physics.

(just checked: NV allows to use the GTX670 for PhysX even i'm using an AMD GPU - some years back they prevented this :) )

Share this post


Link to post
Share on other sites

I assume the 670 goes to some kind of power saving mode after 10 frames (it's an abrupt change from 10 to 30 ms).


Can be fixed in driver settings (prefer maximum performance).

the waiting on compiler is ok with the pipeline cache (but maybe i can still improve this by telling the driver i want my current GPU only, not all existing NV generations)


This is already the case, after switching to GTX 1070 all shaders had to be recompiled again.

Compute performance with Pascal is much better than Kepler:

AMD FuryX: 1.37ms
NV 1070: 2.01 ms
AMD 7950: 2.3 ms
NV 670: 9.5 ms

Currently i have optimized my shaders for FuryX and GTX670, but 1070 runs better with the settings from Fury.
At the time i don't see a reason to do much vendor optimizations at all - would be nice if this holds true for future chips as well.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!