Jump to content

View more

Image of the Day

Working on Johnny Bonasera Chapter 2 Intro. #screenshotsaturday #adventuregame https://t.co/QWEx05T09z
IOTD | Top Screenshots

The latest, straight to your Inbox.

Subscribe to GameDev.net's newsletters to receive the latest updates and exclusive content.


Sign up now

Extreme long compile times and bad performance on Nvidia

2: Adsense
  • You cannot reply to this topic
7 replies to this topic

#1 JoeJ   Members   

2308
Like
0Likes
Like

Posted 16 February 2017 - 01:37 PM

...continuing from this older thread https://www.gamedev.net/topic/686395-simple-shader-causes-nvidia-driver-to-hang/
after days of debugging i found out: The driver does not hang, it just takes a very long time to return from calls to vkCreateComputePipelines()
(about 5 minutes for a simple shader like in the code snippet :( ).

I'm using a GTX 670.

It takes 3 hours to compile all shaders of my project.
Few shaders compile in seconds like they should, but almost all need minutes.

It is strange that changing a number in the code snippet makes the problem go away - do you think there could be some limit for practical buffer sizes?
I use 12 buffers for a total of 170MB, but i could split to more smaller buffers. (All storage buffers, all compute shaders)

Also the performance seems not right:

FuryX: 1.4 ms
7950: 2.3 ms

GTX 670: 30ms (!)

I did expect NV to perform much worse than AMD, but a factor more than 10 seems at least twice too much.
However there is no bad spot showing up in the profiler - relative runtimes match those i see on AMD, there's just a constant scale.

Anyone had similar issues?

I'll try my OpenCL implementation to see if it runs faster...



EDIT:

OpenCL performance:

FuryX: 2.2 ms
GTX 670: 12.5 ms

This looks familiar to me and makes sense.
Reminds me to OpenGL compute shaders where OpenCL was two times faster than OpenGL on Nvidia.

Why do i always need to learn this the hard way? Argh!
Seems i'll have to go to DX12 hoping they care a bit more there.

F***!







 
#version 450

layout (local_size_x = 64) in;

layout (std430, binding = 0) buffer bS_V4 { vec4 _G_sample_V4[]; };
layout (std430, binding = 1) buffer bS_SI { uint _G_sample_indices[]; };
layout (std430, binding = 2) buffer bLL { uint _G_lists[]; };
layout (std430, binding = 3) buffer bDBG { uint _G_dbg[]; };

void main ()
{
	uint lID = gl_LocalInvocationID.x;

	uint listIndex = gl_WorkGroupID.x * 64 + lID;
	if (listIndex < 100)
	{
		_G_dbg[8] = _G_lists[786522]; // changing to a smaller number like 78652 would compile fast
	}
}

Edited by JoeJ, 16 February 2017 - 02:21 PM.


#2 C0lumbo   Members   

4306
Like
3Likes
Like

Posted 16 February 2017 - 05:22 PM

Sounds like you ought to make a test case and send it to NVidia. I'm sure this is the sort of thing that someone on their team would want to know about and to fix. Whether you can get the information to the relevant person is another matter, posting here would be my first port of call: https://devtalk.nvidia.com/default/board/166/vulkan/.

 

I don't have much experience with the compute side of things, but I wonder whether the SPIR-V of the shader looks as trivial as the GLSL. Maybe there's some clues in there.



#3 JoeJ   Members   

2308
Like
0Likes
Like

Posted 17 February 2017 - 03:48 AM

Thanks. I'll do that but first i'll test with a 1070 and run my shader in a project of some other person to ensure it's nothing on my side.
I've already wasted so much time on this, some hours more won't matter :)

#4 JoeJ   Members   

2308
Like
0Likes
Like

Posted 17 February 2017 - 08:17 AM

I loaded my shader into one of Sascha Willems examples.
First time it took 72 sec. to compile the simple shader.
Second time it took only one sec, bacause Sascha uses the pipeline cache but i'm not.

Going back to my own project, the shader has a different filename and date but the NV driver recoginzed it is the same shader, took it from cache so it also took only one sec.
(Notice they do this although i'm still not using pipeline cache).

Then after changing a number in the shader it takes 72 sec. again in my project.

I don't know if it's possible to ship pipeline cache results with a game so the user does not need to wait 3 hours, but at least it's a great improvement :)
Unfortunately lot's of my shaders have a mutation per tree level so even with the pipeline cache i have to wait up to half an hour to see the effect of a single changed shader.

#5 JoeJ   Members   

2308
Like
0Likes
Like

Posted 17 February 2017 - 10:54 AM

Very interesting: For the first 10 frames i get good VK performance of 10 ms with the GTX 670! - did not notice yesterday.

I assume the 670 goes to some kind of power saving mode after 10 frames (it's an abrupt change from 10 to 30 ms).
Most time of my frame is spent on GPU<->CPU data transfer, causing 1 FPS, so the 670 might think it's time to rest.
I'll see when i'm done and transfer is not necessary anymore.

10ms seems right. Some years back AMD 280X was twice as fast than Kepler Titan and IIRC 4-5 times faster than GTX 670 in compute.
I still wonder why there's such a huge difference (670 and 5970 have similar specs), but it's probably a hardware limit.
Still waiting and hoping for 1070 to perform better...




So finally there seems nothing wrong at all.
It's good to see NV is faster with VK than with OpenCL 1.2 too,
and the waiting on compiler is ok with the pipeline cache (but maybe i can still improve this by telling the driver i want my current GPU only, not all existing NV generations).

Edit:
Forgot to mention: I can confirm it works to have both AMD and NV in one machine. Can use one for rendering and the other for compute. Can optimize for both etc. Awesome! :)

Edited by JoeJ, 17 February 2017 - 11:08 AM.


#6 mike44   Members   

150
Like
0Likes
Like

Posted 19 February 2017 - 02:15 AM

Forgot to mention: I can confirm it works to have both AMD and NV in one machine. Can use one for rendering and the other for compute. Can optimize for both etc. Awesome! :)

You mean it's possible for a game to use simultaneously my integrated Intel graphics together with my GTX1070? Cool :-)



#7 JoeJ   Members   

2308
Like
0Likes
Like

Posted 19 February 2017 - 01:03 PM

You mean it's possible for a game to use simultaneously my integrated Intel graphics together with my GTX1070?


I think so, but i don't have iGPU and some people say it gets turned off if it is unplugged and a dedicated GPU is detected.
But i guess this is not true for modern APIs or even OpenCL. It should be available for compute.

Although iGPU throttles CPU cores due to heat and bandwidth, i think its perfect for things like physics.

(just checked: NV allows to use the GTX670 for PhysX even i'm using an AMD GPU - some years back they prevented this :) )

#8 JoeJ   Members   

2308
Like
0Likes
Like

Posted 20 February 2017 - 03:31 AM

I assume the 670 goes to some kind of power saving mode after 10 frames (it's an abrupt change from 10 to 30 ms).


Can be fixed in driver settings (prefer maximum performance).

the waiting on compiler is ok with the pipeline cache (but maybe i can still improve this by telling the driver i want my current GPU only, not all existing NV generations)


This is already the case, after switching to GTX 1070 all shaders had to be recompiled again.

Compute performance with Pascal is much better than Kepler:

AMD FuryX: 1.37ms
NV 1070: 2.01 ms
AMD 7950: 2.3 ms
NV 670: 9.5 ms

Currently i have optimized my shaders for FuryX and GTX670, but 1070 runs better with the settings from Fury.
At the time i don't see a reason to do much vendor optimizations at all - would be nice if this holds true for future chips as well.