Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 19 Jun 2011
Offline Last Active Nov 05 2013 09:48 PM

Topics I've Started

Are GPU drivers optimizing pow(x,2)?

09 October 2012 - 05:23 AM

The power function usually is about 6 times slower than a simple mad-instruction (at least on current NVidia GPUs). The HLSL compiler itself doesn't optimize it and simply converts the pow into a LOG, a MUL and an EXP instruction. But most constant powers up to x^32 would actually be faster to be calculated by using just MUL instructions. Now the question is: Should I bother optimizing it myself, or do the drivers usually optimize something like this? Here is the function I would be using, if I had to optimize it myself:

float constpow(float x, uint y)
	if (y == 0)
		return 1; //Cost 0

	if (y == 1)
		return x; //Cost 0

	float x2 = x * x; //Cost 1

	if (y == 2)
		return x2; //Cost 1

	if (y == 3)
		return x2 * x; //Cost 2

	float x4 = x2 * x2; //Cost 2

	if (y == 4)
		return x4; //Cost 2

	if (y == 5)
		return x4 * x; //Cost 3

	if (y == 6)
		return x4 * x2; //Cost 3

	if (y == 7)
		return x4 * x2 * x; //Cost 4

	float x8 = x4 * x4; //Cost 3

	if (y == 8)
		return x8; //Cost 3

	if (y == 9)
		return x8 * x; //Cost 4

	if (y == 10)
		return x8 * x2; //Cost 4

	if (y == 11)
		return x8 * x2 * x; //Cost 5

	if (y == 12)
		return x8 * x4; //Cost 4

	if (y == 13)
		return x8 * x4 * x; //Cost 5

	if (y == 14)
		return x8 * x4 * x2; //Cost 5

	float x16 = x8 * x8; //Cost 4

	if (y == 16)
		return x16; //Cost 4

	if (y == 17)
		return x16 * x; //Cost 5

	if (y == 18)
		return x16 * x2; //Cost 5

	if (y == 20)
		return x16 * x4; //Cost 5

	if (y == 24)
		return x16 * x8; //Cost 5

	if (y == 32)
		return x16 * x16; //Cost 5

	return pow(x, y);

If the drivers would do this themselves, it would probably be better to just leave the pow(x, y) there, because they know better when to optimize it. I'd obviously only use this when y is constant. I obviously don't want to have any dynamic branching here.

Blinn-Phong Specular Exponent to Trowbridge-Reitz Roughness

17 September 2012 - 07:54 AM

Is there a good formula to convert the specular exponent (glossiness) of the Blinn-Phong NDF to the roughness value of the Trowbridge-Reitz NDF?
I've tried something like , but that doesn't work that well. Is there a better approximation?

I'm currently changing the BRDF to an actual Cook-Torrance BRDF with Trowbridge-Reitz distribution, Schlick fresnel and Smith-Trowbridge-Reitz geometry factor. The BRDF itself is only 27 clock cycles on a Fermi or Kepler GPU (NDotL, NDotH, LDotH, ... not included). It's fast enough, so there's no reason for me to use a weak approximation of Cook-Torrance. But all my models are still storing Blinn-Phong glossiness, that's why I need to convert them.

I actually would want to use an approximation for though. That's the worst part of the whole BRDF. The 2 square roots alone take 12 clock cycles Posted Image

[Compute Shader] Groupshared memory as slow as VRAM

10 September 2012 - 12:40 AM

As far as I know, I have one of the earlier and cheaper mobile graphics cards that supported DX11. It's a AMD Radeon HD 5730M.

I've started optimizing graphics algorithms by porting them to compute shader and improving them by sharing memory and synchronizing the threads. This way I could improve the runtime of my Bloom from to per pixel.

But that was only the theoretical runtime. In reality, the algorithm performed so much worse than the original linear algorithm. I'm pretty sure I know the reason. Instead of let's say 32 read operations and 1 write operation, the algorithm now needs 1 read operation from VRAM, 5 read operations from groupshared memory, 5 write operations to groupshared memory and 1 write operation to VRAM.

Overall groupshared memory being L1 Cache should be way faster than 32 read operations from VRAM and it's even way less operations because of the algorithm having logarithmic runtime, but it's way slower (8ms instead of 0.5ms). The slowdown could be because of memory bank conflicts. But could they really cause such an enormous slowdown?

To me it looks like my graphics card might not even have an actual L1 cache residing on the Wavefront as groupshared memory at all. It performs just as bad as a UAV residing in VRAM would. So maybe they simply wrote a driver that uses 32kb of reserved memory in the VRAM as groupshared memory. Could that be the case or is it the bank conflicts?

I wish there were tools that could shine more light on such problems. Graphics cards and the tools should be more transparent in what's actually going on, so that the developers could improve the algorithms even further.

Update: After reading through NVidias CUDA documentation my shaders don't even cause any bank conflicts at all. Each half warp (16 threads) always accesses 16 different memory banks. Just a whole block (1024 threads) accesses them multiple times, which is normal and has nothing to do with bank conflicts.

[Solved] Just a little math question

05 September 2012 - 12:19 PM

I have a function:
Posted Image

and I need two functions g(x) and h(y) where the following is true:
Posted Image

I don't know if it's possible, but it would greatly improve the quality of my algorithm.

Is there a way to write HLSL 5 Assembly?

08 August 2012 - 10:05 AM

Is there a way to write HLSL 5 Assembly? I don't really like the instructions the HLSL compiler compiles. I could hand optimize about 30% of the instructions if I could directly write the assembly instructions.

Just a simple dumb example:
[source lang="cpp"]mov r0.y, l(-8.656170)mul r1.z, r0.x, r0.y[/source]
r0.y never gets used again after these instructions.

Or that:
[source lang="cpp"]mul r0.y, r0.w, r0.ymul r0.x, r0.z, r0.x[/source]
instead of:
[source lang="cpp"]mul r0.xy, r0.zw, r0.xy[/source]