Jump to content

  • Log In with Google      Sign In   
  • Create Account


HLSL fast trig functions


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
14 replies to this topic

#1 NotTakenSN   Members   -  Reputation: 149

Like
0Likes
Like

Posted 05 February 2013 - 09:07 PM

Does hlsl have access to fast trig functions that sacrifice accuracy for speed? I know CUDA has __sinf(x), __cosf(x), etc. which can be an order of magnitude faster than the sinf(x), cosf(x) counterparts. I swore I read it somewhere before, but I just can't find it on google or msdn anymore.



Sponsor:

#2 Chris_F   Members   -  Reputation: 2225

Like
0Likes
Like

Posted 05 February 2013 - 09:59 PM

On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?

 

http://msdn.microsoft.com/en-us/library/windows/desktop/ff471376%28v=vs.85%29.aspx

 

What you see is what you get.



#3 Ryan_001   Prime Members   -  Reputation: 1339

Like
0Likes
Like

Posted 05 February 2013 - 10:40 PM

On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?

 

http://msdn.microsoft.com/en-us/library/windows/desktop/ff471376%28v=vs.85%29.aspx

 

What you see is what you get.

 

I'm pretty sure that isn't accurate.  It depends on the architecture but the trig functions are not executed by the standard ALU but by a special function unit.  The ratio of ALU's to the SFU's depends on the architecture but can be anywhere from 8:1 or even less.  For example in the Fermi architecture you can see the layout of ALU's to SFU's here: http://www.nvidia.ca/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf on page 8.  Each 'Streaming Multi-processor' contains 32 ALU's to 4 SFU's.  So trig functions would take 8 cycles as opposed to 1 (provided all the threads in a warp need to execute the trig function).



#4 Chris_F   Members   -  Reputation: 2225

Like
0Likes
Like

Posted 05 February 2013 - 10:50 PM

On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?

 

http://msdn.microsoft.com/en-us/library/windows/desktop/ff471376%28v=vs.85%29.aspx

 

What you see is what you get.

 

I'm pretty sure that isn't accurate.  It depends on the architecture but the trig functions are not executed by the standard ALU but by a special function unit.  The ratio of ALU's to the SFU's depends on the architecture but can be anywhere from 8:1 or even less.  For example in the Fermi architecture you can see the layout of ALU's to SFU's here: http://www.nvidia.ca/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf on page 8.  Each 'Streaming Multi-processor' contains 32 ALU's to 4 SFU's.  So trig functions would take 8 cycles as opposed to 1 (provided all the threads in a warp need to execute the trig function).

 

http://www.gamedev.net/topic/322422-number-of-gpu-cycles-for-cos-and-sin-functions/

 

And that was back in 2005. The number of execution units has nothing to do with the number of cycles per instruction, but rather the number of instructions per clock.


Edited by Chris_F, 05 February 2013 - 10:52 PM.


#5 Hodgman   Moderators   -  Reputation: 29304

Like
1Likes
Like

Posted 05 February 2013 - 11:16 PM

Does hlsl have access to fast trig functions that sacrifice accuracy for speed? I know CUDA has __sinf(x), __cosf(x), etc. which can be an order of magnitude faster than the sinf(x), cosf(x) counterparts. I swore I read it somewhere before, but I just can't find it on google or msdn anymore.

AFAIK, HLSL (or Cg) doesn't have any approximate/alternative sin/cos functions.



On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?
 
http://msdn.microsoft.com/en-us/library/windows/desktop/ff471376(v=vs.85).aspx
 
What you see is what you get.

I'm pretty sure that isn't accurate.  It depends on the architecture but the trig functions are not executed by the standard ALU but by a special function unit.  The ratio of ALU's to the SFU's depends on the architecture but can be anywhere from 8:1 or even less.  For example in the Fermi architecture you can see the layout of ALU's to SFU's here: http://www.nvidia.ca/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf on page 8.  Each 'Streaming Multi-processor' contains 32 ALU's to 4 SFU's.  So trig functions would take 8 cycles as opposed to 1 (provided all the threads in a warp need to execute the trig function).


 
http://www.gamedev.net/topic/322422-number-of-gpu-cycles-for-cos-and-sin-functions/
 
And that was back in 2005. The number of execution units has nothing to do with the number of cycles per instruction, but rather the number of instructions per clock.


There's no citations in that thread for how many cycles anything takes mellow.png
 
Googling though, I found this citation, which is very old: http://http.developer.nvidia.com/CgTutorial/cg_tutorial_chapter10.html
"In many cases, Standard Library functions compile to single assembly instructions that run in just one GPU clock cycle! Examples of such functions are sin , cos , lit , dot , and exp"
 
On modern GPU, they will take up one instruction slot and it will likely take 1 cycle (or, amortized, less than 1) to issue the instruction, but without knowing how the processor is pipelined, it'd be impossible to know how many cycles until the results are available after the instruction is issued.
 
 
Instructions per clock does have an impact on cycles per instruction. If there's multiple sub-processors that each support a different (possibly overlapping) sub-set of the instruction set, then things get interesting.
e.g. say the processor can dual-issue (2 instructions per clock) to two sub-processors, but sub-unit A only supports 50% of the instruction set while sub-unit B supports the whole thing. We'll call instructions that can't be handled by the 'A' unit "B-instructions", and the ones that can be handled by both "A-instructions".
If a program only has a B-instruction every two or more instructions in the program, then everything runs smoothly, but if a program is only made up of B-instructions, then the processor can't dual-issue any more, and it will run twice as slow as a result.
This then gets more complicated when it only takes 1 cycle to issue an instruction, but multiple cycles for it to complete. If B-instructions take 2 cycles to complete, then we need to ensure that there's 3 "A-instructions" between each of them, instead of just 1 in order to keep everything running smoothly (smooth = instructions have an amortized cost of 1 cycle, even multi-cycle ones). In this situation, a program made up of 100 B-instructions could take the same amount of time as a program made up of 100 B's and 300 A's! When profiling the two programs though, each would show a drastically different (averaged) cycles per instruction value (due to a lower instructions per clock value).
 
As Ryan_001 mentioned, it may depend on the surrounding code. If you've got a long series of muls with a sin in the middle, the sin might have an impact of 1 extra cycle... but if you've got a long series of sins, then they might each take 8 cycles, etc...

Edited by Hodgman, 05 February 2013 - 11:52 PM.


#6 Ryan_001   Prime Members   -  Reputation: 1339

Like
0Likes
Like

Posted 05 February 2013 - 11:28 PM

As Hodgman alluded to, the SFU's do compute trig functions in a single cycle, but there are only 4 of them per SM.  From http://www.pgroup.com/lit/articles/insider/v2n1a5.htm (which doesn't seem to work when clicked on but if you copy paste it works...):

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. Each Tesla core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles transcendentals and double-precision operations at 1/8 the compute bandwidth. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor; this means the peak double-precision throughput is 1/2 of the single-precision throughput.

To my knowledge though instructions are not reordered and/or pipelined in a GPU like they are in a CPU, but I don't know for certain.  The docs I read seemed to imply that GPUs used warp scheduling to hide latency, and not instruction reordering, though I could be wrong there...


Edited by Ryan_001, 05 February 2013 - 11:59 PM.


#7 MJP   Moderators   -  Reputation: 10828

Like
0Likes
Like

Posted 06 February 2013 - 12:14 AM

The number of "cycles" an instruction takes is completely dependent on the architecture, and also even the exact meaning of that number might depend on the specifics of that architecture. For instance if you read up on AMD Southern Island's ISA, you'll find that one of the 4 SIMD units on a CU will execute a "normal" 32-bit FP instruction (MUL, MAD, etc.) in 4 cycles. This is because each SIMD unit has 16 lanes, so it takes 4 full cycles to complete the operation for all threads in a wavefront. Double-precision, integer, and transcendental instructions can run anywhere from 1/2 to 1/16 rate, so they might take anywhere from 8 to 64 cycles to complete.



#8 NotTakenSN   Members   -  Reputation: 149

Like
0Likes
Like

Posted 06 February 2013 - 02:53 AM

So I'm assuming no one knows the hlsl functions (I thought it might've been some [attribute] modifier). Strange thing is CUDA has a bunch of functions that sacrifice accuracy for speed, including square roots, exponentials, and trigonometric functions. This is detailed in the CUDA best practices guide under instruction optimization: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

I suppose it might just have a larger math library than hlsl.



#9 mhagain   Crossbones+   -  Reputation: 7803

Like
2Likes
Like

Posted 06 February 2013 - 04:12 AM

So I'm assuming no one knows the hlsl functions (I thought it might've been some [attribute] modifier). Strange thing is CUDA has a bunch of functions that sacrifice accuracy for speed, including square roots, exponentials, and trigonometric functions. This is detailed in the CUDA best practices guide under instruction optimization: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

I suppose it might just have a larger math library than hlsl.

 

You seem to be working on the assumption that these trig functions are somehow provided by some form of software library.  They're not; they execute on hardware and in hardware units.  That has a number of implications but the main one that's relelvant here is this: if a faster trig function is not implemented by hardware, then you just cannot use one.

 

Now, looking at how this bears on HLSL vs CUDA.  CUDA is in a position where it just runs on a single vendor's hardware, so the hardware capabilities it exposes can be tuned for that single vendor.  It doesn't need to support what other vendors may or may not be able to do.  HLSL, on the other hand, must run everywhere.

 

Best possible case if HLSL were to expose such functions would be they would just compile down to the same hardware instruction(s) as the regular ones.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#10 MJP   Moderators   -  Reputation: 10828

Like
1Likes
Like

Posted 06 February 2013 - 03:28 PM

HLSL essentially targets a virtual machine, where the supported instruction set of that VM is dictated by the shader model. When the shader gets compiled it outputs assembly using the instructions and registered that are supported by the shader model, and then when when you actually load the shader on the device the driver will JIT compile the assembly to the hardware-specific ISA of the GPU. This means there can (and will be) instructions that are supported by the hardware ISA but aren't supported by the shader model VM. So for instance Nvidia hardware might support an approximate sin/cos instruction in addition to the normal version, but since SM5.0 assembly only has normal sin/cos instructions HLSL can't expose it. For it to be exposed in HLSL, the approximate instruction would have to get added to the spec of a future shader model. However for this to happen, the instruction typically needs to be supported  by both Nvidia and AMD hardware. This has already happened several times in the past, for example SM5.0 added approximate reciprocal and reciprocal square root functions (rcp() and rsqrt()) as well as coarse-grained and fine-grained derivative functions. These instructions have existed for quite some time in GPU's, but until they were added to the SM5.0 spec they couldn't be directly targeted. Instead the driver might try decide when to use the approximate instructions when JIT compiling the shader.

As mhagain already explained, CUDA has the advantage of targeting a much more specific range of hardware. This means that Nvidia can expose the hardware more directly, which can give higher performance at the expense of tightly coupling your code with Nvidia hardware.


Edited by MJP, 06 February 2013 - 03:31 PM.


#11 NotTakenSN   Members   -  Reputation: 149

Like
0Likes
Like

Posted 06 February 2013 - 10:39 PM

Thanks for the insightful and detailed responses, everybody. Do you think that future versions of hlsl would support this though? Even with the differences between amd and nvidia architecture, I would think that it wouldn't be too hard to create an assembly instruction that would result in using the fast trig functions with nvidia hardware while using the normal trig options with amd hardware. Doesn't the JIT compiler know what hardware is being used? I don't think the compiler should use the fast trig functions without being explicitly told to do so, because accuracy may be important for some applications. I just don't understand why there wouldn't be an assembly instruction for this. Just because the function isn't supported by both vendors shouldn't mean it can't be exploited by hlsl at all. There just needs to be an assembly instruction that uses fast trig operations when the available hardware is detected. Seems simple to me... but then, I'm no expert.



#12 MJP   Moderators   -  Reputation: 10828

Like
0Likes
Like

Posted 07 February 2013 - 12:47 PM

I couldn't really answer those questions for sure. I don't have any insider info on the process that Microsoft uses to decide what goes into the specification, or what criteria is used to decide on whether to add an instruction.

The JIT compiler definitely knows what hardware is being used...it has to, since its job is to produce microcode for that specific hardware. In general it won't be able to make assumptions about the required precision or accuracy of a calculation, so I'm pretty sure that in most cases it won't try to swap out a sin or cos with an approximate version. However they will definitely tweak their drivers to make optimizations for specific high-profile games, so that they can get higher performance in benchmarks. I wouldn't be surprised if those optimizations included shader tweaks that adjust precision or accuracy.



#13 Adam_42   Crossbones+   -  Reputation: 2437

Like
0Likes
Like

Posted 07 February 2013 - 05:18 PM

It might be worth experimenting with half floats, if they provide enough precision. It's possible the JIT will pick different instructions based on what types are involved, but I've not tried it.

 

If you need faster trig functions you could try approximating them with a texture lookup - you can use the texture wrapping to handle the repetition so it's only a couple of instructions. A texture could also get you sin(x) and cos(x) in a single lookup.

 

To find out what the GPU JIT compiler actually does there are tools available.



#14 MJP   Moderators   -  Reputation: 10828

Like
0Likes
Like

Posted 07 February 2013 - 05:31 PM

Modern AMD and Nvidia GPU's don't have any ALU support for half-precision floating point. In fact they removed support for half precision from HLSL, and then they recently added it back in for Direct3D 11.1 (so that they could support mobile GPU's).



#15 mhagain   Crossbones+   -  Reputation: 7803

Like
0Likes
Like

Posted 07 February 2013 - 07:27 PM

All of this begs the question - just how much are you using these functions anyway that you really feel the need for faster versions of them?  Have you actually benchmarked and determined that these particular functions are bottlenecks for you, or is this some kind of relatively vague "faster versions of these would be nice" thing?

 

Personally I've done full-screen PP effects with 2 sins per-pixel and my own benchmarks have shown ROP to be so dominant that it would take some pretty damn heavy shaders to even register to any comparable significance.  Summary is that I doubt if fast versions are even needed aside from some weird extreme use cases.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS