HLSL fast trig functions

Started by
13 comments, last by 21st Century Moose 11 years, 2 months ago

Does hlsl have access to fast trig functions that sacrifice accuracy for speed? I know CUDA has __sinf(x), __cosf(x), etc. which can be an order of magnitude faster than the sinf(x), cosf(x) counterparts. I swore I read it somewhere before, but I just can't find it on google or msdn anymore.

Advertisement

On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?

http://msdn.microsoft.com/en-us/library/windows/desktop/ff471376%28v=vs.85%29.aspx

What you see is what you get.

On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?

http://msdn.microsoft.com/en-us/library/windows/desktop/ff471376%28v=vs.85%29.aspx

What you see is what you get.

I'm pretty sure that isn't accurate. It depends on the architecture but the trig functions are not executed by the standard ALU but by a special function unit. The ratio of ALU's to the SFU's depends on the architecture but can be anywhere from 8:1 or even less. For example in the Fermi architecture you can see the layout of ALU's to SFU's here: http://www.nvidia.ca/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf on page 8. Each 'Streaming Multi-processor' contains 32 ALU's to 4 SFU's. So trig functions would take 8 cycles as opposed to 1 (provided all the threads in a warp need to execute the trig function).

On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?

http://msdn.microsoft.com/en-us/library/windows/desktop/ff471376%28v=vs.85%29.aspx

What you see is what you get.

I'm pretty sure that isn't accurate. It depends on the architecture but the trig functions are not executed by the standard ALU but by a special function unit. The ratio of ALU's to the SFU's depends on the architecture but can be anywhere from 8:1 or even less. For example in the Fermi architecture you can see the layout of ALU's to SFU's here: http://www.nvidia.ca/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf on page 8. Each 'Streaming Multi-processor' contains 32 ALU's to 4 SFU's. So trig functions would take 8 cycles as opposed to 1 (provided all the threads in a warp need to execute the trig function).

http://www.gamedev.net/topic/322422-number-of-gpu-cycles-for-cos-and-sin-functions/

And that was back in 2005. The number of execution units has nothing to do with the number of cycles per instruction, but rather the number of instructions per clock.

Does hlsl have access to fast trig functions that sacrifice accuracy for speed? I know CUDA has __sinf(x), __cosf(x), etc. which can be an order of magnitude faster than the sinf(x), cosf(x) counterparts. I swore I read it somewhere before, but I just can't find it on google or msdn anymore.

AFAIK, HLSL (or Cg) doesn't have any approximate/alternative sin/cos functions.



On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?

http://msdn.microsoft.com/en-us/library/windows/desktop/ff471376(v=vs.85).aspx

What you see is what you get.

I'm pretty sure that isn't accurate. It depends on the architecture but the trig functions are not executed by the standard ALU but by a special function unit. The ratio of ALU's to the SFU's depends on the architecture but can be anywhere from 8:1 or even less. For example in the Fermi architecture you can see the layout of ALU's to SFU's here: http://www.nvidia.ca/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf on page 8. Each 'Streaming Multi-processor' contains 32 ALU's to 4 SFU's. So trig functions would take 8 cycles as opposed to 1 (provided all the threads in a warp need to execute the trig function).



http://www.gamedev.net/topic/322422-number-of-gpu-cycles-for-cos-and-sin-functions/

And that was back in 2005. The number of execution units has nothing to do with the number of cycles per instruction, but rather the number of instructions per clock.


There's no citations in that thread for how many cycles anything takes mellow.png

Googling though, I found this citation, which is very old: http://http.developer.nvidia.com/CgTutorial/cg_tutorial_chapter10.html
"In many cases, Standard Library functions compile to single assembly instructions that run in just one GPU clock cycle! Examples of such functions are sin , cos , lit , dot , and exp"

On modern GPU, they will take up one instruction slot and it will likely take 1 cycle (or, amortized, less than 1) to issue the instruction, but without knowing how the processor is pipelined, it'd be impossible to know how many cycles until the results are available after the instruction is issued.


Instructions per clock does have an impact on cycles per instruction. If there's multiple sub-processors that each support a different (possibly overlapping) sub-set of the instruction set, then things get interesting.
e.g. say the processor can dual-issue (2 instructions per clock) to two sub-processors, but sub-unit A only supports 50% of the instruction set while sub-unit B supports the whole thing. We'll call instructions that can't be handled by the 'A' unit "B-instructions", and the ones that can be handled by both "A-instructions".
If a program only has a B-instruction every two or more instructions in the program, then everything runs smoothly, but if a program is only made up of B-instructions, then the processor can't dual-issue any more, and it will run twice as slow as a result.
This then gets more complicated when it only takes 1 cycle to issue an instruction, but multiple cycles for it to complete. If B-instructions take 2 cycles to complete, then we need to ensure that there's 3 "A-instructions" between each of them, instead of just 1 in order to keep everything running smoothly (smooth = instructions have an amortized cost of 1 cycle, even multi-cycle ones). In this situation, a program made up of 100 B-instructions could take the same amount of time as a program made up of 100 B's and 300 A's! When profiling the two programs though, each would show a drastically different (averaged) cycles per instruction value (due to a lower instructions per clock value).

As Ryan_001 mentioned, it may depend on the surrounding code. If you've got a long series of muls with a sin in the middle, the sin might have an impact of 1 extra cycle... but if you've got a long series of sins, then they might each take 8 cycles, etc...

As Hodgman alluded to, the SFU's do compute trig functions in a single cycle, but there are only 4 of them per SM. From http://www.pgroup.com/lit/articles/insider/v2n1a5.htm (which doesn't seem to work when clicked on but if you copy paste it works...):

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. Each Tesla core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles transcendentals and double-precision operations at 1/8 the compute bandwidth. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor; this means the peak double-precision throughput is 1/2 of the single-precision throughput.

To my knowledge though instructions are not reordered and/or pipelined in a GPU like they are in a CPU, but I don't know for certain. The docs I read seemed to imply that GPUs used warp scheduling to hide latency, and not instruction reordering, though I could be wrong there...

The number of "cycles" an instruction takes is completely dependent on the architecture, and also even the exact meaning of that number might depend on the specifics of that architecture. For instance if you read up on AMD Southern Island's ISA, you'll find that one of the 4 SIMD units on a CU will execute a "normal" 32-bit FP instruction (MUL, MAD, etc.) in 4 cycles. This is because each SIMD unit has 16 lanes, so it takes 4 full cycles to complete the operation for all threads in a wavefront. Double-precision, integer, and transcendental instructions can run anywhere from 1/2 to 1/16 rate, so they might take anywhere from 8 to 64 cycles to complete.

So I'm assuming no one knows the hlsl functions (I thought it might've been some [attribute] modifier). Strange thing is CUDA has a bunch of functions that sacrifice accuracy for speed, including square roots, exponentials, and trigonometric functions. This is detailed in the CUDA best practices guide under instruction optimization: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

I suppose it might just have a larger math library than hlsl.

So I'm assuming no one knows the hlsl functions (I thought it might've been some [attribute] modifier). Strange thing is CUDA has a bunch of functions that sacrifice accuracy for speed, including square roots, exponentials, and trigonometric functions. This is detailed in the CUDA best practices guide under instruction optimization: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

I suppose it might just have a larger math library than hlsl.

You seem to be working on the assumption that these trig functions are somehow provided by some form of software library. They're not; they execute on hardware and in hardware units. That has a number of implications but the main one that's relelvant here is this: if a faster trig function is not implemented by hardware, then you just cannot use one.

Now, looking at how this bears on HLSL vs CUDA. CUDA is in a position where it just runs on a single vendor's hardware, so the hardware capabilities it exposes can be tuned for that single vendor. It doesn't need to support what other vendors may or may not be able to do. HLSL, on the other hand, must run everywhere.

Best possible case if HLSL were to expose such functions would be they would just compile down to the same hardware instruction(s) as the regular ones.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

HLSL essentially targets a virtual machine, where the supported instruction set of that VM is dictated by the shader model. When the shader gets compiled it outputs assembly using the instructions and registered that are supported by the shader model, and then when when you actually load the shader on the device the driver will JIT compile the assembly to the hardware-specific ISA of the GPU. This means there can (and will be) instructions that are supported by the hardware ISA but aren't supported by the shader model VM. So for instance Nvidia hardware might support an approximate sin/cos instruction in addition to the normal version, but since SM5.0 assembly only has normal sin/cos instructions HLSL can't expose it. For it to be exposed in HLSL, the approximate instruction would have to get added to the spec of a future shader model. However for this to happen, the instruction typically needs to be supported by both Nvidia and AMD hardware. This has already happened several times in the past, for example SM5.0 added approximate reciprocal and reciprocal square root functions (rcp() and rsqrt()) as well as coarse-grained and fine-grained derivative functions. These instructions have existed for quite some time in GPU's, but until they were added to the SM5.0 spec they couldn't be directly targeted. Instead the driver might try decide when to use the approximate instructions when JIT compiling the shader.

As mhagain already explained, CUDA has the advantage of targeting a much more specific range of hardware. This means that Nvidia can expose the hardware more directly, which can give higher performance at the expense of tightly coupling your code with Nvidia hardware.

This topic is closed to new replies.

Advertisement