AFAIK, HLSL (or Cg) doesn't have any approximate/alternative sin/cos functions.
Does hlsl have access to fast trig functions that sacrifice accuracy for speed? I know CUDA has __sinf(x), __cosf(x), etc. which can be an order of magnitude faster than the sinf(x), cosf(x) counterparts. I swore I read it somewhere before, but I just can't find it on google or msdn anymore.
There's no citations in that thread for how many cycles anything takes
I'm pretty sure that isn't accurate. It depends on the architecture but the trig functions are not executed by the standard ALU but by a special function unit. The ratio of ALU's to the SFU's depends on the architecture but can be anywhere from 8:1 or even less. For example in the Fermi architecture you can see the layout of ALU's to SFU's here: http://www.nvidia.ca/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf on page 8. Each 'Streaming Multi-processor' contains 32 ALU's to 4 SFU's. So trig functions would take 8 cycles as opposed to 1 (provided all the threads in a warp need to execute the trig function).
On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?
What you see is what you get.
And that was back in 2005. The number of execution units has nothing to do with the number of cycles per instruction, but rather the number of instructions per clock.
Googling though, I found this citation, which is very old: http://http.developer.nvidia.com/CgTutorial/cg_tutorial_chapter10.html
"In many cases, Standard Library functions compile to single assembly instructions that run in just one GPU clock cycle! Examples of such functions are sin , cos , lit , dot , and exp"
On modern GPU, they will take up one instruction slot and it will likely take 1 cycle (or, amortized, less than 1) to issue the instruction, but without knowing how the processor is pipelined, it'd be impossible to know how many cycles until the results are available after the instruction is issued.
Instructions per clock does have an impact on cycles per instruction. If there's multiple sub-processors that each support a different (possibly overlapping) sub-set of the instruction set, then things get interesting.
e.g. say the processor can dual-issue (2 instructions per clock) to two sub-processors, but sub-unit A only supports 50% of the instruction set while sub-unit B supports the whole thing. We'll call instructions that can't be handled by the 'A' unit "B-instructions", and the ones that can be handled by both "A-instructions".
If a program only has a B-instruction every two or more instructions in the program, then everything runs smoothly, but if a program is only made up of B-instructions, then the processor can't dual-issue any more, and it will run twice as slow as a result.
This then gets more complicated when it only takes 1 cycle to issue an instruction, but multiple cycles for it to complete. If B-instructions take 2 cycles to complete, then we need to ensure that there's 3 "A-instructions" between each of them, instead of just 1 in order to keep everything running smoothly (smooth = instructions have an amortized cost of 1 cycle, even multi-cycle ones). In this situation, a program made up of 100 B-instructions could take the same amount of time as a program made up of 100 B's and 300 A's! When profiling the two programs though, each would show a drastically different (averaged) cycles per instruction value (due to a lower instructions per clock value).
As Ryan_001 mentioned, it may depend on the surrounding code. If you've got a long series of muls with a sin in the middle, the sin might have an impact of 1 extra cycle... but if you've got a long series of sins, then they might each take 8 cycles, etc...