On modern GPUs, trig functions only cost you one cycle, maybe less these days. How much faster does it need to be?
What you see is what you get.
I'm pretty sure that isn't accurate. It depends on the architecture but the trig functions are not executed by the standard ALU but by a special function unit. The ratio of ALU's to the SFU's depends on the architecture but can be anywhere from 8:1 or even less. For example in the Fermi architecture you can see the layout of ALU's to SFU's here: http://www.nvidia.ca/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf on page 8. Each 'Streaming Multi-processor' contains 32 ALU's to 4 SFU's. So trig functions would take 8 cycles as opposed to 1 (provided all the threads in a warp need to execute the trig function).
And that was back in 2005. The number of execution units has nothing to do with the number of cycles per instruction, but rather the number of instructions per clock.