Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


#ActualRyan_001

Posted 05 February 2013 - 11:59 PM

As Hodgman alluded to, the SFU's do compute trig functions in a single cycle, but there are only 4 of them per SM.  From http://www.pgroup.com/lit/articles/insider/v2n1a5.htm (which doesn't seem to work when clicked on but if you copy paste it works...):

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. Each Tesla core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles transcendentals and double-precision operations at 1/8 the compute bandwidth. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor; this means the peak double-precision throughput is 1/2 of the single-precision throughput.

To my knowledge though instructions are not reordered and/or pipelined in a GPU like they are in a CPU, but I don't know for certain.  The docs I read seemed to imply that GPUs used warp scheduling to hide latency, and not instruction reordering, though I could be wrong there...


#6Ryan_001

Posted 05 February 2013 - 11:54 PM

As Hodgman alluded to, the SFU's do compute trig functions in a single cycle, but there are only 4 of them per SM.  From http://www.pgroup.com/lit/articles/insider/v2n1a5.htm (which doesn't seem to work when clicked on but if you copy paste it works...):

 The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. Each Tesla core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles transcendentals and double-precision operations at 1/8 the compute bandwidth. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor; this means the peak double-precision throughput is 1/2 of the single-precision throughput.

#5Ryan_001

Posted 05 February 2013 - 11:52 PM

As Hodgman alluded to, the SFU's do compute trig functions in a single cycle, but there are only 4 of them per SM.  From "http://www.pgroup.com/lit/articles/insider/v2n1a5.htm"

 The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. Each Tesla core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles transcendentals and double-precision operations at 1/8 the compute bandwidth. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor; this means the peak double-precision throughput is 1/2 of the single-precision throughput.

#4Ryan_001

Posted 05 February 2013 - 11:51 PM

As Hodgman alluded to, the SFU's do compute trig functions in a single cycle, but there are only 4 of them per SM.  From http://www.pgroup.com/lit/articles/insider/v2n1a5.htm

 The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. Each Tesla core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles transcendentals and double-precision operations at 1/8 the compute bandwidth. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor; this means the peak double-precision throughput is 1/2 of the single-precision throughput.

#3Ryan_001

Posted 05 February 2013 - 11:51 PM

As Hodgman alluded to, the SFU's do compute trig functions in a single cycle, but there are only 4 of them per SM.  From http://www.pgroup.com/lit/articles/insider/v2n1a5.htm 

 The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. Each Tesla core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles transcendentals and double-precision operations at 1/8 the compute bandwidth. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor; this means the peak double-precision throughput is 1/2 of the single-precision throughput.

#2Ryan_001

Posted 05 February 2013 - 11:50 PM

As Hodgman alluded to, the SFU's do compute trig functions in a single cycle, but there are only 4 of them per SM.  From http://www.pgroup.com/lit/articles/insider/v2n1a5.htm

 The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. Each Tesla core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles transcendentals and double-precision operations at 1/8 the compute bandwidth. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor; this means the peak double-precision throughput is 1/2 of the single-precision throughput.

PARTNERS