Relation between TFLOPS and Threads in a GPU?

Started by
1 comment, last by Ohforf sake 9 years ago

What's the relation between Teraflops and Threads in a GPU like GTX Titan X has 7 TFOPS of Compute Performance but how many threads it has ?

If they are different concepts. Can you please explain both of them?

Advertisement

There's no relationship because GPUs have no real "threads" in the CPU sense. The choice (of DirectCompute) to use the word thread is so bogus I cannot believe it made to final documentation. That said...

It depends on the chip family and even on the specific segment.

The number of instructions executing per tick is just: $$ops = processingElements * clockRate_{hz}$$

Which takes us to the magic world of processing elements: what are those?

They are part of the ALU carrying out the useful work. Many people think one PE ~ 1 thread and given current GPU capabilities you can currently do that. But in practice 1 PE is a much more fine-grained element and you are given the choice of how to setup the PEs to make up "threads".

The native concept of a "thread" for GPU is the Wavefront (AMD GCN, OpenCL) or the Warp (NV). They are basically the same thing: packs of 64/32 processing elements.

I am going to use the word thread for your convenience but be warned it is inaccurate term.

Marketing wants you to believe a PE is a thread but if that would be the case then a single CPU thread using SSE would be quad-threaded.

The amount of "threads" executing at a given time (assuming you always saturate device) is

$$threads=processingElements / threadSize$$

So for example GM200 titan x has 3072 "cores" (marketing jargon) which are really 3072 PE (CL jargon). With a warp size of 32, you have 96 threads in flight. WRONG! You have 96 warps!

If tomorrow NV decides their warp size becomes 16, you'll have 96*2 warps.

This is at each given clock.

During processing, the GPU will switch across several warps. The amount of warps in flight depends on device and the actual program being executed. There's usually an upper bound but I'm not well aware of NV architectures.

EDIT: I messed up the second formula somehow.

Previously "Krohm"

Peak performance (in FLoating point OPerations per Second = FLOPS) is the theoretical upper limit on how many computations a device can sustain per second. If a Titan X were doing nothing else than computing 1 + 2 * 3 then it could do that 3 072 000 000 000 times per second and since there are two operations in there (an addition and a multiplication) this amounts to 6 144 000 000 000 FLOPS or about 6.144 TFLOPS. But you only get that speed if you never read any data or write back any results or do anything else other than a multiply followed by an addition.

A "thread" (and Krohm rightfully warned of its use as a marketing buzzword) is generally understood to be an execution context. If a device executes a program, this refers to the current state, such as the current position in the program, the current values of the local variables, etc.

Threads and peak performance are two entirely different things!

Some compute devices (some Intel CPUs, some AMD CPUs, SUN niagara CPUs and most GPUs) can store more than one execution context aka "thread" on the chip so that they can interleave the execution of both/all of them. This sometimes falls under the term of "hardware-threads", at least for CPUs. And this is done for performance reasons. But it does not affect the theoretical peak performance of the device, only how much of that you can actually use. And the direct relationship between the maximum number of hardware threads, the used number of hardware threads, and the achieved performance ... is very complicated. It depends on lots of different factors like memory throughput, memory latency, access patterns, the actual algorithm, and so on.

So if this is what you are asking about, then you might have to look into how GPUs work and how certain algorithms make use of that.

This topic is closed to new replies.

Advertisement