hyphotetical raw gpu programming

Started by
33 comments, last by fir 9 years, 9 months ago

You guys are wayyy over my head with this stuff. I'm kinda with fir on this; I only have a vague notion of what a GPU does, but I figure it's like he says, a vast array of memory as data input, a similar vast array as output, and a set of processors that read and process instructions from yet another array of memory to transform the input to the output. Is that not the case?


At a high level, yes that could be the case but that's taking the birds eye view of things smile.png

Do all the processing units always work in lock-step or can they be divided into subgroups each processing a different program on different input sets?


Yes and no.

This is where things get fun as it immediately depends on the architecture at hand. I'll deal with AMD's latest GCN because they have opened a lot of docs on how it work.

The basic unit of the GPU, he building block, is the "Compute Unit" or "CU" in their terminology.

The CU itself is made up of a scheduler, 4 groups of 16 SIMD units, a scalar unit, a branch/message unit, local data store, a 4 banks of vector registers, a bank of scalar registers, texture filter units, texture load/store units and an L1 cache.

The scheduler is where the work comes in and where things kick off being complicated right away as it can keep multiple program kernels in flight. A single scheduler can keep up to 2560 threads in flight at once and each cycle can issue up to 5 instructions to the various units from any of the kernels it has in flight.

The work itself is divided up into 'wavefronts', these are a grouping of 64 threads from which will be executing in lock step.

So the work spread is 10 waves of 64 threads spread over the 4 SIMD units.
Each of these waves could come from a different program.

Each clock cycle a SIMD is considered for execution, at which point each wave on that SIMD get a chance to execute an instruction (at most 1) and up to 5 instructions can be issued (Vector ALU, Vector Memory read/write/atomic, Scalar (see below), branch, local data share, export or global data share, a special instruction. Note; more instruction types than can be issued and only one of each type can be issued per clock.

(The scalar unit is it own execution unit in it's own right; the scheduler issues instructions to it but they can be ALU, memory or flow control instructions. Up to 1 per clock can be issued.)

The SIMD units aren't vectored however; to perform a vector operation on a SIMD takes 4 cycles. So if you were doing a vec4 + vec4 on SIMD0 it would take 4 cycles per component before the result was ready and the next instruction can be issued - the work is effectively issued as 4 add instructions across 64 threads run in groups of 16. (However during those 3 cycles the scheduler will be considering SIMD1-3 for execution so work is still being done on the CU).
(For sanity sake however we basically pretend that all 64 threads in a work group execute at the same time; it's basically the same thing from a logical point of view.)

So, in one CU, at any given time, up to 10 programs can be running per SIMD with 40 programs in flight in the CU managing up to 2560 threads of data. This is a theoretical maximum however as it depends on what resources the CU has; the vector register banks are statically allocated so if one program comes along and grabs all of them on one SIMD then no more work can be issued on it until it has been completed. This memory file is 64KB in size which means you have 16384 registers (64KB/4byte) per SIMD, however this is statically shared across all wave fronts so if, for example, you have a program where each thread requires 84 registers the SIMD can only maintain 3 wavefronts in flight as it doesn't have the resources for any more (3x64x84 = 16128, to issue another wavefront from the same kernel would require another 5376 registers it doesn't have space for). (In theory the SIMD could be handed off another program which only required 3 vgprs to work so another wavefront could be launch but in practise that is unlikely.)
(SGPR are also limited across the whole CU as the scalar unit is shared between all SIMDs.)

So, given an easy program flow which is only 64 threads in size.
- Program is handed off to the CU
- CU's scheduler assigns it to a SIMD unit
- Each clock cycle the scheduler looks at a SIMD unit and decides which instructions from which wavefront is executed.

If you have more than 64 threads in the group of work, then this would be broken up and spread across either different SIMDS or different wavefronts in the same SIMD. It will always reside on the same CU however; this is because of memory barriers etc needed to treat the execution as one group.
(The 64 thread limit is useful to know however because if you write code which fits into a wave front then you can assume all 64 threads are at the same place at the same time so you can drop atomic operations for operating on local memory stores etc).

There is also a lot not covered here as the GPU requires you manage the cache yourself for memory read/write operations and there is a lot of complex detail, most of which is hidden by the graphic/compute API of choice which will Do The Right Thing for you.

Of course a GPU isn't made up of just one CU; a R290X for example has 44 CUs which means it can have up to 112,640 work items in flight at once.

Pulling back out from the CU we arrive at the Shader Engine; this is a grouping of N CUs which contain the geometry processor, rasterizer and ROP/render backend units - the GP and Rasterizer push work into the CU; the ROP take 'exports' and do the various graphics blending operations etc to write data out.

Stepping back up from that again we come to the Global Data Store and L2 cache which is shared between all the Shader Engines.

Feeding all of this is the GPU front end which consists of a Graphics Command Processor and Asynchronous Compute Engines (ACEs); AMD GPUs have one GCP and up to 8 ACEs, all of which operate independently of each other. The GCP handles traditional graphics programming tasks (as well as compute), where as the ACEs are only for compute work. While the GCP only handles the graphics queue the ACEs can handle multiple command queues (up to 8 each) meaning that you have 64+ ways of feeding commands into the GPU.

The ACEs can operate out-of-order internally (theoretically allowing you to do task graphs on the GPU) and per-cycle can create a workgroup and dispatch one wavefront from that workgroup to the CUs.

So, a compute flow would be;
- work is presented to GCP or ACE
- workgroup is created and wavefront dispatched to a CU
- CU associates wavefront with SIMD
- each clock cycle a CU looks at a wavefront on a SIMD and dispatches work from it.

Data fetches themselves in the CU are effectively 'raw' pointer based; typically some VGPR or SGPR are used to pass in tables of data, effectively base addresses, at which point the memory can be fetched. (There is a whole L1/L2 cache architecture in place).

There are probably other things I've missed (bank conflicts on Local data store springs to mind...) but keep in mind this is specific to AMD's GCN architecture (and if you want to know more/details then AMD's developer page is a good place to go; white papers and presentations can be found there - even I had to reference one to keep the numbers/details straight in my head).

NV is slightly different and the mobile architectures are going to be very different again (they work on a binned-tiled rendering system so their data flow is different), as are the older GPUS and in a few years probably the newer ones too.
Advertisement
For the gory details of a relatively simple[1] GPU, Broadcom have released documentation for the Raspberry Pi's GPU.


[1] Simpler than the AMD GCN described by phantom at least.

I mean, I can describe how a traditional CPU works down to the NAND gate level (and possibly further), but I'd be interested in learning about GPU internals more.

Phantom pretty much described, how it works (in the current generation), but to give a very basic comparison to CPUs:

Take your i7 CPU: It has (amongst other things) various caches, scalar and vectorized 8-wide ALUs, 4 cores and SMT (intel calls it "hyperthreading") that allows for 2 threads per core.
Now strip out the scalar ALUs, ramp up the vectorized ALUs from 8-wide to 32-wide and increase their number, allow the SMT to run 64 instead of 2 "threads"/warps/wavefronts per core (note that on GPUs, every SIMD lane is called a thread) and put in 8 of those cores instead of just 4. Then increase all ALU latencies by a factor of about 3, all cache and memory latencies by a factor of about 10, and also memory throughput by a significant factor (don't have a number, sorry).
Add some nice stuff like texture samplers, shared memory (== local data store) and some hardware support for divergent control flows, and you arrive more or less at an NVidia GPU.

Again, Phantom's desciption is way more accurate, but if you think in CPU terms, those are probably the key differences.

does such hardware operate on one 'lite' adress space?

I understand from that above that in such hardware there are

two kinds of threads one is strictly parallel 'threads' that shares the same instruction pointer (but i lost the track how many such threads are in case but someone mentioned 22 thousands of such scalar channells, or so) but there are also real separate execution track machines, where each one has seperate instruction pointer and can execute distinct assembly chunk of code

if so - taking this second approach, the code to execute on each

of those track machines must be provided in some form to them.

my main question is - how those codes are provided to them,

is there some over assembly routine that is some code program that assigns sperate assembly routines to track machines and coordinates it?

Two things.

Micro-Code and Hardware.

As of any other processing unit you can manage it in hardware, Zilogs Z80 CPU, or the most times these days, micro programming. Some parts are better build in hardware others are better build in some type of microcode.

The front-end of the GPU processes very high-level/complex instructions - basically the result of Draw/Dispatch commands from GL/D3D. This front-end reads/executes these commands, which results in work occurring in the shader cores.

E.g. The front-end might execute an instruction that says to execute a compute shader for 128x128 items. It then creates 128x128=16384 "threads" and groups them into 16384/64=256 "waves" (because it uses 64-wide SIMD to work on 64 "threads" at once). Each of these "waves" is like a CPU-thread, having it's own instruction-pointer, execution state/register file, etc... The GPU then basically "hyperthreads" those 256 "waves". If it's only got 1 "processor", then it will only execute 1 wave (64 "threads") at a time. If it has to stall due to a cache-miss/etc, it will save the execution state and switch to a different wave (which will have its own instruction pointer, etc).

The front-end of the GPU processes very high-level/complex instructions - basically the result of Draw/Dispatch commands from GL/D3D. This front-end reads/executes these commands, which results in work occurring in the shader cores.

E.g. The front-end might execute an instruction that says to execute a compute shader for 128x128 items. It then creates 128x128=16384 "threads" and groups them into 16384/64=256 "waves" (because it uses 64-wide SIMD to work on 64 "threads" at once). Each of these "waves" is like a CPU-thread, having it's own instruction-pointer, execution state/register file, etc... The GPU then basically "hyperthreads" those 256 "waves". If it's only got 1 "processor", then it will only execute 1 wave (64 "threads") at a time. If it has to stall due to a cache-miss/etc, it will save the execution state and switch to a different wave (which will have its own instruction pointer, etc).

well maybe most is clear, this is a set of processors that execute assembly code - though one important thing was not clearly answered (i know that maybe it is a hard to answer) - if this set

of processors is fully programable or some part of this generall assembly code flow is hardcoded in hardware

For example if there is a set of 64 worker processors (waves) there must be some schedular coordinator over it -is this fully programable assembly executor or is this something more constrained and not flexibly programable

is this archotecture like 1 schedluling processor + 64 worker processors?

I also do not know how constrained are worker 'processors' are they able to run any kode like real cpu?

The "worker processors", as you call them, are turing complete.

Edit: I'm assuming, with "scheduling processor" you are referring to the warp/CU schedulers.
There is no scheduling processor. The warp schedulers (there can be more then one per core) are hardcoded. They have to make a decision every cycle, within the cycle. No piece of software can do that. They are like the SMT scheduler in the CPU, you might be able to influence them, but you can't program them. And while you can use _mm_pause to hint a yield on the CPU side, to my knowledge the common APIs do not support s.th. similar for the GPU. It might be, that the drivers can change certain scheduling policies, but if they can they don't expose it in the APIs.

I think NVidia once considered, putting a general purpose ARM core on the GPU die for some driver/management stuff, but I think they never actually went through with it.

The "worker processors", as you call them, are turing complete.

Edit: I'm assuming, with "scheduling processor" you are referring to the warp/CU schedulers.
There is no scheduling processor. The warp schedulers (there can be more then one per core) are hardcoded. They have to make a decision every cycle, within the cycle. No piece of software can do that. They are like the SMT scheduler in the CPU, you might be able to influence them, but you can't program them. And while you can use _mm_pause to hint a yield on the CPU side, to my knowledge the common APIs do not support s.th. similar for the GPU. It might be, that the drivers can change certain scheduling policies, but if they can they don't expose it in the APIs.

I think NVidia once considered, putting a general purpose ARM core on the GPU die for some driver/management stuff, but I think they never actually went through with it.

hm thats sad news , i hoped it to be fully programable one sheduling prosessors and a couple of working but flexible

prosessors

now it seems that as its no sheduling processor though working processors are physically flexible absence of such flexible sheduling commanders make them less flexible to use (though those are speculations)

I dont quite see what this sheduling device is doing, i see you say its something like microcode in cpu.. that is dispatching one assembly stream into channels blocks etc.. If so does that meen that the gpu is able to execute only one input assembly stream

and onlu paralelises it internally? So even if IP (instructon pointers) are separate those processors are not free to use

as those are covered by something as microcode manager?

ps. is that input stream some assembly stream (that later is transformed to seperate assembly streams in waves) or this input stream is more like input array of some data to process?

I dont quite see what this sheduling device is doing, i see you say its something like microcode in cpu.. that is dispatching one assembly stream into channels blocks etc.. If so does that meen that the gpu is able to execute only one input assembly stream
and onlu paralelises it internally? So even if IP (instructon pointers) are separate those processors are not free to use
as those are covered by something as microcode manager?


(again, focusing on AMD's as it has the most documentation out there).

You are thinking about things at the wrong level; the GPU is doing more than one thing at once across multiple SIMDs inside multiple compute units (CU) - it's generally best, when talking on this level not to refer to the GPU at all but the internal units.

An stream of instructions is directed at a SIMD in a CU, and each SIMD can maintain 10 such instruction streams itself (so it has 10 instruction pointers). Each CU has four SIMD so it can keep 40 instruction streams in fight at once (each one made up of 64 threads, or instances, of the instruction stream which can have their own data but execute the same instruction).

However the SIMD don't decide what is executed next because the CU has shared resources the programs need to use which is why each CU has a scheduler deciding what to run next. The simplest part of this is deciding which SIMD unit to look at to get each instruction stream (it uses a simple round-robin system), after that it looks at all the wavefronts/instruction streams being executed and decides what to run next.

The choice is based upon the current state of the CU; for example if one wavefront wants to execute a scalar instruction but the scalar unit is currently busy then it won't get to execute. Same goes for local memory reads and writes as well as global reads and writes; if other SIMD wavefronts have taken up the resource then the work can't be carried out.

The reason this needs to be pretty quick is each clock cycle the scheduler has to look at the state of up to 10 wavefronts and decide which instructions to execute; this isn't something which is going to work very well if written in software as a single clock cycle would, at best, be enough to run one instruction.

So, if you want to think about it at the GPU level then if we take the R290X version of the GCN core; it can be running 44 CU * 4 SIMD * 10 waves of work at any given time; that work could be from one program or it could be from 1760 different programs/instruction stream. (Which equates to 112,640 instances of programs running at once) and every cycle 1/10th of those are looked at and work scheduled to run.

This topic is closed to new replies.

Advertisement