hyphotetical raw gpu programming

Started by
33 comments, last by fir 9 years, 9 months ago

I mean, I can describe how a traditional CPU works down to the NAND gate level (and possibly further), but I'd be interested in learning about GPU internals more.

Phantom pretty much described, how it works (in the current generation), but to give a very basic comparison to CPUs:

Take your i7 CPU: It has (amongst other things) various caches, scalar and vectorized 8-wide ALUs, 4 cores and SMT (intel calls it "hyperthreading") that allows for 2 threads per core.
Now strip out the scalar ALUs, ramp up the vectorized ALUs from 8-wide to 32-wide and increase their number, allow the SMT to run 64 instead of 2 "threads"/warps/wavefronts per core (note that on GPUs, every SIMD lane is called a thread) and put in 8 of those cores instead of just 4. Then increase all ALU latencies by a factor of about 3, all cache and memory latencies by a factor of about 10, and also memory throughput by a significant factor (don't have a number, sorry).
Add some nice stuff like texture samplers, shared memory (== local data store) and some hardware support for divergent control flows, and you arrive more or less at an NVidia GPU.

Again, Phantom's desciption is way more accurate, but if you think in CPU terms, those are probably the key differences.

this is nice picture, but still is some confusion what is called thread here;

you describe some float32 simds - then you write about 64 threads each one working on float32? (you mean official documentations call thread each one scalar chanell here

i mean there would be 32x64 scaler threads?)

as to those 64 big threads, are those independant, each one

has its own code it executes (?) and own instruction pointer?

That wopuld be more clear descriptions than phanthom's

though phantom user gave more info about this sheduler thing

when speaking about sheduler, i understand that those 64 big threads are managed by those sheduler? here i do not understand or at leas im not sure - i may suspect that this sheduler comes between workloads and those 64 big threads

I may suspect that each workload is seperate assembly program

and threads are dynamically assigned to those workloads, maybe that could have some sense

if this picture is correct it would be like 64 cores each on working on float32 simd so it really is a whole bunch of processing power but im not sure if the way i see it here is xompatible with description

ps anyway is seem to be more clear, maybe details are not much important but probebry i got a general idea

input assembly routine (or few paralel routines) that is consumed by sheduler and dispathed "in width" to up to 64 32-simd threads

this is different than i thought becouse this involves this input assembly to be defined on some width of data, i mean not some normal scalar assembly but some width-assembly

yet my oryginal question was how those input assembly routines are provided for execution and also how results are taken back, (there must be some way some function pointers interpreted by hardware as routines to execute or something like that)

I am also curious what it is with results, if i provide three workloads can i run them asynchronously then get a signal

that first is done then use the result as an input for some next

workload etc - I mean if i can build some pre scheduler loop

that constantly prowides workloads and consumes the results - that was the 'scheduling code' i had somewhat on my mind

- is there something like here to run on gpu or this is just to write on cpu side?

Advertisement

what is SIMD unit, what is its size is it float4/int4 vector on each simd? I may suspect so but i cannot be sure ;
further there is saying about 10 wavefronts, why 10? how many CUs is in this card?

also i dont understand what means simd unit, and what does mean thread here, when speaking about 4 groups of 16 simd units it is meant that there are 64 'threads' each one is working on float4 (or int4 i dont know) 'data packs'


A SIMD unit is the part of the CU which is doing VGPR based ALU work. So if you issue an instruction to add two vectors together this is the unit which does the work.
The SIMD units are scalar in nature however; you have 16 threads which execute the same instruction at the same time but the data is different. Each one is working on a 32 bit float or int, or 64bit double, during this work. This means that vectorised work requires more clock cycles to complete as they are done as separate operations. So a vec2 + vec2 would take 2 instructions per thread to complete (x+x & y+y).

A 'thread' is an instance of data grouped together; they aren't quite the same as CPU threads because CPU threads operate independently where as on a GPU you'll have a number of threads executing the same instruction (64 on AMD, 32 on NVidia are typical numbers). So instruction wise they move in lock step but data wise they are separate.

The 4 groups of SIMD units mean just that; you have 4 groups of 16 threads which are operating on different wavefronts independently. Work is never scheduled across SIMD units and once assigned to a SIMD unit it won't be moved off.

Each SIMD, per wavefront, works on 64 threads at a time; as it has room for 16 threads to be executing at once this means that for any given instruction it takes at least 4 clock cycles for it to complete and for more work to be issued. So, returning again to our vec2 + vec2 example this would take 8 clock cycles to complete (assuming 32bit float);

0 : Threads 0 - 15 execute x+x
1 : Threads 16 - 31 execute x+x
2 : Threads 32 - 47 execute x+x
3 : Threads 48 - 63 execute x+x
4 : Threads 0 - 15 execute y+y
5 : Threads 16 - 31 execute y+y
6 : Threads 32 - 47 execute y+y
7 : Threads 48 - 63 execute y+y

Note; this might not happen like this as between cycle 3 and 4 an instruction from a different wavefront might be issued so in wall clock time it could take longer than 8 cycles to complete the operation.

While this is how the GPU works internally we conceptually think of it as all 64 threads operating on the same instruction at the same time; because nothing can pre-empt the work during an instruction being operated on over the 4 clock cycles you can treat it as if all 64 operations happened at once as the observable result is the same.

The number of wavefronts per SIMD is simply a case of that's how the hardware was designed; probably a case of AMD did some simulation work and between that and the number of transistors required to support more 10 was probably the sweet spot. It also makes sense given the nature of the scheduler as it can dispatch up to 5 instructions per clock from 5 of 10 wavefronts, this means that theoretically have twice the required number of threads 'waiting' to issue work than can be serviced but this also means that if threads stall out you have others waiting to take over. If, for example, wavefront 0 is waiting on data from memory to come in and can't issue work then there are still another 9 to choose from to try and issue all 5 instructions from.

The number of CU depends on the cost of the GPU; a top end 290X will have 44, others will have less; this is just a function of the cost of the hardware, nothing more.

when speaking about sheduler, i understand that those 64 big threads are managed by those sheduler? here i do not understand or at leas im not sure - i may suspect that this sheduler comes between workloads and those 64 big threads
I may suspect that each workload is seperate assembly program
and threads are dynamically assigned to those workloads, maybe that could have some sense


The CU scheduler dispatches work to SIMD units; those SIMD units work in groups of 64 threads, 16 at a time, as described in my reply above.
The workloads can be separate or the same programs, depending on the work requirements; You could have 40 instances of the same program running, or 40 different programs working on the same CU, split across the 4 SIMDs.
The threads are not dynamically assigned; an instance of the program is assigned to the SIMD at start up, registers are allocated and that work will always stay on that SIMD unit and will always execute in banks of 64 threads (You can ask for less threads but that just means that cycles go to waste as the difference between what you asked and the multiple of 64 thread count which is closest, but bigger, is going to be ignore. So if you only dispatch 32 work units then 32 threads are going unused. If you dispatch 96 then you'll require two groups of 64 threads to be dispatched and again 32 will go unused).

All allocation of workloads and registers is static for the life time of the program.

this is different than i thought becouse this involves this input assembly to be defined on some width of data, i mean not some normal scalar assembly but some width-assembly

yet my oryginal question was how those input assembly routines are provided for execution and also how results are taken back, (there must be some way some function pointers interpreted by hardware as routines to execute or something like that)




There is no 'width assembly' (beyond the requirement to enable 64bit float mode, but that would be a mode switch in the instruction stream itself) as all SIMD units are scalar; vector operations in GLSL/HLSL/OpenCL are decomposed to scalar operations and these are what the SIMD units see. The number of workgroups required is handled outside the CU at the GPU command processor stage where either the graphics command processor or async compute engine consumes instruction packets to setup the CU to perform work.

The work is provided by the front end command processors which consume their own instruction stream.
The process for setting up an execution would look something like this;
- host DMA's program code into GPU memory
- command inserted into command processor's instruction stream telling it where to find the program code and the parameters for it
- command processor executes instructions to setup workgroup and dispatch work to CU
- CU scheduler is given data (internally routed) which includes address of program code in GPU memory
- CU scheduler assigns this address as the instruction pointer to the SIMD that will deal with it
- CU scheduler then schedules instructions from any SIMD workloads it has internally

This is very much like how a normal CPU works in many regards; in that an instruction pointer is loaded and execution proceeds from there; the only difference is the program has to be uploaded by a host and then two schedulers are involved in dispatching the work (first as a group and then at a per-instance level).

To get the results back to the host you'd have to copy them back from GPU memory, either via a DMA transfer or by having the memory in the CPU's address range and accessing directly.
Either way you'd get whatever the gpu wrote out.

The GPU can also send back details to the host via a return channel/memory stream which allows you to do things like look for markers and known when instructions are complete so you know when it is safe to operate on the memory.

I am also curious what it is with results, if i provide three workloads can i run them asynchronously then get a signal
that first is done then use the result as an input for some next
workload etc - I mean if i can build some pre scheduler loop
that constantly prowides workloads and consumes the results - that was the 'scheduling code' i had somewhat on my mind
- is there something like here to run on gpu or this is just to write on cpu side?


In theory, if directed to the right front end command processor, then yes.
The async command processors in the GCN architecture can communicate between each other which would allow you to setup task graphs between them to do as you say; this would be done using flags and signals in memory and have each ACE waiting on and signally the correct one.
However last I checked this wasn't currently exposed on the PC.

Anyhow, I get talking about cool hardware and I start to ramble smile.png -- Long story short, yes you can do what you want to do today, but the tricky part is that GPUs just aren't organized like a CPU or even a bunch of CPUs (Midgard is the exception, Knights Landing to a lesser extent) and so you can't just expect it to run CPU-style code well. A big part of making code go fast on a GPU is partitioning the problem into managable, cache-and-divergence-coherent chunks, which tends to either be super-straight forward (easy) or requires you to pull the solution entirely apart and put it back together in a different configuration (hard).

Is there something that could be mentioned as a cause for this?

(that each of those cores cannot execute in 'independant direction'?)

I'm not sure if you're looking for search terms or what, but I can summarize the driving force between why CPUs and GPUs are so different at a hardware level.

CPU cores are designed to do only one thing at a time. To get more stuff done per wall-clock-time-unit while working on only one thing at a time, CPUs have evolved to reduce the amount of time from reading an instruction to having the results. When you do only one thing at a time, the only way to do more is to do that one thing more quickly.

In a modern CPU, only between 1-2 percent of the transistors comprise the ALU, registers, and basic control decoding and execution logic that your CS100 class tells you that a CPU is, and that you think of as doing the work of the processor -- the other 98 percent of transistors implement and manage structures that exist only to make that 2% "CS100 model" of the CPU go fast. Caches reduce memory latency, out-of-order execution finds instructions that are independent and send them to available execution units at the same time -- SMT (hyperthreading) does the same but kicks in when there's not enough independent instructions in a single program thread, branch predictors track which way your branches (ifs and loops) go most often so that the rest of the fancy footwork isn't mis-spent on doing the wrong thing. As a consequence of only doing one thing, that one thing can walk all over memory doing what it likes without dragging anyone else along -- it can branch here and there without much care.

GPU cores, on the other hand, are designed to do many versions of the same thing all at once. To get more stuff done per wall-clock-time-unit, and being able to do many things at once, GPUs have evolved not to do the same number of versions in less time, but to do many more versions in the same time. When you can do many things at a time, the best way to do more is to just multiply the number of workers, rather than making fewer workers go faster -- the trick is to manage complexity that results.

In a GPU, much, much larger percentage of the transistors actually do the work that you think of it as doing. They achieve such a high percentage by simplifying or simply not having the kinds of structures that CPUs use to go fast, then make up for it by putting lots of copies of what's left on the chip. There are still caches, of course, those are always a win. Something like SMT (hyperthreading) is there, but in the simplified form Phantom described. There is no out-of-order execution and no branch prediction. The most important and fundamental thing that GPUs do to manage complexity, though, is that while each "thread" can have a certain amount of data input and state that's unique to itself, it *has* to share its code with ~64 (typical) siblings, and that program is executed in lock-step. The benefit of this to silicon complexity is that you spread that free-loading control logic that's not doing any "work" among 64 "workers" that are, the downside is that none of the workers can go off and do their own thing -- if the 64 workers reach an 'if' statement, and even 1 of the workers goes in a different direction than all the others, then all the workers have to do all the work of both the 'true' and the 'false' code-paths, then decide which version of the work to keep and which to throw away. On the hardware level, the 64 workers are inseparable, and so the programmer who want's his code to go fast on a GPU takes the burden on himself to try and make sure that his solution is structured in a way where all the workers can all go down the 'true' or 'false' code-path together -- then all the throw-away work can be avoided and all the workers are doing useful work all of the time.

However, this GPU model of brigades of workers marching in lock-step is probably starting to sound rather restrictive compared to the free-wheeling CPU model, and you're absolutely right. CPUs, for all their single-mindedness, tackle problems in any shape with equal capability. But the hive-mind of GPUs is not so flexible, every problem must be adapted to its collective way of thinking, and the problems that easily lend themselves to this way of thinking are the ones that will shine most brightly on a GPU (this is also part of the reason GPUs have been able to change the underlying architecture so drastically, because the programmer speaks to the hive-mind, rather than the individual workers). Some kinds of problems don't map well at all, for a variety of reasons -- perhaps the code is too branchy, the dataset too small, the memory accesses too sparse, or the latency to travel the PCIe bus to the GPU and back again overwhelms any benefit of parallel execution. Luckily for GPUs, however, a great many interesting problems map naturally, or well enough with clever coding, that GPUs are extremely interesting for some problems -- aside from graphics, GPUs can be used in artificial vision, physics simulations, weather simulation, financial and risk analysis, gas and oil exploration, and many others.

All of the reasons above are why CPUs and GPUs are so different at the transistor level, and why you have to approach them so differently. If you approach one as the other, you trade away all of its strengths for its weaknesses -- and you will very likely find that the resulting code will run no better, if not worse, than where it ran to begin with.

throw table_exception("(? ???)? ? ???");

alright this got yet more confuzed now but overal was helpfull, i will read about it more but i got some base picture, (now i dont want to go deeper in that but some day when a read a bit and clarify i can return with more detailed things)

This topic is closed to new replies.

Advertisement