Phantom pretty much described, how it works (in the current generation), but to give a very basic comparison to CPUs:I mean, I can describe how a traditional CPU works down to the NAND gate level (and possibly further), but I'd be interested in learning about GPU internals more.
Take your i7 CPU: It has (amongst other things) various caches, scalar and vectorized 8-wide ALUs, 4 cores and SMT (intel calls it "hyperthreading") that allows for 2 threads per core.
Now strip out the scalar ALUs, ramp up the vectorized ALUs from 8-wide to 32-wide and increase their number, allow the SMT to run 64 instead of 2 "threads"/warps/wavefronts per core (note that on GPUs, every SIMD lane is called a thread) and put in 8 of those cores instead of just 4. Then increase all ALU latencies by a factor of about 3, all cache and memory latencies by a factor of about 10, and also memory throughput by a significant factor (don't have a number, sorry).
Add some nice stuff like texture samplers, shared memory (== local data store) and some hardware support for divergent control flows, and you arrive more or less at an NVidia GPU.
Again, Phantom's desciption is way more accurate, but if you think in CPU terms, those are probably the key differences.
this is nice picture, but still is some confusion what is called thread here;
you describe some float32 simds - then you write about 64 threads each one working on float32? (you mean official documentations call thread each one scalar chanell here
i mean there would be 32x64 scaler threads?)
as to those 64 big threads, are those independant, each one
has its own code it executes (?) and own instruction pointer?
That wopuld be more clear descriptions than phanthom's
though phantom user gave more info about this sheduler thing
when speaking about sheduler, i understand that those 64 big threads are managed by those sheduler? here i do not understand or at leas im not sure - i may suspect that this sheduler comes between workloads and those 64 big threads
I may suspect that each workload is seperate assembly program
and threads are dynamically assigned to those workloads, maybe that could have some sense
if this picture is correct it would be like 64 cores each on working on float32 simd so it really is a whole bunch of processing power but im not sure if the way i see it here is xompatible with description
ps anyway is seem to be more clear, maybe details are not much important but probebry i got a general idea
input assembly routine (or few paralel routines) that is consumed by sheduler and dispathed "in width" to up to 64 32-simd threads
this is different than i thought becouse this involves this input assembly to be defined on some width of data, i mean not some normal scalar assembly but some width-assembly
yet my oryginal question was how those input assembly routines are provided for execution and also how results are taken back, (there must be some way some function pointers interpreted by hardware as routines to execute or something like that)
I am also curious what it is with results, if i provide three workloads can i run them asynchronously then get a signal
that first is done then use the result as an input for some next
workload etc - I mean if i can build some pre scheduler loop
that constantly prowides workloads and consumes the results - that was the 'scheduling code' i had somewhat on my mind
- is there something like here to run on gpu or this is just to write on cpu side?