when speaking about sheduler, i understand that those 64 big threads are managed by those sheduler? here i do not understand or at leas im not sure - i may suspect that this sheduler comes between workloads and those 64 big threads
I may suspect that each workload is seperate assembly program
and threads are dynamically assigned to those workloads, maybe that could have some sense
The CU scheduler dispatches work to SIMD units; those SIMD units work in groups of 64 threads, 16 at a time, as described in my reply above.
The workloads can be separate or the same programs, depending on the work requirements; You could have 40 instances of the same program running, or 40 different programs working on the same CU, split across the 4 SIMDs.
The threads are not dynamically assigned; an instance of the program is assigned to the SIMD at start up, registers are allocated and that work will always stay on that SIMD unit and will always execute in banks of 64 threads (You can ask for less threads but that just means that cycles go to waste as the difference between what you asked and the multiple of 64 thread count which is closest, but bigger, is going to be ignore. So if you only dispatch 32 work units then 32 threads are going unused. If you dispatch 96 then you'll require two groups of 64 threads to be dispatched and again 32 will go unused).
All allocation of workloads and registers is static for the life time of the program.
this is different than i thought becouse this involves this input assembly to be defined on some width of data, i mean not some normal scalar assembly but some width-assembly
yet my oryginal question was how those input assembly routines are provided for execution and also how results are taken back, (there must be some way some function pointers interpreted by hardware as routines to execute or something like that)
There is no 'width assembly' (beyond the requirement to enable 64bit float mode, but that would be a mode switch in the instruction stream itself) as all SIMD units are scalar; vector operations in GLSL/HLSL/OpenCL are decomposed to scalar operations and these are what the SIMD units see. The number of workgroups required is handled outside the CU at the GPU command processor stage where either the graphics command processor or async compute engine consumes instruction packets to setup the CU to perform work.
The work is provided by the front end command processors which consume their own instruction stream.
The process for setting up an execution would look something like this;
- host DMA's program code into GPU memory
- command inserted into command processor's instruction stream telling it where to find the program code and the parameters for it
- command processor executes instructions to setup workgroup and dispatch work to CU
- CU scheduler is given data (internally routed) which includes address of program code in GPU memory
- CU scheduler assigns this address as the instruction pointer to the SIMD that will deal with it
- CU scheduler then schedules instructions from any SIMD workloads it has internally
This is very much like how a normal CPU works in many regards; in that an instruction pointer is loaded and execution proceeds from there; the only difference is the program has to be uploaded by a host and then two schedulers are involved in dispatching the work (first as a group and then at a per-instance level).
To get the results back to the host you'd have to copy them back from GPU memory, either via a DMA transfer or by having the memory in the CPU's address range and accessing directly.
Either way you'd get whatever the gpu wrote out.
The GPU can also send back details to the host via a return channel/memory stream which allows you to do things like look for markers and known when instructions are complete so you know when it is safe to operate on the memory.
I am also curious what it is with results, if i provide three workloads can i run them asynchronously then get a signal
that first is done then use the result as an input for some next
workload etc - I mean if i can build some pre scheduler loop
that constantly prowides workloads and consumes the results - that was the 'scheduling code' i had somewhat on my mind
- is there something like here to run on gpu or this is just to write on cpu side?
In theory, if directed to the right front end command processor, then yes.
The async command processors in the GCN architecture can communicate between each other which would allow you to setup task graphs between them to do as you say; this would be done using flags and signals in memory and have each ACE waiting on and signally the correct one.
However last I checked this wasn't currently exposed on the PC.