hyphotetical raw gpu programming

Started by
33 comments, last by fir 9 years, 9 months ago

hm thats sad news , i hoped it to be fully programable one sheduling prosessors and a couple of working but flexible
prosessors


That is actually the architecture behind the Cell processor. GPUs work differently. Phantom already described the GCN architecture, and for all intends and purposes of this discussion, the NVidia architectures are very similar. The biggest difference is that they had to invent new names for literally everything.

now it seems that as its no sheduling processor though working processors are physically flexible absence of such flexible sheduling commanders make them less flexible to use (though those are speculations)


While good scheduling is actually very non trivial in a GPU, especially with those time constraints, I don't see why a fixed scheduling policy would make the GPU less flexible. Think of it like this: On the CPU, you also don't control the schedulers, neither the hardware ones, nor the software OS schedulers.

Maybe you should download Cuda/OpenCL/... and give it a spin. Things will be a lot clearer, once you have actual hands on experience.
Advertisement

hm thats sad news , i hoped it to be fully programable one sheduling prosessors and a couple of working but flexible
prosessors


That is actually the architecture behind the Cell processor. GPUs work differently. Phantom already described the GCN architecture, and for all intends and purposes of this discussion, the NVidia architectures are very similar. The biggest difference is that they had to invent new names for literally everything.

now it seems that as its no sheduling processor though working processors are physically flexible absence of such flexible sheduling commanders make them less flexible to use (though those are speculations)


While good scheduling is actually very non trivial in a GPU, especially with those time constraints, I don't see why a fixed scheduling policy would make the GPU less flexible. Think of it like this: On the CPU, you also don't control the schedulers, neither the hardware ones, nor the software OS schedulers.

Maybe you should download Cuda/OpenCL/... and give it a spin. Things will be a lot clearer, once you have actual hands on experience.

well i thinked that such GPU sheduling is quite different than cpu sheduling on threads - cpu shedules threads for many apps, and here on gpu you want to write a code that you want to shedule yourself - you need some expression for that (i mean you mean something like manual thread management in your app, so i think if this kind of sheduler is fixed, you cannot do such thing as

normal desktop coding when you could run 4 threads manually and assign task to it) - or this is (or would be ) possible on gpu too?

With the CPU schedulers, I meant the piece of hardware that schedules (micro) instructions from either one (most CPUs), two (bigger Intel and AMD), four (eg. Sun Niagara) or eight (eg. Sun Sparc T4) threads onto a bunch of execution units. They are basically the same as the CU/warp schedulers, that schedule instructions from multiple warps/wavefronts onto the GPU's execution units. In both cases, you have no control over them.

For bigger work chunks, you can always use persistent threads on the GPU, and then do some basic form of scheduling yourself, similarly to how you can create a bunch of worker threads on the CPU and then schedule tasks onto them.
But you can not schedule the instructions onto the execution units yourself. Not on any GPU and not on any CPU.

With the CPU schedulers, I meant the piece of hardware that schedules (micro) instructions from either one (most CPUs), two (bigger Intel and AMD), four (eg. Sun Niagara) or eight (eg. Sun Sparc T4) threads onto a bunch of execution units. They are basically the same as the CU/warp schedulers, that schedule instructions from multiple warps/wavefronts onto the GPU's execution units. In both cases, you have no control over them.

For bigger work chunks, you can always use persistent threads on the GPU, and then do some basic form of scheduling yourself, similarly to how you can create a bunch of worker threads on the CPU and then schedule tasks onto them.
But you can not schedule the instructions onto the execution units yourself. Not on any GPU and not on any CPU.

If by "schedule the instructions onto the execution units yourself." you mean the think i got on my mind i mean called cores on chosen assembly code buffers i imagine it this low lewel way - I imegined it such as desktop thread sheduling is managed by os and GPU has no OS (or does he?)

so i m trying to imagine this just a set of processors and assembly chunks and procedures (same as you see a processor working on a linear ram and thats all0 I would like to build this time of picture but related to gpu hardware

I will reread the posts here yet as at least a half of it i didnt get at all (got no time and skills to read many docs but will try just to 'deduce' something from this info - it is worse but consumes less time, than in futre i will try to read more )

ps has GPU the OS? If so is it some asseembly routines loaded by some driver, or maybe some kind of flash rom os, or maybe some hardware coded 'microkernel for microcode' or what? (sory for speculations and weak knowledge but it is complex etc, hard to digest the info)

With the CPU schedulers, I meant the piece of hardware that schedules (micro) instructions from either one (most CPUs), two (bigger Intel and AMD), four (eg. Sun Niagara) or eight (eg. Sun Sparc T4) threads onto a bunch of execution units. They are basically the same as the CU/warp schedulers, that schedule instructions from multiple warps/wavefronts onto the GPU's execution units. In both cases, you have no control over them.

ps this is confusing what you name cpu shedulind - 1) dispatching assembly stream to some internal processors channels and 'stations' or 2) something like hyperthreading when you got two assembly streams (two instruction pointers etc) but you 'shedule' it to one processor or 3) to normal sheduling threads by os on 4 cores or something like that

- i got confused, Anyway i thing it is worth to talk/reflect

thing on such 'highre' level of abstraction to clarify a view of

things (how cpu is working low lewel is well known to anybody ,. gpu seem to be very same system as cpu, this is just like second computer in the first computer :/

Both companies do expose a kind of assembly language for recent GPUs if you look around for it. It is entirely possible to write assembly programs for the GPU or build a compiler that can target them. But the mapping isn't quite as 1-1 as on, say x86 (even x86 you're only talking to a logical 'model' of an x86 CPU and the actual micro-code execution is more RISC-like).

It seen fun for me that present processors seem to work like

interpreters, they read the code and interpret/recompile it

on the fly, sad that programmers cannot for example reprogram this internal interpreter ;\ (it would be there a processor program that is written to execute assembly, this assembly could be reprogrammed to, it would be kool) - or, other option, maybe provide already staticaly compiled wersion (as compiled code usualy works faster than runtime interpretation)

Anyhow, I get talking about cool hardware and I start to ramble smile.png -- Long story short, yes you can do what you want to do today, but the tricky part is that GPUs just aren't organized like a CPU or even a bunch of CPUs (Midgard is the exception, Knights Landing to a lesser extent) and so you can't just expect it to run CPU-style code well. A big part of making code go fast on a GPU is partitioning the problem into managable, cache-and-divergence-coherent chunks, which tends to either be super-straight forward (easy) or requires you to pull the solution entirely apart and put it back together in a different configuration (hard).

Is there something that could be mentioned as a cause for this?

(that each of those cores cannot execute in 'independant direction'?)

got no time and skills to read many docs but will try just to 'deduce' something from this info - it is worse but consumes less time, than in futre i will try to read more


You might as well stop until you've got the time then; my initial explanation contains pretty much all the details but you are asking questions which are already answered, you just lack the base knowledge to make sense of them.

Your comparison with CPUs is still incorrect because a CPU only schedules instructions from a single stream per core/hardware thread; it requires the OS to task switch. A GPU is automatically scheduling work from up to 5 threads, from a group of 10, per clock BEFORE the instructions are decoded and run on the correct unit - and that is just the CU level.

You REALLY need to go and read plenty of docs if you didn't understand my explanation because this isn't an easy subject matter at all if you want to understand the low level stuff.

got no time and skills to read many docs but will try just to 'deduce' something from this info - it is worse but consumes less time, than in futre i will try to read more


You might as well stop until you've got the time then; my initial explanation contains pretty much all the details but you are asking questions which are already answered, you just lack the base knowledge to make sense of them.

Your comparison with CPUs is still incorrect because a CPU only schedules instructions from a single stream per core/hardware thread; it requires the OS to task switch. A GPU is automatically scheduling work from up to 5 threads, from a group of 10, per clock BEFORE the instructions are decoded and run on the correct unit - and that is just the CU level.

You REALLY need to go and read plenty of docs if you didn't understand my explanation because this isn't an easy subject matter at all if you want to understand the low level stuff.

Not readed that yet, (scaned only), im reading the text i know not much quite slow and easily getting tired -,probably i will read it tomorow morning


The basic unit of the GPU, he building block, is the "Compute Unit" or "CU" in their terminology.

The CU itself is made up of a scheduler, 4 groups of 16 SIMD units, a scalar unit, a branch/message unit, local data store, a 4 banks of vector registers, a bank of scalar registers, texture filter units, texture load/store units and an L1 cache.

Im not readed whole yet (will do tomorrow) but some parts of it are confusing

1CU = 4 groups of 16 SIMD units

what is SIMD unit, what is its size is it float4/int4 vector on each simd? I may suspect so but i cannot be sure ;

further there is saying about 10 wavefronts, why 10? how many CUs is in this card?

also i dont understand what means simd unit, and what does mean thread here, when speaking about 4 groups of 16 simd units it is meant that there are 64 'threads' each one is working on float4 (or int4 i dont know) 'data packs'

this seem to me most probably but i cannot be sure and it is some more obstacle of understanding further description

This topic is closed to new replies.

Advertisement