Sign in to follow this  
fir

hyphotetical raw gpu programming

Recommended Posts

fir    460

I wonder if it would be possible to run raw assembly on GPU

(maybe I should name it G-cpu-s as those are maybe just

a number of some kind od simplified cpus, or something)

Could maybe some eleborate on this - is gpu just a row

od simplistic cpus ?

How such code would look like - a number of memory spaces then each one

feel fill with assembly code then run them?

 

 

Share this post


Link to post
Share on other sites
fastcall22    10840
CPUs and GPUs are both processors, but they specialize in different areas. GPUs excel in massively parallel processing whereas CPUs excel more in general processing. You might want to look into OpenCL, I think it may have some support for video cards.

Share this post


Link to post
Share on other sites
fir    460

It depends very much on the internal architecture -- because GPUs have sort of hidden behind the problem they focus on, the underlying silicon has been changed radically even in just the 10 or so years since they've become really programable -- off the top of my head, you had VLIW from AMD with an issue width of 5 and 4 going back to the HD5x00 and HD6x00 series, single-SIMD-per-core before that, and multiple-SIMD-per-core recently in GCN 1.0, and GCN 1.1/2.0 having the same basic architecture as that but with better integration into the system's memory hierarchy; From nVidia, you've had half-sized, double-pumped cores, single 'cores' with very many relatively-independent ALUs, and most recently (maxwell), which shrunk back the number of ALUs per core.

 

Both companies do expose a kind of assembly language for recent GPUs if you look around for it. It is entirely possible to write assembly programs for the GPU or build a compiler that can target them. But the mapping isn't quite as 1-1 as on, say x86 (even x86 you're only talking to a logical 'model' of an x86 CPU and the actual micro-code execution is more RISC-like).

 

If you branch out from PC and look at mobile GPUs that you find in phones and tablets, then you have tiled architectures too. ARM's latest Mali GPU architecture Midgard is something really unique -- where every small-vector ALU is completely execution-independant of any other -- as a consequence, every pixel could go down a different codepath for no penalty at all which is something no other GPU can do. In a normal GPU the penalty for divergent branches (an 'if' where the condition is true for some pixels and false for others) is proportional to the square of the number of divergent branches in the codepath, which can quickly become severe.

 

Then, you have something similar in intel's MIC platform, which was originally going to be a high-end GPU ~5 years ago. The upcoming incarnation of MIC is Knights Landing, which is up to 72 customized x86-64 processors based on the most recent Silvermont Atom core -- its been customized by having x87 floating point chopped off, each physical core runs 4 hyper-threads, and each physical core has 2 512-bit SIMDs, and its got up to 8 GB of on-package RAM using a 512-bit bus giving ~350GB/s bandwidth.

 

Anyhow, I get talking about cool hardware and I start to ramble smile.png -- Long story short, yes you can do what you want to do today, but the tricky part is that GPUs just aren't organized like a CPU or even a bunch of CPUs (Midgard is the exception, Knights Landing to a lesser extent) and so you can't just expect it to run CPU-style code well. A big part of making code go fast on a GPU is partitioning the problem into managable, cache-and-divergence-coherent chunks, which tends to either be super-straight forward (easy) or requires you to pull the solution entirely apart and put it back together in a different configuration (hard).

 

Got a problem understanding that as my knowledge is low - i reread

it few times.. Probably to know how (some) gpu is exactly build i would have to work in some company making them : /

 

But i can show my simple schematic picture of this and ask for some claryfying if possible. So for me GPU world seem to be contained from those parts

 

- Input Vram (containing some textures, geometry and some things)

 

- Output vram kontaining some framebuffers atc

 

- Some CPU's (I dont know nothing about this but i imagine they are something like normal cpu driven by some assembly but maybe this assembly is a bit simpler (?) they also say that they are close to x86

sse assembly (at least by type of registers ? - i dont know

 

- some assembly program (or programs) - it must be some program if

there are cpus - but its total unknown if this is one code ot there are 

many programs - each one for each cpu - are those programs clones of one program or are they different ?

 

this question how those programs look like is the one unknown, other

important unknown is if such hardware (i mean GPU) when executing

all this transformation from input Vram to output vram uses only those

assembly programs and those cpus or has yet some other kind of hardwate thad do some transforms but is not such like cpu+assembly

but some other 'hardware construct ' (maybe some that is hardcoded in transistors not programmable by assembly - if there are such things

Im speculating

Share this post


Link to post
Share on other sites
SeanMiddleditch    17565

it few times.. Probably to know how (some) gpu is exactly build i would have to work in some company making them : /


Each individual GPU can use an entirely different instruction set, even in the same series of GPU. AMD rather publicly switched from a "VLIW5" to a "VLIW4" architecture recently which necessitates an entirely different architecture (and during the transition, some GPUs they release used the old version and other variations in the same product line used the new version). Even within a broad architecture like AMD's VLIW4, each card my have minor variations in its instruction set that is abstracted by the driver's low-level shader compiler.

Your only sane option is to compile to a hardware-neutral IR like SPIR (https://www.khronos.org/spir) or PTX (which is NVIDIA-specific). SPIR is the intended solution to this problem by the Khronos group that allows for a multitude of languages and APIs to all target GPUs without having to deal with the unstable instruction sets.

Some CPU's (I dont know nothing about this but i imagine they are something like normal cpu driven by some assembly but maybe this assembly is a bit simpler (?) they also say that they are close to x86


GPUs are not like CPUs. They are massive SIMD units. They'd be most similar to doing SSE/AVX/AVX512 coding, except _everything_ is SIMD (memory fetches/stores, comparisons, branches, etc.). A program instance in a GPU is really a collection of ~64 or so cores all running in lockstep. That's why branching in GPU code is so bad; in order for one instance to go down a branch _all_ instances must go down that branch (and ignore the results of doing so on instances that shouldn't be on that branch).

You might want go Google "gpu architecture" or check over those SPIR docs.

Share this post


Link to post
Share on other sites
Bregma    9202

You guys are wayyy over my head with this stuff.  I'm kinda with fir on this; I only have a vague notion of what a GPU does, but I figure it's like he says, a vast array of memory as data input, a similar vast array as output, and a set of processors that read and process instructions from yet another array of memory to transform the input to the output.  Is that not the case?

 

Do all the processing units always work in lock-step or can they be divided into subgroups each processing a different program on different input sets?

 

Is there a separate processor that divides up the data and feeds it or controls the main array of processors as appropriate?

 

I mean, I can describe how a traditional CPU works down to the NAND gate level (and possibly further), but I'd be interested in learning about GPU internals more.

Share this post


Link to post
Share on other sites
fir    460

me too, especially to learn (/discuss most important knowledge)  in an easy way, Some docs harder than intel manuals can be obstacle. Most importand would be to get a picture how this assembly code looks like and how its executed - for example if this is some long linear assembly routine 

like

 

start:

  ..assembly..

  ..assembly..

 

  ..assembly..

  ..assembly..

 

  ..assembly..

end.

 

one long routine

that is given to consume by pack of 64 processors or if this is some other structure,

 

Some parts of pipeline are programmable by client programmer but what with the other parts - are those programmed by some internal assembly code or what? hard to find an answers but it would be interesting to know that

Share this post


Link to post
Share on other sites
Ohforf sake    2052

I mean, I can describe how a traditional CPU works down to the NAND gate level (and possibly further), but I'd be interested in learning about GPU internals more.

Phantom pretty much described, how it works (in the current generation), but to give a very basic comparison to CPUs:

Take your i7 CPU: It has (amongst other things) various caches, scalar and vectorized 8-wide ALUs, 4 cores and SMT (intel calls it "hyperthreading") that allows for 2 threads per core.
Now strip out the scalar ALUs, ramp up the vectorized ALUs from 8-wide to 32-wide and increase their number, allow the SMT to run 64 instead of 2 "threads"/warps/wavefronts per core (note that on GPUs, every SIMD lane is called a thread) and put in 8 of those cores instead of just 4. Then increase all ALU latencies by a factor of about 3, all cache and memory latencies by a factor of about 10, and also memory throughput by a significant factor (don't have a number, sorry).
Add some nice stuff like texture samplers, shared memory (== local data store) and some hardware support for divergent control flows, and you arrive more or less at an NVidia GPU.

Again, Phantom's desciption is way more accurate, but if you think in CPU terms, those are probably the key differences. Edited by Ohforf sake

Share this post


Link to post
Share on other sites
fir    460

does such hardware operate on one 'lite' adress space?

I understand from that above that in such hardware there are

two kinds of threads one is strictly parallel 'threads' that shares the same instruction pointer (but i lost the track how many such threads are in case but someone mentioned 22 thousands of such scalar channells, or so)  but there are also real separate execution track machines, where each one has seperate instruction pointer and can execute distinct assembly chunk of code

 

if so - taking this second approach, the code to execute on each 

of those track machines must be provided in some form to them.

my main question is - how those codes are provided to them,

is there some over assembly routine that  is some code program that assigns sperate assembly routines to track machines and coordinates it?

Share this post


Link to post
Share on other sites
Tribad    981

Two things.

Micro-Code and Hardware.

 

As of any other processing unit you can manage it in hardware, Zilogs Z80 CPU, or the most times these days, micro programming. Some parts are better build in hardware others are better build in some type of microcode.

Share this post


Link to post
Share on other sites
Hodgman    51234
The front-end of the GPU processes very high-level/complex instructions - basically the result of Draw/Dispatch commands from GL/D3D. This front-end reads/executes these commands, which results in work occurring in the shader cores.

E.g. The front-end might execute an instruction that says to execute a compute shader for 128x128 items. It then creates 128x128=16384 "threads" and groups them into 16384/64=256 "waves" (because it uses 64-wide SIMD to work on 64 "threads" at once). Each of these "waves" is like a CPU-thread, having it's own instruction-pointer, execution state/register file, etc... The GPU then basically "hyperthreads" those 256 "waves". If it's only got 1 "processor", then it will only execute 1 wave (64 "threads") at a time. If it has to stall due to a cache-miss/etc, it will save the execution state and switch to a different wave (which will have its own instruction pointer, etc).

Share this post


Link to post
Share on other sites
fir    460

The front-end of the GPU processes very high-level/complex instructions - basically the result of Draw/Dispatch commands from GL/D3D. This front-end reads/executes these commands, which results in work occurring in the shader cores.

E.g. The front-end might execute an instruction that says to execute a compute shader for 128x128 items. It then creates 128x128=16384 "threads" and groups them into 16384/64=256 "waves" (because it uses 64-wide SIMD to work on 64 "threads" at once). Each of these "waves" is like a CPU-thread, having it's own instruction-pointer, execution state/register file, etc... The GPU then basically "hyperthreads" those 256 "waves". If it's only got 1 "processor", then it will only execute 1 wave (64 "threads") at a time. If it has to stall due to a cache-miss/etc, it will save the execution state and switch to a different wave (which will have its own instruction pointer, etc).

 

well maybe most is clear, this is a set of processors that execute assembly code - though one important thing was not clearly answered (i know that maybe it is a hard to answer) - if this set

of processors is fully programable or some part of this generall assembly code flow is hardcoded in hardware

For example if there is a set of 64 worker processors (waves) there must be some schedular coordinator over it -is this fully programable assembly executor or is this something more constrained and not flexibly programable 

 

is this archotecture like 1 schedluling processor + 64 worker processors?

 

I also do not know how constrained are worker 'processors' are they able to run any kode like real cpu?

Share this post


Link to post
Share on other sites
Ohforf sake    2052
The "worker processors", as you call them, are turing complete.

Edit: I'm assuming, with "scheduling processor" you are referring to the warp/CU schedulers.
There is no scheduling processor. The warp schedulers (there can be more then one per core) are hardcoded. They have to make a decision every cycle, within the cycle. No piece of software can do that. They are like the SMT scheduler in the CPU, you might be able to influence them, but you can't program them. And while you can use _mm_pause to hint a yield on the CPU side, to my knowledge the common APIs do not support s.th. similar for the GPU. It might be, that the drivers can change certain scheduling policies, but if they can they don't expose it in the APIs.

I think NVidia once considered, putting a general purpose ARM core on the GPU die for some driver/management stuff, but I think they never actually went through with it. Edited by Ohforf sake

Share this post


Link to post
Share on other sites
fir    460

The "worker processors", as you call them, are turing complete.

Edit: I'm assuming, with "scheduling processor" you are referring to the warp/CU schedulers.
There is no scheduling processor. The warp schedulers (there can be more then one per core) are hardcoded. They have to make a decision every cycle, within the cycle. No piece of software can do that. They are like the SMT scheduler in the CPU, you might be able to influence them, but you can't program them. And while you can use _mm_pause to hint a yield on the CPU side, to my knowledge the common APIs do not support s.th. similar for the GPU. It might be, that the drivers can change certain scheduling policies, but if they can they don't expose it in the APIs.

I think NVidia once considered, putting a general purpose ARM core on the GPU die for some driver/management stuff, but I think they never actually went through with it.

hm thats sad news , i hoped it to be fully programable one sheduling prosessors and a couple of working but flexible

prosessors

 

now it seems that as its no sheduling processor though working processors are physically flexible absence of such flexible sheduling commanders make them less flexible to use (though those are speculations)

 

I dont quite see what this sheduling device is doing, i see you say its something like microcode in cpu.. that is dispatching one assembly stream into channels blocks etc..  If so does that meen that the gpu is able to execute only one input assembly stream 

and onlu paralelises it internally? So even if IP (instructon pointers) are separate those processors are not free to use

as those are covered by something as microcode manager?

 

ps. is that input stream some assembly stream (that later is transformed to seperate assembly streams in waves) or this input stream is more like input array of some data to process?

Edited by fir

Share this post


Link to post
Share on other sites
_the_phantom_    11250

I dont quite see what this sheduling device is doing, i see you say its something like microcode in cpu.. that is dispatching one assembly stream into channels blocks etc..  If so does that meen that the gpu is able to execute only one input assembly stream 
and onlu paralelises it internally? So even if IP (instructon pointers) are separate those processors are not free to use
as those are covered by something as microcode manager?


(again, focusing on AMD's as it has the most documentation out there).

You are thinking about things at the wrong level; the GPU is doing more than one thing at once across multiple SIMDs inside multiple compute units (CU) - it's generally best, when talking on this level not to refer to the GPU at all but the internal units.

An stream of instructions is directed at a SIMD in a CU, and each SIMD can maintain 10 such instruction streams itself (so it has 10 instruction pointers). Each CU has four SIMD so it can keep 40 instruction streams in fight at once (each one made up of 64 threads, or instances, of the instruction stream which can have their own data but execute the same instruction).

However the SIMD don't decide what is executed next because the CU has shared resources the programs need to use which is why each CU has a scheduler deciding what to run next. The simplest part of this is deciding which SIMD unit to look at to get each instruction stream (it uses a simple round-robin system), after that it looks at all the wavefronts/instruction streams being executed and decides what to run next.

The choice is based upon the current state of the CU; for example if one wavefront wants to execute a scalar instruction but the scalar unit is currently busy then it won't get to execute. Same goes for local memory reads and writes as well as global reads and writes; if other SIMD wavefronts have taken up the resource then the work can't be carried out.

The reason this needs to be pretty quick is each clock cycle the scheduler has to look at the state of up to 10 wavefronts and decide which instructions to execute; this isn't something which is going to work very well if written in software as a single clock cycle would, at best, be enough to run one instruction.

So, if you want to think about it at the GPU level then if we take the R290X version of the GCN core; it can be running 44 CU * 4 SIMD * 10 waves of work at any given time; that work could be from one program or it could be from 1760 different programs/instruction stream. (Which equates to 112,640 instances of programs running at once) and every cycle 1/10th of those are looked at and work scheduled to run.

Share this post


Link to post
Share on other sites
Ohforf sake    2052

hm thats sad news , i hoped it to be fully programable one sheduling prosessors and a couple of working but flexible
prosessors


That is actually the architecture behind the Cell processor. GPUs work differently. Phantom already described the GCN architecture, and for all intends and purposes of this discussion, the NVidia architectures are very similar. The biggest difference is that they had to invent new names for literally everything.
 

now it seems that as its no sheduling processor though working processors are physically flexible absence of such flexible sheduling commanders make them less flexible to use (though those are speculations)


While good scheduling is actually very non trivial in a GPU, especially with those time constraints, I don't see why a fixed scheduling policy would make the GPU less flexible. Think of it like this: On the CPU, you also don't control the schedulers, neither the hardware ones, nor the software OS schedulers.

Maybe you should download Cuda/OpenCL/... and give it a spin. Things will be a lot clearer, once you have actual hands on experience.

Share this post


Link to post
Share on other sites
fir    460

 

hm thats sad news , i hoped it to be fully programable one sheduling prosessors and a couple of working but flexible
prosessors


That is actually the architecture behind the Cell processor. GPUs work differently. Phantom already described the GCN architecture, and for all intends and purposes of this discussion, the NVidia architectures are very similar. The biggest difference is that they had to invent new names for literally everything.
 

now it seems that as its no sheduling processor though working processors are physically flexible absence of such flexible sheduling commanders make them less flexible to use (though those are speculations)


While good scheduling is actually very non trivial in a GPU, especially with those time constraints, I don't see why a fixed scheduling policy would make the GPU less flexible. Think of it like this: On the CPU, you also don't control the schedulers, neither the hardware ones, nor the software OS schedulers.

Maybe you should download Cuda/OpenCL/... and give it a spin. Things will be a lot clearer, once you have actual hands on experience.

 

 

well i thinked that such GPU sheduling is quite different than cpu sheduling on threads - cpu shedules threads for many apps, and here on gpu you want to write a code that you want to shedule yourself - you need some expression for that (i mean you mean something like manual thread management in your app, so i think if this kind of sheduler is fixed, you cannot do such thing as 

normal desktop coding when you could run 4 threads manually and assign task to it)  - or this is (or would be ) possible on gpu too? 

Share this post


Link to post
Share on other sites
Ohforf sake    2052
With the CPU schedulers, I meant the piece of hardware that schedules (micro) instructions from either one (most CPUs), two (bigger Intel and AMD), four (eg. Sun Niagara) or eight (eg. Sun Sparc T4) threads onto a bunch of execution units. They are basically the same as the CU/warp schedulers, that schedule instructions from multiple warps/wavefronts onto the GPU's execution units. In both cases, you have no control over them.

For bigger work chunks, you can always use persistent threads on the GPU, and then do some basic form of scheduling yourself, similarly to how you can create a bunch of worker threads on the CPU and then schedule tasks onto them.
But you can not schedule the instructions onto the execution units yourself. Not on any GPU and not on any CPU.

Share this post


Link to post
Share on other sites
fir    460

With the CPU schedulers, I meant the piece of hardware that schedules (micro) instructions from either one (most CPUs), two (bigger Intel and AMD), four (eg. Sun Niagara) or eight (eg. Sun Sparc T4) threads onto a bunch of execution units. They are basically the same as the CU/warp schedulers, that schedule instructions from multiple warps/wavefronts onto the GPU's execution units. In both cases, you have no control over them.

For bigger work chunks, you can always use persistent threads on the GPU, and then do some basic form of scheduling yourself, similarly to how you can create a bunch of worker threads on the CPU and then schedule tasks onto them.
But you can not schedule the instructions onto the execution units yourself. Not on any GPU and not on any CPU.

If by "schedule the instructions onto the execution units yourself." you mean the think i got on my mind i mean called cores on chosen assembly code buffers i imagine it this low lewel way - I imegined it such as desktop thread sheduling is managed by os and GPU has no OS (or does he?)

so i m trying to imagine this just a set of processors and assembly chunks and procedures (same as you see a processor working on a linear ram and thats all0 I would like to build this time of picture but related to gpu hardware

I will reread the posts here yet as at least a half of it i didnt get at all (got no time and skills to read many docs but will try just to 'deduce' something from this info - it is worse but consumes less time, than in futre i will try to read more )

 

ps has GPU the OS? If so is it some asseembly routines loaded by some driver, or maybe some kind of flash rom os, or maybe some hardware coded 'microkernel for microcode' or what? (sory for speculations and weak knowledge but it is complex etc, hard to digest the info)

Edited by fir

Share this post


Link to post
Share on other sites
fir    460

With the CPU schedulers, I meant the piece of hardware that schedules (micro) instructions from either one (most CPUs), two (bigger Intel and AMD), four (eg. Sun Niagara) or eight (eg. Sun Sparc T4) threads onto a bunch of execution units. They are basically the same as the CU/warp schedulers, that schedule instructions from multiple warps/wavefronts onto the GPU's execution units. In both cases, you have no control over them.
 

 

ps this is confusing what you name cpu shedulind - 1) dispatching assembly stream to some internal processors channels and 'stations' or 2) something like hyperthreading when you got two assembly streams (two instruction pointers etc) but you 'shedule' it to one processor or 3) to normal sheduling threads by os on 4 cores or something like that

- i got confused, Anyway i thing it is worth to talk/reflect

thing on such 'highre' level of abstraction to clarify a view of

things (how cpu is working low lewel is well known to anybody ,. gpu seem to be very same system as cpu, this is just like second computer in the first computer :/

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this