hyphotetical raw gpu programming

Started by
33 comments, last by fir 9 years, 9 months ago

I wonder if it would be possible to run raw assembly on GPU

(maybe I should name it G-cpu-s as those are maybe just

a number of some kind od simplified cpus, or something)

Could maybe some eleborate on this - is gpu just a row

od simplistic cpus ?

How such code would look like - a number of memory spaces then each one

i feel fill with assembly code then run them?

Advertisement
CPUs and GPUs are both processors, but they specialize in different areas. GPUs excel in massively parallel processing whereas CPUs excel more in general processing. You might want to look into OpenCL, I think it may have some support for video cards.

It depends very much on the internal architecture -- because GPUs have sort of hidden behind the problem they focus on, the underlying silicon has been changed radically even in just the 10 or so years since they've become really programable -- off the top of my head, you had VLIW from AMD with an issue width of 5 and 4 going back to the HD5x00 and HD6x00 series, single-SIMD-per-core before that, and multiple-SIMD-per-core recently in GCN 1.0, and GCN 1.1/2.0 having the same basic architecture as that but with better integration into the system's memory hierarchy; From nVidia, you've had half-sized, double-pumped cores, single 'cores' with very many relatively-independent ALUs, and most recently (maxwell), which shrunk back the number of ALUs per core.

Both companies do expose a kind of assembly language for recent GPUs if you look around for it. It is entirely possible to write assembly programs for the GPU or build a compiler that can target them. But the mapping isn't quite as 1-1 as on, say x86 (even x86 you're only talking to a logical 'model' of an x86 CPU and the actual micro-code execution is more RISC-like).

If you branch out from PC and look at mobile GPUs that you find in phones and tablets, then you have tiled architectures too. ARM's latest Mali GPU architecture Midgard is something really unique -- where every small-vector ALU is completely execution-independant of any other -- as a consequence, every pixel could go down a different codepath for no penalty at all which is something no other GPU can do. In a normal GPU the penalty for divergent branches (an 'if' where the condition is true for some pixels and false for others) is proportional to the square of the number of divergent branches in the codepath, which can quickly become severe.

Then, you have something similar in intel's MIC platform, which was originally going to be a high-end GPU ~5 years ago. The upcoming incarnation of MIC is Knights Landing, which is up to 72 customized x86-64 processors based on the most recent Silvermont Atom core -- its been customized by having x87 floating point chopped off, each physical core runs 4 hyper-threads, and each physical core has 2 512-bit SIMDs, and its got up to 8 GB of on-package RAM using a 512-bit bus giving ~350GB/s bandwidth.

Anyhow, I get talking about cool hardware and I start to ramble :) -- Long story short, yes you can do what you want to do today, but the tricky part is that GPUs just aren't organized like a CPU or even a bunch of CPUs (Midgard is the exception, Knights Landing to a lesser extent) and so you can't just expect it to run CPU-style code well. A big part of making code go fast on a GPU is partitioning the problem into managable, cache-and-divergence-coherent chunks, which tends to either be super-straight forward (easy) or requires you to pull the solution entirely apart and put it back together in a different configuration (hard).

throw table_exception("(? ???)? ? ???");

It depends very much on the internal architecture -- because GPUs have sort of hidden behind the problem they focus on, the underlying silicon has been changed radically even in just the 10 or so years since they've become really programable -- off the top of my head, you had VLIW from AMD with an issue width of 5 and 4 going back to the HD5x00 and HD6x00 series, single-SIMD-per-core before that, and multiple-SIMD-per-core recently in GCN 1.0, and GCN 1.1/2.0 having the same basic architecture as that but with better integration into the system's memory hierarchy; From nVidia, you've had half-sized, double-pumped cores, single 'cores' with very many relatively-independent ALUs, and most recently (maxwell), which shrunk back the number of ALUs per core.

Both companies do expose a kind of assembly language for recent GPUs if you look around for it. It is entirely possible to write assembly programs for the GPU or build a compiler that can target them. But the mapping isn't quite as 1-1 as on, say x86 (even x86 you're only talking to a logical 'model' of an x86 CPU and the actual micro-code execution is more RISC-like).

If you branch out from PC and look at mobile GPUs that you find in phones and tablets, then you have tiled architectures too. ARM's latest Mali GPU architecture Midgard is something really unique -- where every small-vector ALU is completely execution-independant of any other -- as a consequence, every pixel could go down a different codepath for no penalty at all which is something no other GPU can do. In a normal GPU the penalty for divergent branches (an 'if' where the condition is true for some pixels and false for others) is proportional to the square of the number of divergent branches in the codepath, which can quickly become severe.

Then, you have something similar in intel's MIC platform, which was originally going to be a high-end GPU ~5 years ago. The upcoming incarnation of MIC is Knights Landing, which is up to 72 customized x86-64 processors based on the most recent Silvermont Atom core -- its been customized by having x87 floating point chopped off, each physical core runs 4 hyper-threads, and each physical core has 2 512-bit SIMDs, and its got up to 8 GB of on-package RAM using a 512-bit bus giving ~350GB/s bandwidth.

Anyhow, I get talking about cool hardware and I start to ramble smile.png -- Long story short, yes you can do what you want to do today, but the tricky part is that GPUs just aren't organized like a CPU or even a bunch of CPUs (Midgard is the exception, Knights Landing to a lesser extent) and so you can't just expect it to run CPU-style code well. A big part of making code go fast on a GPU is partitioning the problem into managable, cache-and-divergence-coherent chunks, which tends to either be super-straight forward (easy) or requires you to pull the solution entirely apart and put it back together in a different configuration (hard).

Got a problem understanding that as my knowledge is low - i reread

it few times.. Probably to know how (some) gpu is exactly build i would have to work in some company making them : /

But i can show my simple schematic picture of this and ask for some claryfying if possible. So for me GPU world seem to be contained from those parts

- Input Vram (containing some textures, geometry and some things)

- Output vram kontaining some framebuffers atc

- Some CPU's (I dont know nothing about this but i imagine they are something like normal cpu driven by some assembly but maybe this assembly is a bit simpler (?) they also say that they are close to x86

sse assembly (at least by type of registers ? - i dont know

- some assembly program (or programs) - it must be some program if

there are cpus - but its total unknown if this is one code ot there are

many programs - each one for each cpu - are those programs clones of one program or are they different ?

this question how those programs look like is the one unknown, other

important unknown is if such hardware (i mean GPU) when executing

all this transformation from input Vram to output vram uses only those

assembly programs and those cpus or has yet some other kind of hardwate thad do some transforms but is not such like cpu+assembly

but some other 'hardware construct ' (maybe some that is hardcoded in transistors not programmable by assembly - if there are such things

Im speculating

it few times.. Probably to know how (some) gpu is exactly build i would have to work in some company making them : /


Each individual GPU can use an entirely different instruction set, even in the same series of GPU. AMD rather publicly switched from a "VLIW5" to a "VLIW4" architecture recently which necessitates an entirely different architecture (and during the transition, some GPUs they release used the old version and other variations in the same product line used the new version). Even within a broad architecture like AMD's VLIW4, each card my have minor variations in its instruction set that is abstracted by the driver's low-level shader compiler.

Your only sane option is to compile to a hardware-neutral IR like SPIR (https://www.khronos.org/spir) or PTX (which is NVIDIA-specific). SPIR is the intended solution to this problem by the Khronos group that allows for a multitude of languages and APIs to all target GPUs without having to deal with the unstable instruction sets.

Some CPU's (I dont know nothing about this but i imagine they are something like normal cpu driven by some assembly but maybe this assembly is a bit simpler (?) they also say that they are close to x86


GPUs are not like CPUs. They are massive SIMD units. They'd be most similar to doing SSE/AVX/AVX512 coding, except _everything_ is SIMD (memory fetches/stores, comparisons, branches, etc.). A program instance in a GPU is really a collection of ~64 or so cores all running in lockstep. That's why branching in GPU code is so bad; in order for one instance to go down a branch _all_ instances must go down that branch (and ignore the results of doing so on instances that shouldn't be on that branch).

You might want go Google "gpu architecture" or check over those SPIR docs.

Sean Middleditch – Game Systems Engineer – Join my team!

There aren't any tools for directly generating and running the raw ISA of a GPU. GPU's are intended be used through drivers and 3D graphics or GPGPU API's, which abstract away a lot of the specifics of the hardware. AMD publicly documents the shader ISA, register set, and command buffer format of their GPU's. With this information you technically have enough information to build your own driver layer for setting up a GPU's registers and memory, issuing commands, and running your raw shader code. However this would be incredibly difficult in practice, and would require a lot of general familiarity with GPU's, your operating system, and the specific GPU that you're targeting. And of course by the time you've finished their might be new GPU's on the market with different ISA's and registers.

You guys are wayyy over my head with this stuff. I'm kinda with fir on this; I only have a vague notion of what a GPU does, but I figure it's like he says, a vast array of memory as data input, a similar vast array as output, and a set of processors that read and process instructions from yet another array of memory to transform the input to the output. Is that not the case?

Do all the processing units always work in lock-step or can they be divided into subgroups each processing a different program on different input sets?

Is there a separate processor that divides up the data and feeds it or controls the main array of processors as appropriate?

I mean, I can describe how a traditional CPU works down to the NAND gate level (and possibly further), but I'd be interested in learning about GPU internals more.

Stephen M. Webb
Professional Free Software Developer

me too, especially to learn (/discuss most important knowledge) in an easy way, Some docs harder than intel manuals can be obstacle. Most importand would be to get a picture how this assembly code looks like and how its executed - for example if this is some long linear assembly routine

like

start:

..assembly..

..assembly..

..assembly..

..assembly..

..assembly..

end.

one long routine

that is given to consume by pack of 64 processors or if this is some other structure,

Some parts of pipeline are programmable by client programmer but what with the other parts - are those programmed by some internal assembly code or what? hard to find an answers but it would be interesting to know that

You guys are wayyy over my head with this stuff. I'm kinda with fir on this; I only have a vague notion of what a GPU does, but I figure it's like he says, a vast array of memory as data input, a similar vast array as output, and a set of processors that read and process instructions from yet another array of memory to transform the input to the output. Is that not the case?

50,000 foot-view? Yes. Think of it like doing CPU assembly -- you write assembly as if you have this idealized x86 processor with a certain number of registers and a certain CISC-like instruction set. That's what you write, that's what gets stored on your hard disk. But when you sent that to the CPU it does all kinds of crazy transformations to it and executes an entirely different, though equivalent, program at the lower levels. If you were to write a GPU program in SPUR or PTX, which is as close to GPU assembly language as it practical, its the same situation except that a) you might be executing on wildely different underlying architectures, and that there are various perf consequences of such, and b) the logical leap between SPUR/PTX and GPU silicon is probably an order of magnitude more removed than between x86 assembly and your CPU silicon.

Furthermore, almost nothing on a GPU behaves as a CPU does -- not caching, not branching, not latency, not throughput -- good news though, the same old math works (except when it doesn't :) )

Do all the processing units always work in lock-step or can they be divided into subgroups each processing a different program on different input sets?

Is there a separate processor that divides up the data and feeds it or controls the main array of processors as appropriate?

In every programmable GPU I'm aware of, with the exception of ARM's new Midgard and Intel's MIC if we're including it, yes -- there's always lock-step execution. Sticking just to recent architectures, AMD's GCN compute block has 4 16-wide SIMD ALUs (in typical 4-vector code, each would correspond to x, y, z, and w), and there's only one program counter per block, IIRC. This is where my own knowlege starts to get a bit fuzzy unless I'm looking at docs, but the take-away is that you're certainly lock-step with 16 physically, and I think the full 64 ALUs as more of a practical matter.

At a higher level, you can put different workloads on different compute blocks, and your GPU has between 4 and ~48 of those. The workloads you send the GPU are meted out to the blocks by a higher-level unit. IIRC, in the past this unit could only handle two workloads at a time, but the most recent GCN cards can handle 8 -- You can think of them a bit like hyperthreads -- mostly the duplication is there so that free compute blocks don't go to waste -- and the workloads themselves are very short-lived, typically. The increase from 2-8 workloads is possibly even a bit premature from a client-code perspective. I think its just precipitating that soon GPUs will be able to issue very-small-grained sub-workloads to themselves -- the workloads that come across the PCIe bus are big enough to not really need more than the two.

throw table_exception("(? ???)? ? ???");

Here are some good resources to read:

Background: How GPUs work

Midgard Architecture Explored

throw table_exception("(? ???)? ? ???");

This topic is closed to new replies.

Advertisement