Sign in to follow this  
xaxazak

Some (tricky) technical GPU questions.

Recommended Posts

Firstly, If anyone knows of any decent resources for learning details like I'm asking about, can you tell me plzthx. I'm happy to do a lot of reading.

Here's a handful of GPU questions I'm having trouble finding answers to. I thought it'd be easier to ask these in bulk rather than multiple questions. I hope that's OK.

I asked on Khronos (https://forums.khronos.org/showthread.php/13413-A-pile-of-technical-GPU-questions-sorry-)), but they didn't have a category for it and it didn't get any response.

I'm really asking about standard-practice in immediate-rendering GPUs (Nvidia/ATI/maybe Intel).

 

1. Terminology
Is there a standard terminology for GPU shading components yet? What’s the best way to refer to:

  • The element responsible for a single texel output (eg CUDA core). (= Lane? Unit?)
  • The block of elements (above) whose instructions are performed together (SIMD). (= Core?)
  • The component responsible for managing tasks and cores. (= Thread dispatcher?)

I will use lane and core for the rest of this uberquestion.


2. Memory addressing
Is GPU access to graphics memory ever virtual (ie, via page tables)?

Can the driver/GPU choose to move resources to different parts of physical memory (eg to avoid contention when running multiple applications)?


3. Per-primitive user data
GPUs don’t support per-primitive(or per tesselation-patch etc) data (eg, per-triangle colors/normals) yet, right? Is there any technical reason why? Implicit per-primitive data is required by cores (interpolation constants and flat values). This seems to be a common request, and data does seem to be being wasted.


4. ROP texel ordering
How is the order preserved when sending finished texels to ROPs (render-output-units)? Where/how do out-of-order texels queue when the previous primitive hasn’t been fully processed by the ROPs.


5. TMUs and cores
Can any lane/core use any TMU (texture-mapping unit) (assuming it has the same texture loaded) or are they grouped somehow? Is there a texture-request queue or is there some other scheduling method?


6. Identical texture metadata
For two textures with identical metadata in the same memory heap, is switching a TMU between textures necessarily any more complex then simply changing the TMU’s texture pointer offset (ignoring resulting cache misses).

 

7. Data "families"

There seem to be many data “families” available to core lanes:

  • Per-lane:
    1. Private lane variables. (Read/Write).
    2. Lane location/index (differentiating lanes within a core). (Read-only).
    3. Derivatives (per pair/quad?). (Read/Write(ish)).
  • Per-core (read-only):
    1. Per-primitive(or patch, etc) constant data. Interpolation constants etc.
    2. Draw-call-constant data (uniforms, descriptor set data).
  • RAM-based stuff (TMU, buffer array data, input attachments, counters, etc).

Does that make sense?

Are B1 and B2 are stored in the same area? Are they stored per-core or shared between cores somehow? They’re often identical between many cores, but IIUC other cores can be performing different tasks.

How does the task-manager/thread-dispatch write to each core's B1/B2? In bulk / all-at-once, or granularly? Are these writes significant performance-wise? (kinda technical but related to a shader-design issue I have).

 

Thanks for all input.

Share this post


Link to post
Share on other sites

Is GPU access to graphics memory ever virtual (ie, via page tables)?
Yes, GPU's use virtual memory... I don't think it was always like that but I don't know when it started.

GPUs don’t support per-primitive(or per tesselation-patch etc) data (eg, per-triangle colors/normals) yet, right?
 You could do it via a geometry shader couldn't you?

How is the order preserved when sending finished texels to ROPs (render-output-units)? Where/how do out-of-order texels queue when the previous primitive hasn’t been fully processed by the ROPs.
I don't know how its implemented but IIRC the ROP's have some sort of ordering mechanizm.  See ROV's (raster ordered views) for an idea of what I'm talking about.

Can any lane/core use any TMU (texture-mapping unit) (assuming it has the same texture loaded) or are they grouped somehow? Is there a texture-request queue or is there some other scheduling method?
Different GPU's are organized differently but from what I remember in all cases they are grouped somehow or the other.  For example nvidia shares texture units inside an "SM".

Share this post


Link to post
Share on other sites
GPUs don’t support per-primitive(or per tesselation-patch etc) data (eg, per-triangle colors/normals) yet, right?
 You could do it via a geometry shader couldn't you?

 

That shouldn't be necessary in most cases. For normals using the screen space derivatives of the position usually works well enough. Furthermore, both DirectX and OpenGL support "flat"/"nointerpolation" for vertex attributes. In such cases, the "provoking vertex" determines the value for the entire primitive (see https://www.khronos.org/opengl/wiki/Primitive#Provoking_vertex). You may need to duplicate a few vertices, but not all of them.

Edited by l0calh05t

Share this post


Link to post
Share on other sites
1) Lane, SIMD, Core, Package is understandable by everyone... But no, the GPU vendors use their own terminology.

D3D calls them threads, thread-groups and devices... But these terms are also a bit higher level. One thread maps to one lane, but one thread group may map to several SIMD registers...

2) yep, GPUs use virtual addresses these days. Not too long ago they only used physical addressing, which was the cause of many limitations.

3) as above - geometry/hull shaders, and non-interpolated vertex shader outputs.

4) highly GPU specific magic. Nvidia is a guess. AMD publishes full details on their site.

5) for AMD from memory (probably wrong):
A gpu has several shader engines, which each have several shader cores. Each core has 4x SIMD16 units that act as SIMD64. They can "hyperthread" up to 10 instances of a kernel (so 640 threads). Each core has two texture units each with an L1. The engine has an L2. Cache miss latency is about 1000 cycles, but issuing a texture fetch is less than 1 cycle. All the threads in a core share those two fetching units, controlled by some HW arbitrator.
The fetch units are not configured with a particular texture. There is no such thing as texture bindings. Texture metadata is stored in cbuffers now, and it is sent to the fetch unit along with the UV coordinates for each fetch.
You could put 10000 texture descriptor structures into a cbuffer, have the shader dynamically index them, and never have to change your texture bindings. Old APIs get in the way of these new abilities.

6) see above. Not applicable any more.

7) detibatives are just a per-lane variable subtracted from itself, basically. It's your lane's value minus your neighbour's value.
New GPUs let you do arbitrary shuffling/rotating of per-lane variables, to enable inter-lane communication.

Draw call constant data (uniforms, descriptors) lives in memory just like texels or vertex buffers. They're fetched via the L2 just like texels.

Each engine and each core has a small amount of local RAM (LDS/GDS) that can be used to store thing like VS outputs, GS-generated data, etc... Or this data might be streamed into temporary RAM locations, as decided by the driver.

Compute shaders can read/write to a core's LDS (Local RAM) using the groupshared keyword in HLSL.

Share this post


Link to post
Share on other sites

Per-primitive user data

Not needed.

For the vertex stage, this is meaningless, but on the other hand, GPUs support (and have been supporting for some years) the current primitive ID (such as gl_PrimitiveID) in any stage that deals with primitives. That ID plus a shader buffer into which you index, and there you go.

Share this post


Link to post
Share on other sites
Sorry about the ultra-late reply. I was kinda distracted by events recently and I'm just getting back to coding.
 

That shouldn't be necessary in most cases. For normals using the screen space derivatives of the position usually works well enough


Isn't using the provoking vertex and flat "interpolation" kinda inefficient, though?

Flat data can sometimes be larger than "true" vertex data, even to a point where, on average, over half your total vertex data is unused. It seems like such a good target for improving efficiency.

Screen space derivatives seem like they would often be inefficient too. You're effectively calculating the same values repeatedly on every quad-lane, rather than just having it constant. I guess it could be slightly faster in some cases with tiny triangles, but for ones with 100+ of texels I'd be surprised if it were more efficient than sending the data to the cores. And they're less accurate and less flexible.
 


Thanks heaps for the detailed reply, you've cleared up a of number questions.
 

Lane, SIMD, Core, Package is understandable by everyone


Core seems to be used for both individual lane hardware and for SIMD blocks ... and GCN. But I guess the alternative is to use vendor-specific terms.
 

Nvidia is a guess. AMD publishes full details on their site.


Cool. I'll go look that up.


 

Not needed.

For the vertex stage, this is meaningless, but on the other hand, GPUs support (and have been supporting for some years) the current primitive ID (such as gl_PrimitiveID) in any stage that deals with primitives. That ID plus a shader buffer into which you index, and there you go.

Thanks. That fixes around 3/4 of the issue.

Internally, though, there's still arithmetic occurring - data address = buffer pointer + (buffer index * primitive-data-object size).
You might get lucky and have a power-of-two primitive-data-object size so you can use a shift operator rather than a multiply.
You might also get lucky and have the compiler compute this offset once per core rather than per lane if it figures out that it's static.
But it's still repeatedly doing an identical calculation that only needs doing once - if at all - occupying silicon and burning joules.

 

Not an expert on the topic myself, but here is a series of blog posts you may find useful, if you haven't seen it yet: link.

Thanks. I just reread that, and I think I missed some pages the first time. Definitely worth reading (twice).

Share this post


Link to post
Share on other sites

You might get lucky and have a power-of-two primitive-data-object size so you can use a shift operator rather than a multiply.

Multiplies are pretty cheap nowadays, its nothing to worry about.  Basically all GPU's have hardware multipliers that have a throughput of 1 clock. (IIRC a latency of approximately 5cycles) But in all seriousness calculating effective address is nothing to worry about.

Share this post


Link to post
Share on other sites

You might also get lucky and have the compiler compute this offset once per core rather than per lane if it figures out that it's static.
AMD cores are dual-issue between the SIMD-64 vector instruction pipe and the scalar instruction pipe. Their compiler will look for operations that are constant across all lanes and compile them as scalar instructions instead of vector instructions. These often end up being almost free, as most shaders are bottlenecked by vector instructions and are largely idle on the scalar pipe, making scalar operations "free".

Share this post


Link to post
Share on other sites
There's no way they could actually shove flat values into core registers? And leave them there the entire time the pixel shader is processing the primitive.
Typically, every pixel needs to read them.

I guess current hardware doesn't have registers that would suit that usage, though.

Share this post


Link to post
Share on other sites

There's no way they could actually shove flat values into core registers? And leave them there the entire time the pixel shader is processing the primitive. Typically, every pixel needs to read them.

Is there really any reason to do that systemically?

If every pixel uses them, they are going to wind up cached anyway. Seems more useful as a compiler/cache optimisation than an actual hardware design decision.

Share this post


Link to post
Share on other sites
On NVidia, yes they do have big bank on "constant" registers that they will pre-load with cbuffer data at the start of a draw.
On AMD, cbuffers are just like any other buffer... However, they will act differently because each lane (e.g. pixel) is reading the same addresses.
AMD does have some constant registers... But only 16 of them (of 32bit size, so 64bytes total) per draw so they're usually used to store cbuffer descriptors/etc rather than actual constants though. This is all exposed to the user in D3D12, so you get to choose how to use those 64bytes!

AMD shader cores are basically just SIMD64 CPU cores that compute 64 pixels at a time. They have a bunch of 32bit register used when doing work that is uniform across all lanes (e.g. load a constant) and a bunch of 32bit_simd64 registers used for lane-varying (per pixel) work, and there's instructions that operate on each type of register.

Say you write a pixel shader like:
int y = ... // Some per pixel value
float x = mySrv[y];
And one like:
float x = myConstant;

These would compile into psuedo asm like:
int32_simd64 y = ... // Some per pixel value
SRV mySrv = gather_uniform(cb0, 0);// fetch SRV0's descriptor from a driver-generated cbuffer for use by all lanes.
float32_simd64 x = gather32x64(mySrv, y);//load 64 floats from that SRV
and:
float32 x = gather_uniform(cb1, 4); // load one float from cb1 at offset 4 bytes.
As above, once one core does one of these loads, the data will be present in the L2$ and subsequent cores will be able to load that same data very quickly.

Share this post


Link to post
Share on other sites
Is it possible to efficiently preserve that data across multiple pixel-shader warps on the same primitive? So basically just set the register values for the first warp of a primitive.
I guess you could check if the primitive ID has changed, but the cost of that might be more than the benefit unless your per-primitive data was big (and as you say, AMD only has 64 bytes).

I guess that'd only be a benefit for large triangles though.

Share this post


Link to post
Share on other sites

Is it possible to efficiently preserve that data across multiple pixel-shader warps on the same primitive?

No, different warps of pixels may well be executing on different "cores", and would have to communicate with each other via memory. There also isn't necessarily a "first warp of a primitive" -- ideally dozens of warps would be scheduled to start simultaneously :)

Share this post


Link to post
Share on other sites

Note that AMD is to present a Primitive program stage that replaces both Vertex & Geometry stages for their upcoming VEGA GPU from the slides we have seen.

Not sure what it will do.

Geometry shaders are unusable, you should consider them defunct and only them under threat.

Share this post


Link to post
Share on other sites

Note that AMD is to present a Primitive program stage that replaces both Vertex & Geometry stages for their upcoming VEGA GPU from the slides we have seen.
Not sure what it will do.
Geometry shaders are unusable, you should consider them defunct and only them under threat.

NVidia supports a "fast geometry shader" extension to D3D11 and GL :wink:
AMDs actual hardware shader pipeline in GCN already doesn't match the high level logical pipeline. e.g. your vertex shader code doesn't always execute as part of the vertex shader stage :o

I haven't read anything about the new arch, but ideally GS code would be able to immediately schedule VS warps. A pass-through GS could actually just be bolted onto the front of the VS code and run in a single stage. Maybe that's what NV already does?

Share this post


Link to post
Share on other sites

Is it possible to efficiently preserve that data across multiple pixel-shader warps on the same primitive?

No, different warps of pixels may well be executing on different "cores", and would have to communicate with each other via memory. There also isn't necessarily a "first warp of a primitive" -- ideally dozens of warps would be scheduled to start simultaneously :)

If it's flat data then it's constant, so there's no need for communication between cores. For example, if 10 cores execute 5 warps each for a single primitive, then, for each core, the first (consecutive) warp on each core copies the flat data from cache (or memory if you're unlucky) into registers.
The following warps could somehow determine they're using the same primitive and then just use the data in the registers without needing to load it from cache.

So you're duplicating the flat data copy once per core - but that's better than once per warp.

Of course, unless you add hardware support (which I'd imagine would be extremely cheap and tiny) you'll need a warp-consistent if statement that is able to determine whether it's the first warp - possibly by comparing primitive ID with a register. I don't know if that additional cost would be worse than the cache fetches it avoids. Edited by xaxazak

Share this post


Link to post
Share on other sites

If it's flat data then it's constant, so there's no need for communication between cores. For example, if 10 cores execute 5 warps each for a single primitive, then, for each core, the first (consecutive) warp on each core copies the flat data from cache (or memory if you're unlucky) into registers.
The following warps could somehow determine they're using the same primitive and then just use the data in the registers without needing to load it from cache.

So you're duplicating the flat data copy once per core - but that's better than once per warp.

There's an issue with that: You don't have the guarantee that all warps will be working on the same primitive.
Half of Warp A could be working on triangle X, and the other half of Warp A could be working on triangle Y. GPUs make some effort to keep everything convergent; but if they were to restrict triangle X to a set of warps; and triangle Y to another set to Warps, it would get very inefficient quickly.

 

I am curious: why are you asking these extremely low level questions? Knowing the insides of your GPU is important, specially if you want to squeeze the last drop of it both in a technique you want to do and performance you want to achieve.

However without specifying a particular set of HW, GPUs are very heterogeneous. They're not like x86 CPUs which all work relatively similar because they have to produce perfectly identical results.

Although there is some common ground, more than half of these answers will change in 2 years. Specializing in a particular HW is more useful (i.e. GCN is present in PC, XBox One & PS4; PowerVR is present in Metal-capable iOS devices). For example you ask about TMUs, yet TMUs no longer exist as such concept. It's much more complex and very GPU-specific.

For instance Mali GPUs do not have threadgroup/LDS at all. They emulate it via RAM fetches. Therefore any optimization that relies on the use and reuse of threadgroup data in GCN (and other GPUs) hurts a lot in Mali.

It's like learning how to drive a car and asking how atoms of a car battery move from one end to another to power the car's instruments. Yes, if you want to be the best driver perhaps this knowledge could be of use to you to be on the top 3 best drivers of the world; however you need to sit on the car and feel the wheel first.

Btw this is a nice resource on latency hiding on GCN.

If you want to learn the deep internals of each HW, I recommend you start by reading their manuals:

The presentations on SIGGRAPH and GDC are also very useful.

Edited by Matias Goldberg

Share this post


Link to post
Share on other sites

If it's flat data then it's constant, so there's no need for communication between cores. For example, if 10 cores execute 5 warps each for a single primitive, then, for each core, the first (consecutive) warp on each core copies the flat data from cache (or memory if you're unlucky) into registers.
The following warps could somehow determine they're using the same primitive and then just use the data in the registers without needing to load it from cache.

So you're duplicating the flat data copy once per core - but that's better than once per warp.

There's an issue with that: You don't have the guarantee that all warps will be working on the same primitive.
Half of Warp A could be working on triangle X, and the other half of Warp A could be working on triangle Y. GPUs make some effort to keep everything convergent; but if they were to restrict triangle X to a set of warps; and triangle Y to another set to Warps, it would get very inefficient quickly.

Ah, well there goes that idea I guess (although, given that huge triangles still exist sometimes, you could possibly still gain if it's possible to use an "if" to determine whether a warp is single-primitive (depending on how significant avoiding those cache reads is)).

I am curious: why are you asking these extremely low level questions? Knowing the insides of your GPU is important, specially if you want to squeeze the last drop of it both in a technique you want to do and performance you want to achieve.
However without specifying a particular set of HW, GPUs are very heterogeneous. They're not like x86 CPUs which all work relatively similar because they have to produce perfectly identical results.

Mostly I think it's useful and important to get an understanding of how stuff works internally. I understand that there are huge differences between some technologies, but I'm not interested in mobile - like a sportscar fan doesn't care about tractors.

For this case, specifically, two reasons.
1 - If what I'm suggesting turns out to be possible with the current SPIR-V instruction set I might give it a go.
2 - The issue of redundant flat surface calcs seems like an extremely clear duplication of effort. It screams out for optimization. samoth's reply to it already helped to remove a lot of redundancy, but it seems strange that it's not mentioned more often as this is an issue that total beginners often query (eg, "how do I send per face data").

If you want to learn the deep internals of each HW, I recommend you start by reading their manuals:

I've read a bit recently, but I still need a lot more. I will take a look at those links, thanks.

Share this post


Link to post
Share on other sites

>like a sportscar fan doesn't care about tractors.
I care about both

Sorry, I meant to say doesn't necessarily care. I'd just forgotten the word for a second and forgot that I'd forgotten it when I posted.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this