Some (tricky) technical GPU questions.

Started by
19 comments, last by xaxazak 6 years, 10 months ago

Firstly, If anyone knows of any decent resources for learning details like I'm asking about, can you tell me plzthx. I'm happy to do a lot of reading.

Here's a handful of GPU questions I'm having trouble finding answers to. I thought it'd be easier to ask these in bulk rather than multiple questions. I hope that's OK.

I asked on Khronos (https://forums.khronos.org/showthread.php/13413-A-pile-of-technical-GPU-questions-sorry-)), but they didn't have a category for it and it didn't get any response.

I'm really asking about standard-practice in immediate-rendering GPUs (Nvidia/ATI/maybe Intel).

1. Terminology
Is there a standard terminology for GPU shading components yet? What’s the best way to refer to:

  • The element responsible for a single texel output (eg CUDA core). (= Lane? Unit?)
  • The block of elements (above) whose instructions are performed together (SIMD). (= Core?)
  • The component responsible for managing tasks and cores. (= Thread dispatcher?)

I will use lane and core for the rest of this uberquestion.


2. Memory addressing
Is GPU access to graphics memory ever virtual (ie, via page tables)?

Can the driver/GPU choose to move resources to different parts of physical memory (eg to avoid contention when running multiple applications)?


3. Per-primitive user data
GPUs don’t support per-primitive(or per tesselation-patch etc) data (eg, per-triangle colors/normals) yet, right? Is there any technical reason why? Implicit per-primitive data is required by cores (interpolation constants and flat values). This seems to be a common request, and data does seem to be being wasted.


4. ROP texel ordering
How is the order preserved when sending finished texels to ROPs (render-output-units)? Where/how do out-of-order texels queue when the previous primitive hasn’t been fully processed by the ROPs.


5. TMUs and cores
Can any lane/core use any TMU (texture-mapping unit) (assuming it has the same texture loaded) or are they grouped somehow? Is there a texture-request queue or is there some other scheduling method?


6. Identical texture metadata
For two textures with identical metadata in the same memory heap, is switching a TMU between textures necessarily any more complex then simply changing the TMU’s texture pointer offset (ignoring resulting cache misses).

7. Data "families"

There seem to be many data “families” available to core lanes:

  • Per-lane:
    1. Private lane variables. (Read/Write).
    2. Lane location/index (differentiating lanes within a core). (Read-only).
    3. Derivatives (per pair/quad?). (Read/Write(ish)).
  • Per-core (read-only):
    1. Per-primitive(or patch, etc) constant data. Interpolation constants etc.
    2. Draw-call-constant data (uniforms, descriptor set data).
  • RAM-based stuff (TMU, buffer array data, input attachments, counters, etc).

Does that make sense?

Are B1 and B2 are stored in the same area? Are they stored per-core or shared between cores somehow? They’re often identical between many cores, but IIUC other cores can be performing different tasks.

How does the task-manager/thread-dispatch write to each core's B1/B2? In bulk / all-at-once, or granularly? Are these writes significant performance-wise? (kinda technical but related to a shader-design issue I have).

Thanks for all input.

Advertisement

Is GPU access to graphics memory ever virtual (ie, via page tables)?
Yes, GPU's use virtual memory... I don't think it was always like that but I don't know when it started.

GPUs don’t support per-primitive(or per tesselation-patch etc) data (eg, per-triangle colors/normals) yet, right?
You could do it via a geometry shader couldn't you?

How is the order preserved when sending finished texels to ROPs (render-output-units)? Where/how do out-of-order texels queue when the previous primitive hasn’t been fully processed by the ROPs.
I don't know how its implemented but IIRC the ROP's have some sort of ordering mechanizm. See ROV's (raster ordered views) for an idea of what I'm talking about.

Can any lane/core use any TMU (texture-mapping unit) (assuming it has the same texture loaded) or are they grouped somehow? Is there a texture-request queue or is there some other scheduling method?
Different GPU's are organized differently but from what I remember in all cases they are grouped somehow or the other. For example nvidia shares texture units inside an "SM".

-potential energy is easily made kinetic-

GPUs don’t support per-primitive(or per tesselation-patch etc) data (eg, per-triangle colors/normals) yet, right?
You could do it via a geometry shader couldn't you?

That shouldn't be necessary in most cases. For normals using the screen space derivatives of the position usually works well enough. Furthermore, both DirectX and OpenGL support "flat"/"nointerpolation" for vertex attributes. In such cases, the "provoking vertex" determines the value for the entire primitive (see https://www.khronos.org/opengl/wiki/Primitive#Provoking_vertex). You may need to duplicate a few vertices, but not all of them.

1) Lane, SIMD, Core, Package is understandable by everyone... But no, the GPU vendors use their own terminology.

D3D calls them threads, thread-groups and devices... But these terms are also a bit higher level. One thread maps to one lane, but one thread group may map to several SIMD registers...

2) yep, GPUs use virtual addresses these days. Not too long ago they only used physical addressing, which was the cause of many limitations.

3) as above - geometry/hull shaders, and non-interpolated vertex shader outputs.

4) highly GPU specific magic. Nvidia is a guess. AMD publishes full details on their site.

5) for AMD from memory (probably wrong):
A gpu has several shader engines, which each have several shader cores. Each core has 4x SIMD16 units that act as SIMD64. They can "hyperthread" up to 10 instances of a kernel (so 640 threads). Each core has two texture units each with an L1. The engine has an L2. Cache miss latency is about 1000 cycles, but issuing a texture fetch is less than 1 cycle. All the threads in a core share those two fetching units, controlled by some HW arbitrator.
The fetch units are not configured with a particular texture. There is no such thing as texture bindings. Texture metadata is stored in cbuffers now, and it is sent to the fetch unit along with the UV coordinates for each fetch.
You could put 10000 texture descriptor structures into a cbuffer, have the shader dynamically index them, and never have to change your texture bindings. Old APIs get in the way of these new abilities.

6) see above. Not applicable any more.

7) detibatives are just a per-lane variable subtracted from itself, basically. It's your lane's value minus your neighbour's value.
New GPUs let you do arbitrary shuffling/rotating of per-lane variables, to enable inter-lane communication.

Draw call constant data (uniforms, descriptors) lives in memory just like texels or vertex buffers. They're fetched via the L2 just like texels.

Each engine and each core has a small amount of local RAM (LDS/GDS) that can be used to store thing like VS outputs, GS-generated data, etc... Or this data might be streamed into temporary RAM locations, as decided by the driver.

Compute shaders can read/write to a core's LDS (Local RAM) using the groupshared keyword in HLSL.

Per-primitive user data

Not needed.

For the vertex stage, this is meaningless, but on the other hand, GPUs support (and have been supporting for some years) the current primitive ID (such as gl_PrimitiveID) in any stage that deals with primitives. That ID plus a shader buffer into which you index, and there you go.

Not an expert on the topic myself, but here is a series of blog posts you may find useful, if you haven't seen it yet: link.

Sorry about the ultra-late reply. I was kinda distracted by events recently and I'm just getting back to coding.

That shouldn't be necessary in most cases. For normals using the screen space derivatives of the position usually works well enough


Isn't using the provoking vertex and flat "interpolation" kinda inefficient, though?

Flat data can sometimes be larger than "true" vertex data, even to a point where, on average, over half your total vertex data is unused. It seems like such a good target for improving efficiency.

Screen space derivatives seem like they would often be inefficient too. You're effectively calculating the same values repeatedly on every quad-lane, rather than just having it constant. I guess it could be slightly faster in some cases with tiny triangles, but for ones with 100+ of texels I'd be surprised if it were more efficient than sending the data to the cores. And they're less accurate and less flexible.


Thanks heaps for the detailed reply, you've cleared up a of number questions.

Lane, SIMD, Core, Package is understandable by everyone


Core seems to be used for both individual lane hardware and for SIMD blocks ... and GCN. But I guess the alternative is to use vendor-specific terms.

Nvidia is a guess. AMD publishes full details on their site.


Cool. I'll go look that up.



Not needed.

For the vertex stage, this is meaningless, but on the other hand, GPUs support (and have been supporting for some years) the current primitive ID (such as gl_PrimitiveID) in any stage that deals with primitives. That ID plus a shader buffer into which you index, and there you go.

Thanks. That fixes around 3/4 of the issue.

Internally, though, there's still arithmetic occurring - data address = buffer pointer + (buffer index * primitive-data-object size).
You might get lucky and have a power-of-two primitive-data-object size so you can use a shift operator rather than a multiply.
You might also get lucky and have the compiler compute this offset once per core rather than per lane if it figures out that it's static.
But it's still repeatedly doing an identical calculation that only needs doing once - if at all - occupying silicon and burning joules.


Not an expert on the topic myself, but here is a series of blog posts you may find useful, if you haven't seen it yet: link.

Thanks. I just reread that, and I think I missed some pages the first time. Definitely worth reading (twice).

You might get lucky and have a power-of-two primitive-data-object size so you can use a shift operator rather than a multiply.

Multiplies are pretty cheap nowadays, its nothing to worry about. Basically all GPU's have hardware multipliers that have a throughput of 1 clock. (IIRC a latency of approximately 5cycles) But in all seriousness calculating effective address is nothing to worry about.

-potential energy is easily made kinetic-

You might also get lucky and have the compiler compute this offset once per core rather than per lane if it figures out that it's static.
AMD cores are dual-issue between the SIMD-64 vector instruction pipe and the scalar instruction pipe. Their compiler will look for operations that are constant across all lanes and compile them as scalar instructions instead of vector instructions. These often end up being almost free, as most shaders are bottlenecked by vector instructions and are largely idle on the scalar pipe, making scalar operations "free".
There's no way they could actually shove flat values into core registers? And leave them there the entire time the pixel shader is processing the primitive.
Typically, every pixel needs to read them.

I guess current hardware doesn't have registers that would suit that usage, though.

This topic is closed to new replies.

Advertisement