xaxazak

Members
  • Content count

    0
  • Joined

  • Last visited

Community Reputation

131 Neutral

About xaxazak

  • Rank
    Newbie
  1. Sorry, I meant to say doesn't necessarily care. I'd just forgotten the word for a second and forgot that I'd forgotten it when I posted.
  2. There's an issue with that: You don't have the guarantee that all warps will be working on the same primitive. Half of Warp A could be working on triangle X, and the other half of Warp A could be working on triangle Y. GPUs make some effort to keep everything convergent; but if they were to restrict triangle X to a set of warps; and triangle Y to another set to Warps, it would get very inefficient quickly. Ah, well there goes that idea I guess (although, given that huge triangles still exist sometimes, you could possibly still gain if it's possible to use an "if" to determine whether a warp is single-primitive (depending on how significant avoiding those cache reads is)). Mostly I think it's useful and important to get an understanding of how stuff works internally. I understand that there are huge differences between some technologies, but I'm not interested in mobile - like a sportscar fan doesn't care about tractors. For this case, specifically, two reasons. 1 - If what I'm suggesting turns out to be possible with the current SPIR-V instruction set I might give it a go. 2 - The issue of redundant flat surface calcs seems like an extremely clear duplication of effort. It screams out for optimization. samoth's reply to it already helped to remove a lot of redundancy, but it seems strange that it's not mentioned more often as this is an issue that total beginners often query (eg, "how do I send per face data"). I've read a bit recently, but I still need a lot more. I will take a look at those links, thanks.
  3. No, different warps of pixels may well be executing on different "cores", and would have to communicate with each other via memory. There also isn't necessarily a "first warp of a primitive" -- ideally dozens of warps would be scheduled to start simultaneously :) If it's flat data then it's constant, so there's no need for communication between cores. For example, if 10 cores execute 5 warps each for a single primitive, then, for each core, the first (consecutive) warp on each core copies the flat data from cache (or memory if you're unlucky) into registers. The following warps could somehow determine they're using the same primitive and then just use the data in the registers without needing to load it from cache. So you're duplicating the flat data copy once per core - but that's better than once per warp. Of course, unless you add hardware support (which I'd imagine would be extremely cheap and tiny) you'll need a warp-consistent if statement that is able to determine whether it's the first warp - possibly by comparing primitive ID with a register. I don't know if that additional cost would be worse than the cache fetches it avoids.
  4. Is it possible to efficiently preserve that data across multiple pixel-shader warps on the same primitive? So basically just set the register values for the first warp of a primitive. I guess you could check if the primitive ID has changed, but the cost of that might be more than the benefit unless your per-primitive data was big (and as you say, AMD only has 64 bytes). I guess that'd only be a benefit for large triangles though.
  5. There's no way they could actually shove flat values into core registers? And leave them there the entire time the pixel shader is processing the primitive. Typically, every pixel needs to read them. I guess current hardware doesn't have registers that would suit that usage, though.
  6. Sorry about the ultra-late reply. I was kinda distracted by events recently and I'm just getting back to coding.   Isn't using the provoking vertex and flat "interpolation" kinda inefficient, though? Flat data can sometimes be larger than "true" vertex data, even to a point where, on average, over half your total vertex data is unused. It seems like such a good target for improving efficiency. Screen space derivatives seem like they would often be inefficient too. You're effectively calculating the same values repeatedly on every quad-lane, rather than just having it constant. I guess it could be slightly faster in some cases with tiny triangles, but for ones with 100+ of texels I'd be surprised if it were more efficient than sending the data to the cores. And they're less accurate and less flexible.   Thanks heaps for the detailed reply, you've cleared up a of number questions.   Core seems to be used for both individual lane hardware and for SIMD blocks ... and GCN. But I guess the alternative is to use vendor-specific terms.   Cool. I'll go look that up.   Thanks. That fixes around 3/4 of the issue. Internally, though, there's still arithmetic occurring - data address = buffer pointer + (buffer index * primitive-data-object size). You might get lucky and have a power-of-two primitive-data-object size so you can use a shift operator rather than a multiply. You might also get lucky and have the compiler compute this offset once per core rather than per lane if it figures out that it's static. But it's still repeatedly doing an identical calculation that only needs doing once - if at all - occupying silicon and burning joules.   Thanks. I just reread that, and I think I missed some pages the first time. Definitely worth reading (twice).
  7. Firstly, If anyone knows of any decent resources for learning details like I'm asking about, can you tell me plzthx. I'm happy to do a lot of reading. Here's a handful of GPU questions I'm having trouble finding answers to. I thought it'd be easier to ask these in bulk rather than multiple questions. I hope that's OK. I asked on Khronos (https://forums.khronos.org/showthread.php/13413-A-pile-of-technical-GPU-questions-sorry-)), but they didn't have a category for it and it didn't get any response. I'm really asking about standard-practice in immediate-rendering GPUs (Nvidia/ATI/maybe Intel).   1. Terminology Is there a standard terminology for GPU shading components yet? What’s the best way to refer to: The element responsible for a single texel output (eg CUDA core). (= Lane? Unit?) The block of elements (above) whose instructions are performed together (SIMD). (= Core?) The component responsible for managing tasks and cores. (= Thread dispatcher?) I will use lane and core for the rest of this uberquestion. 2. Memory addressing Is GPU access to graphics memory ever virtual (ie, via page tables)? Can the driver/GPU choose to move resources to different parts of physical memory (eg to avoid contention when running multiple applications)? 3. Per-primitive user data GPUs don’t support per-primitive(or per tesselation-patch etc) data (eg, per-triangle colors/normals) yet, right? Is there any technical reason why? Implicit per-primitive data is required by cores (interpolation constants and flat values). This seems to be a common request, and data does seem to be being wasted. 4. ROP texel ordering How is the order preserved when sending finished texels to ROPs (render-output-units)? Where/how do out-of-order texels queue when the previous primitive hasn’t been fully processed by the ROPs. 5. TMUs and cores Can any lane/core use any TMU (texture-mapping unit) (assuming it has the same texture loaded) or are they grouped somehow? Is there a texture-request queue or is there some other scheduling method? 6. Identical texture metadata For two textures with identical metadata in the same memory heap, is switching a TMU between textures necessarily any more complex then simply changing the TMU’s texture pointer offset (ignoring resulting cache misses).   7. Data "families" There seem to be many data “families” available to core lanes: Per-lane: Private lane variables. (Read/Write). Lane location/index (differentiating lanes within a core). (Read-only). Derivatives (per pair/quad?). (Read/Write(ish)). Per-core (read-only): Per-primitive(or patch, etc) constant data. Interpolation constants etc. Draw-call-constant data (uniforms, descriptor set data). RAM-based stuff (TMU, buffer array data, input attachments, counters, etc). Does that make sense? Are B1 and B2 are stored in the same area? Are they stored per-core or shared between cores somehow? They’re often identical between many cores, but IIUC other cores can be performing different tasks. How does the task-manager/thread-dispatch write to each core's B1/B2? In bulk / all-at-once, or granularly? Are these writes significant performance-wise? (kinda technical but related to a shader-design issue I have).   Thanks for all input.