Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

Community Reputation

131 Neutral

About xaxazak

  • Rank

Personal Information

  • Role
  • Interests
  1. That works, too. Currently I'm assigning n to a static template member of a suicide struct which has static_assert(fail_bool<T>) [that's not the actual name]. It's slightly more verbose, but it lets me put a message inside static_assert. Gonna browse the template variable stuff in the standard later tonight, see if I can figure out the legalese of it myself.
  2. ----- Are, sorry, so replace "should(n't) be" with "AFAIK is(n't)". To me, it seems like constexpr int n; fails on (compile-time) instantiation, which, for a non-template, is the same as definition. But template <T> constexpr int n; isn't an instantiation. Any use of n<int> shouldn't give zero, it should fail to compile. But n<SZug<int>> could compile if n<SZug<T>> were specialized. At least, that's how template structs work. For template structs, you can write lots of illegal stuff in them and they don't fail until you try to (compile-time) instantiate them. (However, I notice that static members of uninstantiated template structs require valid definitions if constexpr, but not if const. While that does seem sensible, it also seems a bit inconsistent - maybe the spec explicitly requires this). But if I look at it the way you do I also see your point. And while I think it could be useful (I was using it), it's very easy to get the same effect using static_assert. But (ir)regardless, either G++ or LLVM is getting it wrong, or the spec is ambiguous. So something needs a fix.
  3. True, there is occasional ambiguity (it's not supposed to be there, but it's not perfect). But do you have any reason to think ambiguity exists in these examples? They are pretty simple, fairly significant, and in no way obscure. If there is ambiguity here it's significant. So if either of these are in fact real ambiguity, then I think it needs a defect report sent to the C++ standard committee (if one doesn't already exist). I acknowledge that that was probably the intent - at least initially. But C++ isn't meant to be about intent and spirit, it's meant to be about following explicit rules. It's important for compilers to attempt to follow the text to the letter, because if you let them diverge, you then get libraries tested on different compilers that can't be used together. Regarding example #1, I think it's unlikely that a compiler is going to forget about optimizing a template function call simply because its definition comes later (which is common and not illegal) so I doubt allowing this would incur much compiler dev time. One big use-case here is avoiding cyclic dependencies. Having constexpr functions call eachother recursively can be useful. Using const instead of constexpr doesn't change the result here. G++ still accepts. Clang still rejects. "constexpr int n;" shouldn't be legal. The issue is whether "template <T> constexpr int n;" should be. The "detect bug at compile time" was about attempting to use a fully-specialized version of the variable if that version hasn't been given a value. So "int x = n<int>;" will error unless n has been specialized for int.
  4. These aren't c++2a features - sorry I guess the 1st line was misleading (I just use 2a for everything). Both claim to fully support c++17, and using -std=c++17 instead makes no difference here (nor does -pedantic or -Wall). I do sometimes (many of my posts have spec references, including issue #1 here), but I'm not always that great at remembering where the pertinent details are (it's a huge document), it can take a very long time for me to hunt down relevant details. The topic issues are part of many separate sections. That's why I'm after a 2nd opinion before I post bugs (after checking for duplicates). I'm not really sure if that's accurate. In almost all cases, code is either legal or illegal C++ (although there are a handful of extra features like runtime-size arrays - usually they can be disabled (eg with -pedantic). I'd say the specification gives no freedom regarding whether code should compile or not (except for explicitly undefined behavior, extensions, and builtins and other internal details (which are hidden inside ::std)).
  5. All compiled using -std=c++2a. ------------------------- First - defining a constexpr member function of a template after use. template <typename tTYPE> struct SFoo { constexpr void foo(); }; int main() { SFoo<int> foo; foo.foo(); } template <typename tTYPE> constexpr void SFoo<tTYPE>::foo() {} GCC compiles and links this fine. Clang doesn't instantiate foo(), and thus fails to link. NOTE: constexpr implies inline. I think LLVM is wrong here - the definition just needs to be provided somewhere. Am I right? (From this post). ------------------------------------- Second, default initialization of unused constexpr template value. template <typename T> constexpr int n; int main() {} Again, GCC accepts it, but clang++ complains: If you want to only use specializations of n, requiring a value here is unnecessary (and could prevent detecting bugs at compile-time). -------------------------------------- Are both of these LLVM bugs?
  6. I'm afraid I totally screwed that up, my apologies. I had this issue years ago and I'm only now working to tidy up the code. But it appears I forgot the details of the issue. My issue is actually both specializations AND overloads. And the specializations are more (albeit not entirely) "obedient", so you're totally correct there. I did have some issues with specializations too, but either in combination with overloads or in cases where I actually wanted to specialize after they'd been used (altering the literal-string specialization, except earlier assertions had already used it - which I'll agree was actually due to bad design). So, I may as well just ask about overloads, since fixing that will also fix all related specialization issues. And I should give an actual explicit code example, too. Toggle commenting out "#define USER_CODE_BETWEEN" to see the issue. #include <iostream> //#define USER_CODE_BETWEEN // <====--- Toggle This Line. //=== INITIAL LIBRARY CODE ===// template <typename T> void foo(T const& arg) { std::cout << "default = " << arg << std::endl; } //=== USER CODE (BETWEEN) ===// #ifdef USER_CODE_BETWEEN void foo(int const& arg) { std::cout << "overload = " << arg << std::endl; } #endif //=== FINAL LIBRARY CODE ===// template <typename T> void bar(T const& arg) { foo(arg); } //=== USER CODE (AFTER) ===// #ifdef USER_CODE_FIRST void foo(int const& arg) { std::cout << "overload = " << arg << std::endl; } #endif int main() { char c = 'c'; int i = 1; bar(c); bar(i); return EXIT_SUCCESS; } USER_CODE_BETWEEN is how I want it to work, but that can't be achieved if there's only one header that includes both INITIAL LIBRARY CODE and FINAL LIBRARY CODE.
  7. NOTE: This is a cross-post, sorry. I asked this question (in a worse way) on StackOverflow, but it's not really a StackOverflow-type question. --- Say you want to write a library, which you want to: Provide a user-specializable template type (eg CFoo). Provide a function that uses that type (eg: bar(CFoo& foo)) - this function needs to use any user specializations. Be included by a single header file. I don't think it's possible to achieve all three, because specializations of CFoo need to occur after CFoo's declaration, but before bar()'s definition. If both are provided inside the same header, then there's no sane way to get user code between the two. Looking at it another way, I'm looking for a way to delay compiling the definition of bar() until the user's code has been compiled. --- The mechanisms I have considered to mitigate this problem are (copied (and tidied) from linked post): Attempting to craft some sort of single "late_call" forwarding template function, which is defined at the end of the source and uses some mechanism to deduce the target function from its parameters. Unfortunately, I can only see how this can work in very rare special cases. [--] Mostly doesn't work. [-] Requires a #include at the end of source. Creating an extendable list of headers to include, via the preprocessor, then including them all at the end via a single final #include. It's possible to hack a list like this with a fixed number of places using a lot of #defines. [-] Artificial limit. [-] Uses macro #includes, which screw up some tools. [-] Ugly as hell. [-] Requires a #include at the end of source. Creating my own manual pragma-type command, then writing an external tool to run over the preprocessed code and move stuff about before compiling. [+] Works perfectly. [+] Nothing needs to be added to the end of the source. [--] This pretty much ensures nobody will ever want to use my library, and I'll probably hate it myself too. Creating a "late.hpp", in which I add #includes for every delayed definition, guarded by #ifdefs to check whether they're needed. [-] Requires a #include at the end of source. [--] Breaks modularity. Manually add a list of delayed-definition headers at the end. [--] Breaks modularity. Source files may indirectly acquire new delayed definition requirements if other implementations change. [-] Ugly. [--] Potential source of bugs. I'm currently using #4. --- Does anyone have any better ideas? NOTE: I'm especially interested in methods that would work for multiple independent libraries.
  8. Sorry, I meant to say doesn't necessarily care. I'd just forgotten the word for a second and forgot that I'd forgotten it when I posted.
  9. There's an issue with that: You don't have the guarantee that all warps will be working on the same primitive. Half of Warp A could be working on triangle X, and the other half of Warp A could be working on triangle Y. GPUs make some effort to keep everything convergent; but if they were to restrict triangle X to a set of warps; and triangle Y to another set to Warps, it would get very inefficient quickly. Ah, well there goes that idea I guess (although, given that huge triangles still exist sometimes, you could possibly still gain if it's possible to use an "if" to determine whether a warp is single-primitive (depending on how significant avoiding those cache reads is)). Mostly I think it's useful and important to get an understanding of how stuff works internally. I understand that there are huge differences between some technologies, but I'm not interested in mobile - like a sportscar fan doesn't care about tractors. For this case, specifically, two reasons. 1 - If what I'm suggesting turns out to be possible with the current SPIR-V instruction set I might give it a go. 2 - The issue of redundant flat surface calcs seems like an extremely clear duplication of effort. It screams out for optimization. samoth's reply to it already helped to remove a lot of redundancy, but it seems strange that it's not mentioned more often as this is an issue that total beginners often query (eg, "how do I send per face data"). I've read a bit recently, but I still need a lot more. I will take a look at those links, thanks.
  10. No, different warps of pixels may well be executing on different "cores", and would have to communicate with each other via memory. There also isn't necessarily a "first warp of a primitive" -- ideally dozens of warps would be scheduled to start simultaneously :) If it's flat data then it's constant, so there's no need for communication between cores. For example, if 10 cores execute 5 warps each for a single primitive, then, for each core, the first (consecutive) warp on each core copies the flat data from cache (or memory if you're unlucky) into registers. The following warps could somehow determine they're using the same primitive and then just use the data in the registers without needing to load it from cache. So you're duplicating the flat data copy once per core - but that's better than once per warp. Of course, unless you add hardware support (which I'd imagine would be extremely cheap and tiny) you'll need a warp-consistent if statement that is able to determine whether it's the first warp - possibly by comparing primitive ID with a register. I don't know if that additional cost would be worse than the cache fetches it avoids.
  11. Is it possible to efficiently preserve that data across multiple pixel-shader warps on the same primitive? So basically just set the register values for the first warp of a primitive. I guess you could check if the primitive ID has changed, but the cost of that might be more than the benefit unless your per-primitive data was big (and as you say, AMD only has 64 bytes). I guess that'd only be a benefit for large triangles though.
  12. There's no way they could actually shove flat values into core registers? And leave them there the entire time the pixel shader is processing the primitive. Typically, every pixel needs to read them. I guess current hardware doesn't have registers that would suit that usage, though.
  13. Sorry about the ultra-late reply. I was kinda distracted by events recently and I'm just getting back to coding.   Isn't using the provoking vertex and flat "interpolation" kinda inefficient, though? Flat data can sometimes be larger than "true" vertex data, even to a point where, on average, over half your total vertex data is unused. It seems like such a good target for improving efficiency. Screen space derivatives seem like they would often be inefficient too. You're effectively calculating the same values repeatedly on every quad-lane, rather than just having it constant. I guess it could be slightly faster in some cases with tiny triangles, but for ones with 100+ of texels I'd be surprised if it were more efficient than sending the data to the cores. And they're less accurate and less flexible.   Thanks heaps for the detailed reply, you've cleared up a of number questions.   Core seems to be used for both individual lane hardware and for SIMD blocks ... and GCN. But I guess the alternative is to use vendor-specific terms.   Cool. I'll go look that up.   Thanks. That fixes around 3/4 of the issue. Internally, though, there's still arithmetic occurring - data address = buffer pointer + (buffer index * primitive-data-object size). You might get lucky and have a power-of-two primitive-data-object size so you can use a shift operator rather than a multiply. You might also get lucky and have the compiler compute this offset once per core rather than per lane if it figures out that it's static. But it's still repeatedly doing an identical calculation that only needs doing once - if at all - occupying silicon and burning joules.   Thanks. I just reread that, and I think I missed some pages the first time. Definitely worth reading (twice).
  14. Firstly, If anyone knows of any decent resources for learning details like I'm asking about, can you tell me plzthx. I'm happy to do a lot of reading. Here's a handful of GPU questions I'm having trouble finding answers to. I thought it'd be easier to ask these in bulk rather than multiple questions. I hope that's OK. I asked on Khronos (https://forums.khronos.org/showthread.php/13413-A-pile-of-technical-GPU-questions-sorry-)), but they didn't have a category for it and it didn't get any response. I'm really asking about standard-practice in immediate-rendering GPUs (Nvidia/ATI/maybe Intel).   1. Terminology Is there a standard terminology for GPU shading components yet? What’s the best way to refer to: The element responsible for a single texel output (eg CUDA core). (= Lane? Unit?) The block of elements (above) whose instructions are performed together (SIMD). (= Core?) The component responsible for managing tasks and cores. (= Thread dispatcher?) I will use lane and core for the rest of this uberquestion. 2. Memory addressing Is GPU access to graphics memory ever virtual (ie, via page tables)? Can the driver/GPU choose to move resources to different parts of physical memory (eg to avoid contention when running multiple applications)? 3. Per-primitive user data GPUs don’t support per-primitive(or per tesselation-patch etc) data (eg, per-triangle colors/normals) yet, right? Is there any technical reason why? Implicit per-primitive data is required by cores (interpolation constants and flat values). This seems to be a common request, and data does seem to be being wasted. 4. ROP texel ordering How is the order preserved when sending finished texels to ROPs (render-output-units)? Where/how do out-of-order texels queue when the previous primitive hasn’t been fully processed by the ROPs. 5. TMUs and cores Can any lane/core use any TMU (texture-mapping unit) (assuming it has the same texture loaded) or are they grouped somehow? Is there a texture-request queue or is there some other scheduling method? 6. Identical texture metadata For two textures with identical metadata in the same memory heap, is switching a TMU between textures necessarily any more complex then simply changing the TMU’s texture pointer offset (ignoring resulting cache misses).   7. Data "families" There seem to be many data “families” available to core lanes: Per-lane: Private lane variables. (Read/Write). Lane location/index (differentiating lanes within a core). (Read-only). Derivatives (per pair/quad?). (Read/Write(ish)). Per-core (read-only): Per-primitive(or patch, etc) constant data. Interpolation constants etc. Draw-call-constant data (uniforms, descriptor set data). RAM-based stuff (TMU, buffer array data, input attachments, counters, etc). Does that make sense? Are B1 and B2 are stored in the same area? Are they stored per-core or shared between cores somehow? They’re often identical between many cores, but IIUC other cores can be performing different tasks. How does the task-manager/thread-dispatch write to each core's B1/B2? In bulk / all-at-once, or granularly? Are these writes significant performance-wise? (kinda technical but related to a shader-design issue I have).   Thanks for all input.
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!