Jump to content
  • Advertisement

Infinisearch

Member
  • Content Count

    1194
  • Joined

  • Last visited

Community Reputation

3053 Excellent

6 Followers

About Infinisearch

  • Rank
    Contributor

Personal Information

  • Role
    Programmer
  • Interests
    Business
    Design
    Programming

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Infinisearch

    Multiple AppendStructuredBuffers

    Divergence whether branch or data wouldn't parallelize anything, unless I'm missing what you're trying to say. IIRC in AMD GCN the L1D cache isn't banked while the LDS is banked. If your atomic counters per append list are in LDS there is an opportunity to paralleize the the increment the write pointers (I think). Then I think (i'm not sure) either memory coalescing on access to the L1D cache would take care of parallelizing what can be parallelized in the actual appending or all appends are serialized. But my question had to do with contention on the same atomic counter... I was wondering if the coalescing hardware did some kind of trick by realizing its the same atomic counter and some how parallelize multiple increments.
  2. I don't know Vulkan but to get a clearer idea of what might be happening I'll ask the following question: What exactly do you mean by z-fighting in this context? Do you mean nothing is passing your depth test in the second pass or something else?
  3. Infinisearch

    Multiple AppendStructuredBuffers

    A question regarding Append buffers in regards to SIMT execution. If you have a shader in which all lanes access the same append buffer does the hardware accelerate this some way? Or it it serialized atomic adds equal to the number of lanes?
  4. Thats understandable but shared memory per thread group is well within your grasp. So I think you should at least take that into account. Also trying to save registers isn't necessarily a bad thing and might pay off, but using codeXL or whatever profiling tool you're using sounds like a good idea before you go crazy explicitly trying to save registers with no hope of reaching a target.
  5. I think I made a mistake in this section of my post. I say both 40% occupancy and 80%=40%*2 but this is wrong. If you have one thread group accessing 32k of shared memory then you can only afford two thread groups on that CU, so the 6*2=12 thread groups is correct. But if the thread group size is only 256 thats 256/64=4 waves per thread group which with two thread groups would lead to 8 waves per CU. A CU has a maximum occupancy of 40 so 8/40=x/100 ...20% occupancy. I accidentally used the occupancy per SIMD and totally forgot I was calculating per CU, sorry for the mistake. So if you wanted to keep a thread group size of 256 but increase occupancy you'd need to reduce shared memory usage per thread group. So staying with the original thinking of 80% occupancy and a thread group size of 256 you'd have 32waves/4=8 thread groups per CU which means 64k/8=8k shared memory per thread group. If you need more shared memory per thread group and are still aiming for 80% occupancy see if you can increase your thread group size. I think this is wrong too since my earlier math/logic was in error. Let me try the math again... 80% occupancy = 8 waves per SIMD. A SIMD has a 64k register file organized as 4bytes wide, so 64k/4=16384 registers. 16384 registers/64lanes=256VGPR's. 256VGPR's/8waves=32VGPR's per wave. So I guess I got that part right.
  6. Oh yeah I forgot to mention in my quoted example I was going for 32 VGPR registers minimum as a goal. Which is what the above quote provides (32 VGPR) if you need more registers either reduce occupancy or reduce thread group size. I'm not surprised because the price difference I'm surprised because the 1070 beats the FuryX in almost all the benchmarks I've seen except for a double precision compute benchmark and a couple of what I assume are single precision benchmarks. Just look at this benchmark comparison summary: https://www.anandtech.com/bench/product/1720?vs=1731 edit - also in the above quote you said a 280X, you sure you didn't mean a 290X?
  7. Actually I think its 10 wavefronts per SIMD not per CU, so it 6*4*10=240. At least according to the link I posted two posts ago and according to this Crash course on GCN. https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah Really thats pretty suprising since a 1070 is on par in performance with a 980ti which usually beats or matches a furyx. Have you tried with CUDA?
  8. Is that considered the runtime or something else?
  9. How would you iterate over the right register, wouldn't that require self modifying code or the ability to indexing a register based on a register? If the later instruction is available then you do have constant time access. The only other option I can think of would be a case statement. Don't forget that while each CU has 64k of shared memory D3D compute shaders limit you to 32k of shared memory per thread group. Do you mean theoretical GFLOP or do you mean measured performance? You don't have 6 waves on your GPU.... you have 6 CU's on your GPU. So going by what JoeJ said earlier (which is the type of stuff the link in my last post covers) you would have a threadgroup size of 256 accessing 32k of shared memory which if I'm not mistaken lead to 40% occupancy. If you use the main queue to dispatch your compute shader then you will run 6*2=12 thread groups (80%=40%*2) of 256 threads at a time on your GPU while using all the shared memory. So while your GPU only has 384 total lanes you can run 12*256=3072 threads at once while utilizing you GPU efficiently.
  10. Did you try creating the resource using GPU just to try? I'm not sure about this, I'm just guessing but doesn't an upload heap have to be able to be mapped to GPU address space? If so maybe theres a limitation as to how much can be mapped at a time. I was wondering what exactly the graphics memory manager is and where its 'located'? (runtime, usermode driver...?) Is it what evicts and loads commited resources into VRAM?
  11. IIRC thats pre-2006/7. In 2006/7 the first GPU with unified shaders came out. Unified shaders mean that the cores can be used for graphics or compute. They do use same cores which implies the same chips. As far as Async compute goes you need to have 'matching' shaders. This basically means not all shaders run well together. You don't have to use Async compute, its optional. Also a graphics workload and compute workload can execute at the same time on the same CU/SM but again they have to be 'compatible'. Finally I found out that on GCN AMD does/can indeed run all threads in a threadgroup in lock step. I read it in the following link which is a nice blog post on how to get good compute performance on AMD GCN graphics cards. If you understand it you'll get a better understanding of how to use async compute as well. https://gpuopen.com/optimizing-gpu-occupancy-resource-usage-large-thread-groups/
  12. Infinisearch

    Rendering lights in deferred shading

    Oh I had forgot about a presentation I had that might be of interest to you, I'll attach it here since I couldn't find it on the net with a quick google search. Its a paper on implementing alot of shadows, at the time they were able to get alot of shadow casting light on the geforce titan of the time. (I think thats a kepler based titan, which equates to a 1060 or 1070 nowadays) The papers the presentation can be found here: http://www.cse.chalmers.se/~uffe/publications.htm Just search for shadows. a12-olsson.pdf
  13. Ok so I in my last post I messed up big time. I forgot something I had read with regards to shared memory. Anyway long story short you do need to use a groupmemorybarrierwithgroupsync. Basically a threadgroup size larger than the SIMT width gets broken down into multiple waves(amd)/warps(nvidia) but there is no guarantee of synchronization between them so you have to use a syncpoint. Even if some hardware did sync up related warps/waves, without a guarantee all hardware works this way you need to standardize the use of the barrier with sync to keep the hardware inconsistencies transparent. Sorry for my terrible memory causing trouble. I also watched a bit of this video which confirmed nvidia hardware (at least of the kepler generation) treats all warps independently. (First 10 minutes) http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization-Guidelines-GPU-Architecture-Details.mp4
  14. Infinisearch

    Rendering lights in deferred shading

    I was just wondering what was commonly used currently...? seeing as how battlefield 3 was the first game with tile deferred, its been around for a while now.
  15. Infinisearch

    Rendering lights in deferred shading

    Yeah that would handle multiple shadow maps being "bound to the pipeline" at the same time. But I was referring to how one knows which lights actually have shadows associated with them? Is the first N lights dedicated to lights with shadow maps? Or is there a reference to a shadow map per light?
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!