wavefronts in-flight

Started by
11 comments, last by Adam Miles 5 years, 5 months ago

Hello.

GCN paper says that its one simd engine can have up to 10 wavefonts in-flight. Does it mean it can run 10 wavefonts simultaniosly? If so then how? By pipelining them? AFAIK wavefonts are scheduled by a scheduler. How does the scheduler interact with simd engine to make it possible? Do these 10 wavefonts belong to only one instruction?

 

Advertisement

The idea is to switch to other wavefronts when the actual one waits on global memory operations. So they do not run simultaneously, and instructions do not need to be in sync, it can even be wavefronts from different dispatches.

The most important to know is all 10 wavefronts share the CUs registers and LDS memory. That's why we aim for low register and LDS usage, so we have as many wavefronts in flight is possible to hide memory latency.

Thank you.

This article confised me - https://www.realworldtech.com/cayman/5/

Quote

ALU wavefronts take 8 cycles to execute. The first 4 cycles are for reading the register file, one quarter-wavefront at a time. The second 4 cycles are for actually executing the operations, again a quarter wavefront at a time. To hide the 8 cycle back-to-back latency (most CPUs have only a single cycle), the two separate ALU wavefronts (even and odd) execute in an interleaved fashion. First one wavefront accesses the register file access, while the other executes; then they switch. This alternation continues until one finishes and is replaced by another wavefront. This is conceptually similar to fine-grained multi-threading, where the two threads switch every 4 cycles, but do not simultaneously execute

I thought the author was talking about 2 wavefronts.

I assume 'Cayman' as an older, meanwhile totally outdated architecture than current GCN. At least the article is from 2010. ;)

GCN was a big change against previous generations, so older stuff is useless to know.

Yes, but it helps to understand and Cayman is not entily diffferent from GCN. For example Cayman has 8 wavefronts in-flight whereas CGN 10 and it should work same way.

So with that 10 wavefronts buffering CU can interleave read memory operations and execution of different wavefronts. And these 10 wavefronts can belong to different instructions. Does it improve performance every time? What about cases when read operations are prefetched? AFAIK compiler can do such optimization.

2 hours ago, _Flame_ said:

and it should work same way.

I doubt this but i don't know details. However i recommend this guide, which covers older GPUs too, but i did not read that: http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf

OpenCL 1.x is the same as compute shaders, it just misses advanced command list stuff and indirect dispatch we see in DX12 / VK. So the guide is worth to read, especially the chapter about memory access patterns.

2 hours ago, _Flame_ said:

Does it improve performance every time?

Usually yes. But too much wavefronts in flight can cause cache thrashing so even reducing performance. Consoles can reduce occupancy on demand, on PC you only can reserve more LDS than necessary or waste registers to limit occupancy. I have used with this trick on older NV GPUs where it indeed improved performance slightly in rare cases, but i have not tried in recent years again. I heard it can be important with async compute beside rendering, but no experience myself.

In practice i see a linear increase in performance almost always with better occupancy, at least up to 8 wavefronts, not so much beyond that. But this always depends on what you do.

2 hours ago, _Flame_ said:

What about cases when read operations are prefetched? AFAIK compiler can do such optimization.

I assume what you mean is the compiler may rearrange instructions so the read happens earlier (which may require more registers to hold the data). This can help likely if the data is in cache, global read can take 400-800 cycles(!).

 

My own experience here is: It is totally worth to spend much time on optimization of register and LDS usage. Be sure to use a debug tool that shows you this, otherwise you are blindfolded.

Typically i end up with a speed up of factor 4 after optimization. (On NV this number is much smaller.) But i talk about pretty complex compute shaders, with simple stuff there might not be many options.

27 minutes ago, JoeJ said:

This can help likely if the data is in cache, global read can take 400-800 cycles(!).

Wow, it takes 4 cycles to read data even if it's already in cache and 4 cycles to execute one wavefront(since it takes 1 cycle to run 16 items and one wavefront is 64 items)).

Thank you, now it's very clear.

1 minute ago, _Flame_ said:

Wow, it takes 4 cycles to read data even if it's already in cache

I think this takes much longer too, even accessing LDS takes longer than 4 cycles i guess (anyone knows some numbers?)

What you mean is probably to read data from registers. 

You want to have data in registers if possible, (which now is accessible even from neighboring threads with SM 6.)

Secondly you want to cache data to LDS memory which is much faster than global memory. This is also the common way to share data over the whole workgroup and mostly key to efficient parallel algorithms. It's also the main difference between pixel and compute shaders. If the term 'Prefix Sum' is new to you in this context, i recommend the OpenGL Super Bible chapter about compute shaders no matter what API you use - it was enlightening to me :)

Finally you want to access global memory as less as possible.

 

The 4 cycles per instruction should be correct, because one GCN SIMD is 16 lanes wide, and a wavefront is executed in 4 alternating steps. But AFAIK this has no effect on how we should program.

On 10/21/2018 at 11:16 PM, JoeJ said:

The 4 cycles per instruction should be correct, because one GCN SIMD is 16 lanes wide, and a wavefront is executed in 4 alternating steps. But AFAIK this has no effect on how we should program. 

Indeed, we shouldn't think about this.

To understand the parallelism on the AMD GCN CU SIMDs, it helped me realising that there's a fixed number of registers (let's say 256 of 32-bit 64-wide VGPR) and all wavefronts executing on one SIMD (the smallest unit) occupy a fixed portion of those. So if your shader needs 60 VGPRs, the GPU can schedule 256/60=4 64-thread wavefronts in parallel (out of the maximum of 10). That means that WF0 will occupy registers 0-59, WF1 registers 60-129, ... WF3 registers 180-239. Registers 240-255 will remain unused unless a different shader's wavefronts are ready to squeeze in in parallel. To each individual WF, the registers locally look like VGPR0-59 (each one worth 32-bits of memory, fitting 1 float or int, for example).

An important observation is that the wavefronts don't get swapped in or out of the register array. At each clock step, only one of the 4 scheduled wavefronts (in our 60 VGPR case) actually executes instructions. Put differently, they don't really run in parallel on that SIMD. However as soon as WF0 hits a memory instruction which is gonna take many hundreds of cycles, it will be paused and WF1 can start executing its instructions. That's the parallelism latency hiding. Other CUs' SIMDs, of course, run in parallel to this and have their own register arrays. On one of the consoles, there's for example 2 shader engines, each SE has 9 compute units, each CU has 4 SIMDs and each SIMD can schedule up to 10 WFs. So the GPU is actually executing only 2*9*4=72 instructions at each clock step (however each of the 72 instructions on the GPU is run for 64 threads at once!), however up to 720 wavefronts can be in flight, because many memory requests are also in flight in parallel. If I made an error with the numbers, please excuse me :)

If you're bandwidth-bound, your overall speed will depend on the amount of data and the ALU is almost for free. But if you add many more expensive ALU, actual execution of the instructions of different WFs can get serialised (depending on the scheduler) because each WF has stuff to do instead of waiting for the memory.

The above concerns individual instructions. Scheduling of the individual wavefronts is handled by a circuit (shader processor input, SPI) which has some maximum throughput and is shared between the CUs. That means that it isn't able to schedule a new WF every clock cycle but I believe this isn't usually a bottleneck.

I wanted to post a PDF with the details but I can't find it :D

On 10/21/2018 at 10:16 PM, JoeJ said:

I think this takes much longer too, even accessing LDS takes longer than 4 cycles i guess (anyone knows some numbers?)

I once wrote a shader to try and measure this and got numbers of around:
K$ (Constant Cache): ~16 cycles
L1 Hit: ~116 cycles
L2 Hit: ~170 cycles

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

This topic is closed to new replies.

Advertisement