Is high ILP still important when a CU executes only 2 threadGroups?

Started by
8 comments, last by MJP 5 years, 8 months ago

My LDS usage is the maximal allowed. Each threadGroup consumes 32KB of LDS. So, one CU can handle as much as 2 threadGroups at once.

Because this setup is of low occupancy, I tried to add Instruction Level Parallelism, using the whole bunch of available registers. So I can easily have ILP level of 16 now, in most of my shader code. I tried to compensate for the low occupancy with high ILP.

I wonder if it is worthy at all to make my shaders' code longer in order to add ILP in this specific scenario.
Does high ILP has any sense for the hypothetical case of only one threadGroup per CU?

Advertisement

Hi! Just out of curiosity, since I'm a bit out of context, how do you "add" instruction level parallelism? CUs always execute a whole thread worth of instructions (64 threads on AMD GCN doing the same instrution) at once in a lock-step. But I'm sure you know this.

If you have such a low occupancy, you'll suffer on memory operations, since the GPU won't be able to switch to a different thread group to execute some other instructions, in order to hide the memory latency. If bandwidth is blocking you 90% of the time, optimising the other 10% (ALU) isn't going to help you much.

Profile profile profile? :)

22 minutes ago, pcmaster said:

how do you "add" instruction level parallelism?


for (int i = 0; i < 16; i++) {
  // load registers from LDS
  fl0 = fl0 + fl1;
  fl0 = fl0 / fl2;
  // write back to LDS
}

//==========================================
// load all the registers from LDS
// ......
// load all the registers from LDS

fl01 = fl01 + fl11;
fl02 = fl02 + fl12;
fl03 = fl03 + fl13;
fl04 = fl04 + fl14;

.........

fl012 = fl012 + fl112;
fl013 = fl013 + fl113;
fl014 = fl014 + fl114;
fl015 = fl015 + fl115;

fl01 = fl01 / fl21;
fl02 = fl02 / fl22;
fl03 = fl03 / fl23;
fl04 = fl04 / fl24;

.........
  
fl012 = fl012 / fl212;
fl013 = fl013 / fl213;
fl014 = fl014 / fl214;
fl015 = fl015 / fl215;

// write back all registers to LDS
//...
// write back all registers to LDS

Capture.thumb.JPG.105c45ec403cccb683b39eea0e9fdc3a.JPG

... i think we still need more help. Neither the code nor the pic (with unknown source), nor a bit of googling explain what ILP might be, or what you expect from it. (I only find the term seems more relevant to pre GCN, but what does it mean?)

http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization-Guidelines-GPU-Architecture-Details.mp4

around minute 19

If i understand it correctly, the staling warps are executing a single instruction pass of the neighbor busy warp. Like stealing ops from it.

On CPU i think this is called instruction preordering and it always covers ILP code. But on GPU, if i understand it, it is only covering it if there is a warp staling. 

Actually, I don't understand it completely, since I asked here. They say they have various execution units for 32 integers, for 32 floats, fewer for 64 floats etc. So I am not sure if having 8 ALUs for 32 floats, each of those ALUs can execute a warp if the code is ILP capable. Which would make a setup with twice as better occupancy but twice less ILP, as good as a setup with twice worse occupancy but twice better ILP.

Because higher occupancy is letting me less registers to use, but low occupancy is giving me unused registers i could use for a better ILP to try to compensate.

 

30 minutes ago, NikiTo said:

If i understand it correctly, the staling warps are executing a single instruction pass of the neighbor busy warp. Like stealing ops from it.

Audio still broken here, but just reading the text i would assume the compiler arranges instructions to load early and depend lately on the data. (Which is the primary reason you can't count registers by hand, because you don't know how many registers the compiler uses to store data loaded early.)

If there is only one wavefront per CU due to LDS or register limit, there is no option to 'steal' work from another. So with 'only' 1 or 2 wavefronts, all you can do is aiming for low bandwidth with good caching and high ALU utilization. I think the utilization of various integer / float units is quite independent from that limitation and too fine grained to compensate a stall in practice.

LDS access does not stall, but you probably know.

AFAIK, some NV GPUs have more LDS (96 vs 64 KB), and they have more VGPRs in general, but no scalar registers. Their generations are quite different to each other.

Did you try to use procedural fake data instead loading from memory / writing just a fraction of results? I often do this to see how costly memory operations are. 

 

How big are your thread groups? Larger than the HW wavefront size? 

13 hours ago, JoeJ said:

(Which is the primary reason you can't count registers by hand, because you don't know how many registers the compiler uses to store data loaded early.)

I declare all my variables in the very beginning of the code. I even declare outside the counters i use inside loops. I do this because i read somewhere that some compilers are bad and don't delete the temporal variables created inside loops when the loop ends, causing this way to run out of registers for no reason. This way i know at least the most minimal number of registers my code needs. If the compiler can spot the spans of time where my variables can be reused, and reuses it, I would give all my kudos to that compiler XD
 

9 hours ago, Hodgman said:

How big are your thread groups? Larger than the HW wavefront size?

My thread groups are 512 on ADM hardware.

2 * 512 threads is 1024 threads per-CU, which works out to an occupancy of 4 for each the 4 SIMD's on a CU. While an occupancy of 4 isn't amazing, it's not terrible either. I would be careful about using more VGPR's than 64, since if you go above 64 your occupancy will drop below 4 and your performance may suffer.

This topic is closed to new replies.

Advertisement