My 1055T has 128KB L1, 512KB L2 and 6MB L3. Not really researched what other processors have these days, but the SolidBlockBuffer data array is a solid 3468 bytes, so I assume it should just be able to load that in its entirety (since even though I iterate in the z direction which is adjacent access, I still do the +- x and y jumps as well).
There is some other accesses of other data that I think I can rework to avoid in the average "nothing to render here" case.
Going to dig through the CodeXL docs to see if there is anything there useful in addition to the plain sample profile data I have. Guess it might at least give the number of cache misses in a region of code, so can see if its excessive.