Graphics cards are now up to 100+ texture units (on the GTX 1080). These are 100 units that can fetch and filter a texture, but if I write a single shader with a single texture fetch:
{
gl_FragColor = texture2D();
}
Will that only map the texture to 1 texture unit or will it spread that work out accross as many of those 100 units as possible? I don't see how it wouldn't make sense not to spread the work. But here is why I ask:
Someone once told me this, and I now verified, that using 3 different source textures instead of 1 Texture Array with 3 textures, is faster. My terrain shader has definitely confirmed this.
With 3 normal textures + 3 diffuse textures bound to 6 different texture slots = 108 FPS
With 1 normal Array(3 textures) + 1 diffuse array(3textures) = 83 FPS(obviously still fetching 3 times for each texture array).
So then what is going on in the hardware level that when I am performing fetches with the texture array that it is either not scheduling to as many texture units as possible, or does it literally share exactly 1 texture unit and all requests for that texture are blocked until it receives a fetch back?
If that is the case, then wouldn't it make sense that for a screen space algorithm such as HBAO, when taking 16 samples per pixel, that you bind the depth buffer to unit 0, unit 1, unit 2, unit 3, unit 4 ..... unit 16 and then in the shader perform 1 fetch on each of those samplers, that way they all fetch in parallel? I'm not aware that anyone has ever done that, or why you would ever need 100 textures bound at once, which leads to me to believe that the hardware will always try to uses fetches across all 100 units, and that something stupid is simply happening for these texture arrays.