Texture Units

Started by
4 comments, last by Hodgman 7 years, 8 months ago

Graphics cards are now up to 100+ texture units (on the GTX 1080). These are 100 units that can fetch and filter a texture, but if I write a single shader with a single texture fetch:

{
gl_FragColor = texture2D();

}

Will that only map the texture to 1 texture unit or will it spread that work out accross as many of those 100 units as possible? I don't see how it wouldn't make sense not to spread the work. But here is why I ask:

Someone once told me this, and I now verified, that using 3 different source textures instead of 1 Texture Array with 3 textures, is faster. My terrain shader has definitely confirmed this.

With 3 normal textures + 3 diffuse textures bound to 6 different texture slots = 108 FPS

With 1 normal Array(3 textures) + 1 diffuse array(3textures) = 83 FPS(obviously still fetching 3 times for each texture array).

So then what is going on in the hardware level that when I am performing fetches with the texture array that it is either not scheduling to as many texture units as possible, or does it literally share exactly 1 texture unit and all requests for that texture are blocked until it receives a fetch back?

If that is the case, then wouldn't it make sense that for a screen space algorithm such as HBAO, when taking 16 samples per pixel, that you bind the depth buffer to unit 0, unit 1, unit 2, unit 3, unit 4 ..... unit 16 and then in the shader perform 1 fetch on each of those samplers, that way they all fetch in parallel? I'm not aware that anyone has ever done that, or why you would ever need 100 textures bound at once, which leads to me to believe that the hardware will always try to uses fetches across all 100 units, and that something stupid is simply happening for these texture arrays.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Advertisement

In your performance test, there's 14ms difference between the two tests. To me, that sounds like a bug, and I'd keep drilling into that situation to find out why :o
They really should be quite similar...
The array case should have less CPU overhead as there's less API calls to bind the resources, and requires less set up time per pixel, as there's only one resource descriptor to load instead of three. Perhaps there's more GPU overhead due to having to add a slice index multiplied by the slice stride... But that's like one instruction, which would require a hell of a lot of pixels to be drawn to add up to 14ms!

As for texture units, there used to be a fixed number of registers that would hold texture descriptors (the guts of an SRV), which set a hard limit, and mapped perfectly to these "slot based" APIs.
These days, GPUs are "bindless", where fixed descriptor registers have been replaced with general purpose registers. Descriptors are stored anywhere in RAM, and the first thing that a shader does is fetch the descriptors from RAM and into some GPRs. When performing a texture fetch, the addresses to read pixel data from are computed by reading the GPRs that hold the texture descriptor and the GPRs that hold the texture coordinates. It's mostly just general purpose memory fetching.
In this new model, the number of textures bound at once is essentially limitless. Only a finite number of descriptors (SRVs) will fit into the register space at once (there's a finite number of GPRs available), but you could always fetch a new descriptor from memory right before doing the texture fetch, removing this hard limit. The API limit of 128 on D3D11 is completely arbitrary.

On both old and new hardware, binding the same texture twice is useless - it just adds extra CPU overhead, and on modern GPUs it forces them to fetch these extra descriptors from RAM per wavefront.
The actual texture-fetching hardware (confusingly, also sometimes called a texture unit) is unrelated to and decoupled from these API binding points / slots / SRV's. A GPU computation core will automatically make use of all of its available memory channels, which may include one or more bits of dedicated texture fetching and filtering HW units.

Yea I'll have to get an exact test, but I adjusted my FPS #'s already, so it's not as bad, looking more like 83FPS vs 98FPS, so 2ms difference. My original #'s were just wrong.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

I haven't ever tested that situation myself, but it still seems odd to me that there's a difference that's measurable :)
Other things to consider are things like resource tiling patterns. Texture fetches are random-access, but accesses that are close in time to each other tend to be clustered in regions (1D for buffers, 2D for textures, 3D for volume textures). GPU's exploit this by rearranging the array indexing. Instead of typical row-major array addressing, where pixel indices increment horizontally across rows, textures typically use a Z-curve is better at keeping spatially close texels closer together in memory. It could be that in your test, the 3x 2D textures are picking a better tiling pattern than your 1x 2D-array texture is :(

In D3D, the resource creation flags will likely be used to pick these memory layouts, etc, which is why it's important to not specify flags such as D3D11_BIND_UNORDERED_ACCESS unless you have to.

My guess would be that its a caching issue. If each texture unit generally samples 1 texture it can have a better cache around that memory. Now that it is sample the same UV coordinate but in a 2nd or 3rd slice, it may just not have as good cache behavior.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

If each texture unit generally samples 1 texture

Like I said before, the modern hardware concept of a texture unit does not match up with the API's concept of a texture unit.
The API exposes 128 "texture binding slots", but the hardware might have as few as one texture-decoding hardware block per pix.
e.g. in this diagram of the AMD GCN architecture:
GCN_CU_M.jpg
Inside a Compute Unit, there's a 64-thread wavefront (made up of 4x SIMD16 processors working together to act as SIMD64), but there's only 16 "texture-fetch units", which means every 4 pixels get to share 1 texture-fetch unit! Worse, there's only 4 texture filtering units, so every 16 pixels have to share one texture filter unit!!

Also, the entire wavefront shares a single L1 cache, and all the compute units (there might be say, ~a dozen CU's on the chip) share a single L2 cache. It's actually worse than that, because a CU can "hyperthread" up to 10 wavefronts at the same time, meaning you've got up to 640 threads sharing the same 16 texture-fetch units, 4 texture-filtering units and 16KB L1 cache :o

The API/shader's idea of a "texture unit" is a lie. At the very beginning of a shader, it first fetches a struct that represents all of the bound textures into the Scalar Registers.
e.g. your shader begins with some hidden / auto-generated code something like:


struct TextureDescriptor { u16 width, height, depth, mips, format; void* data; }
StructuredBuffer<TextureDescriptor> _internalBindings;
TextureDescriptor t0 = _internalBindings.Load(0);
TextureDescriptor t1 = _internalBindings.Load(1);
...
TextureDescriptor t127 = _internalBindings.Load(127);

When your shader tries to read from a texture, say "API texture slot#0", it uses the data in this hidden variable t0 to do so.

This topic is closed to new replies.

Advertisement