Image Image-LoadStore vs Shader-Storage-Buffer-Object

Started by
4 comments, last by Stephan Picker 9 years, 11 months ago

I'am having some questions about Image Image-LoadStore and SSBOs (and also UBO).

I will move solved Questions down to Facts.

Questions:

* why is a float[]-array not tighly packed when declared in a SSBO. ??? There is always a stride of sizeof(vec4).

* ... why does this make sense ???

* what are typical use-cased for the two ???

Facts (or educated-guesses):

* SSBO and Textures both reside in global memory? normally yes.

* Acces to SSBOs is uncached? only L2 cache.

* Acces to textures via LoadStore is cached (texture-cache)? yes, uses texture-Cache.

* The actual load and store via LoadStore is done by texturing-hardware? yes, uses texture-memory-pipeline.

* which of the two is more performant? texture-cache has better 2d-spacial-locality, but latency longer

* where do UBOs reside in memory? uniform-buffers reside in the constant-cache! (only 1-float per clock)

* can I also write to an Uniform-Buffer? no.

Advertisement

You can read the specs:

https://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt

https://www.opengl.org/registry/specs/ARB/shader_storage_buffer_object.txt

https://www.opengl.org/registry/specs/ARB/uniform_buffer_object.txt

Anything not defined there is left up to the implementation. In practice I think UBO are generally local memory (faster) and limited to around 64K, while SSBOs and image load/store are treated the same as textures. I think that modern GPUs have a pretty much unified memory hierarchy, with coherent L2 cache but no coherency at L1, which can be disabled for coherent access between cores. Regarding packing, there are four different memory layout qualifiers. Uniforms are implicitly constant, so there is no writing to a UBO.

I'm guessing here a lot, but this is based on what is exposed/documented in cuda for recent nvidia cards.

* SSBO and Textures both reside in global memory. ???

The gpu can read it's dedicated gpu memory (== "global memory" from the gpu's perspective) but also the system memory. I believe it is possible to read a texture from system memory without transfering it to VRAM first, but usually you don't want to do that. AFAIK the GPU can also write to system memory, so it should be possible for the OpenGL implementation to make an SSBO reside in system memory. You might be able to force this with the correct flags when generating the buffer. Otherwise gpu memory should be the default location.

* Acces to SSBOs is uncached ???

Usually, every global memory read goes through an L2 cache and there is no reason for the OpenGL implementation to bypass that for SSBOs (except for atomic operations maybe). They are, however, not cached by the L1, at least not by the hardware. The L1 is divided into two halves. On halve is used to cache spilled registers, the other can be thought of as a software controlled cache. All bets are off on how much the latter is utilized.
Also, since the caches are not the subsitute for a reasonably sized register file (like on x86), their latency is significantly higher then what you are used to from the CPU.

* Acces to textures via LoadStore is cached (texture-cache) ???
* The actual load and store via LoadStore is done by texturing-hardware ???

The texture sampling hardware can only read. Same for the texture cache, it's read only (It's also a seperate memory pipeline with different coherency). I'm assuming that the cuda equivalent to LoadStore is "surfaces". According to the cuda documentation, surface reads are cached in the texture cache. My guess is that LoadStore-Loads are texture cached and LoadStore-Stores are more or less ordinary writes to global memory.

* why is a float[]-array not tighly packed when declared in a SSBO. ??? There is always a stride of sizeof(vec4).
* ... why does this make sense ???

Is it? Sorry, I have no idea why.

* which of the two is more performant ???
* what are typical use-cased for the two ???

By "the two" I assume you refer to texture LoadStore and SSBOs?

The texture cache and the texture memory layout are optimized to improve cache locality for nearby texture reads (where nearby actually means nearby texels in 2D textures). Since this layout is
not simply row major (or column major) but some swizzled layout, that might change from generation to generation, you need a specialized write/store mechanism for it. Hence the need for LoadStore.
So LoadStore should be quite efficient, when your reads are close together. However, the texture memory pipeline is longer then the normal memory pipeline. So if you don't need a cache that's optimized
for 2D locations, then SSBO should be faster.

* UBOs reside in local memory, so it's normally only 64k, but faster than SSBOs ???
* can I also write to an Uniform-Buffer ???

The cuda counterpart for ordinary uniforms or uniform buffers is "constant memory". Constant memory is cached by the constant cache and is read only (In cuda, "local memory" is the L1 backed memory for spilling registers which is very different from the constant memory). So no writing there. It is however quite fast.
The primary attribute of the constant cache is that it can only service one float to a warp in each cycle. So all threads of a warp should request the same float. For example, if you have a vertex shader where each thread reads the same ProjectionView[0][0] value, then this is serviced in one cycle. If however, every thread reads a different value like someUniform[threadIndex], then it takes 32 cycles (with a 32 thread warp) for the hardware to complete that.

Thank you very much Ohforf sake. Very good post.

Now for the last open question:

* why is a float[]-array not tighly packed when declared in a SSBO. ??? There is always a stride of sizeof(vec4).

* ... why does this make sense ???

I read that GPU preferabley have 128bit registers.

so this could explain the odd allignment. I also remember quite briefly, that if it was tightly packed, the ALU would have to mask out single 32bit packets and store them in a register.

I think the article refers to CPU registers. NVidia hasn't used vector registers since the 8000 series (except for some video decoding stuff) and AFAIK the same goes for AMD gpus.

On NVidia, the ALUs are 32 wide (hence the 32 threads per warp that should always perform the same operation) but the scatter/gather capable load/store units can do a significant amount of shuffling.

Oh interesting face:

I just played around with my SSBO.

and changd:


layout(shared) buffer Classification
{
	float	h[MAX_FACES];
};

to


layout(std430) buffer Classification
{
	float	h[MAX_FACES];
};

now the stride is sizeof(float).

But since my application is very performance critical (the algorithm will be running for hours) I still want to know what's up.

std430 now gives me a "good" alignmen, but "shared" offers a implementation specific optimized layout.

So my GPU (Radeon 6950) must have a reason to stride it the way it is?!?

This topic is closed to new replies.

Advertisement