I'm
guessing here a lot, but this is based on what is exposed/documented in cuda for recent nvidia cards.
* SSBO and Textures both reside in global memory. ???
The gpu can read it's dedicated gpu memory (== "global memory" from the gpu's perspective) but also the system memory. I believe it is possible to read a texture from system memory without transfering it to VRAM first, but usually you don't want to do that. AFAIK the GPU can also write to system memory, so it should be possible for the OpenGL implementation to make an SSBO reside in system memory. You might be able to force this with the correct flags when generating the buffer. Otherwise gpu memory should be the default location.
* Acces to SSBOs is uncached ???
Usually, every global memory read goes through an L2 cache and there is no reason for the OpenGL implementation to bypass that for SSBOs (except for atomic operations maybe). They are, however, not cached by the L1, at least not by the hardware. The L1 is divided into two halves. On halve is used to cache spilled registers, the other can be thought of as a software controlled cache. All bets are off on how much the latter is utilized.
Also, since the caches are not the subsitute for a reasonably sized register file (like on x86), their latency is significantly higher then what you are used to from the CPU.
* Acces to textures via LoadStore is cached (texture-cache) ???
* The actual load and store via LoadStore is done by texturing-hardware ???
The texture sampling hardware can only read. Same for the texture cache, it's read only (It's also a seperate memory pipeline with different coherency). I'm assuming that the cuda equivalent to LoadStore is "surfaces". According to the cuda documentation, surface reads are cached in the texture cache. My guess is that LoadStore-Loads are texture cached and LoadStore-Stores are more or less ordinary writes to global memory.
* why is a float[]-array not tighly packed when declared in a SSBO. ??? There is always a stride of sizeof(vec4).
* ... why does this make sense ???
Is it? Sorry, I have no idea why.
* which of the two is more performant ???
* what are typical use-cased for the two ???
By "the two" I assume you refer to texture LoadStore and SSBOs?
The texture cache and the texture memory layout are optimized to improve cache locality for nearby texture reads (where nearby actually means nearby texels in 2D textures). Since this layout is
not simply row major (or column major) but some swizzled layout, that might change from generation to generation, you need a specialized write/store mechanism for it. Hence the need for LoadStore.
So LoadStore should be quite efficient, when your reads are close together. However, the texture memory pipeline is longer then the normal memory pipeline. So if you don't need a cache that's optimized
for 2D locations, then SSBO should be faster.
* UBOs reside in local memory, so it's normally only 64k, but faster than SSBOs ???
* can I also write to an Uniform-Buffer ???
The cuda counterpart for ordinary uniforms or uniform buffers is "constant memory". Constant memory is cached by the constant cache and is read only (In cuda, "local memory" is the L1 backed memory for spilling registers which is very different from the constant memory). So no writing there. It is however quite fast.
The primary attribute of the constant cache is that it can only service one float to a warp in each cycle. So all threads of a warp should request the same float. For example, if you have a vertex shader where each thread reads the same ProjectionView[0][0] value, then this is serviced in one cycle. If however, every thread reads a different value like someUniform[threadIndex], then it takes 32 cycles (with a 32 thread warp) for the hardware to complete that.