Big array in GLSL causes OUT_OF_MEMORY

Started by
12 comments, last by JoeJ 7 years ago

Any idea why the code below gives me "GL_OUT_OF_MEMORY", or "internal malloc failed" errors? Because I can't believe I'm actually running out of memory.

For the first time I'm using SSBO's to get relative large (~50 .. 100 MB) into the videocard. Making the buffer & setting its size with glBufferData doesn't seem to give any problems. But loading/compiling a shader with this code kills it:


	struct TestProbe {
		vec4	staticLight[6];
	}; // 96 bytes

	layout (std430, binding=1) buffer ProbeSystem {
         	TestProbe	testProbes[262144];
	}; // 96 x 262144 = 24 MB

Making the array smaller eventually cures it, but why would this be an issue? Also tried a smaller 2D array ( "testProbes[64][4096]" ), but no luck there either. My guess is that Í forgot something, the GPU trying to reserve this memory in the wrong (uniform?) area or something... OR, maybe this just can't be done in a fragment shader, and I need a ComputeShader instead?

Advertisement

Can you use testProbes[]; instead of testProbes[262144]; ?

Also, are you planning on writing into the buffer from the shader? SSBO's are designed for cases where shaders will read and write to the buffer. For the read-only case, you can use a Buffer texture.

>> testProbes [ ]

And... it's gone :) The error message I mean. Not sure if things actually work, my code isn't far enough to really test, but at least the "ouf of memory" errors are gone.

>> Read & Write?

Wasn't planning to write via a shader. At least not yet. Never tried "Buffer Textures", but looking at the link, one element can only contain up to 4 x 32bits ? My testcode is rather simplistic here, but the later version will have multiple arrays and bigger structs. Trying to make an octree-alike system with probes.

That brings me to another question. Does it make any (performance) difference when having to deal with either small or large structs? Because the actual buffer content will be something like this:


struct AmbiProbeCell {
	ivec2	childMask; // 4x4x4(64 bit) sub-cells
	ivec2	childOffset;
}; // 16 b
	
struct Probe {
	vec4	properties;
	vec4	staticLight[6];
	vec4	skyLight[6];
}; // 196 b
		

layout (std430, binding=1) buffer ProbeArray{
	Probe	probes[];
};

layout (std430, binding=2) buffer ProbeTree{
	TreeCell cells[];
};

The idea is that I query through TreeCells first (thus jumping from one small struct to another). Then leaf cells will contain an index to the other "Probes" array, which contains much larger structs. So basically I have a relative small, and a large array. But I could also join them both into a single (even bigger) array. But as you while traversing trees, I'll be jumping from one element to another. Does (struct) size matter here?

Personally i moved away from using any kind of structs on GPU pretty early, when initial tests showed performance differences of 1000%(!).

I suggest you try some variations to see what's best for you, it depends on memory access patterns which depend on use case.
My experience tells me that SOA like this may be a lot better:

vec4 properties[MAX_PORBES];
vec4 staticLight[6 * MAX_PORBES];
vec4 skyLight[6 * MAX_PORBES]; // (Note: you can pack all 3 in one big SSBO and use your own indexing for proper access, which may save same registers)


The advantage may be that if we have a workgroup processing parallel code like this:

vec4 myLight = GetLight(staticLight[threadID], normal);
myLight += GetLight(skyLight[threadID], normal) * (1-myLight.w);

First ALL threads access staticLight, then ALL threads access skyLight.
So even those instructions follow each other, it's better to make the memory compact according to parallel access than to make it compact for sequential access.
So SOA over AOS.
(At least that's the conclusion i've got from my results)

But i'm talking more on compute and generating the data than on shading and just using the data.
There may be a difference, spend some time to try it all :)

Never tried "Buffer Textures", but looking at the link, one element can only contain up to 4 x 32bits ? My testcode is rather simplistic here, but the later version will have multiple arrays and bigger structs. Trying to make an octree-alike system with probes.

My GL is rusty, but HLSL has syntax for doing typed loads as with textures (1 to 4 channels, big enum of formats), raw 32bit loads with your own typecasting, or syntactic sugar where you define a struct. I would be disappointed if GL didn't have equivalents.

That brings me to another question. Does it make any (performance) difference when having to deal with either small or large structs?

The amount of data fetched by a shader is one of the most important performance hints. Memory latency is like 1k cycles, so waiting on a fetch is slow. GPUs try to have a lot of threads in flight to hide latency, but YMMV.
As for struct vs manual loads, that's down to how good the compiler is. The raw GPU ASM code will likely have instructions for doing 32/64/96/128 bit loads from 4-byte aligned memory addresses. All your buffer loading code will compile down to these instructions in every case.
I'm wondering what you try to do here.
Am i right it's something like this:

For each pixel of a deferred Frame
Start getting a node index from a uniform grid
Traverse a octree from that to find the best fitting cell (probe) for that pixel (or interpolate multiple cells to keep things smooth)
light it accordingly?

Is this how Quantum Break worked and is it really possible to do it fast enough?
There is the problem of divergent code flow and data access during traversal - it seems very inefficient to me.
How does this justify against using lightmaps, where all this can be preprocessed at least for static geometry?

Albeit being very skeptical i'm interested because i considered something similar to apply my GI data to game models.
But i believe there are better options:
1. Lightmaps (basically a LUT to adress the proper lobe(s) directly)
2. Render the lobes to a screenspace sized buffer, read from that while shading (Needs to store multiple lobes per pixel to handle edge cases, so already worse)
3. Render the scene to G-buffer first, then 'render' lobes and accumulate pixels (Same idea just deferred - sounds a bit better)
4. While deferred shading traverse the lobe tree to find the proper fit (Eventually what we talk about - slowest method, probably even slower than updating the lobe tree in realtime for me)

You're pretty much right. What I'm trying to do is nicely described in this paper:

https://mediatech.aalto.fi/~ari/Publications/SIGGRAPH_2015_Remedy_Notes.pdf

So, the world is broken down in a tree structure. Each cell is connected to 1 probe, and eventually divides further into 4x4x4 (64) subcells, Advantage is that you don't need tenthousands of probes in an uniform 3D grid. Disadvantage is well, you need to traverse the tree first before you know which probe to pick for any given pixel-position.

The traversing in my case goes jumps deeper up to 3 times. So the first fetch will be a large cell, A cell has an offset and bitMask (int64), where each bit tells whether there is a deeper cell or not. Using this offset and how many bits were counted, we know where to access the next cell.

If no deeper cell was found, the same counting mechanism will tell where to fetch the actual probe data. The probe in my case is basically a 1x1 faced cubemap. Plus it tells a few more details, like which Specular Probe to use, or stuff like fog-thickness. All in all, big-data (50+ MB in my case).

Currently I use "traditional" lightmaps, but having several problems. UV mapping issues in some cases, though that will be most likely gone if they simply refer further to a probe (your 1st solution). Still, doesn't work too well for dynamic objects / particles / translucent stuff (glass).

Splatting the probes (I think that is your option3) on screenspace G-Buffers (depth/normal/position) is probably much easier. Like deferred lighting, each probe would render a cube (sized according the tree + some overlap with neighbours to cause interpolation), and apply its light-data on whatever geometry it intersects.

Downside might be the large number of cubes overlapping each other, giving potential fill-rate issues. Plus particles and such require a slight different approach. There is also a chance on light leaking (splatting probes from neihgbour rooms), though I can think we can mask that with some "Room ID" number or something.

What I did in the past is simply making an unform 3D grid - thus LOTS of probes EVERYWHERE. I injected the probes surrounding the camera into a 32x32x32 3D texture. Simple & Fast, but no G.I. and popping for distant stuff, and a lot of probes (+baking time) wasted on vacuum space. Also sensitive for light leaks in some cases.

Hmmm, branching factor of 64 and max 3 levels - makes sense. I think this makes the divergence problem quite acceptable.
I agree this seems a better option than splatting.

Thanks for your explantation, makes me think.

I hope to see a blog post about your results then... :)

Promised. IF I can make it work, that is :D

Then again you guys made me think again as well. Problem is always how to get the right probe(s) somehow. I was just thinking maybe probes can inject their ID(array index) into a 3D texture. Thus:

* Make a volume texture.

*** Since it only has to store a single int this time, it can be a relative big texture. For example, 256 x 128(need less height here) x 256 x R32 = 32 MB only

* Volume texture follows your camera

* Render all probes as Points into the Volume texture:

*** Biggest probes (the larger cells) first. These would inject multiple points (using geometry shader)

*** Smaller probes overwrite the ID's of the bigger cells they are inside.

*** Leak reduction: Foreground room probes will overwrite rooms further away, Doesn't always work, but often it should.

* Anything that renders, will fetch the probe ID by simply using its (world - camera offset) position.

* Use the ID to directly fetch from the probe array

The probes may still be a SSBO (sorry, the topic drifted off as usual hehe). Could be done with textures as well, but I find the idea of having 12 or 13 textures messy - not sure if it matters performance wise... Of course, the ID-injection step also takes time, but I know from experience its pretty cheap. And from there on, anything (particles, glass, volumetric fog-raymarching) can figure out its probe relative easily).

But I'm 100% sure I forgot about a few big BUT's! here :)

If all is static, couldn't you precompute the grid for the whole scene, compress and stream it?

This topic is closed to new replies.

Advertisement