Sign in to follow this  
spek

Big array in GLSL causes OUT_OF_MEMORY

Recommended Posts

Any idea why the code below gives me "GL_OUT_OF_MEMORY", or "internal malloc failed" errors? Because I can't believe I'm actually running out of memory.

For the first time I'm using SSBO's to get relative large (~50 .. 100 MB) into the videocard. Making the buffer & setting its size with glBufferData doesn't seem to give any problems. But loading/compiling a shader with this code kills it:

	struct TestProbe {
		vec4	staticLight[6];
	}; // 96 bytes

	layout (std430, binding=1) buffer ProbeSystem {
         	TestProbe	testProbes[262144];
	}; // 96 x 262144 = 24 MB

Making the array smaller eventually cures it, but why would this be an issue? Also tried a smaller 2D array ( "testProbes[64][4096]" ), but no luck there either. My guess is that Í forgot something, the GPU trying to reserve this memory in the wrong (uniform?) area or something... OR, maybe this just can't be done in a fragment shader, and I need a ComputeShader instead?

Share this post


Link to post
Share on other sites

Can you use testProbes[]; instead of testProbes[262144]; ?

Also, are you planning on writing into the buffer from the shader? SSBO's are designed for cases where shaders will read and write to the buffer. For the read-only case, you can use a Buffer texture.

Edited by Hodgman

Share this post


Link to post
Share on other sites

>> testProbes [ ]

And... it's gone :) The error message I mean. Not sure if things actually work, my code isn't far enough to really test, but at least the "ouf of memory" errors are gone.

 

>> Read & Write?

Wasn't planning to write via a shader. At least not yet. Never tried "Buffer Textures", but looking at the link, one element can only contain up to 4 x 32bits ? My testcode is rather simplistic here, but the later version will have multiple arrays and bigger structs. Trying to make an octree-alike system with probes.

 

That brings me to another question. Does it make any (performance) difference when having to deal with either small or large structs? Because the actual buffer content will be something like this:

struct AmbiProbeCell {
	ivec2	childMask; // 4x4x4(64 bit) sub-cells
	ivec2	childOffset;
}; // 16 b
	
struct Probe {
	vec4	properties;
	vec4	staticLight[6];
	vec4	skyLight[6];
}; // 196 b
		

layout (std430, binding=1) buffer ProbeArray{
	Probe	probes[];
};

layout (std430, binding=2) buffer ProbeTree{
	TreeCell cells[];
};

The idea is that I query through TreeCells first (thus jumping from one small struct to another). Then leaf cells will contain an index to the other "Probes" array, which contains much larger structs. So basically I have a relative small, and a large array. But I could also join them both into a single (even bigger) array. But as you while traversing trees, I'll be jumping from one element to another. Does (struct) size matter here?

Share this post


Link to post
Share on other sites
Personally i moved away from using any kind of structs on GPU pretty early, when initial tests showed performance differences of 1000%(!).

I suggest you try some variations to see what's best for you, it depends on memory access patterns which depend on use case.
My experience tells me that SOA like this may be a lot better:

vec4 properties[MAX_PORBES];
vec4 staticLight[6 * MAX_PORBES];
vec4 skyLight[6 * MAX_PORBES]; // (Note: you can pack all 3 in one big SSBO and use your own indexing for proper access, which may save same registers)


The advantage may be that if we have a workgroup processing parallel code like this:

vec4 myLight = GetLight(staticLight[threadID], normal);
myLight += GetLight(skyLight[threadID], normal) * (1-myLight.w);

First ALL threads access staticLight, then ALL threads access skyLight.
So even those instructions follow each other, it's better to make the memory compact according to parallel access than to make it compact for sequential access.
So SOA over AOS.
(At least that's the conclusion i've got from my results)

But i'm talking more on compute and generating the data than on shading and just using the data.
There may be a difference, spend some time to try it all :) Edited by JoeJ

Share this post


Link to post
Share on other sites

Never tried "Buffer Textures", but looking at the link, one element can only contain up to 4 x 32bits ? My testcode is rather simplistic here, but the later version will have multiple arrays and bigger structs. Trying to make an octree-alike system with probes.

My GL is rusty, but HLSL has syntax for doing typed loads as with textures (1 to 4 channels, big enum of formats), raw 32bit loads with your own typecasting, or syntactic sugar where you define a struct. I would be disappointed if GL didn't have equivalents.
 

That brings me to another question. Does it make any (performance) difference when having to deal with either small or large structs?

The amount of data fetched by a shader is one of the most important performance hints. Memory latency is like 1k cycles, so waiting on a fetch is slow. GPUs try to have a lot of threads in flight to hide latency, but YMMV.
As for struct vs manual loads, that's down to how good the compiler is. The raw GPU ASM code will likely have instructions for doing 32/64/96/128 bit loads from 4-byte aligned memory addresses. All your buffer loading code will compile down to these instructions in every case.

Share this post


Link to post
Share on other sites
I'm wondering what you try to do here.
Am i right it's something like this:

For each pixel of a deferred Frame
Start getting a node index from a uniform grid
Traverse a octree from that to find the best fitting cell (probe) for that pixel (or interpolate multiple cells to keep things smooth)
light it accordingly?

Is this how Quantum Break worked and is it really possible to do it fast enough?
There is the problem of divergent code flow and data access during traversal - it seems very inefficient to me.
How does this justify against using lightmaps, where all this can be preprocessed at least for static geometry?

Albeit being very skeptical i'm interested because i considered something similar to apply my GI data to game models.
But i believe there are better options:
1. Lightmaps (basically a LUT to adress the proper lobe(s) directly)
2. Render the lobes to a screenspace sized buffer, read from that while shading (Needs to store multiple lobes per pixel to handle edge cases, so already worse)
3. Render the scene to G-buffer first, then 'render' lobes and accumulate pixels (Same idea just deferred - sounds a bit better)
4. While deferred shading traverse the lobe tree to find the proper fit (Eventually what we talk about - slowest method, probably even slower than updating the lobe tree in realtime for me)

Share this post


Link to post
Share on other sites

You're pretty much right. What I'm trying to do is nicely described in this paper:

https://mediatech.aalto.fi/~ari/Publications/SIGGRAPH_2015_Remedy_Notes.pdf

So, the world is broken down in a tree structure. Each cell is connected to 1 probe, and eventually divides further into 4x4x4 (64) subcells, Advantage is that you don't need tenthousands of probes in an uniform 3D grid. Disadvantage is well, you need to traverse the tree first before you know which probe to pick for any given pixel-position.

 

The traversing in my case goes jumps deeper up to 3 times. So the first fetch will be a large cell, A cell has an offset and bitMask (int64), where each bit tells whether there is a deeper cell or not. Using this offset and how many bits were counted, we know where to access the next cell.

If no deeper cell was found, the same counting mechanism will tell where to fetch the actual probe data. The probe in my case is basically a 1x1 faced cubemap. Plus it tells a few more details, like which Specular Probe to use, or stuff like fog-thickness. All in all, big-data (50+ MB in my case).

 

Currently I use "traditional" lightmaps, but having several problems. UV mapping issues in some cases, though that will be most likely gone if they simply refer further to a probe (your 1st solution). Still, doesn't work too well for dynamic objects / particles / translucent stuff (glass).

Splatting the probes (I think that is your option3) on screenspace G-Buffers (depth/normal/position) is probably much easier. Like deferred lighting, each probe would render a cube (sized according the tree + some overlap with neighbours to cause interpolation), and apply its light-data on whatever geometry it intersects.

Downside might be the large number of cubes overlapping each other, giving potential fill-rate issues. Plus particles and such require a slight different approach. There is also a chance on light leaking (splatting probes from neihgbour rooms), though I can think we can mask that with some "Room ID" number or something.

 

What I did in the past is simply making an unform 3D grid - thus LOTS of probes EVERYWHERE. I injected the probes surrounding the camera into a 32x32x32 3D texture. Simple & Fast, but no G.I. and popping for distant stuff, and a lot of probes (+baking time) wasted on vacuum space. Also sensitive for light leaks in some cases.

Share this post


Link to post
Share on other sites
Hmmm, branching factor of 64 and max 3 levels - makes sense. I think this makes the divergence problem quite acceptable.
I agree this seems a better option than splatting.

Thanks for your explantation, makes me think.

I hope to see a blog post about your results then... :)

Share this post


Link to post
Share on other sites

Promised. IF I can make it work, that is :D

Then again you guys made me think again as well. Problem is always how to get the right probe(s) somehow. I was just thinking maybe probes can inject their ID(array index) into a 3D texture. Thus:

* Make a volume texture.

*** Since it only has to store a single int this time, it can be a relative big texture. For example, 256 x 128(need less height here) x 256 x R32 = 32 MB only

* Volume texture follows your camera

* Render all probes as Points into the Volume texture:

*** Biggest probes (the larger cells) first. These would inject multiple points (using geometry shader)

*** Smaller probes overwrite the ID's of the bigger cells they are inside.

*** Leak reduction: Foreground room probes will overwrite rooms further away, Doesn't always work, but often it should.

* Anything that renders, will fetch the probe ID by simply using its (world - camera offset) position.

* Use the ID to directly fetch from the probe array

 

The probes may still be a SSBO (sorry, the topic drifted off as usual hehe). Could be done with textures as well, but I find the idea of having 12 or 13 textures messy - not sure if it matters performance wise... Of course, the ID-injection step also takes time, but I know from experience its pretty cheap. And from there on, anything (particles, glass, volumetric fog-raymarching) can figure out its probe relative easily).

But I'm 100% sure I forgot about a few big BUT's! here :)

Share this post


Link to post
Share on other sites

Problem in my case is that there is no "whole scene". The world is divided in smaller sectors (mainly rooms and corridors in my case), and are loaded on the fly, when nearby enough. Which certainly doesn't make this story easier, because the tree itself is made of multiple sub-trees, and also the probe array is dynamically filled. If a sector gets unloaded, it releases a slot (X probes/cells), which can then be claimed by another sector that will be loaded.

Share this post


Link to post
Share on other sites
But i don't see a problem with that,
you could use a fixed number for the maximum visible lobes (e.g. 0x10000) and preallocate enough memory once.

Then you use a ringbuffer that contains indices to available lobes.
Initially all are available, so we initialize the ringbufer with all of them:

for (int i=0; i<0x10000; i++) ringbuffer[i] = i;

we then use a head index to the ringbuffer to point to the next available lobe:
int head = 0;
and a tail index that points to the first used lobe:
int tail = 0x10000;

Then at runtime when camera moves around, we constantly request, free or keep nodes.
To request a node, simply read and increase head,
to release a node, just read and increase tail.
(Always binary and those indices with 0xFFFF to keep the ring a ring and avoid out of range access)


If you already know about this simple concept (has it a name?), think of how this could solve the problem you mentioned.
If not - it really is that simple, just the levels of indirection is a bit confusing but not too deep (probably i have a bug in mu explantation).


So, no matter what parts of your scene you use at a moment - if all those parts have consistent world coords you CAN precompute that single index grid,
and you can manage both (block of index grid and lobes) with a mechanism i have explained.
The mechanism is efficient on GPU.
Downside is it orders things at random memory locations, but '3D' means spatial order is not perfect in any case.

Share this post


Link to post
Share on other sites

Well first, as for the original question(s): SSBO works, as the querying works - thank you guys! Got some ambient baked into a tree & probe array, as described in the first posts. That was the good news :) The bad news that (indeed), it's terribly slow. FPS crumbled from ~50 to ~20. And I'm not even taking multiple probes for interpolation yet :(

Now my laptop isn't a graphical powerhouse, and obviously my methods are most likely less optimized than what the Quantum Break papers do behind curtains. Maybe they do the ambient pass on a lower resolution as well. Plus I didn't really play with different memory lay-outs & compressed struct sizes yet.

Traversing the tree jumps through 16byte (vec4) sized structs, with 3 or jumps. Yet the code to figure out which subcell to access seems a bit complex to me. And reducing my original 388 probe to 48 bytes (6 colors per probe) didn't help at least. Then again it's still a huge struct. But all in all... not very promising. Or I'm doing something terribly wrong? I can paste the GLSL code if you guys are interested.

 

EDIT: Doh. 20 FPS because I was drawing all probes in debug at the same time. Without that the performance actually isn't that bad. Still not 100% convinced, but its bed time now hehe.

 

@JoeJ

I think I'm missing the part where these pre-computed indices are stored... How does surfaceX know it's connected to index 1234? Or, given a certain pixel on screen (knowing its position, normal, and eventually which Tree / Offset it used, all baked into g-Buffers), what tells me this index? A lightmap?

Now you mentioned "all static" earlier, note I'd like to use the same data for moving objects / particles as well. And maybe volumetric fog/raymarching, if not too expensive (and that would certainly kill the GPU with the sluggish method I now tried for traversing trees).

Edited by spek

Share this post


Link to post
Share on other sites

I think I'm missing the part where these pre-computed indices are stored... How does surfaceX know it's connected to index 1234? Or, given a certain pixel on screen (knowing its position, normal, and eventually which Tree / Offset it used, all baked into g-Buffers), what tells me this index?


The index comes from indexing the int-volume-texture you mentioned with the world position of the fragment.
The returned index points to one of all lobes. But because only a subset of lobes is in use,
you need to maintain a indirection table that maps this 'global' index to the actual lobe in memory.

The indirection table is the only thing where you need to have the whole scene in memory, and it has the size of total lobe count.
Lobes and blocks of index volume texture can be partial in memory (that's what i mean with streaming. The mentioned mechanism is a way to implement it).

At the end you still have a sparse distribution of lobes but a dense distribution of indices covering the entire space,
so moving objects / volumetrics can be done as intended.


You really need to care for high occupancy here to compensate the waiting on all those memory reads necessary to resolve indirections.
But i expect low register usage so this should be possible.
(The same applies to the tree traversal you use in your current approach. Use tools to figure out occupancy. If you don't you're blindfolded)



I did some SSBO vs. texture comparisions in the past but difference was negligible for me.
Probably textures are faster with spational ordering and ordered access, but if you don't have this you can SSBOs without doupts.
If you would use bricks of 4x4x4 lobes or something, textures could be a win (e.g. Crassins initial voxel octree tracing paper).
If you have completely random or perfectly ordered access, both should end up being equally fast.

So it should not matter o lot, but AOS vs SAO should make a big difference.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this