Big array in GLSL causes OUT_OF_MEMORY

spek · 2017-04-07T05:52:14

Any idea why the code below gives me "GL_OUT_OF_MEMORY", or "internal malloc failed" errors? Because I can't believe I'm actually running out of memory. For the first time I'm using SSBO's to get relative large (~50 .. 100 MB) into the videocard. Making the buffer & setting its size with glBufferData doesn't seem to give any problems. But loading/compiling a shader with this code kills it: struct TestProbe { vec4 staticLight[6]; }; // 96 bytes layout (std430, binding=1) buffer ProbeSystem { TestProbe testProbes[262144]; }; // 96 x 262144 = 24 MB Making the array smaller eventually cures it, but why would this be an issue? Also tried a smaller 2D array ( "testProbes[64][4096]" ), but no luck there either. My guess is that Ã forgot something, the GPU trying to reserve this memory in the wrong (uniform?) area or something... OR, maybe this just can't be done in a fragment shader, and I need a ComputeShader instead?

Graphics and GPU Programming Programming

Started by spek April 04, 2017 11:57 PM

12 comments, last by JoeJ 7 years ago

spek

1,244

Author

April 06, 2017 03:43 PM

Problem in my case is that there is no "whole scene". The world is divided in smaller sectors (mainly rooms and corridors in my case), and are loaded on the fly, when nearby enough. Which certainly doesn't make this story easier, because the tree itself is made of multiple sub-trees, and also the probe array is dynamically filled. If a sector gets unloaded, it releases a slot (X probes/cells), which can then be claimed by another sector that will be loaded.

Tower22 Blog

JoeJ

4,181

April 06, 2017 07:31 PM

But i don't see a problem with that,
you could use a fixed number for the maximum visible lobes (e.g. 0x10000) and preallocate enough memory once.

Then you use a ringbuffer that contains indices to available lobes.
Initially all are available, so we initialize the ringbufer with all of them:

for (int i=0; i<0x10000; i++) ringbuffer = i;

we then use a head index to the ringbuffer to point to the next available lobe:
int head = 0;
and a tail index that points to the first used lobe:
int tail = 0x10000;

Then at runtime when camera moves around, we constantly request, free or keep nodes.
To request a node, simply read and increase head,
to release a node, just read and increase tail.
(Always binary and those indices with 0xFFFF to keep the ring a ring and avoid out of range access)

If you already know about this simple concept (has it a name?), think of how this could solve the problem you mentioned.
If not - it really is that simple, just the levels of indirection is a bit confusing but not too deep (probably i have a bug in mu explantation).

So, no matter what parts of your scene you use at a moment - if all those parts have consistent world coords you CAN precompute that single index grid,
and you can manage both (block of index grid and lobes) with a mechanism i have explained.
The mechanism is efficient on GPU.
Downside is it orders things at random memory locations, but '3D' means spatial order is not perfect in any case.

spek

1,244

Author

April 06, 2017 11:49 PM

Well first, as for the original question(s): SSBO works, as the querying works - thank you guys! Got some ambient baked into a tree & probe array, as described in the first posts. That was the good news :) The bad news that (indeed), it's terribly slow. FPS crumbled from ~50 to ~20. And I'm not even taking multiple probes for interpolation yet :(

Now my laptop isn't a graphical powerhouse, and obviously my methods are most likely less optimized than what the Quantum Break papers do behind curtains. Maybe they do the ambient pass on a lower resolution as well. Plus I didn't really play with different memory lay-outs & compressed struct sizes yet.

Traversing the tree jumps through 16byte (vec4) sized structs, with 3 or jumps. Yet the code to figure out which subcell to access seems a bit complex to me. And reducing my original 388 probe to 48 bytes (6 colors per probe) didn't help at least. Then again it's still a huge struct. But all in all... not very promising. Or I'm doing something terribly wrong? I can paste the GLSL code if you guys are interested.

EDIT: Doh. 20 FPS because I was drawing all probes in debug at the same time. Without that the performance actually isn't that bad. Still not 100% convinced, but its bed time now hehe.

@JoeJ

I think I'm missing the part where these pre-computed indices are stored... How does surfaceX know it's connected to index 1234? Or, given a certain pixel on screen (knowing its position, normal, and eventually which Tree / Offset it used, all baked into g-Buffers), what tells me this index? A lightmap?

Now you mentioned "all static" earlier, note I'd like to use the same data for moving objects / particles as well. And maybe volumetric fog/raymarching, if not too expensive (and that would certainly kill the GPU with the sluggish method I now tried for traversing trees).

Tower22 Blog

JoeJ

4,181

April 07, 2017 05:52 AM

I think I'm missing the part where these pre-computed indices are stored... How does surfaceX know it's connected to index 1234? Or, given a certain pixel on screen (knowing its position, normal, and eventually which Tree / Offset it used, all baked into g-Buffers), what tells me this index?

The index comes from indexing the int-volume-texture you mentioned with the world position of the fragment.
The returned index points to one of all lobes. But because only a subset of lobes is in use,
you need to maintain a indirection table that maps this 'global' index to the actual lobe in memory.

The indirection table is the only thing where you need to have the whole scene in memory, and it has the size of total lobe count.
Lobes and blocks of index volume texture can be partial in memory (that's what i mean with streaming. The mentioned mechanism is a way to implement it).

At the end you still have a sparse distribution of lobes but a dense distribution of indices covering the entire space,
so moving objects / volumetrics can be done as intended.

You really need to care for high occupancy here to compensate the waiting on all those memory reads necessary to resolve indirections.
But i expect low register usage so this should be possible.
(The same applies to the tree traversal you use in your current approach. Use tools to figure out occupancy. If you don't you're blindfolded)

I did some SSBO vs. texture comparisions in the past but difference was negligible for me.
Probably textures are faster with spational ordering and ordered access, but if you don't have this you can SSBOs without doupts.
If you would use bricks of 4x4x4 lobes or something, textures could be a win (e.g. Crassins initial voxel octree tracing paper).
If you have completely random or perfectly ordered access, both should end up being equally fast.

So it should not matter o lot, but AOS vs SAO should make a big difference.

Big array in GLSL causes OUT_OF_MEMORY

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Big array in GLSL causes OUT_OF_MEMORY

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines