• Advertisement
Sign in to follow this  

Map Buffer Range Super Slow?

This topic is 1075 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I am testing out some different ways to write data into a VBO and I am sort of confused.
I started to compare glBufferSubData and glMapBufferRange using the UNSYNCRONIZED flag and I'm finding that glBufferSubData is doing significantly better. Where glBufferSubData is pulling in ~950 FPS compared to the ~150 FPS glMapBufferRange with the UNSYNCRONIZED flag is getting
 
I am not doing anything special here. My test is 10000 untextured 32x32 quads (colored red) with a prerandomized position at init time. Where the quads are spread out over a 800x600 window.

 

So I'm wondering what gives, I would think that numbers would be the other way around? I thought that the UNSYNCRONIZED flag was supposed to tell the GPU not to block, where as glBufferSubData does cause a block on the GPU?
 
 
glBufferSubdata

void SpriteBatcher::Render(Matrix4 &projection)
{

glUseProgram(shaderProgram.programID);
glUniformMatrix4fv(shaderProgram.uniforms[0].location, 1, GL_FALSE, projection.data);
 
for (int i = 0; i < 12 * MAX_SPRITE_BATCH_SIZE; i += 12)
{
    verts1[i] = pos[i];
    verts1[i + 1] = pos[i + 1];
    verts1[i + 2] = 0.0f;
 
    verts1[i + 3] = _verts1[i];
    verts1[i + 4] = _verts1[i + 1] + 32.0f;
    verts1[i + 5] = 0.0f;
 
    verts1[i + 6] = _verts1[i] + 32.0f;
    verts1[i + 7] = _verts1[i + 1] + 32.0f;
    verts1[i + 8] = 0.0f;
 
    verts1[i + 9] = _verts1[i] + 32.0f;
    verts1[i + 10] = _verts1[i + 1];
    verts1[i + 11] = 0.0f;
}
 
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferSubData(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, _verts1);

glBindVertexArray(vao);
glDrawElements(GL_TRIANGLES, MAX_SPRITE_BATCH_SIZE * INDICES_COUNT_PER_QUAD, GL_UNSIGNED_SHORT, (const void*)0);

}

glMapBufferRange with UNSYNCRONIZED flag

void SpriteBatcher::Render(Matrix4 &projection)
{
	glUseProgram(shaderProgram.programID);
	glUniformMatrix4fv(shaderProgram.uniforms[0].location, 1, GL_FALSE, projection.data);


	glBindBuffer(GL_ARRAY_BUFFER, vbo);
	GLfloat *pointer = (GLfloat*)glMapBufferRange(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
	
        if (pointer == NULL)
		throw std::runtime_error("Null pointer on map");

	for (int i = 0; i < 12 * MAX_SPRITE_BATCH_SIZE; i += 12)
	{
		pointer[i] = pos[i];
		pointer[i + 1] = pos[i + 1];
		pointer[i + 2] = 0.0f;

		pointer[i + 3] = pointer[i];
		pointer[i + 4] = pointer[i + 1] + 32.0f;
		pointer[i + 5] = 0.0f;

		pointer[i + 6] = pointer[i] + 32.0f;
		pointer[i + 7] = pointer[i + 1] + 32.0f;
		pointer[i + 8] = 0.0f;

		pointer[i + 9] = pointer[i] + 32.0f;
		pointer[i + 10] = pointer[i + 1];
		pointer[i + 11] = 0.0f;
	}
	glUnmapBuffer(GL_ARRAY_BUFFER);

	glBindVertexArray(vao);
	glDrawElements(GL_TRIANGLES, MAX_SPRITE_BATCH_SIZE * INDICES_COUNT_PER_QUAD, GL_UNSIGNED_SHORT, (const void*)0);
}

Share this post


Link to post
Share on other sites
Advertisement
You might also want the MAP_INVALIDATE_RANGE_BIT flag set, indicating you'll be overwriting the whole range, hinting to GL to not copy the old data before returning from the map function.

Share this post


Link to post
Share on other sites

You might also want the MAP_INVALIDATE_RANGE_BIT flag set, indicating you'll be overwriting the whole range, hinting to GL to not copy the old data before returning from the map function.

 
I went and tested out the INVALIDATE_BUFFER and INVALIDATE_RANGE flags and both actually drop the FPS to ~100.
I also tried orphaning the buffer, but there was no improvement

//Orphan code snip
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, NULL, GL_DYNAMIC_DRAW);
GLfloat *pointer = (GLfloat*)glMapBufferRange(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
if (pointer == NULL)
     throw std::runtime_error("Null pointer on map");
Edited by noodleBowl

Share this post


Link to post
Share on other sites
According to nvidia video on azdo map buffer is slow even with un synchronized bit.
Not sure why, I think it's tied to some multi threading synchronisation happening inside the driver.

Share this post


Link to post
Share on other sites

I found the problem!
 
The problem lies in here

for (int i = 0; i < 12 * MAX_SPRITE_BATCH_SIZE; i += 12)
{
pointer[i] = pos[i];
pointer[i + 1] = pos[i + 1];
pointer[i + 2] = 0.0f;
 
pointer[i + 3] = pointer[i];
pointer[i + 4] = pointer[i + 1] + 32.0f;
pointer[i + 5] = 0.0f;
 
pointer[i + 6] = pointer[i] + 32.0f;
pointer[i + 7] = pointer[i + 1] + 32.0f;
pointer[i + 8] = 0.0f;
 
pointer[i + 9] = pointer[i] + 32.0f;
pointer[i + 10] = pointer[i + 1];
pointer[i + 11] = 0.0f;
}

Turns out lines like these: pointer[i + 9] = pointer + 32.0f;
Count as reading the buffer!

 

So after changing all the lines to just use the pos array like the first 2 lines my FPS shot up to ~1050 FPS!

Edited by noodleBowl

Share this post


Link to post
Share on other sites


Turns out lines like these: pointer[i + 9] = pointer + 32.0f;
Count as reading the buffer!
Yep! I didn't even look at that code before... but this is really bad for performance.

 

The best case for a map call is that the pointer returned by glMap* is an actual pointer into GPU-RAM, and that the OS will have marked these pages of addresses as being uncached and write-combined.

 

When you write to those addresses, the CPU doesn't bother also writing to the data to it's L1/L2/L3 caches, because it knows you're not going to be reading it again. It also doesn't write to RAM immediately -- it buffers up each small write into a write-combining-buffer of some large size, and when enough data has been written, it flushes that whole buffer through to GPU-RAM in a single bulk transfer.

 

All this is wonderful... until you try to read from the buffer!

At that point, the CPU has to stall, prematurely flush out the write-combine buffer, wait for that transfer to actually reach GPU-RAM, then issue a read request to copy data from GPU-RAM back to the CPU. Furthermore, normally when reading data from RAM, you'll transfer a bulk amount (e.g. 64 bytes) and store it in the cache, as you'll likely want to read some nearby data soon too, and then move the requested amount (e.g. 4 bytes) from the cache to a CPU register -- in the best case, this means you do one RAM transaction when reading 16 floats! In this case though, because we're using uncached memory, every single read request results in a RAM transaction. Trying to read 16 floats == waiting on 16 "cache misses".

Share this post


Link to post
Share on other sites

I found the problem!
 
The problem lies in here

for (int i = 0; i < 12 * MAX_SPRITE_BATCH_SIZE; i += 12)
{
pointer[i] = pos[i];
pointer[i + 1] = pos[i + 1];
pointer[i + 2] = 0.0f;
 
pointer[i + 3] = pointer[i];
pointer[i + 4] = pointer[i + 1] + 32.0f;
pointer[i + 5] = 0.0f;
 
pointer[i + 6] = pointer[i] + 32.0f;
pointer[i + 7] = pointer[i + 1] + 32.0f;
pointer[i + 8] = 0.0f;
 
pointer[i + 9] = pointer[i] + 32.0f;
pointer[i + 10] = pointer[i + 1];
pointer[i + 11] = 0.0f;
}

Turns out lines like these: pointer[i + 9] = pointer + 32.0f;
Count as reading the buffer!

 

So after changing all the lines to just use the pos array like the first 2 lines my FPS shot up to ~1030 FPS!

Out of curiosity, do you see any difference now between only UNSYNCHRONIZED, and UNSYNCHRONIZED with INVALIDATE?

Share this post


Link to post
Share on other sites

Out of curiosity, do you see any difference now between only UNSYNCHRONIZED, and UNSYNCHRONIZED with INVALIDATE?


I looked into it and using the flag combo:
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT

I'm pulling about ~1050 FPS

When using the flag combo:
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT or

GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT

There is a slight drop, bringing me down to ~1040 FPS

Now when doing orphaning using glBufferData:

glBufferData(GL_ARRAY_BUFFER, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, NULL, GL_DYNAMIC_DRAW);	
GLfloat *pointer = (GLfloat*)glMapBufferRange(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);

I roughly get about the same FPS as the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT combo, but it fluctuates more for a bit before coming around to ~1038 to ~1040 FPS

Also around orphaning (glBufferData with NULL) and the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT flag combo, are these valued as doing the same thing? In the past I have had some issues with graphical artifacts when only trying the GL_MAP_INVALIDATE_BUFFER_BIT flag, but cleared them up using orphaning so I'm unsure if I was doing it correctly

Share this post


Link to post
Share on other sites

Also around orphaning (glBufferData with NULL) and the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT flag combo, are these valued as doing the same thing? In the past I have had some issues with graphical artifacts when only trying the GL_MAP_INVALIDATE_BUFFER_BIT flag, but cleared them up using orphaning so I'm unsure if I was doing it correctly

No, they're completely different things.

Orphaning allocates an entirely new buffer for you to write new data into, and any new draw commands will reference this new buffer.

Any existing draw commands which have already been submitted (but haven't yet been consumed by the GPU) will still use the old allocation, with the old data, so there's no chance of graphical corruption. After all those commands are executed by the GPU, the driver will garbage-collect this orphaned allocation.

 

Unsynchronized mapping just gives you a pointer to the existing allocation for that buffer, with zero synchronization or safety. You're making a sacred promise to the driver that you will not overwrite any part of the data that could potentially be used by existing draw commands. You need to implement your own ring-buffer or similar allocation strategy, and use GL fences/events to tell when it's safe to overwrite different parts of the buffer.

 

From your timing data, we can guess that the extra GPU memory allocation management involved in orphaning is costing you about 10?s per buffer per frame... which is pretty good!

Edited by Hodgman

Share this post


Link to post
Share on other sites

Unsynchronized mapping just gives you a pointer to the existing allocation for that buffer, with zero synchronization or safety. You're making a sacred promise to the driver that you will not overwrite any part of the data that could potentially be used by existing draw commands. You need to implement your own ring-buffer or similar allocation strategy, and use GL fences/events to tell when it's safe to overwrite different parts of the buffer.


Is using orphaning seen as the fastest way or the better choice to handle VBOs when mapping?
Since the driver will automatically (and I assume as quick as possible) get a new block of memory and maybe even reuse/recycle previous allocated buffers?

Or would it be optimal to have 6 - 8 VBOs at the ready with fence?
Since the fences will only block when they need to (eg the GPU is not able to process the data fast enough)

Share this post


Link to post
Share on other sites

A buffer sized for about 3 frames worth of data is generally safe enough.  You can map with unsynchronized, append, and when the buffer fills you just begin again at the start of the buffer without needing to fence or orphan.  I'd still feel safer adding a fence just in case you hit a non-typical frame.

Share this post


Link to post
Share on other sites

 

Out of curiosity, do you see any difference now between only UNSYNCHRONIZED, and UNSYNCHRONIZED with INVALIDATE?


I looked into it and using the flag combo:
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT

I'm pulling about ~1050 FPS

When using the flag combo:
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT or

GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT

There is a slight drop, bringing me down to ~1040 FPS

Now when doing orphaning using glBufferData:

glBufferData(GL_ARRAY_BUFFER, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, NULL, GL_DYNAMIC_DRAW);	
GLfloat *pointer = (GLfloat*)glMapBufferRange(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);

I roughly get about the same FPS as the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT combo, but it fluctuates more for a bit before coming around to ~1038 to ~1040 FPS

Also around orphaning (glBufferData with NULL) and the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT flag combo, are these valued as doing the same thing? In the past I have had some issues with graphical artifacts when only trying the GL_MAP_INVALIDATE_BUFFER_BIT flag, but cleared them up using orphaning so I'm unsure if I was doing it correctly

 

 

As small advice, please don't give timing values in FPS. Use time instead. 1050 -> 1040 FPS is just 0.009 milliseconds. Meanwhile 30 -> 20 FPS would be well above 10 ms. In the context of computer graphics FPS is pretty meaningless save for a few very specific purposes. It's called frames per second for a reason, otherwise people would just call it frequency tongue.png

Share this post


Link to post
Share on other sites

^ Yes! What that guy said! I hate seeing FPS measurements. They totally skew perception of the results.

 

With those particular numbers I'd just use glBufferSubData and wouldn't even bother.

 

So I'm wondering what gives, I would think that numbers would be the other way around? I thought that the UNSYNCRONIZED flag was supposed to tell the GPU not to block, where as glBufferSubData does cause a block on the GPU?

 

That isnt necesarely true. glBufferSubData works by passing the data to the driver, the driver making a copy of it, then returning from the call. Then the driver is free to copy immediately to the GPU, buffer a couple of commands first instead, or use whatever data updating scheme the driver uses.

 

With glMapBuffer, the driver gives you a pointer to driver's memory space, so no intermediary copy is being made, you just write to the driver's memory.

 

EDIT: Mother fucking quote blocks, how do they work?

Edited by TheChubu

Share this post


Link to post
Share on other sites

A buffer sized for about 3 frames worth of data is generally safe enough.  You can map with unsynchronized, append, and when the buffer fills you just begin again at the start of the buffer without needing to fence or orphan.  I'd still feel safer adding a fence just in case you hit a non-typical frame.


I have tried this before in the past, but had a lot of issues with graphical artifacts. My quads would warp their parts or stretch out in awkward ways.
So I ended up switching to orphaning only when my buffer was full, but when you mention the above it might be kind of bad since I end up orphaning no matter what, even if the data in my buffer has not been in used in a while sad.png
 
When it comes to fences how would this work? I understand how the actual fence works (I basically stall if the GPU is not finished with its commands)

But how would I know that my fence is valid? Valid in the sense that I am only waiting on the parts of my buffer that I need to wait on.

Or is it something that is automatically done, since I am only mapping a range of my buffer and not the entire thing?

Share this post


Link to post
Share on other sites
When it comes to fences how would this work? I understand how the actual fence works (I basically stall if the GPU is not finished with its commands)

But how would I know that my fence is valid? Valid in the sense that I am only waiting on the parts of my buffer that I need to wait on.

Or is it something that is automatically done, since I am only mapping a range of my buffer and not the entire thing?

 

What you're basically doing is using it as a ring buffer.  Say you create a buffer big enough to hold 4000 objects.  In frame 1 you write 1000 objects into locations 0 to 999.  In frame 2 you're drawing another 1000 objects but this time you write them into locations 1000 to 1999, adjusting the parameters of your draw calls accordingly.  In frame 3 you draw another 1000 from locations 2000 to 2999. Frame 4 you use 3000 to 3999.  When frame 5 comes around you can say with some confidence that the commands you issued in frame 1 have already been processed by the GPU so it's now safe to go back and write to locations 0 to 999 again.  Use a fence to ensure that it's safe by blocking until the GPU is ready before you do so, if you wish; otherwise orphan, otherwise choose to live a little dangerously.

 

Each frame you're appending to a part of the buffer that you absolutely know for certain is not being used for any previously queued-up draw calls: either the fence ensures it, or orphaning ensures it, or enough frames have passed (GPUs typically buffer-up 3 frames; I added one extra in this example for headroom); either way you know it because you haven't issued any draw calls from that part of the buffer yet or the GPU has already finished with calls that you previously did issue from it.  That part is perfectly safe to do.  What's important is that the fence (or orphaning) isn't necessary every frame; it's only necessary in the frame in which you go back to the start of the buffer and begin again from that point.

Share this post


Link to post
Share on other sites

A buffer sized for about 3 frames worth of data is generally safe enough. You can map with unsynchronized, append, and when the buffer fills you just begin again at the start of the buffer without needing to fence or orphan. I'd still feel safer adding a fence just in case you hit a non-typical frame.

Never do this without fences! You cant assume a maximum of two frames latency unless you use a fence to enforce it.
You're making the gamble that your users won't be GPU-bottlenecked, and that their driver isn't going to attempt to help by adding extra command buffering, with the wager of you betting graphical corruption against the prize of not having to make a handful of api calls at the end of each frame. That's not a good wager.

The simplest solution it to place a fence at the end of every frame and have the CPU block on the previous frame's fence (max 1 frame latency) or the previous previous frame's fence (max 2 frames latency).

You can then safely size your buffers to be max_size_required_per_frame * (max_latency + 1), and otherwise safely make assumptions about max latency.

[edit]
BTW, many games have done this for a long time, just to put a limit on buffered GPU frames, as more buffering results in more input lag. FPS or "twitchy" games often limit themselves to one frame's latency, which also has the bonus of reducing the memory requirements of your ring buffers.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement