Map Buffer Range Super Slow?

Started by
14 comments, last by Hodgman 9 years, 1 month ago

I am testing out some different ways to write data into a VBO and I am sort of confused.
I started to compare glBufferSubData and glMapBufferRange using the UNSYNCRONIZED flag and I'm finding that glBufferSubData is doing significantly better. Where glBufferSubData is pulling in ~950 FPS compared to the ~150 FPS glMapBufferRange with the UNSYNCRONIZED flag is getting

I am not doing anything special here. My test is 10000 untextured 32x32 quads (colored red) with a prerandomized position at init time. Where the quads are spread out over a 800x600 window.

So I'm wondering what gives, I would think that numbers would be the other way around? I thought that the UNSYNCRONIZED flag was supposed to tell the GPU not to block, where as glBufferSubData does cause a block on the GPU?


glBufferSubdata


void SpriteBatcher::Render(Matrix4 &projection)
{

glUseProgram(shaderProgram.programID);
glUniformMatrix4fv(shaderProgram.uniforms[0].location, 1, GL_FALSE, projection.data);
 
for (int i = 0; i < 12 * MAX_SPRITE_BATCH_SIZE; i += 12)
{
    verts1[i] = pos[i];
    verts1[i + 1] = pos[i + 1];
    verts1[i + 2] = 0.0f;
 
    verts1[i + 3] = _verts1[i];
    verts1[i + 4] = _verts1[i + 1] + 32.0f;
    verts1[i + 5] = 0.0f;
 
    verts1[i + 6] = _verts1[i] + 32.0f;
    verts1[i + 7] = _verts1[i + 1] + 32.0f;
    verts1[i + 8] = 0.0f;
 
    verts1[i + 9] = _verts1[i] + 32.0f;
    verts1[i + 10] = _verts1[i + 1];
    verts1[i + 11] = 0.0f;
}
 
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferSubData(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, _verts1);

glBindVertexArray(vao);
glDrawElements(GL_TRIANGLES, MAX_SPRITE_BATCH_SIZE * INDICES_COUNT_PER_QUAD, GL_UNSIGNED_SHORT, (const void*)0);

}

glMapBufferRange with UNSYNCRONIZED flag


void SpriteBatcher::Render(Matrix4 &projection)
{
	glUseProgram(shaderProgram.programID);
	glUniformMatrix4fv(shaderProgram.uniforms[0].location, 1, GL_FALSE, projection.data);


	glBindBuffer(GL_ARRAY_BUFFER, vbo);
	GLfloat *pointer = (GLfloat*)glMapBufferRange(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
	
        if (pointer == NULL)
		throw std::runtime_error("Null pointer on map");

	for (int i = 0; i < 12 * MAX_SPRITE_BATCH_SIZE; i += 12)
	{
		pointer[i] = pos[i];
		pointer[i + 1] = pos[i + 1];
		pointer[i + 2] = 0.0f;

		pointer[i + 3] = pointer[i];
		pointer[i + 4] = pointer[i + 1] + 32.0f;
		pointer[i + 5] = 0.0f;

		pointer[i + 6] = pointer[i] + 32.0f;
		pointer[i + 7] = pointer[i + 1] + 32.0f;
		pointer[i + 8] = 0.0f;

		pointer[i + 9] = pointer[i] + 32.0f;
		pointer[i + 10] = pointer[i + 1];
		pointer[i + 11] = 0.0f;
	}
	glUnmapBuffer(GL_ARRAY_BUFFER);

	glBindVertexArray(vao);
	glDrawElements(GL_TRIANGLES, MAX_SPRITE_BATCH_SIZE * INDICES_COUNT_PER_QUAD, GL_UNSIGNED_SHORT, (const void*)0);
}
Advertisement
You might also want the MAP_INVALIDATE_RANGE_BIT flag set, indicating you'll be overwriting the whole range, hinting to GL to not copy the old data before returning from the map function.

You might also want the MAP_INVALIDATE_RANGE_BIT flag set, indicating you'll be overwriting the whole range, hinting to GL to not copy the old data before returning from the map function.


I went and tested out the INVALIDATE_BUFFER and INVALIDATE_RANGE flags and both actually drop the FPS to ~100.
I also tried orphaning the buffer, but there was no improvement


//Orphan code snip
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, NULL, GL_DYNAMIC_DRAW);
GLfloat *pointer = (GLfloat*)glMapBufferRange(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
if (pointer == NULL)
     throw std::runtime_error("Null pointer on map");
According to nvidia video on azdo map buffer is slow even with un synchronized bit.
Not sure why, I think it's tied to some multi threading synchronisation happening inside the driver.

I found the problem!

The problem lies in here


for (int i = 0; i < 12 * MAX_SPRITE_BATCH_SIZE; i += 12)
{
pointer[i] = pos[i];
pointer[i + 1] = pos[i + 1];
pointer[i + 2] = 0.0f;
 
pointer[i + 3] = pointer[i];
pointer[i + 4] = pointer[i + 1] + 32.0f;
pointer[i + 5] = 0.0f;
 
pointer[i + 6] = pointer[i] + 32.0f;
pointer[i + 7] = pointer[i + 1] + 32.0f;
pointer[i + 8] = 0.0f;
 
pointer[i + 9] = pointer[i] + 32.0f;
pointer[i + 10] = pointer[i + 1];
pointer[i + 11] = 0.0f;
}

Turns out lines like these: pointer[i + 9] = pointer + 32.0f;
Count as reading the buffer!

So after changing all the lines to just use the pos array like the first 2 lines my FPS shot up to ~1050 FPS!


Turns out lines like these: pointer[i + 9] = pointer + 32.0f;
Count as reading the buffer!
Yep! I didn't even look at that code before... but this is really bad for performance.

The best case for a map call is that the pointer returned by glMap* is an actual pointer into GPU-RAM, and that the OS will have marked these pages of addresses as being uncached and write-combined.

When you write to those addresses, the CPU doesn't bother also writing to the data to it's L1/L2/L3 caches, because it knows you're not going to be reading it again. It also doesn't write to RAM immediately -- it buffers up each small write into a write-combining-buffer of some large size, and when enough data has been written, it flushes that whole buffer through to GPU-RAM in a single bulk transfer.

All this is wonderful... until you try to read from the buffer!

At that point, the CPU has to stall, prematurely flush out the write-combine buffer, wait for that transfer to actually reach GPU-RAM, then issue a read request to copy data from GPU-RAM back to the CPU. Furthermore, normally when reading data from RAM, you'll transfer a bulk amount (e.g. 64 bytes) and store it in the cache, as you'll likely want to read some nearby data soon too, and then move the requested amount (e.g. 4 bytes) from the cache to a CPU register -- in the best case, this means you do one RAM transaction when reading 16 floats! In this case though, because we're using uncached memory, every single read request results in a RAM transaction. Trying to read 16 floats == waiting on 16 "cache misses".

I found the problem!

The problem lies in here


for (int i = 0; i < 12 * MAX_SPRITE_BATCH_SIZE; i += 12)
{
pointer[i] = pos[i];
pointer[i + 1] = pos[i + 1];
pointer[i + 2] = 0.0f;
 
pointer[i + 3] = pointer[i];
pointer[i + 4] = pointer[i + 1] + 32.0f;
pointer[i + 5] = 0.0f;
 
pointer[i + 6] = pointer[i] + 32.0f;
pointer[i + 7] = pointer[i + 1] + 32.0f;
pointer[i + 8] = 0.0f;
 
pointer[i + 9] = pointer[i] + 32.0f;
pointer[i + 10] = pointer[i + 1];
pointer[i + 11] = 0.0f;
}

Turns out lines like these: pointer[i + 9] = pointer + 32.0f;
Count as reading the buffer!

So after changing all the lines to just use the pos array like the first 2 lines my FPS shot up to ~1030 FPS!

Out of curiosity, do you see any difference now between only UNSYNCHRONIZED, and UNSYNCHRONIZED with INVALIDATE?

Out of curiosity, do you see any difference now between only UNSYNCHRONIZED, and UNSYNCHRONIZED with INVALIDATE?


I looked into it and using the flag combo:
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT

I'm pulling about ~1050 FPS

When using the flag combo:
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT or

GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT

There is a slight drop, bringing me down to ~1040 FPS

Now when doing orphaning using glBufferData:


glBufferData(GL_ARRAY_BUFFER, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, NULL, GL_DYNAMIC_DRAW);	
GLfloat *pointer = (GLfloat*)glMapBufferRange(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);

I roughly get about the same FPS as the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT combo, but it fluctuates more for a bit before coming around to ~1038 to ~1040 FPS

Also around orphaning (glBufferData with NULL) and the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT flag combo, are these valued as doing the same thing? In the past I have had some issues with graphical artifacts when only trying the GL_MAP_INVALIDATE_BUFFER_BIT flag, but cleared them up using orphaning so I'm unsure if I was doing it correctly


Also around orphaning (glBufferData with NULL) and the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT flag combo, are these valued as doing the same thing? In the past I have had some issues with graphical artifacts when only trying the GL_MAP_INVALIDATE_BUFFER_BIT flag, but cleared them up using orphaning so I'm unsure if I was doing it correctly

No, they're completely different things.

Orphaning allocates an entirely new buffer for you to write new data into, and any new draw commands will reference this new buffer.

Any existing draw commands which have already been submitted (but haven't yet been consumed by the GPU) will still use the old allocation, with the old data, so there's no chance of graphical corruption. After all those commands are executed by the GPU, the driver will garbage-collect this orphaned allocation.

Unsynchronized mapping just gives you a pointer to the existing allocation for that buffer, with zero synchronization or safety. You're making a sacred promise to the driver that you will not overwrite any part of the data that could potentially be used by existing draw commands. You need to implement your own ring-buffer or similar allocation strategy, and use GL fences/events to tell when it's safe to overwrite different parts of the buffer.

From your timing data, we can guess that the extra GPU memory allocation management involved in orphaning is costing you about 10?s per buffer per frame... which is pretty good!

Unsynchronized mapping just gives you a pointer to the existing allocation for that buffer, with zero synchronization or safety. You're making a sacred promise to the driver that you will not overwrite any part of the data that could potentially be used by existing draw commands. You need to implement your own ring-buffer or similar allocation strategy, and use GL fences/events to tell when it's safe to overwrite different parts of the buffer.


Is using orphaning seen as the fastest way or the better choice to handle VBOs when mapping?
Since the driver will automatically (and I assume as quick as possible) get a new block of memory and maybe even reuse/recycle previous allocated buffers?

Or would it be optimal to have 6 - 8 VBOs at the ready with fence?
Since the fences will only block when they need to (eg the GPU is not able to process the data fast enough)

This topic is closed to new replies.

Advertisement