Map Buffer Range Super Slow?

Started by
14 comments, last by Hodgman 9 years, 1 month ago

A buffer sized for about 3 frames worth of data is generally safe enough. You can map with unsynchronized, append, and when the buffer fills you just begin again at the start of the buffer without needing to fence or orphan. I'd still feel safer adding a fence just in case you hit a non-typical frame.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Advertisement

Out of curiosity, do you see any difference now between only UNSYNCHRONIZED, and UNSYNCHRONIZED with INVALIDATE?


I looked into it and using the flag combo:
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT

I'm pulling about ~1050 FPS

When using the flag combo:
GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT or

GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT

There is a slight drop, bringing me down to ~1040 FPS

Now when doing orphaning using glBufferData:


glBufferData(GL_ARRAY_BUFFER, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, NULL, GL_DYNAMIC_DRAW);	
GLfloat *pointer = (GLfloat*)glMapBufferRange(GL_ARRAY_BUFFER, 0, MAX_SPRITE_BATCH_SIZE * VERTEX_DATA_SIZE_PER_QUAD, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);

I roughly get about the same FPS as the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT combo, but it fluctuates more for a bit before coming around to ~1038 to ~1040 FPS

Also around orphaning (glBufferData with NULL) and the GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_BUFFER_BIT flag combo, are these valued as doing the same thing? In the past I have had some issues with graphical artifacts when only trying the GL_MAP_INVALIDATE_BUFFER_BIT flag, but cleared them up using orphaning so I'm unsure if I was doing it correctly

As small advice, please don't give timing values in FPS. Use time instead. 1050 -> 1040 FPS is just 0.009 milliseconds. Meanwhile 30 -> 20 FPS would be well above 10 ms. In the context of computer graphics FPS is pretty meaningless save for a few very specific purposes. It's called frames per second for a reason, otherwise people would just call it frequency tongue.png

^ Yes! What that guy said! I hate seeing FPS measurements. They totally skew perception of the results.

With those particular numbers I'd just use glBufferSubData and wouldn't even bother.

So I'm wondering what gives, I would think that numbers would be the other way around? I thought that the UNSYNCRONIZED flag was supposed to tell the GPU not to block, where as glBufferSubData does cause a block on the GPU?

That isnt necesarely true. glBufferSubData works by passing the data to the driver, the driver making a copy of it, then returning from the call. Then the driver is free to copy immediately to the GPU, buffer a couple of commands first instead, or use whatever data updating scheme the driver uses.

With glMapBuffer, the driver gives you a pointer to driver's memory space, so no intermediary copy is being made, you just write to the driver's memory.

EDIT: Mother fucking quote blocks, how do they work?

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

A buffer sized for about 3 frames worth of data is generally safe enough. You can map with unsynchronized, append, and when the buffer fills you just begin again at the start of the buffer without needing to fence or orphan. I'd still feel safer adding a fence just in case you hit a non-typical frame.


I have tried this before in the past, but had a lot of issues with graphical artifacts. My quads would warp their parts or stretch out in awkward ways.
So I ended up switching to orphaning only when my buffer was full, but when you mention the above it might be kind of bad since I end up orphaning no matter what, even if the data in my buffer has not been in used in a while sad.png

When it comes to fences how would this work? I understand how the actual fence works (I basically stall if the GPU is not finished with its commands)

But how would I know that my fence is valid? Valid in the sense that I am only waiting on the parts of my buffer that I need to wait on.

Or is it something that is automatically done, since I am only mapping a range of my buffer and not the entire thing?

When it comes to fences how would this work? I understand how the actual fence works (I basically stall if the GPU is not finished with its commands)

But how would I know that my fence is valid? Valid in the sense that I am only waiting on the parts of my buffer that I need to wait on.

Or is it something that is automatically done, since I am only mapping a range of my buffer and not the entire thing?

What you're basically doing is using it as a ring buffer. Say you create a buffer big enough to hold 4000 objects. In frame 1 you write 1000 objects into locations 0 to 999. In frame 2 you're drawing another 1000 objects but this time you write them into locations 1000 to 1999, adjusting the parameters of your draw calls accordingly. In frame 3 you draw another 1000 from locations 2000 to 2999. Frame 4 you use 3000 to 3999. When frame 5 comes around you can say with some confidence that the commands you issued in frame 1 have already been processed by the GPU so it's now safe to go back and write to locations 0 to 999 again. Use a fence to ensure that it's safe by blocking until the GPU is ready before you do so, if you wish; otherwise orphan, otherwise choose to live a little dangerously.

Each frame you're appending to a part of the buffer that you absolutely know for certain is not being used for any previously queued-up draw calls: either the fence ensures it, or orphaning ensures it, or enough frames have passed (GPUs typically buffer-up 3 frames; I added one extra in this example for headroom); either way you know it because you haven't issued any draw calls from that part of the buffer yet or the GPU has already finished with calls that you previously did issue from it. That part is perfectly safe to do. What's important is that the fence (or orphaning) isn't necessary every frame; it's only necessary in the frame in which you go back to the start of the buffer and begin again from that point.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

A buffer sized for about 3 frames worth of data is generally safe enough. You can map with unsynchronized, append, and when the buffer fills you just begin again at the start of the buffer without needing to fence or orphan. I'd still feel safer adding a fence just in case you hit a non-typical frame.

Never do this without fences! You cant assume a maximum of two frames latency unless you use a fence to enforce it.
You're making the gamble that your users won't be GPU-bottlenecked, and that their driver isn't going to attempt to help by adding extra command buffering, with the wager of you betting graphical corruption against the prize of not having to make a handful of api calls at the end of each frame. That's not a good wager.

The simplest solution it to place a fence at the end of every frame and have the CPU block on the previous frame's fence (max 1 frame latency) or the previous previous frame's fence (max 2 frames latency).

You can then safely size your buffers to be max_size_required_per_frame * (max_latency + 1), and otherwise safely make assumptions about max latency.

[edit]
BTW, many games have done this for a long time, just to put a limit on buffered GPU frames, as more buffering results in more input lag. FPS or "twitchy" games often limit themselves to one frame's latency, which also has the bonus of reducing the memory requirements of your ring buffers.

This topic is closed to new replies.

Advertisement