Mutiple VAOs and VBOs

Started by
27 comments, last by Matias Goldberg 9 years, 5 months ago

However, no one seems to be interested in doing anything about it.

Post it to Timothy Lottes/Graham Sellers/Christophe Riccio's Twitter. See what they say...

They are aware of it.
Advertisement

To be clear, I doubt VAOs perform that bad. While you got Valve and a couple of scattered devs saying otherwise, on the other end you have every single OpenGL driver developer out there saying they perform better. And to go against those kind of people, you'd need a bit more big names than just Valve to say it, someone that either is known for being a graphics powerhouse (Crytek, DICE) so they're bound to know what the fuck they're doing, or someone who has been working with OpenGL for a long ass time (say, Carmack).

So far I haven't seen such complaints from other developers that have been porting new games to OpenGL lately (Firaxis, Aspyr, 4A Games, etc).

Maybe you should do some research on my qualifications before implying that I don't know what I'm talking about.

someone that either is known for being a graphics powerhouse (Crytek, DICE) so they're bound to know what the fuck they're doing, or someone who has been working with OpenGL for a long ass time (say, Carmack).

...Someone like Eric Lengyel, perhaps?

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


Maybe you should do some research on my qualifications before implying that I don't know what I'm talking about.

Ohh sorry man, I didn't meant to imply you didn't knew what you were talking about. I thought you were saying that I should take Valve's opinion alone on the subject (which wouldn't be the first time it happens that Valve says XYZ and it becomes the absolute truth), that's what I was disagreeing with.


They are aware of it.

Sweet

Cass Everitt: VAO is a giant state object that can change all VA state plus needs name lookup and mem deref which usually misses in the cache. App structs are smaller, cache better, and pollute less GL state. Naive VAO impl is awful. But even optimized has trouble beating non-VAO path.

Basically it boils down to giant sweeps of state change with a big struct VS making only the fine grained state changes you need? Looks like D3D state structs VS all the more fine grained OpenGL calls for setting the same state...

EDIT: Oops, quoting twitter isn't pretty nice on the editor...

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

All of this is consistent with an observation both myself and (IIRC) L Spiro made here a coupla years back. There's ever-increasing real-world evidence that yes, they are slower, with only AMD's GL driver guy claiming otherwise.

Despite that, I do like using VAOs in cases where I can afford to burn the performance because they do make state management/filtering easier and cleaner.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


with only AMD's GL driver guy claiming otherwise.
To be fair, I've seen that the techniques he describes require very few VAO changes, which was what Matias was referring to. I saw some posts of him (well, him and Matias) in the Ogre forums, let me find it...

Here: http://www.ogre3d.org/forums/viewtopic.php?f=25&t=81060

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

You're mixing the round robbin method (use two VBOs) with unsynchronized.

Normally when you use round robbin, it's because you don't have unsynchronized methods.

When you've got unsynchronized, you use one VBO, but allocate twice (or three times) the size you need to use. And when mapping, you lock a subregion each frame. For example you need to update 32MB, then create a 96MB buffer. On frame 0 you lock region [0; 32) on frame 1 lock [32; 64) then on frame 2 lock [64; 96), on frame 3 lock again region [0; 32) and so on. It's still a double buffer scheme, but with one buffer (well, in my example a triple buffer scheme, which is often what's recommended).

VERY IMPORTANT: Your code is broken, because you're using unsynchronized flags without fencing. That means you may write to a buffer while the GPU is still using it, thus glitches or crashes can happen (or even full system hang/bsod).


//Wait for the fence to complete
if( fences[i] )
{
    GLbitfield waitFlags    = 0;
    GLuint64 waitDuration   = 0;
    while( true )
    {
        GLenum waitRet = glClientWaitSync( fences[i],
                                           waitFlags, waitDuration );
        if( waitRet == GL_ALREADY_SIGNALED || waitRet == GL_CONDITION_SATISFIED )
        {
            glDeleteSync( fences[i];
            fences[i] = 0;
            break;
        }

        if( waitRet == GL_WAIT_FAILED )
        {
            //Fatal error! (Out of memory? Driver error? GPU was removed?)
        }

        const GLuint64 kOneSecondInNanoSeconds = 1000000000
        // After the first time, need to start flushing, and wait for a looong time.
        waitFlags = GL_SYNC_FLUSH_COMMANDS_BIT;
        waitDuration = kOneSecondInNanoSeconds;
    }
}

glBindBuffer(GL_ARRAY_BUFFER, vbo);
GLvoid * data = glMapBufferRange( start = region + index * size, /*Desired params. Write Bit and Unsynchronized Bit */)
  
glUnmapBuffer(GL_ARRAY_BUFFER);

// .... 

glDrawElements();
//The fence needs to be created after you're done with commands that read the VBO.
//Noob mistake: Don't create the fence right you've unmapped it (but you still
//will make another command to read from it, like a draw call)
fences[index] = glFenceSync( GL_SYNC_GPU_COMMANDS_COMPLETE, 0 );

index = (index + 1) % 3; //Triple buffer scheme
The sample apitest has sample code showing how to do this (note most of it is thought with GL4 in mind)

I'm sort of having trouble understanding how mixing these could be bad. Not saying that is what you mean, but it is implied that way to me. I would think this would act as an extra safety net to the whole unsyncronized methodology.

I basically see using glMapBufferRange + the unsyncronized flag as 'Put this data in the VBO right now, I do not care how or what you do just put it in there'. Which could lead to things not drawing right if you accidentally map to a VBO that is being used in drawing. If I use Round Robin with 3 VBOs or more and they all get mapped with glMapBufferRange + the unsyncronized flag, I would think that the only way it would fail is if my GPU is falling behind really really badly or something is seriously wrong. Even then we could just a fence in there as a triple safety check, but I kind of think that is going to far at that point

Is mixing those two methods just over kill? Or am I thinking to crazily to the point where it is a performance killer? I can only think of a few down sides like I would need 2 glBindBuffer calls per frame (assuming I have separate VBOs), but even then I could fix that using a giant VBO, then offsets when mapping, and when drawing use glDrawRangeElements

I basically see using glMapBufferRange + the unsyncronized flag as 'Put this data in the VBO right now, I do not care how or what you do just put it in there'. Which could lead to things not drawing right if you accidentally map to a VBO that is being used in drawing.

This is only correct if you have a possibility of writing to a portion of the buffer that may be currently used for drawing.

The classic streaming buffer method appends data to the buffer, so you can be absolutely certain that the region you're writing to is not being used for drawing. When there is no more space in the buffer you "orphan" it. What this means is that if the entire buffer is no longer in use the GL implementation will just continue using it, otherwise it will hand you a new block of memory and (potentially) free the old buffer memory when it is no longer being used; otherwise the old memory can continue being used for draw calls without any issues.

Over a few frames things settle down and the driver is no longer allocating new blocks of memory but is instead just handing back blocks of memory that had previously been used. So in other words it's not necessary to do your own multi-buffering, because the driver itself is automatically multi-buffering for you behind-the-scenes.

At this stage it's worth highlighting that this buffer update model is well-known and widely-used in D3D-land (where it's called "discard/no-overwrite") and has existed since D3D8, if not earlier; i.e it has close on 15 years of real-world usage behind it. So it's not some kind of voodoo magic that you may not be able to rely on; it's a well-known and widely-understood usage pattern that driver writers anticipate and optimize around.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I basically see using glMapBufferRange + the unsyncronized flag as 'Put this data in the VBO right now, I do not care how or what you do just put it in there'. Which could lead to things not drawing right if you accidentally map to a VBO that is being used in drawing.


This is only correct if you have a possibility of writing to a portion of the buffer that may be currently used for drawing.

The classic streaming buffer method appends data to the buffer, so you can be absolutely certain that the region you're writing to is not being used for drawing. When there is no more space in the buffer you "orphan" it. What this means is that if the entire buffer is no longer in use the GL implementation will just continue using it, otherwise it will hand you a new block of memory and (potentially) free the old buffer memory when it is no longer being used; otherwise the old memory can continue being used for draw calls without any issues.

Over a few frames things settle down and the driver is no longer allocating new blocks of memory but is instead just handing back blocks of memory that had previously been used. So in other words it's not necessary to do your own multi-buffering, because the driver itself is automatically multi-buffering for you behind-the-scenes.

At this stage it's worth highlighting that this buffer update model is well-known and widely-used in D3D-land (where it's called "discard/no-overwrite") and has existed since D3D8, if not earlier; i.e it has close on 15 years of real-world usage behind it. So it's not some kind of voodoo magic that you may not be able to rely on; it's a well-known and widely-understood usage pattern that driver writers anticipate and optimize around.

I actually was doing a sprite batcher in DX11 that used the discard / no overwrite flags, but L. Spiro got on to me about doing it this way and the cost of orphaning. Pushing that manual buffering was better. To be honest I didn't want to believe him, because I always thought 'No, the driver knows how to handle it best' . But I started to question it with papers and posts like:

http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-AsynchronousBufferTransfers.pdf (Section 28.3.2: Buffer Respecification (Orphaning) )

http://www.java-gaming.org/index.php?topic=28209.0 (Reply #5: VBO FPS Numbers)

Which give out performance numbers or state that orphaning is only as good as a driver's implementation/coding.

So I'm not really sure what believe, which is why I'm always stuck in a 'Am I even doing this right?' state

I'm sort of having trouble understanding how mixing these could be bad. Not saying that is what you mean, but it is implied that way to me. I would think this would act as an extra safety net to the whole unsyncronized methodology.

It's not that it's bad (btw mixing both doesn't act as safety net). The thing is that you needlessly generate a problem for you when you need to render with VAOs.

The VAO saves the VBO that is bound.
If you use three VBOs, either you modify the state of the VAO every frame, or have three VAOs (one per VBO). With just one VBO but using different ranges, you need one VAO, and no need to modify the VAO in any frame.

I basically see using glMapBufferRange + the unsyncronized flag as 'Put this data in the VBO right now, I do not care how or what you do just put it in there'.

That's correct.

Which could lead to things not drawing right if you accidentally map to a VBO that is being used in drawing.

That's the best thing that can happen. The worst thing that can happen is full system crash (BSOD, or not even that, DOS-style lockup needing a hard reset). Depends on GPU architecture and Motherboard (Bus).
You must synchronize.

Note that for dynamic content (and assuming you will be discarding the whole contents every frame), you just need one fence per frame. You don't need one fence per buffer.

If I use Round Robin with 3 VBOs or more and they all get mapped with glMapBufferRange + the unsyncronized flag, I would think that the only way it would fail is if my GPU is falling behind really really badly or something is seriously wrong.

If you don't use round robin, it's the same thing. Because the example I gave you, you would be writing to a region that the GPU is not writing right now.
Remember, it's not that the VBO is in use by the GPU while you're writing from the CPU. What's important is that the dword (4 bytes) you're writing to from the CPU is not currently being read by the GPU (to avoid a crash); and that the region of memory you're writing to has already been read from all the GPU commands until now (to avoid a race condition causing graphic corruption).

Is mixing those two methods just over kill?

Yes, because you gain nothing from mixing them, and complicate yourself by having more VBOs and more VAOs.

This topic is closed to new replies.

Advertisement