glSyncFence correct usage

Started by
3 comments, last by Hodgman 9 years, 1 month ago
I am testing out the use of a sync fence object and I believe I have it implemented correctly (I occasionally get GLenum 37146 (ALREADY_SIGNALED) back from glClientWaitSync though). But I am little skeptical since I would think that orphaning a 4MB buffer when it is full would do worse than just starting at the beginning. Not to say that the sync version should do better, but I just think it would. Now the difference is not that big, but I am still curious as to why

Orphan Delta Time (In Seconds): ~0.00049
Sync Delta Time (In Seconds): ~0.00053

So I am left wondering: I'm doing it correctly?

Sync version Below
//Map buffer Method
void MapBuffer()
{
	if (mapPointer != NULL)
		return;

	//In working with Orphan version. This fence code would be gone
	fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, NULL); //Create the fence
	GLenum state = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, 1000000000); //Timeout after 1 second
	glDeleteSync(fence); //Delete the fence object
	if (state != GL_CONDITION_SATISFIED) //If anything other than satisfied. Print it out
		std::cout << "Fence blocked! CODE: " << state << std::endl;
	

	glBindBuffer(GL_ARRAY_BUFFER, vbo);
	if (mapBufferOffset == mapBufferLength)
	{
		mapBufferOffset = 0;
		baseVertex = 0;

                //glBufferData(GL_ARRAY_BUFFER, mapBufferLength, NULL, GL_DYNAMIC_DRAW); <--- Orphaned here in Orphan version
	}

	mapPointer = (GLfloat*)glMapBufferRange(GL_ARRAY_BUFFER, mapBufferOffset, mapBufferLength - mapBufferOffset, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
	if (mapPointer == NULL)
		throw std::runtime_error("Sprite Batcher Error. Map Pointer is NULL");
}

//Main loop in a nutshell
int main()
{
      while(run)
      {
          MapBuffer();
          DrawSprite(1);
          DrawSprite(2);
          UnmapBuffer();
          Render();
      }
}

Advertisement

Orphaning can actually be really fast; in the best case the driver already has an appropriately sized block of memory ready and waiting, so all it needs to do is swap pointers. If you orphan with the kind of frequency the driver expects (i.e regularly enough) it's going to be able to keep on just swapping pointers and will never actually have to do an allocation at all. In other words - and this will be driver internal behaviour so I can't give specifics - but in other words, the driver doesn't necessarily have to release memory used for an orphaned buffer as soon as all pending draw calls on it are completed. It can decide to defer cleaning it up until a few frames later, just in case it needs to hand it back to you again before that time. That's why the docs generally advise that the first operation you do on a dynamic buffer each frame is orphan it. By contrast, if you only orphan every 5/6/7higher frames the driver may have already released the memory and will have to do an allocation (really clever drivers will be able to detect this pattern and adjust their behaviour to match it, but it may take a fair few frames to detect it, apply the heuristic and settle things down).

What I'm saying is that when you say: "I am little skeptical since I would think that orphaning a 4MB buffer when it is full" - don't be. The orphan/append/append/append/orphan pattern is something that's been used for dynamic buffers in D3D since the days of D3D7 (GL only gained it much later) and is a well-understood pattern that driver writers know to expect. So don't be skeptical without first profiling and determining it's performance characteristics; particularly don't write a mess of code because you think it may be slow otherwise: profile it first.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Your sync logic looks very dangerous to performance...

You're pushing a fence into the command buffer, and then immediately telling the CPU to block until the GPU reaches the fence?
Not only is that stalling the CPU, but it's going to starve the GPU as well. It's a complete CPU/GPU sync point.

A synthetic 10000fps test like this isn't going to show you the actual impacts of a CPU-GPU sync like this.
e.g. If you put that code into a real game running at 40fps, you might see it suddenly regress to 200fps because of the syncing.

Your sync logic looks very dangerous to performance...


Then how should it look? I only managed to find one exampleish code on fences, which is why I'm very unsure of it actually working properly / working towards performance sad.png

You're pushing a fence into the command buffer, and then immediately telling the CPU to block until the GPU reaches the fence?
Not only is that stalling the CPU, but it's going to starve the GPU as well. It's a complete CPU/GPU sync point.


Isn't that what I want to do? Wait until the GPU is done with the previous commands, such as glDrawElements and etc


Isn't that what I want to do? Wait until the GPU is done with the previous commands, such as glDrawElements and etc
Yeah, but you also want to avoid synchronizing the two processors / causing them to wait on each other. Usually you avoid that kind of synchronisation by allocating more memory (double buffering). Either use two (or more) buffers and switch which one you're using per frame, or allocate the buffer to be twice (or more) as large as usual and switch which range of the buffer you're using every frame.

You can either do it fine-grained with one fence per map, or coarse with one fence per frame. I find the per-frame method to be much simpler and faster.

Each frame, keep track of which range of the buffer you've written data into. On the next frame, you can't overwrite any of the previous frame's ranges -- you have to write into new ranges.

If you're unable to allocate a new unused range (e.g. the buffer is too small), then (go and fix your code to allocate a bigger buffer so this doesn't happen, or) you have to orphan the buffer and get yourself a new one as a last resort. Let's say that we've always allocated a big enough buffer though, so we can ignore the orphaning case.

At the end of each frame, insert a fence.

Once the GPU has reached that fence, the range of the buffer used by that frame is now free to use again. At the start of every frame, wait on the fence from two frames earlier.

e.g. At the start of Frame #3, you'll wait on the fence that was issued at the end of Frame #1.

Buffer: [[Buffer Range A][Buffer Range B]]

Frame 1 commands:

Draw using range A.
Draw using range A.
Draw using range A.
Fence #1.

Frame 2 commands:

Draw using range B.
Draw using range B.
Draw using range B.
Fence #2.

Frame 3 commands:

Client wait on Fence #1

Draw using range A.
Draw using range A.
Draw using range A.
Fence #3.
This avoids a CPU/GPU sync, as the GPU has a healthy buffer of 1-frame's worth of commands in between when a fence is issued, and when it is waited on.

To add to Hodgman's post, also note that a fence will become signalled after every other command that was posted to the queue prior to it has completed. Which means that while your code is correct (in a sense of working without errors), it is also almost guaranteed to sync with things that you don't care about (assuming you're not just drawing these sprites).

You should always insert a fence as soon as possible (that is, as soon as every draw call that uses the protected buffer has been submitted), not some time later as in your example.

This topic is closed to new replies.

Advertisement