Jump to content

  • Log In with Google      Sign In   
  • Create Account


#Actualtheagentd

Posted 20 June 2013 - 12:43 PM

I'm sorry for taking so long to respond. Work is killing me...

 

@mhagain

My manual "synchronization" is actually no synchronization at all. I'm depending on a rolling buffer approach where I allocate and resize VBOs as they are needed and then ensure that the same VBO is not reused until at least 6 frames have passed. 6 frames is a lot of time and should be much longer than the OpenGL driver is prepared to let the GPU fall behind before stalling the CPU, and neither decreasing or increasing this value has an effect on performance (although low values introduce artifacts of course). Regardless, the fact that I am using GL_MAP_UNSYNCHRONIZED_BIT should disable all synchronization and be the fastest way of doing this. I don't really care if this is not 100% correct or safe at this point, I'm just saying that at the moment I'm not doing any 2-way communication with the GPU at all, so I don't see any possible way that the performance problems on Nvidia cards are my fault.

 

@marcClintDion

My instancing algorithm is as simple as it can get. I simply upload a buffer (using the above described VBO handling) filled with 3D-positions (16-bit values, padded from 6 to 8 bytes) of where to render each instance, which is read into the shader as a per-instance attribute (glVertexAttribDivisor(instancePositionLocation, 1)). Then everything is drawn using a single call to glDrawElementsInstanced().

 

Concerning performance, 5 out of 7 perform as I expect. The AMD HD5500 and the Intel HD5000 are both very limited by fragment performance, not vertex performance. I'd also like to argue that the AMD cards are too slow when doing psuedo-instancing, not the other way around. The performance numbers are also adding up when comparing the cards:

 

GTX 295 vs HD7790: The GTX 295 was only running on one GPU. When both are enabled I get around 90% higher FPS, which is very close to the HD7790. Those two cards have very similar theoretical computing performance.

 

 

Instancing render() method: http://pastie.org/8063921 (simplified) Shader: http://pastie.org/8063953
Psuedo-instancing render() method: http://pastie.org/8063921 (simplified) Shader: http://pastie.org/8063948
 
 
I've also tested the performance of simply using glBufferData() instead of glMapBufferRange(..., GL_MAP_UNSYNCHRONIZED_BIT). Here are the test results from my GTX 295 using both GPUs. This is a new scene so these numbers are not comparable to the ones in my previous post.
 
Instancing + glMapBufferRange: 112 FPS
Instancing + glBufferData: 141 FPS
Psuedo-instancing: 153 FPS
 
Similar performance numbers have been confirmed on cards from all series from the 200-series up to the 600 series, including on laptops, high-end and low-end GPUs. So far it seems easy to just let Nvidia cards use the psuedo-instancing version while everything else uses Instancing + glMapBufferRange.
 
Sorry for the wall of text... >_<

#1theagentd

Posted 20 June 2013 - 12:42 PM

I'm sorry for taking so long to respond. Work is killing me...

 

@mhagain

My manual "synchronization" is actually no synchronization at all. I'm depending on a rolling buffer approach where I allocate and resize VBOs as they are needed and then ensure that the same VBO is not reused until at least 6 frames have passed. 6 frames is a lot of time and should be much longer than the OpenGL driver is prepared to fall behind before locking to let the GPU catch up, and neither decreasing or increasing this value has an effect on performance (although low values introduce som minor flickering). Regardless, the fact that I am using GL_MAP_UNSYNCHRONIZED_BIT should disable all synchronization and be the fastest way of doing this. I don't really care if this is not 100% correct or safe at this point, I'm just saying that at the moment I'm not doing any 2-way communication with the GPU at all, so I don't see any possible way that the performance problems on Nvidia cards are my fault.

 

@marcClintDion

My instancing algorithm is as simple as it can get. I simply upload a buffer (using the above described VBO handling) filled with 3D-positions (16-bit values, padded from 6 to 8 bytes) of where to render each instance, which is read into the shader as a per-instance attribute (glVertexAttribDivisor(instancePositionLocation, 1)). Then everything is drawn using a single call to glDrawElementsInstanced().

 

Concerning performance, 5 out of 7 perform as I expect. The AMD HD5500 and the Intel HD5000 are both very limited by fragment performance, not vertex performance. I'd also like to argue that the AMD cards are too slow when doing psuedo-instancing, not the other way around. The performance numbers are also adding up when comparing the cards:

 

GTX 295 vs HD7790: The GTX 295 was only running on one GPU. When both are enabled I get around 90% higher FPS, which is very close to the HD7790. Those two cards have very similar theoretical computing performance.

 

 

Instancing render() method: http://pastie.org/8063921 (simplified) Shader: http://pastie.org/8063953
Psuedo-instancing render() method: http://pastie.org/8063921 (simplified) Shader: http://pastie.org/8063948
 
 
I've also tested the performance of simply using glBufferData() instead of glMapBufferRange(..., GL_MAP_UNSYNCHRONIZED_BIT). Here are the test results from my GTX 295 using both GPUs. This is a new scene so these numbers are not comparable to the ones in my previous post.
 
Instancing + glMapBufferRange: 112 FPS
Instancing + glBufferData: 141 FPS
Psuedo-instancing: 153 FPS
 
Similar performance numbers have been confirmed on cards from all series from the 200-series up to the 600 series, including on laptops, high-end and low-end GPUs. So far it seems easy to just let Nvidia cards use the psuedo-instancing version while everything else uses Instancing + glMapBufferRange.
 
Sorry for the wall of text... >_<

PARTNERS