Sign in to follow this  

OpenGL Mind if we discuss VBO streaming?

This topic is 1959 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Ah, well I guess my computer was choosing "Integrated Graphics" when running my app. It's usually pretty good about choosing the right graphics chip, but not this time.

I set it to use Dedicated globally, and that did give each method a significant performance boost, with Method 3 performing about 33% better than method 1.

When rendering 600k verts, I now get 131fps with Method 1 and 175fps with Method 3.

600k verts! That doesn't sound to shabby to me. =)

Share this post

Link to post
Share on other sites
you have to remember that the driver is responsible for doing all the dirty work on the backend, and it doesn't gurantee that any of the above method's well do what they are suppose to do, only that the resulting image should be the same, so method 1, might actually be doing what method 3 does on the backend, but again, it's completely up to the driver to implement things however it see's fit.

Share this post

Link to post
Share on other sites
That is an interesting comparison Geometrian. I always thought transform feedback would have minimal impact on performance compared to compute shaders and OpenCL, but from your example it seems it does make a lot of difference. Are there any settings you could tweak to optimize the performance?

Perhaps this is going a bit off topic, but I am curious to know. Does anyone know how compute shaders compare to OpenCL and transform feedback?

Share this post

Link to post
Share on other sites
Yeah, I'm actually extremely interested in that comparison. When I read about transform feedback, that's pretty much exactly what it sounded like to me -- OpenCL-like functionality within OpenGL. I have to say, Geometrian, your post was very enlightening.

slicer4ever, that's a good point.

Share this post

Link to post
Share on other sites
I've done some work with compute shaders in D3D (not GL) and in general I've found that they're a good bit slower for this kind of simpler use case, partially because of resource binding rules and partially because current hardware incurs some overhead when switching compute shaders on and off each frame. Future hardware should be better, of course, and more complex use cases do exist where the gain from using a CS outweighs the overhead; something like this is just currently more likely to come out on the wrong side of the tradeoff.

In all cases your option (3) should be faster, but that's not the only bottleneck in your code. You're also drawing a lot of particles so you have quite substantial fillrate overhead too, and with a sufficiently high number of particles that's just going to swamp the time taken to do buffer updates (especially when you come to do more complex per-fragment ops on them). That's what I'd intentify as a major cause of your times coming out more equal than they otherwise would be.

Share this post

Link to post
Share on other sites
Interesting. Do you suppose I could test that by making sure that the particles are out of view, and thus clipped?

Share this post

Link to post
Share on other sites
About test setup nr 3:

I would use a sync object instead of orphaning - no need to trash the memory, just reuse (with range invalidation [GPU->CPU is VERY costly and unnecessary here] / unsynchronized of course). Also, ~x3-x4 buffer size overhead should be more than enough (my own streaming world-geometry buffer has a lag of only 2 frames). Test it please - expecting an improvement.

* CPU bound: per vertex calculations at CPU side can be quite costly - IF you can not do that at GPU side then uninterleaving the vertex position might speed it up considerably by using vector instructions (SSE2 or whatever else is available).
* GPU bound: hell no, the GPU is bored to death ... it is hardly doing anything with thous vertices in comparison.
* Transfer bound: i somewhat suspect your test implementation is mostly CPU bound, but transfer probably still takes quite a lot of time. If your streaming is done right (show relevant code?) then you can not get any better with that and need to consider alternatives (transform feedback / opencl as mentioned).

Share this post

Link to post
Share on other sites
Orphaning is actually an incredibly cheap operation - D3D has been doing it since version 7 back in the 1990s or thereabouts, and drivers/hardware are built around this usage scheme. With orphaning what happens is the first one or two times you may get some memory allocation, but after that the driver is able to detect that the programmer is doing this and will keep previously used copies of the buffer memory around, flipping between them as required - classic double-buffering but managed for you by the driver. So memory trashing doesn't occur, the driver will just hand back a chunk of memory that had previously been used and all is well.

If you're still concered about that, there is an alternative method. If you size your buffer so that it's always at least large enough for 3 or more frames of data, you don't even need to orphan at all - by the time you reset your position back to 0 the GPU will already have finished with the vertexes at the start of the buffer, so you can just map the range as normal. This obviously needs you to know a bit more about how much data you're pushing, what your maximums are, etc, but if your use case meets those requirements it can be used and can work very well. No need for sync objects and everything runs smoothly. Edited by mhagain

Share this post

Link to post
Share on other sites
A bit confusing reply :/

[quote name='mhagain' timestamp='1346843460' post='4976770']
Orphaning is actually an incredibly cheap operation ...
Yes, and not doing something at all is even cheaper than doing it needlessly (*).

Which is why i recommended ...

[quote name='mhagain' timestamp='1346843460' post='4976770']If you size your buffer so that it's always at least large enough for 3 or more frames of data, you don't even need to orphan at all - by the time you reset your position back to 0 the GPU will already have finished with the vertexes at the start of the buffer, so you can just map the range as normal.[/quote]
... wait ... you propose the exact same option. Except ...

[quote name='mhagain' timestamp='1346843460' post='4976770']
No need for sync objects and everything runs smoothly.
... i would still use it for purely sanity reasons (GL does not specify how many frames are allowed in command buffer. Ie, swapbuffers does not mean a glFinish nowadays [thankfully] as modern drivers add it to command buffer like all the other commands - up to driver specific limit).

(*) hm, on second thought, for such a small buffer it indeed should not make much difference. I use orphaning exclusively for use-once streams as they all happen to be few and small in my case. Such exceptions excepted - i prefer to use one big buffer for all the use-once-or-perhaps-more streaming (instead of thousands of separate buffers for them [4K-40K real case, ~2-8K per second flying in/out]) and do the syncing/management on my own (no cons and plenty pros).

Share this post

Link to post
Share on other sites

This topic is 1959 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Similar Content

    • By xhcao
      Does sync be needed to read texture content after access texture image in compute shader?
      My simple code is as below,
      glBindImageTexture(0, texture[0], 0, GL_FALSE, 3, GL_READ_ONLY, GL_R32UI);
      glBindImageTexture(1, texture[1], 0, GL_FALSE, 4, GL_WRITE_ONLY, GL_R32UI);
      glDispatchCompute(1, 1, 1);
      // Does sync be needed here?
      glBindFramebuffer(GL_READ_FRAMEBUFFER, framebuffer);
                                     GL_TEXTURE_CUBE_MAP_POSITIVE_X + face, texture[1], 0);
      glReadPixels(0, 0, kWidth, kHeight, GL_RED_INTEGER, GL_UNSIGNED_INT, outputValues);
      Compute shader is very simple, imageLoad content from texture[0], and imageStore content to texture[1]. Does need to sync after dispatchCompute?
    • By Jonathan2006
      My question: is it possible to transform multiple angular velocities so that they can be reinserted as one? My research is below:
      // This works quat quaternion1 = GEQuaternionFromAngleRadians(angleRadiansVector1); quat quaternion2 = GEMultiplyQuaternions(quaternion1, GEQuaternionFromAngleRadians(angleRadiansVector2)); quat quaternion3 = GEMultiplyQuaternions(quaternion2, GEQuaternionFromAngleRadians(angleRadiansVector3)); glMultMatrixf(GEMat4FromQuaternion(quaternion3).array); // The first two work fine but not the third. Why? quat quaternion1 = GEQuaternionFromAngleRadians(angleRadiansVector1); vec3 vector1 = GETransformQuaternionAndVector(quaternion1, angularVelocity1); quat quaternion2 = GEQuaternionFromAngleRadians(angleRadiansVector2); vec3 vector2 = GETransformQuaternionAndVector(quaternion2, angularVelocity2); // This doesn't work //quat quaternion3 = GEQuaternionFromAngleRadians(angleRadiansVector3); //vec3 vector3 = GETransformQuaternionAndVector(quaternion3, angularVelocity3); vec3 angleVelocity = GEAddVectors(vector1, vector2); // Does not work: vec3 angleVelocity = GEAddVectors(vector1, GEAddVectors(vector2, vector3)); static vec3 angleRadiansVector; vec3 angularAcceleration = GESetVector(0.0, 0.0, 0.0); // Sending it through one angular velocity later in my motion engine angleVelocity = GEAddVectors(angleVelocity, GEMultiplyVectorAndScalar(angularAcceleration, timeStep)); angleRadiansVector = GEAddVectors(angleRadiansVector, GEMultiplyVectorAndScalar(angleVelocity, timeStep)); glMultMatrixf(GEMat4FromEulerAngle(angleRadiansVector).array); Also how do I combine multiple angularAcceleration variables? Is there an easier way to transform the angular values?
    • By dpadam450
      I have this code below in both my vertex and fragment shader, however when I request glGetUniformLocation("Lights[0].diffuse") or "Lights[0].attenuation", it returns -1. It will only give me a valid uniform location if I actually use the diffuse/attenuation variables in the VERTEX shader. Because I use position in the vertex shader, it always returns a valid uniform location. I've read that I can share uniforms across both vertex and fragment, but I'm confused what this is even compiling to if this is the case.
      #define NUM_LIGHTS 2
      struct Light
          vec3 position;
          vec3 diffuse;
          float attenuation;
      uniform Light Lights[NUM_LIGHTS];
    • By pr033r
      I have a Bachelor project on topic "Implenet 3D Boid's algorithm in OpenGL". All OpenGL issues works fine for me, all rendering etc. But when I started implement the boid's algorithm it was getting worse and worse. I read article ( inspirate from another code (here: but it still doesn't work like in tutorials and videos. For example the main problem: when I apply Cohesion (one of three main laws of boids) it makes some "cycling knot". Second, when some flock touch to another it scary change the coordination or respawn in origin (x: 0, y:0. z:0). Just some streng things. 
      I followed many tutorials, change a try everything but it isn't so smooth, without lags like in another videos. I really need your help. 
      My code (optimalizing branch):
      Exe file (if you want to look) and models folder (for those who will download the sources):
      Thanks for any help...

    • By Andrija
      I am currently trying to implement shadow mapping into my project , but although i can render my depth map to the screen and it looks okay , when i sample it with shadowCoords there is no shadow.
      Here is my light space matrix calculation
      mat4x4 lightViewMatrix; vec3 sun_pos = {SUN_OFFSET * the_sun->direction[0], SUN_OFFSET * the_sun->direction[1], SUN_OFFSET * the_sun->direction[2]}; mat4x4_look_at(lightViewMatrix,sun_pos,player->pos,up); mat4x4_mul(lightSpaceMatrix,lightProjMatrix,lightViewMatrix); I will tweak the values for the size and frustum of the shadow map, but for now i just want to draw shadows around the player position
      the_sun->direction is a normalized vector so i multiply it by a constant to get the position.
      player->pos is the camera position in world space
      the light projection matrix is calculated like this:
      mat4x4_ortho(lightProjMatrix,-SHADOW_FAR,SHADOW_FAR,-SHADOW_FAR,SHADOW_FAR,NEAR,SHADOW_FAR); Shadow vertex shader:
      uniform mat4 light_space_matrix; void main() { gl_Position = light_space_matrix * transfMatrix * vec4(position, 1.0f); } Shadow fragment shader:
      out float fragDepth; void main() { fragDepth = gl_FragCoord.z; } I am using deferred rendering so i have all my world positions in the g_positions buffer
      My shadow calculation in the deferred fragment shader:
      float get_shadow_fac(vec4 light_space_pos) { vec3 shadow_coords = / light_space_pos.w; shadow_coords = shadow_coords * 0.5 + 0.5; float closest_depth = texture(shadow_map, shadow_coords.xy).r; float current_depth = shadow_coords.z; float shadow_fac = 1.0; if(closest_depth < current_depth) shadow_fac = 0.5; return shadow_fac; } I call the function like this:
      get_shadow_fac(light_space_matrix * vec4(position,1.0)); Where position is the value i got from sampling the g_position buffer
      Here is my depth texture (i know it will produce low quality shadows but i just want to get it working for now):
      sorry because of the compression , the black smudges are trees ...
      EDIT: Depth texture attachment:
  • Popular Now