Sign in to follow this  

OpenGL GL_MAP_PERSISTENT_BIT performance problem

This topic is 1429 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I create a buffer object with GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT and then I map it with GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_FLUSH_EXPLICIT_BIT.

 

According to usage examples I've seen, I write to the pointer and do a flush. No unmapping. I was initially doing this for uniform buffer objects holding mainly transforms, so I'd be writing data multiple times per frame. I didn't notice any problems.

 

However, I tried to use this with pixel buffer objects and I have a big problem. In my setup, I have shared memory into which another application is drawing stuff, and I use it to load as texture in my application. To make the uploading asynchronous, I use the standard way of ping-ponging two PBOs. Initially, I wasn't using the persistent and flush bits. I would map one PBO and write data, then unmap it, and the other one would have an asynchronous texture subimage operation. Then the next frame they switch.  When I tried to change the map-write-unmap operation to map once with persistence and then write-flush, the framerate of everything dropped very significantly. How is it that doing this once per frame had so much impact, when it seems to work fine with UBOs as in my first use case? Is there an issue with persistent mapping and simply the amount of memory being transferred (HD-resolution texture during most frames)? I assume I'm probably doing something wrong with the way I'm using it, perhaps when I'm calling the flush operation (right now it's immediately after the write), but I really don't know. The feature is fairly new to OpenGL, so perhaps drivers aren't as well optimized for it, but that doens't seem likely (I'm on GTX 680). Any suggestions? I was hoping to actually get an improvement by saving on the map/unmap calls...

Edited by Prune

Share this post


Link to post
Share on other sites

I understand the part about GL_MAP_UNSYCHRONIZED_BIT causing the client and server threads to synchronize, but I don't understand exactly how GL_MAP_COHERENT_BIT? works. Looking at http://www.opengl.org/wiki/GLAPI/glMapBufferRange, there's also a comment that seems to contradict the claim of the presentation: "Obviously, there's a reason why you don't get the coherent behavior by default. That reason being performance. You should try to live with the explicit synchronization mechanisms if it is at all possible." So which is it? And, in the context of this coherent flag, what would the effect of the flags (1) GL_MAP_INVALIDATE_BUFFER_BIT and (2) GL_MAP_FLUSH_EXPLICIT_BIT be? My best guess is that, without the unsynchronized bit, then I'd either use the coherent bit or explicit flush, but not both (but then what does the invalidate bit do in the former case)?

Share this post


Link to post
Share on other sites

No contradiction. GL_MAP_COHERENT_BIT is bad performance, just better than the alternative. Notice in the presentation they used 3x the buffer size to help. Also, notice in the talk the "Not a good fit for:" section that lists where coherent is worse performance than the alternatives.

 

1. GL_MAP_INVALIDATE_BUFFER_BIT causes data to be undefined unless written after the invalidate. Performance would depend on the implementation, but I'd suspect no performance benefit if you're using coherent since you would have already taken the hit before you can invalidate the data.

2. GL_MAP_FLUSH_EXPLICIT_BIT causes any modifications not flushed to be undefined. It sounds like this would fully override coherent, although I might be missing something about coherent since I haven't ever used it yet.

Share this post


Link to post
Share on other sites

So what do you think is the best approach for 1) frequently modified uniforms, such as transforms, and 2) per-frame modified large amount of data, such as large textures?

 

[Edit:] I have additional confusion due to OpenGL Insights http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-AsynchronousBufferTransfers.pdf recommending the use of GL_MAP_UNSYNCHRONIZED bit with multiple buffers or multiple ranges, instead of orphaning or round-robin...

Edited by Prune

Share this post


Link to post
Share on other sites

Does OpenGL Insights even account for the existence of GL_MAP_COHERENT_BIT? I'd suspect uniforms that are modified every frame would be well-suited for coherent since you aren't making a separate call to sync/orphan. A lot of the benefits they explain from AMD pinned apply to GL_MAP_COHERENT_BIT, although there is sync without a separate call.

 

I wouldn't want to guess what is best performance on large amounts of data changed every frame since processing a huge chunk of data is obviously less likely to bottleneck on the OpenGL call.

 

If you're at the optimization stage, it's definitely worth investigating coherent's performance. Just be sure to wrap ClientWaitSync like the presentation suggested.

Share this post


Link to post
Share on other sites

Just to be clear, should the glFenceSync() be placed right after the last GPU command that uses the written data? And when I'm doing this explicit fencing, then I should not be also orphaning with ...INVALIDATE... because the orphaning might reallocating another chunk of memory for the buffer?

 

Also, if manually flushing, should glFlushMappedBufferRange() be called right after modifying the data, or just before using the data? It's not clear to me from the description which says that it "indicates" the data has been modified. Does this in essence start the DMA transfer, or wait for its completion on the GPU side?

Edited by Prune

Share this post


Link to post
Share on other sites

glFenceSync injects an sync object into the command buffer. That syncObj is basically set to a value of false (if it were a bool), and when every command preceding syncObj in the command buffer finishes then syncObj gets set to true. Somewhere later in your code, you have "glSyncWait(syncObj,0,GL_TIMEOUT_IGNORED);" and that instruction does nothing other than wait until syncObj is true. Orphaning is not used for this method since you're responsible to make sure memory isn't getting used for 2 different things at once.

 

glFlushMappedBufferRange() should be called when you want OpenGL to know there are changes it must pick up before executing instructions using the data. You want to use it as early as possible so you don't stall waiting to get the data, but not so early that you'd need to flush constantly. I guess use it after making a block of updates, but not after each individual update. In essence, it queues the DMA transfer and returns. It does not wait for the data transfer to complete.

Share this post


Link to post
Share on other sites

Do you mean glClientWaitSync() rather than glSyncWait() (actually, glWaitSync())? I do, after all, need to block the client thread before it does the memcpy.

Share this post


Link to post
Share on other sites

Do you mean glClientWaitSync() rather than glSyncWait() (actually, glWaitSync())? I do, after all, need to block the client thread before it does the memcpy.

Yea, sorry about that. glSyncWait() isn't even a function. You definitely want to glClientWaitSync() when you need your client to wait. I tried looking up the functions online quick to give you more accurate code, but I think I just made it less accurate. Wrap this stuff with anything that will help you debug, use a timer instead of blocking forever, etc. Often you'll set a tight timer during development so you know for performance every time you're waiting and can adjust buffer size or whatever, then loop with a more generous timer waiting to sync.

 

If you read stuff from other people, be warned that I'm not the only one extremely sloppy with my notation on syncs. You might see glFence when someone means sync since it's how you generate the sync object. You might see "sync" meaning anything related to the entire process. You might even see a sequence of code with sync in the wrong place (before/after where it should be). It's because they're just reminding you that you need to do the sync. The whole thing feels the same as locks where people just throw the word "lock" around to say "and whatever I'm talking about will need locks done right, but I'm not doing that for you."

 

Best of luck, and don't be afraid to ask questions if you run into trouble. I won't pretend it's the easiest and most straightforward way to render things, and I'm probably not very good at explaining it either =)

Share this post


Link to post
Share on other sites

Just a note: looks like the fence is needed even if you use explicit flush. That's not what I expected...

Edited by Prune

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Similar Content

    • By xhcao
      Does sync be needed to read texture content after access texture image in compute shader?
      My simple code is as below,
      glUseProgram(program.get());
      glBindImageTexture(0, texture[0], 0, GL_FALSE, 3, GL_READ_ONLY, GL_R32UI);
      glBindImageTexture(1, texture[1], 0, GL_FALSE, 4, GL_WRITE_ONLY, GL_R32UI);
      glDispatchCompute(1, 1, 1);
      // Does sync be needed here?
      glUseProgram(0);
      glBindFramebuffer(GL_READ_FRAMEBUFFER, framebuffer);
      glFramebufferTexture2D(GL_READ_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
                                     GL_TEXTURE_CUBE_MAP_POSITIVE_X + face, texture[1], 0);
      glReadPixels(0, 0, kWidth, kHeight, GL_RED_INTEGER, GL_UNSIGNED_INT, outputValues);
       
      Compute shader is very simple, imageLoad content from texture[0], and imageStore content to texture[1]. Does need to sync after dispatchCompute?
    • By Jonathan2006
      My question: is it possible to transform multiple angular velocities so that they can be reinserted as one? My research is below:
      // This works quat quaternion1 = GEQuaternionFromAngleRadians(angleRadiansVector1); quat quaternion2 = GEMultiplyQuaternions(quaternion1, GEQuaternionFromAngleRadians(angleRadiansVector2)); quat quaternion3 = GEMultiplyQuaternions(quaternion2, GEQuaternionFromAngleRadians(angleRadiansVector3)); glMultMatrixf(GEMat4FromQuaternion(quaternion3).array); // The first two work fine but not the third. Why? quat quaternion1 = GEQuaternionFromAngleRadians(angleRadiansVector1); vec3 vector1 = GETransformQuaternionAndVector(quaternion1, angularVelocity1); quat quaternion2 = GEQuaternionFromAngleRadians(angleRadiansVector2); vec3 vector2 = GETransformQuaternionAndVector(quaternion2, angularVelocity2); // This doesn't work //quat quaternion3 = GEQuaternionFromAngleRadians(angleRadiansVector3); //vec3 vector3 = GETransformQuaternionAndVector(quaternion3, angularVelocity3); vec3 angleVelocity = GEAddVectors(vector1, vector2); // Does not work: vec3 angleVelocity = GEAddVectors(vector1, GEAddVectors(vector2, vector3)); static vec3 angleRadiansVector; vec3 angularAcceleration = GESetVector(0.0, 0.0, 0.0); // Sending it through one angular velocity later in my motion engine angleVelocity = GEAddVectors(angleVelocity, GEMultiplyVectorAndScalar(angularAcceleration, timeStep)); angleRadiansVector = GEAddVectors(angleRadiansVector, GEMultiplyVectorAndScalar(angleVelocity, timeStep)); glMultMatrixf(GEMat4FromEulerAngle(angleRadiansVector).array); Also how do I combine multiple angularAcceleration variables? Is there an easier way to transform the angular values?
    • By dpadam450
      I have this code below in both my vertex and fragment shader, however when I request glGetUniformLocation("Lights[0].diffuse") or "Lights[0].attenuation", it returns -1. It will only give me a valid uniform location if I actually use the diffuse/attenuation variables in the VERTEX shader. Because I use position in the vertex shader, it always returns a valid uniform location. I've read that I can share uniforms across both vertex and fragment, but I'm confused what this is even compiling to if this is the case.
       
      #define NUM_LIGHTS 2
      struct Light
      {
          vec3 position;
          vec3 diffuse;
          float attenuation;
      };
      uniform Light Lights[NUM_LIGHTS];
       
       
    • By pr033r
      Hello,
      I have a Bachelor project on topic "Implenet 3D Boid's algorithm in OpenGL". All OpenGL issues works fine for me, all rendering etc. But when I started implement the boid's algorithm it was getting worse and worse. I read article (http://natureofcode.com/book/chapter-6-autonomous-agents/) inspirate from another code (here: https://github.com/jyanar/Boids/tree/master/src) but it still doesn't work like in tutorials and videos. For example the main problem: when I apply Cohesion (one of three main laws of boids) it makes some "cycling knot". Second, when some flock touch to another it scary change the coordination or respawn in origin (x: 0, y:0. z:0). Just some streng things. 
      I followed many tutorials, change a try everything but it isn't so smooth, without lags like in another videos. I really need your help. 
      My code (optimalizing branch): https://github.com/pr033r/BachelorProject/tree/Optimalizing
      Exe file (if you want to look) and models folder (for those who will download the sources):
      http://leteckaposta.cz/367190436
      Thanks for any help...

    • By Andrija
      I am currently trying to implement shadow mapping into my project , but although i can render my depth map to the screen and it looks okay , when i sample it with shadowCoords there is no shadow.
      Here is my light space matrix calculation
      mat4x4 lightViewMatrix; vec3 sun_pos = {SUN_OFFSET * the_sun->direction[0], SUN_OFFSET * the_sun->direction[1], SUN_OFFSET * the_sun->direction[2]}; mat4x4_look_at(lightViewMatrix,sun_pos,player->pos,up); mat4x4_mul(lightSpaceMatrix,lightProjMatrix,lightViewMatrix); I will tweak the values for the size and frustum of the shadow map, but for now i just want to draw shadows around the player position
      the_sun->direction is a normalized vector so i multiply it by a constant to get the position.
      player->pos is the camera position in world space
      the light projection matrix is calculated like this:
      mat4x4_ortho(lightProjMatrix,-SHADOW_FAR,SHADOW_FAR,-SHADOW_FAR,SHADOW_FAR,NEAR,SHADOW_FAR); Shadow vertex shader:
      uniform mat4 light_space_matrix; void main() { gl_Position = light_space_matrix * transfMatrix * vec4(position, 1.0f); } Shadow fragment shader:
      out float fragDepth; void main() { fragDepth = gl_FragCoord.z; } I am using deferred rendering so i have all my world positions in the g_positions buffer
      My shadow calculation in the deferred fragment shader:
      float get_shadow_fac(vec4 light_space_pos) { vec3 shadow_coords = light_space_pos.xyz / light_space_pos.w; shadow_coords = shadow_coords * 0.5 + 0.5; float closest_depth = texture(shadow_map, shadow_coords.xy).r; float current_depth = shadow_coords.z; float shadow_fac = 1.0; if(closest_depth < current_depth) shadow_fac = 0.5; return shadow_fac; } I call the function like this:
      get_shadow_fac(light_space_matrix * vec4(position,1.0)); Where position is the value i got from sampling the g_position buffer
      Here is my depth texture (i know it will produce low quality shadows but i just want to get it working for now):
      sorry because of the compression , the black smudges are trees ... https://i.stack.imgur.com/T43aK.jpg
      EDIT: Depth texture attachment:
      glTexImage2D(GL_TEXTURE_2D, 0,GL_DEPTH_COMPONENT24,fbo->width,fbo->height,0,GL_DEPTH_COMPONENT,GL_FLOAT,NULL); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE); glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, GL_TEXTURE_2D, fbo->depthTexture, 0);
  • Popular Now