• Advertisement
Sign in to follow this  

OpenGL Getting depth values

This topic is 3901 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm attempting to read out depth values for every screen coordinate in a 640*480 window at each frame. I know this has received attention here before, but the application is slightly unusual, and I would appreciate any advice on the best approach. I need to extract _only_ the matrix of orthogonal depth data that results from a particular viewpoint on the scene. Choices regarding lighting, or textures, or even whether the data is displayed to the screen at all, are _not_ requirements of the application. Strangely - I suppose - I'm using OpenGL despite not requiring any kind of visualisation of the results it gives. Currently I do display each frame to the screen because it allows me to read depth components across the entire window with glReadPixels(). I have heard this described as bad practice(?) and I am aware that there are also various performance issues related to the type of buffer you read into, its alignment and system hardware etc. Despite quite a bit of playing around with glReadPixels(), I am not able to achieve an acceptable level of performance. I'm aware that pixel buffer objects might give a performance improvement, but I'm not sure whether either of these approaches will offer the best solution? Reading old posts on the forum has made me aware of feedback mode. This seems a potentially better type of approach, as I don't require that the results be displayed to screen. I have no previous experience of this technique, but was considering using glfeedbackbuffer() with GL_3D as the feedback buffer type to try and recover the depth data for the entire window. Is this a valid use of the feedback mode, and is it likely to offer a performance increase over the, display to screen then read depths with glReadPixels() approach, described above? Thanks in advance for any help/comments. [For anyone interested in where the application requirements come from, it is an implementation of a particle filter http://en.wikipedia.org/wiki/Particle_filter I use OpenGL to draw an articulated 3D object, consisting of about 20 component parts, each with between 1 and 3 degrees of freedom. This object is viewed from a fixed point and is used for comparison with video image evidence (by this I mean a frame of real-world video from a video camera). I must probe around 1000 object configurations for depth data to compare with every individual frame of video evidence. With video evidence running at 30-60Hz, there are tens of thousands of configurations per second to be probed for depth data. Although the application need not run in real time, it must be manageable. At the moment, my glReadPixels() approach gives ~12fps which equates to over an hour to process 1 second of video evidence. As there is no need to visualise any of the output, only to grab the xyz data, I am hopeful that a performance gain is possible, but perhaps not].

Share this post


Link to post
Share on other sites
Advertisement
i dont think feedback is what u want ( i can see it being a lot slower )
i just ran a very old benchmark of mine on my nvidia gf7600gs
im getting ~150million pixels sec with glReadPixels( GL_DEPTH );
ie 640x480 > 400fps

have u looked into PBO (theres info + a demo on the nvidia developer site)

Share this post


Link to post
Share on other sites
zeds,

That's an interesting result. When I remove my glReadPixels() call I get ~70fps, when I add it in framerate drops to ~12fps.

-If you take out your call to glReadPixels() what kind of performance increase do you get on your benchmark framerate i.e. is it anything like my jump of about 5x, above?

-Could you tell me how you're calling glReadPixels()? How many depth values does your benchmark code read per call. My code is below, I'm trying to take all ~300,000 depth values in the window at once.

Here's how I make my call:
float *fmem = malloc(640*480*sizeof(float));
glReadPixels(0, 0, 640, 480, GL_DEPTH_COMPONENT, GL_FLOAT, fmem);

I think the PBO idea is a good one, but I want to make sure of some things before I move on from glReadPixels(). My fps results above are based on a 1000 frame long test, where the 1000 glReadPixels() calls add 70 seconds in total, versus a run where they aren't called. That looks like under 5 million pixels per second coming back to the app.

-Could I be suffering from the lack of a decent graphics card here? Or perhaps I'm making my call to glReadPixels() incorrectly?

Share this post


Link to post
Share on other sites
Still don't know what graphics card I have in this machine. But using GPUBench I get the following results for glReadPixels() (http://graphics.stanford.edu/projects/gpubench/test_readback.html has details). They don't read GL_DEPTH_COMPONENT but I was still interested to see them (window size for the test is 512*512 by default):

Fixed Hostmem GL_RGBA Mpix/sec: 46.54 MB/sec: 177.53
Fixed Hostmem GL_ABGR_EXT Mpix/sec: 1.48 MB/sec: 5.66
Fixed Hostmem GL_BGRA Mpix/sec: 46.23 MB/sec: 176.36
Float Hostmem GL_RGBA Mpix/sec: 12.55 MB/sec: 191.48
Float Hostmem GL_ABGR_EXT Mpix/sec: 0.47 MB/sec: 7.11
Float Hostmem GL_BGRA Mpix/sec: 12.47 MB/sec: 190.22

I've looked at the GPUBench source code, and made some very slight changes to my glReadPixels() calls to bring my code in line with theirs. My performance is pretty much unchanged, however.

I think I will attempt to give feedback mode a try before I move on. I'll post if I conclude anything other than what zeds predicted above.

Regarding PBOs, I'm concerned that all they will give me is the potential for a non-blocking call to read the depth info. As I don't have much work I can give the app to do in the meantime (before I actually try to use the depth data), I don't think I have much chance of a performance increase. Quote from Dominik Göddeke's tutorial below might be interesting to anyone else considering this approach

"Conventional transfers require a pipeline stall on the GPU to ensure that the data being read back is synchronous with the state of computations. PBO-accelerated transfers are NOT able to change this behaviour, they are only asynchronous on the CPU side. This behaviour cannot be changed at all due to the way the GPU pipeline works. This means in particular that PBO transfers from the GPU will not deliver any speedup with the application covered in this tutorial, they might even be slower than conventional ones. They are however asynchronous on the CPU: If an application can schedule enough work between initiating the transfer and actually using the data, true asynchronous transfers are possible and performance might be improved in case the data format allows this. ... To benefit from PBO acceleration, a lot of independent work needs to be scheduled between initiating the transfer and requesting the data".

Full tutorial available at http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial3.html

Share this post


Link to post
Share on other sites
Quote:

Here's how I make my call:
float *fmem = malloc(640*480*sizeof(float));
glReadPixels(0, 0, 640, 480, GL_DEPTH_COMPONENT, GL_FLOAT, fmem);


i hope youre not doing that each frame ie declaring the memory.

my results are from an old benchmarking app i wrote many years ago (from memory even my gf2mx at the time did >10million pixs)

1000x readpixels of 640x480 GL_DEPTH_COMPONENT with GL_FLOAT should noway near take 70secs.

heres the output from my testing (as u can see depth values should be pretty close to color values)
thus if u have
Fixed Hostmem GL_RGBA Mpix/sec: 46.54 MB/sec: 177.53
u should be seeing something similar WRT depth (which youre not)
try removing everything except for the readpixels + see if thats truly the bottleneck


glReadPixels: DEPTH_COMPONENT -- UNSIGNED_BYTE 170.111 Mpixels/sec
glReadPixels: DEPTH_COMPONENT -- UNSIGNED_SHORT 170.111 Mpixels/sec
glReadPixels: DEPTH_COMPONENT -- FLOAT 145.572 Mpixels/sec
glReadPixels: DEPTH_COMPONENT -- UNSIGNED_INT 140.837 Mpixels/sec
glReadPixels: DEPTH_STENCIL_NV -- UNSIGNED_INT_24_8_NV 150.722 Mpixels/sec
---
glReadPixels: LUMINANCE -- UNSIGNED_BYTE 144.398 Mpixels/sec
glReadPixels: LUMINANCE -- UNSIGNED_SHORT 23.865 Mpixels/sec
glReadPixels: LUMINANCE -- UNSIGNED_INT 16.529 Mpixels/sec
glReadPixels: LUMINANCE -- FLOAT 25.871 Mpixels/sec
glReadPixels: ALPHA -- UNSIGNED_BYTE 186.673 Mpixels/sec
glReadPixels: ALPHA -- UNSIGNED_SHORT 184.746 Mpixels/sec
glReadPixels: ALPHA -- UNSIGNED_INT 184.746 Mpixels/sec
glReadPixels: ALPHA -- FLOAT 175.333 Mpixels/sec
glReadPixels: RED -- UNSIGNED_BYTE 171.744 Mpixels/sec
glReadPixels: RED -- UNSIGNED_SHORT 144.398 Mpixels/sec
glReadPixels: RED -- UNSIGNED_INT 119.305 Mpixels/sec
glReadPixels: RED -- FLOAT 150.722 Mpixels/sec
glReadPixels: RGB -- UNSIGNED_BYTE 141.954 Mpixels/sec
glReadPixels: BGR -- UNSIGNED_BYTE 163.580 Mpixels/sec
glReadPixels: RGBA -- UNSIGNED_BYTE 149.380 Mpixels/sec
glReadPixels: BGRA -- UNSIGNED_BYTE 165.191 Mpixels/sec
glReadPixels: RGB -- FLOAT 45.222 Mpixels/sec
glReadPixels: BGR -- FLOAT 46.668 Mpixels/sec
glReadPixels: RGB -- UNSIGNED_SHORT_5_6_5 154.718 Mpixels/sec
glReadPixels: RGB -- UNSIGNED_SHORT_5_6_5_REV 148.061 Mpixels/sec
glReadPixels: RGBA -- FLOAT 38.000 Mpixels/sec
glReadPixels: BGRA -- FLOAT 37.680 Mpixels/sec
glReadPixels: RGBA -- UNSIGNED_INT_8_8_8_8 166.834 Mpixels/sec
glReadPixels: BGRA -- UNSIGNED_INT_8_8_8_8 142.029 Mpixels/sec
glReadPixels: RGBA -- UNSIGNED_INT_8_8_8_8_REV 149.380 Mpixels/sec
glReadPixels: BGRA -- UNSIGNED_INT_8_8_8_8_REV 166.730 Mpixels/sec

Share this post


Link to post
Share on other sites
I'm not doing the malloc each frame. Sorry, that is misleading.
The good benchmark is what is leaving me so confused. I know you're right that it should be much faster. If I remove just the one glReadPixels() line _only_, then the 1000 frame run does indeed complete 70 seconds faster (about 15sec in total). There's something wrong here.

I found out yesterday that the card in this machine is an ATI EAX300SE 128Mb PCIe.

The only explanation I can come up with at the moment is an ATI driver problem for Linux. (Now I've said that it's bound to be me making a stupid coding mistake).

1) My benchmarks were indeed good, but they were run under windows.
2) I do all my OpenGL work in Debian Linux.
3) I have seen people mention ATI Linux driver problems on other forums, specifically mentioning glReadPixels() e.g. http://www.gpgpu.org/forums/viewtopic.php?t=3353&view=previous&sid=3f7fb23c04d396ca28cd5493ff624753

Don't know what the best next step is. I have an NVidia G-Force 6 Series 6600GT PCIe sitting on my desk but switching them over could be a problem as I don't own this machine. I've yet to look at whether any more recent ATI drivers are available.

Share this post


Link to post
Share on other sites
Found another PC running Debian Linux, very similar spec _but_ with an NVidia graphics card. I ran exactly the same code on both my PC (ATI card) and the alterative machine (NVidia card), results are below.

1000 frame test, duration:

ATI:
Window size 640*512: 3.32sec (readback on), 8sec (readback off)
Window size 214*512: 1.23sec (readback on)

NVidia:
Window size 640*512: 19sec (readback on), 4sec (readback off)
Window size 214*512: 10sec (readback on)

[The readback off cases aren't entirely fair as I also dropped a big array loop every frame, that I shouldn't have done. To give an idea, ATI would be 12sec with readback off and the array loop left in. So you could scale up the 4sec Nvidia result a little.]

But regardless of that, and the fact that I don't know what model the Nvidia card is - it appears faster in general rendering than the ATI... I'm sure that there is some problem with the ATI card's readback under Linux. See the jump up to 3 mins 32secs. An overhead of ~200 seconds. [I was wrong to quote an overhead of 70sec on readback for 1000*glReadPixels(0,0,640,480,...) in earlier posts. It was for 1000*glReadPixels(0,0,214,512,...).

Perhaps this could be helpful info if someone is struggling with slow glReadPixels() under Linux in the future.

Share this post


Link to post
Share on other sites
Is it possible to upload the video evidence to the GPU and do the comparison there instead? That would possibly yield an increase in speed.

Share this post


Link to post
Share on other sites
Jerax,
Yes, I think that's a nice idea. Looking at gpgpu.org the kind of techniques I'd need to employ for general purpose GPU computations look relatively tough (to me, at least) but I think you're right that it's the way to go for performance increases. I'll be testing the approach further using the readback technique for now, but if it's successful then I'll look again at this option.

Re. glReadPixels(), I've replaced my machine's ATI EAX300SE 128Mb PCIe with the NVidia G-Force 6 Series 6600GT 128Mb PCIe. The final result for my benchmark under Linux is now:

1000 frame test, duration:

NVidia 6600GT 128Mb PCIe:
Window size 640*512: 16sec (readback on)

This is manageable for my application.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Tags

  • Advertisement
  • Popular Now

  • Similar Content

    • By LifeArtist
      Good Evening,
      I want to make a 2D game which involves displaying some debug information. Especially for collision, enemy sights and so on ...
      First of I was thinking about all those shapes which I need will need for debugging purposes: circles, rectangles, lines, polygons.
      I am really stucked right now because of the fundamental question:
      Where do I store my vertices positions for each line (object)? Currently I am not using a model matrix because I am using orthographic projection and set the final position within the VBO. That means that if I add a new line I would have to expand the "points" array and re-upload (recall glBufferData) it every time. The other method would be to use a model matrix and a fixed vbo for a line but it would be also messy to exactly create a line from (0,0) to (100,20) calculating the rotation and scale to make it fit.
      If I proceed with option 1 "updating the array each frame" I was thinking of having 4 draw calls every frame for the lines vao, polygons vao and so on. 
      In addition to that I am planning to use some sort of ECS based architecture. So the other question would be:
      Should I treat those debug objects as entities/components?
      For me it would make sense to treat them as entities but that's creates a new issue with the previous array approach because it would have for example a transform and render component. A special render component for debug objects (no texture etc) ... For me the transform component is also just a matrix but how would I then define a line?
      Treating them as components would'nt be a good idea in my eyes because then I would always need an entity. Well entity is just an id !? So maybe its a component?
      Regards,
      LifeArtist
    • By QQemka
      Hello. I am coding a small thingy in my spare time. All i want to achieve is to load a heightmap (as the lowest possible walking terrain), some static meshes (elements of the environment) and a dynamic character (meaning i can move, collide with heightmap/static meshes and hold a varying item in a hand ). Got a bunch of questions, or rather problems i can't find solution to myself. Nearly all are deal with graphics/gpu, not the coding part. My c++ is on high enough level.
      Let's go:
      Heightmap - i obviously want it to be textured, size is hardcoded to 256x256 squares. I can't have one huge texture stretched over entire terrain cause every pixel would be enormous. Thats why i decided to use 2 specified textures. First will be a tileset consisting of 16 square tiles (u v range from 0 to 0.25 for first tile and so on) and second a 256x256 buffer with 0-15 value representing index of the tile from tileset for every heigtmap square. Problem is, how do i blend the edges nicely and make some computationally cheap changes so its not obvious there are only 16 tiles? Is it possible to generate such terrain with some existing program?
      Collisions - i want to use bounding sphere and aabb. But should i store them for a model or entity instance? Meaning i have 20 same trees spawned using the same tree model, but every entity got its own transformation (position, scale etc). Storing collision component per instance grats faster access + is precalculated and transformed (takes additional memory, but who cares?), so i stick with this, right? What should i do if object is dynamically rotated? The aabb is no longer aligned and calculating per vertex min/max everytime object rotates/scales is pretty expensive, right?
      Drawing aabb - problem similar to above (storing aabb data per instance or model). This time in my opinion per model is enough since every instance also does not have own vertex buffer but uses the shared one (so 20 trees share reference to one tree model). So rendering aabb is about taking the model's aabb, transforming with instance matrix and voila. What about aabb vertex buffer (this is more of a cosmetic question, just curious, bumped onto it in time of writing this). Is it better to make it as 8 points and index buffer (12 lines), or only 2 vertices with min/max x/y/z and having the shaders dynamically generate 6 other vertices and draw the box? Or maybe there should be just ONE 1x1x1 cube box template moved/scaled per entity?
      What if one model got a diffuse texture and a normal map, and other has only diffuse? Should i pass some bool flag to shader with that info, or just assume that my game supports only diffuse maps without fancy stuff?
      There were several more but i forgot/solved them at time of writing
      Thanks in advance
    • By RenanRR
      Hi All,
      I'm reading the tutorials from learnOpengl site (nice site) and I'm having a question on the camera (https://learnopengl.com/Getting-started/Camera).
      I always saw the camera being manipulated with the lookat, but in tutorial I saw the camera being changed through the MVP arrays, which do not seem to be camera, but rather the scene that changes:
      Vertex Shader:
      #version 330 core layout (location = 0) in vec3 aPos; layout (location = 1) in vec2 aTexCoord; out vec2 TexCoord; uniform mat4 model; uniform mat4 view; uniform mat4 projection; void main() { gl_Position = projection * view * model * vec4(aPos, 1.0f); TexCoord = vec2(aTexCoord.x, aTexCoord.y); } then, the matrix manipulated:
      ..... glm::mat4 projection = glm::perspective(glm::radians(fov), (float)SCR_WIDTH / (float)SCR_HEIGHT, 0.1f, 100.0f); ourShader.setMat4("projection", projection); .... glm::mat4 view = glm::lookAt(cameraPos, cameraPos + cameraFront, cameraUp); ourShader.setMat4("view", view); .... model = glm::rotate(model, glm::radians(angle), glm::vec3(1.0f, 0.3f, 0.5f)); ourShader.setMat4("model", model);  
      So, some doubts:
      - Why use it like that?
      - Is it okay to manipulate the camera that way?
      -in this way, are not the vertex's positions that changes instead of the camera?
      - I need to pass MVP to all shaders of object in my scenes ?
       
      What it seems, is that the camera stands still and the scenery that changes...
      it's right?
       
       
      Thank you
       
    • By dpadam450
      Sampling a floating point texture where the alpha channel holds 4-bytes of packed data into the float. I don't know how to cast the raw memory to treat it as an integer so I can perform bit-shifting operations.

      int rgbValue = int(textureSample.w);//4 bytes of data packed as color
      // algorithm might not be correct and endianness might need switching.
      vec3 extractedData = vec3(  rgbValue & 0xFF000000,  (rgbValue << 8) & 0xFF000000, (rgbValue << 16) & 0xFF000000);
      extractedData /= 255.0f;
    • By Devashish Khandelwal
      While writing a simple renderer using OpenGL, I faced an issue with the glGetUniformLocation function. For some reason, the location is coming to be -1.
      Anyone has any idea .. what should I do?
  • Advertisement