Getting depth values

Started by
7 comments, last by jd_24 16 years, 10 months ago
I'm attempting to read out depth values for every screen coordinate in a 640*480 window at each frame. I know this has received attention here before, but the application is slightly unusual, and I would appreciate any advice on the best approach. I need to extract _only_ the matrix of orthogonal depth data that results from a particular viewpoint on the scene. Choices regarding lighting, or textures, or even whether the data is displayed to the screen at all, are _not_ requirements of the application. Strangely - I suppose - I'm using OpenGL despite not requiring any kind of visualisation of the results it gives. Currently I do display each frame to the screen because it allows me to read depth components across the entire window with glReadPixels(). I have heard this described as bad practice(?) and I am aware that there are also various performance issues related to the type of buffer you read into, its alignment and system hardware etc. Despite quite a bit of playing around with glReadPixels(), I am not able to achieve an acceptable level of performance. I'm aware that pixel buffer objects might give a performance improvement, but I'm not sure whether either of these approaches will offer the best solution? Reading old posts on the forum has made me aware of feedback mode. This seems a potentially better type of approach, as I don't require that the results be displayed to screen. I have no previous experience of this technique, but was considering using glfeedbackbuffer() with GL_3D as the feedback buffer type to try and recover the depth data for the entire window. Is this a valid use of the feedback mode, and is it likely to offer a performance increase over the, display to screen then read depths with glReadPixels() approach, described above? Thanks in advance for any help/comments. [For anyone interested in where the application requirements come from, it is an implementation of a particle filter http://en.wikipedia.org/wiki/Particle_filter I use OpenGL to draw an articulated 3D object, consisting of about 20 component parts, each with between 1 and 3 degrees of freedom. This object is viewed from a fixed point and is used for comparison with video image evidence (by this I mean a frame of real-world video from a video camera). I must probe around 1000 object configurations for depth data to compare with every individual frame of video evidence. With video evidence running at 30-60Hz, there are tens of thousands of configurations per second to be probed for depth data. Although the application need not run in real time, it must be manageable. At the moment, my glReadPixels() approach gives ~12fps which equates to over an hour to process 1 second of video evidence. As there is no need to visualise any of the output, only to grab the xyz data, I am hopeful that a performance gain is possible, but perhaps not].
Advertisement
i dont think feedback is what u want ( i can see it being a lot slower )
i just ran a very old benchmark of mine on my nvidia gf7600gs
im getting ~150million pixels sec with glReadPixels( GL_DEPTH );
ie 640x480 > 400fps

have u looked into PBO (theres info + a demo on the nvidia developer site)
zeds,

That's an interesting result. When I remove my glReadPixels() call I get ~70fps, when I add it in framerate drops to ~12fps.

-If you take out your call to glReadPixels() what kind of performance increase do you get on your benchmark framerate i.e. is it anything like my jump of about 5x, above?

-Could you tell me how you're calling glReadPixels()? How many depth values does your benchmark code read per call. My code is below, I'm trying to take all ~300,000 depth values in the window at once.

Here's how I make my call:
float *fmem = malloc(640*480*sizeof(float));
glReadPixels(0, 0, 640, 480, GL_DEPTH_COMPONENT, GL_FLOAT, fmem);

I think the PBO idea is a good one, but I want to make sure of some things before I move on from glReadPixels(). My fps results above are based on a 1000 frame long test, where the 1000 glReadPixels() calls add 70 seconds in total, versus a run where they aren't called. That looks like under 5 million pixels per second coming back to the app.

-Could I be suffering from the lack of a decent graphics card here? Or perhaps I'm making my call to glReadPixels() incorrectly?
Still don't know what graphics card I have in this machine. But using GPUBench I get the following results for glReadPixels() (http://graphics.stanford.edu/projects/gpubench/test_readback.html has details). They don't read GL_DEPTH_COMPONENT but I was still interested to see them (window size for the test is 512*512 by default):

Fixed Hostmem GL_RGBA Mpix/sec: 46.54 MB/sec: 177.53
Fixed Hostmem GL_ABGR_EXT Mpix/sec: 1.48 MB/sec: 5.66
Fixed Hostmem GL_BGRA Mpix/sec: 46.23 MB/sec: 176.36
Float Hostmem GL_RGBA Mpix/sec: 12.55 MB/sec: 191.48
Float Hostmem GL_ABGR_EXT Mpix/sec: 0.47 MB/sec: 7.11
Float Hostmem GL_BGRA Mpix/sec: 12.47 MB/sec: 190.22

I've looked at the GPUBench source code, and made some very slight changes to my glReadPixels() calls to bring my code in line with theirs. My performance is pretty much unchanged, however.

I think I will attempt to give feedback mode a try before I move on. I'll post if I conclude anything other than what zeds predicted above.

Regarding PBOs, I'm concerned that all they will give me is the potential for a non-blocking call to read the depth info. As I don't have much work I can give the app to do in the meantime (before I actually try to use the depth data), I don't think I have much chance of a performance increase. Quote from Dominik Göddeke's tutorial below might be interesting to anyone else considering this approach

"Conventional transfers require a pipeline stall on the GPU to ensure that the data being read back is synchronous with the state of computations. PBO-accelerated transfers are NOT able to change this behaviour, they are only asynchronous on the CPU side. This behaviour cannot be changed at all due to the way the GPU pipeline works. This means in particular that PBO transfers from the GPU will not deliver any speedup with the application covered in this tutorial, they might even be slower than conventional ones. They are however asynchronous on the CPU: If an application can schedule enough work between initiating the transfer and actually using the data, true asynchronous transfers are possible and performance might be improved in case the data format allows this. ... To benefit from PBO acceleration, a lot of independent work needs to be scheduled between initiating the transfer and requesting the data".

Full tutorial available at http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial3.html
Quote:
Here's how I make my call:
float *fmem = malloc(640*480*sizeof(float));
glReadPixels(0, 0, 640, 480, GL_DEPTH_COMPONENT, GL_FLOAT, fmem);


i hope youre not doing that each frame ie declaring the memory.

my results are from an old benchmarking app i wrote many years ago (from memory even my gf2mx at the time did >10million pixs)

1000x readpixels of 640x480 GL_DEPTH_COMPONENT with GL_FLOAT should noway near take 70secs.

heres the output from my testing (as u can see depth values should be pretty close to color values)
thus if u have
Fixed Hostmem GL_RGBA Mpix/sec: 46.54 MB/sec: 177.53
u should be seeing something similar WRT depth (which youre not)
try removing everything except for the readpixels + see if thats truly the bottleneck


glReadPixels: DEPTH_COMPONENT -- UNSIGNED_BYTE 170.111 Mpixels/sec
glReadPixels: DEPTH_COMPONENT -- UNSIGNED_SHORT 170.111 Mpixels/sec
glReadPixels: DEPTH_COMPONENT -- FLOAT 145.572 Mpixels/sec
glReadPixels: DEPTH_COMPONENT -- UNSIGNED_INT 140.837 Mpixels/sec
glReadPixels: DEPTH_STENCIL_NV -- UNSIGNED_INT_24_8_NV 150.722 Mpixels/sec
---
glReadPixels: LUMINANCE -- UNSIGNED_BYTE 144.398 Mpixels/sec
glReadPixels: LUMINANCE -- UNSIGNED_SHORT 23.865 Mpixels/sec
glReadPixels: LUMINANCE -- UNSIGNED_INT 16.529 Mpixels/sec
glReadPixels: LUMINANCE -- FLOAT 25.871 Mpixels/sec
glReadPixels: ALPHA -- UNSIGNED_BYTE 186.673 Mpixels/sec
glReadPixels: ALPHA -- UNSIGNED_SHORT 184.746 Mpixels/sec
glReadPixels: ALPHA -- UNSIGNED_INT 184.746 Mpixels/sec
glReadPixels: ALPHA -- FLOAT 175.333 Mpixels/sec
glReadPixels: RED -- UNSIGNED_BYTE 171.744 Mpixels/sec
glReadPixels: RED -- UNSIGNED_SHORT 144.398 Mpixels/sec
glReadPixels: RED -- UNSIGNED_INT 119.305 Mpixels/sec
glReadPixels: RED -- FLOAT 150.722 Mpixels/sec
glReadPixels: RGB -- UNSIGNED_BYTE 141.954 Mpixels/sec
glReadPixels: BGR -- UNSIGNED_BYTE 163.580 Mpixels/sec
glReadPixels: RGBA -- UNSIGNED_BYTE 149.380 Mpixels/sec
glReadPixels: BGRA -- UNSIGNED_BYTE 165.191 Mpixels/sec
glReadPixels: RGB -- FLOAT 45.222 Mpixels/sec
glReadPixels: BGR -- FLOAT 46.668 Mpixels/sec
glReadPixels: RGB -- UNSIGNED_SHORT_5_6_5 154.718 Mpixels/sec
glReadPixels: RGB -- UNSIGNED_SHORT_5_6_5_REV 148.061 Mpixels/sec
glReadPixels: RGBA -- FLOAT 38.000 Mpixels/sec
glReadPixels: BGRA -- FLOAT 37.680 Mpixels/sec
glReadPixels: RGBA -- UNSIGNED_INT_8_8_8_8 166.834 Mpixels/sec
glReadPixels: BGRA -- UNSIGNED_INT_8_8_8_8 142.029 Mpixels/sec
glReadPixels: RGBA -- UNSIGNED_INT_8_8_8_8_REV 149.380 Mpixels/sec
glReadPixels: BGRA -- UNSIGNED_INT_8_8_8_8_REV 166.730 Mpixels/sec

I'm not doing the malloc each frame. Sorry, that is misleading.
The good benchmark is what is leaving me so confused. I know you're right that it should be much faster. If I remove just the one glReadPixels() line _only_, then the 1000 frame run does indeed complete 70 seconds faster (about 15sec in total). There's something wrong here.

I found out yesterday that the card in this machine is an ATI EAX300SE 128Mb PCIe.

The only explanation I can come up with at the moment is an ATI driver problem for Linux. (Now I've said that it's bound to be me making a stupid coding mistake).

1) My benchmarks were indeed good, but they were run under windows.
2) I do all my OpenGL work in Debian Linux.
3) I have seen people mention ATI Linux driver problems on other forums, specifically mentioning glReadPixels() e.g. http://www.gpgpu.org/forums/viewtopic.php?t=3353&view=previous&sid=3f7fb23c04d396ca28cd5493ff624753

Don't know what the best next step is. I have an NVidia G-Force 6 Series 6600GT PCIe sitting on my desk but switching them over could be a problem as I don't own this machine. I've yet to look at whether any more recent ATI drivers are available.
Found another PC running Debian Linux, very similar spec _but_ with an NVidia graphics card. I ran exactly the same code on both my PC (ATI card) and the alterative machine (NVidia card), results are below.

1000 frame test, duration:

ATI:
Window size 640*512: 3.32sec (readback on), 8sec (readback off)
Window size 214*512: 1.23sec (readback on)

NVidia:
Window size 640*512: 19sec (readback on), 4sec (readback off)
Window size 214*512: 10sec (readback on)

[The readback off cases aren't entirely fair as I also dropped a big array loop every frame, that I shouldn't have done. To give an idea, ATI would be 12sec with readback off and the array loop left in. So you could scale up the 4sec Nvidia result a little.]

But regardless of that, and the fact that I don't know what model the Nvidia card is - it appears faster in general rendering than the ATI... I'm sure that there is some problem with the ATI card's readback under Linux. See the jump up to 3 mins 32secs. An overhead of ~200 seconds. [I was wrong to quote an overhead of 70sec on readback for 1000*glReadPixels(0,0,640,480,...) in earlier posts. It was for 1000*glReadPixels(0,0,214,512,...).

Perhaps this could be helpful info if someone is struggling with slow glReadPixels() under Linux in the future.
Is it possible to upload the video evidence to the GPU and do the comparison there instead? That would possibly yield an increase in speed.
Jerax,
Yes, I think that's a nice idea. Looking at gpgpu.org the kind of techniques I'd need to employ for general purpose GPU computations look relatively tough (to me, at least) but I think you're right that it's the way to go for performance increases. I'll be testing the approach further using the readback technique for now, but if it's successful then I'll look again at this option.

Re. glReadPixels(), I've replaced my machine's ATI EAX300SE 128Mb PCIe with the NVidia G-Force 6 Series 6600GT 128Mb PCIe. The final result for my benchmark under Linux is now:

1000 frame test, duration:

NVidia 6600GT 128Mb PCIe:
Window size 640*512: 16sec (readback on)

This is manageable for my application.

This topic is closed to new replies.

Advertisement