Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

Community Reputation

122 Neutral

About jd_24

  • Rank
  1. Yes, you've put it nicely. Markerless is tough, you're right, but tracking is working quite well. My main problem is speed. Takes a long time to crunch through each frame of image evidence. I evaluate the Weighting Function for ~1000 hypotheses (guessed poses) for each frame. [1000*Canny edge detection + 1000*glReadPixels()] *per frame* on a 320*240 window is, of course, quite slow. It needs a much better approach to finding the edge and surface sample points. I'll get there eventually.
  2. I need to think about all the suggestions in the last reply. But I'm still not doing a good job of exaplaining my application ! The paper reference above might help. But I can also say in summary that the cones (about 10 of them) will represent a low degree of freedom (~30) representation of 3D human pose. I *do not* need to actually draw anything, no visualisation is required, all I need is the coordinates of the edge and surface points - and their density may be low in practice, perhaps just 10-20 points per cone. These coordinates are used as (x,y,z) locations to sample from image evidence (this is real world depth data). I can say more if my application is still a mystery? I know from my own testing that turning on backface culling and then plotting the quads (cones) in feedback mode causes the resulting buffer size to approximately halve. I.e. about half the quads are not being drawn. I will try a similar test for cone-cone occlusion with the depth test on at some point. I do not draw the caps of the cones, you are correct. But I do not want surface points that are occluded by a quad belonging to the same cone, either (see right hand image, above). So backface culling is necessary for this reason also. I don't know much about the hardware clipping functionality. Each frame I process contains a unique configuration of cones (they are all allowed to rotate and translate). So the occlusion testing would be an entirely new problem each time.
  3. Probably easier to give you a paper reference: www.cs.berkeley.edu/~daf/appsem/Tracking/papers/cvpr.pdf The bit this post relates to specifically is the calculation of the Weighting Function (Section 7, p5). So it compares a guess about the current pose (a particle) with image evidence. I have implemented a version of the algorithm that attempts to track human movement using stereocam data. Results are promising. However, it finds the edge points using Canny edge detection on the z-buffer, and it finds the surface points with glReadPixels() on the whole scene. So: (1) It's slow. (2) I can't vary the sampling density. It needs work!
  4. I'll begin with the last point as I think a better explanation is needed. Quote:However, not really knowing what you are doing, i'm thinking there is still a faster way to do what you want which doesn't require a geometry shader... As part of a scheme, unrelated to any kind of OpenGL optimisation, I need to extract two arrays of 3D vertex coordinates from a set of cones. The first array must contain the coordinates for all the edge vertices (left hand image), and the second for the set of vertices lying on the cone's surface (right hand image). I think that the last post suggests an almost General Purpose GPU-type scheme to find these two arrays. This is quite advanced for my level of OpenGL, I think if I could understand a more basic but nevertheless effective approach to the problem first, then I might do a better job of understanding the best method for optimisation. For example, I am not currently using FBOs in my approach at all, although I can see this will likely be necessary in due course, I want to devise a good plan first. Some comments about these sets of vertices: (1) It should be possible to change the number of these points on each cone (although the surface density of their distribution is always constant across the cone). I plan to do this by changing the value of the variable nverts used to specify how many quads are used to render a cone. If nverts = 10, then I have 20 points defining the top and bottom 'ends' (circles) of a cone. If these points share an index then it is possible to interpolate, say, 6 more points along the vector top - bottom. The spacing is not correct, but the blue end-points in the right hand image show top[i=n] and bottom[i=n] and the connecting row of points (augmented with small green dots) show the idea of the interpolated points. (2) The two arrays should *not* include any points that are not visible to the camera viewpoint. Either due to self occlusion by the cone itself or occlusion by another cone. (3) To find the edge array I need the visible 'corners' of the cones, marked in red on the left hand image, in order to interpolate between them. I thought that the problem of finding the visible 'corners' of the cone was similar to a bounding box max&min of X&Y, for a while. But it is not the case. I found that this is not true for many cone orientations (disregarding occlusion, for now). I find that getting the 4 red points in the left hand image, in the general case, is not trivial. If the Quads were not tessalated and backface culling was turned on, then it is true that these 4 points would be the only vertices in the feedback buffer plotted *only once*. As each quad is decomposed into 2 triangles, only two of the vertices remain unique. Even given this information I'm not sure the search for unique vertices in the feedback buffer would be particluarly efficient. Question: Is it true that if I enabled the depth test (and backface culling) before drawing all my cones to the feedback buffer, then only the vertices that compsose a quad with a normal facing the camera *and* that are not occluded by another cone, will make it into the buffer. If this is true then it would completely negate the need for my glReadPixels() occlusion checking idea. Which would be a great help. Comment: I would be left with a feedback buffer containing the locations of the surviving vertices from the top and bottom arrays for each cone. But I can see that I wouldn't always be 'safe' interpolating between these points - due to possible occlusion in between - and the top and bottom arrays are likely to be of different lengths for a cone occluded by another cone anyway. Each cone will have to be decomposed into a set of smaller cones or rings. This will negate the need for the interpolation idea. But finding the vertices in the feedback buffer that belong to the edge array is still a (very similar) problem.
  5. Thanks for the many ideas! I'll mention a few more things about what I'm trying to achieve. I haven't returned to the problem yet, but if/when I make progress I'll repost in case the final approach is useful to someone else. The quads form a cone. They are constructed by looping over the number of vertices used to approximate the circles at each end of the cone. So I believe it is correct to say that they are drawn from an indexed array. The dimensions of the cone and the number of vertices used to draw it must be free to vary. [I'm probably not approaching the problem in the best way, but...] The cone is free to rotate and translate randomly between frames. I wish to find the 4 2-D vectors that describe the 'visible corners' of the cone in screen coordinates at any frame (x,y). This was tough (for me, at least). My proposed scheme was as follows: (1) Enable backface culling (2) Allow my cone drawing function to plot all the component quads that form the cylinder to a feedback buffer. (3) In the feedback buffer there will only 'survive' the subset of all quads (described by their component vertices) with normals pointing towards the camera. The order is not (well) known - the dropped quads may occur at any point in the indexed array. (4) However, the 4 2-D vectors are the only vertices in the array that are plotted once only, every other point is plotted twice. (5) Point (4) is only true if quads were draw. As the quads are tesselated into triangles, only 2 of the 4 vectors remain unique. Darn. The following points are probably true(?): (A) If I draw the point output only, there will be no concept of normals and no backface culling will be performed. (B) If my OpenGL skills were any good, I could test the Quad normals myself instead of relying on feedback mode to perform the backface culling. (But I think that if I don't impose an orthogonal view, then world coordinates for (x,y) would not correspond to the screen position anyway. Currently I'm only trusting the feedback buffer coordinates!) If anyone has read this far(!), the final problem I consider may be of interest... There are in fact multiple cones, all free to translate and rotate and to occlude each other. I want only the points that are *visible* to the camera - visible now means that they compose a quad who's normal faces the camera, and that they are not occluded by another cone. Backface culling alone will not give mt this. So, having found all 'corner points' (x,y[,z]) from the feedback buffer, I then render the scene as normal, and read every pixel containing a corner point using glReadPixels(...,GL_DEPTH_COMPONENT,...) to verify that the corresponding depth does indeed lie within some small error tolerance of the corresponding z-value found in the feedback buffer. If not, this point has been occluded by another cone and can be disregarded.
  6. I'm rendering a GL_QUAD_STRIP in feedback mode. I notice that my quadrilaterals are being decomposed into triangles. I'm sure this is good performance wise but I'd like to stop it if possible. Is there any way to stop these quads being tesselated into triangles?
  7. jd_24

    Getting depth values

    Jerax, Yes, I think that's a nice idea. Looking at gpgpu.org the kind of techniques I'd need to employ for general purpose GPU computations look relatively tough (to me, at least) but I think you're right that it's the way to go for performance increases. I'll be testing the approach further using the readback technique for now, but if it's successful then I'll look again at this option. Re. glReadPixels(), I've replaced my machine's ATI EAX300SE 128Mb PCIe with the NVidia G-Force 6 Series 6600GT 128Mb PCIe. The final result for my benchmark under Linux is now: 1000 frame test, duration: NVidia 6600GT 128Mb PCIe: Window size 640*512: 16sec (readback on) This is manageable for my application.
  8. jd_24

    Getting depth values

    Found another PC running Debian Linux, very similar spec _but_ with an NVidia graphics card. I ran exactly the same code on both my PC (ATI card) and the alterative machine (NVidia card), results are below. 1000 frame test, duration: ATI: Window size 640*512: 3.32sec (readback on), 8sec (readback off) Window size 214*512: 1.23sec (readback on) NVidia: Window size 640*512: 19sec (readback on), 4sec (readback off) Window size 214*512: 10sec (readback on) [The readback off cases aren't entirely fair as I also dropped a big array loop every frame, that I shouldn't have done. To give an idea, ATI would be 12sec with readback off and the array loop left in. So you could scale up the 4sec Nvidia result a little.] But regardless of that, and the fact that I don't know what model the Nvidia card is - it appears faster in general rendering than the ATI... I'm sure that there is some problem with the ATI card's readback under Linux. See the jump up to 3 mins 32secs. An overhead of ~200 seconds. [I was wrong to quote an overhead of 70sec on readback for 1000*glReadPixels(0,0,640,480,...) in earlier posts. It was for 1000*glReadPixels(0,0,214,512,...). Perhaps this could be helpful info if someone is struggling with slow glReadPixels() under Linux in the future.
  9. jd_24

    Getting depth values

    I'm not doing the malloc each frame. Sorry, that is misleading. The good benchmark is what is leaving me so confused. I know you're right that it should be much faster. If I remove just the one glReadPixels() line _only_, then the 1000 frame run does indeed complete 70 seconds faster (about 15sec in total). There's something wrong here. I found out yesterday that the card in this machine is an ATI EAX300SE 128Mb PCIe. The only explanation I can come up with at the moment is an ATI driver problem for Linux. (Now I've said that it's bound to be me making a stupid coding mistake). 1) My benchmarks were indeed good, but they were run under windows. 2) I do all my OpenGL work in Debian Linux. 3) I have seen people mention ATI Linux driver problems on other forums, specifically mentioning glReadPixels() e.g. http://www.gpgpu.org/forums/viewtopic.php?t=3353&view=previous&sid=3f7fb23c04d396ca28cd5493ff624753 Don't know what the best next step is. I have an NVidia G-Force 6 Series 6600GT PCIe sitting on my desk but switching them over could be a problem as I don't own this machine. I've yet to look at whether any more recent ATI drivers are available.
  10. jd_24

    Getting depth values

    Still don't know what graphics card I have in this machine. But using GPUBench I get the following results for glReadPixels() (http://graphics.stanford.edu/projects/gpubench/test_readback.html has details). They don't read GL_DEPTH_COMPONENT but I was still interested to see them (window size for the test is 512*512 by default): Fixed Hostmem GL_RGBA Mpix/sec: 46.54 MB/sec: 177.53 Fixed Hostmem GL_ABGR_EXT Mpix/sec: 1.48 MB/sec: 5.66 Fixed Hostmem GL_BGRA Mpix/sec: 46.23 MB/sec: 176.36 Float Hostmem GL_RGBA Mpix/sec: 12.55 MB/sec: 191.48 Float Hostmem GL_ABGR_EXT Mpix/sec: 0.47 MB/sec: 7.11 Float Hostmem GL_BGRA Mpix/sec: 12.47 MB/sec: 190.22 I've looked at the GPUBench source code, and made some very slight changes to my glReadPixels() calls to bring my code in line with theirs. My performance is pretty much unchanged, however. I think I will attempt to give feedback mode a try before I move on. I'll post if I conclude anything other than what zeds predicted above. Regarding PBOs, I'm concerned that all they will give me is the potential for a non-blocking call to read the depth info. As I don't have much work I can give the app to do in the meantime (before I actually try to use the depth data), I don't think I have much chance of a performance increase. Quote from Dominik Göddeke's tutorial below might be interesting to anyone else considering this approach "Conventional transfers require a pipeline stall on the GPU to ensure that the data being read back is synchronous with the state of computations. PBO-accelerated transfers are NOT able to change this behaviour, they are only asynchronous on the CPU side. This behaviour cannot be changed at all due to the way the GPU pipeline works. This means in particular that PBO transfers from the GPU will not deliver any speedup with the application covered in this tutorial, they might even be slower than conventional ones. They are however asynchronous on the CPU: If an application can schedule enough work between initiating the transfer and actually using the data, true asynchronous transfers are possible and performance might be improved in case the data format allows this. ... To benefit from PBO acceleration, a lot of independent work needs to be scheduled between initiating the transfer and requesting the data". Full tutorial available at http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial3.html
  11. jd_24

    Getting depth values

    zeds, That's an interesting result. When I remove my glReadPixels() call I get ~70fps, when I add it in framerate drops to ~12fps. -If you take out your call to glReadPixels() what kind of performance increase do you get on your benchmark framerate i.e. is it anything like my jump of about 5x, above? -Could you tell me how you're calling glReadPixels()? How many depth values does your benchmark code read per call. My code is below, I'm trying to take all ~300,000 depth values in the window at once. Here's how I make my call: float *fmem = malloc(640*480*sizeof(float)); glReadPixels(0, 0, 640, 480, GL_DEPTH_COMPONENT, GL_FLOAT, fmem); I think the PBO idea is a good one, but I want to make sure of some things before I move on from glReadPixels(). My fps results above are based on a 1000 frame long test, where the 1000 glReadPixels() calls add 70 seconds in total, versus a run where they aren't called. That looks like under 5 million pixels per second coming back to the app. -Could I be suffering from the lack of a decent graphics card here? Or perhaps I'm making my call to glReadPixels() incorrectly?
  12. I'm attempting to read out depth values for every screen coordinate in a 640*480 window at each frame. I know this has received attention here before, but the application is slightly unusual, and I would appreciate any advice on the best approach. I need to extract _only_ the matrix of orthogonal depth data that results from a particular viewpoint on the scene. Choices regarding lighting, or textures, or even whether the data is displayed to the screen at all, are _not_ requirements of the application. Strangely - I suppose - I'm using OpenGL despite not requiring any kind of visualisation of the results it gives. Currently I do display each frame to the screen because it allows me to read depth components across the entire window with glReadPixels(). I have heard this described as bad practice(?) and I am aware that there are also various performance issues related to the type of buffer you read into, its alignment and system hardware etc. Despite quite a bit of playing around with glReadPixels(), I am not able to achieve an acceptable level of performance. I'm aware that pixel buffer objects might give a performance improvement, but I'm not sure whether either of these approaches will offer the best solution? Reading old posts on the forum has made me aware of feedback mode. This seems a potentially better type of approach, as I don't require that the results be displayed to screen. I have no previous experience of this technique, but was considering using glfeedbackbuffer() with GL_3D as the feedback buffer type to try and recover the depth data for the entire window. Is this a valid use of the feedback mode, and is it likely to offer a performance increase over the, display to screen then read depths with glReadPixels() approach, described above? Thanks in advance for any help/comments. [For anyone interested in where the application requirements come from, it is an implementation of a particle filter http://en.wikipedia.org/wiki/Particle_filter I use OpenGL to draw an articulated 3D object, consisting of about 20 component parts, each with between 1 and 3 degrees of freedom. This object is viewed from a fixed point and is used for comparison with video image evidence (by this I mean a frame of real-world video from a video camera). I must probe around 1000 object configurations for depth data to compare with every individual frame of video evidence. With video evidence running at 30-60Hz, there are tens of thousands of configurations per second to be probed for depth data. Although the application need not run in real time, it must be manageable. At the moment, my glReadPixels() approach gives ~12fps which equates to over an hour to process 1 second of video evidence. As there is no need to visualise any of the output, only to grab the xyz data, I am hopeful that a performance gain is possible, but perhaps not].
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!