• Content count

  • Joined

  • Last visited

Community Reputation

210 Neutral

About TimothyFarrar

  • Rank
  1. Port Audio Has better Windows interface (WDM-KS) vs DirectSound (emulated on Vista) for SDL.
  2. First, I see a lot of claims that the NVidia's Linux drivers are slower than the Windows ones, anyone care to post something that actually verifies this? Actual glGetQueryObjectiv(query,GL_QUERY_RESULT,&result) timed results would be excellent? It's just too easy to make this claim either without doing any testing, or screwing up the testing in such a way that the performance difference is do to something other than a graphics driver. Personally I don't know the answer, and would love to see some evidence. As for porting to Linux, here is what you need to know, For your 7800+ card, just goto NVidia's website and download the drivers, run the install. Any sane Linux distribution will already have X Window installed with GL support. So getting OpenGL 2.1 + GLSL to work is just a matter of installing the NVidia driver. As for compiling, the following will give you can idea of what you will need for GL and for interfacing with X Windows + some extras which will help you get a basic idea of some optimizations. Dummy example of how to compile and build a single file program, gcc program.c -o program -O3 -I/usr/X11R6/include -fomit-frame-pointer -msse3 -march=nocona -ffast-math -mno-ieee-fp -mfpmath=sse -L/usr/X11R6/lib -L/usr/lib -lm -lXext -lX11 -lGL -I/usr/X11R6/include -> typical path to X11 headers -O3 -> optimize level 3 -fomit-frame-pointer -> remove this optimization if you intend to debug your program -msse3 -march=nocona -> cpu settings for 64-bit system (better code generation) -ffast-math -mno-ieee-fp -mfpmath=sse -> ieee violating optimized math -L/usr/X11R6/lib -> typical include path for X11 shared libraries -L/usr/lib -> typical include path for GL shared libraries -lm -> include the math library -lXext -lX11 -> include X11 shared libraries -lGL -> include gl library As for porting your interface to bring up a window and get keyboard and mouse input, you have a few options, roll your own direct interface to X11, or use another library (like libsdl as was previously suggested). Rolling your own is actually really easy as X11 is easy as pie to work with. Finding good docs might not be however. I posted a simple example of how to get up a window in X11 here,
  3. On which physical cpu/core does a process run?

    Yeah, a getcpu() syscall will probably never happen in the main branch. Still your own personal kernel patch might be the best option. I've done it before (custom modified http server running kernel side), and it's not bad if you are using it for testing only. If you do go the kernel modification route, I think on more recent kernels there is a common vdso page which is mapped into every process. This page is read/execute only and supplies the entry/exit for the vsyscall. I'm not sure if this has been done yet, but one kernel optimization effort was to get a vgettimeofday area into vdso page such that you could simply read the area to gettimeofday() instead of doing an actual syscall. Since the vdso page is always virtually mapped to the same physical page in the kernel, you might be able to write the pids in there as well (from inside the scheduler) for the current pid running on each processor. Then just poll in userspace land to find which processor your current pid is on. Also I'm not 100% up to date on glibc2 pthreads anymore, so you might have to check that each thread gets its own pid still. I believe there is a way to create threads (processes on linux) which share pids. So you might need a second number for identification... In theory it should work... BTW, if you do find/implement a good way of getting current cpu id from userland, please post what you did :)
  4. On which physical cpu/core does a process run?

    Oh, if you go the modify the kernel route, this will help if you intend to add a syscall to return CPU info to a userspace process, http://tldp.org/HOWTO/html_single/Implement-Sys-Call-Linux-2.6-i386/ BTW, looks like a getcpu() vsyscall has been proposed and tested before on x86-64 http://sourceware.org/ml/libc-alpha/2006-06/msg00024.html [Edited by - TimothyFarrar on January 17, 2008 4:47:20 PM]
  5. On which physical cpu/core does a process run?

    Probably one the best questions on this forum in a while. I don't know of any easy way to do this. You could try and modify the kernel to log the information (which is going to be tremendous BTW and slow everything down). Doing this from user space in your app would get messy. How often does a process (thread is a process on linux) switch CPUs. My guess is that it happens more often that you would expect. You could keep track of the gettimeofday() time of common thread entry/blocking points (after IO blocking or mutex, etc). Won't give you an idea of what CPU you are on, but will give you an idea of what really matters, ie how the threads are interacting and blocking. Also gettimeofday() is a syscall, so you might want to use RDTSC in x86 chips to simply read the CPU clock cycle counter instead (but keep in mind it wraps around quickly...), // for 32bit machines typedef signed long long is8; static inline is8 VCycle(void) { is8 x; asm volatile("rdtsc\n\tmov %%edx, %%ecx\n\t" :"=A" (x)); return x; } // or for when running in 64bit mode typedef unsigned int iu4; typedef signed long is8; static inline is8 VCycle(void) { iu4 aa,dd; asm volatile("rdtsc" : "=a" (aa), "=d" (dd)); return ((is8)aa)|(((is8)dd)<<32); } BTW, I know the 64bit one works, because I use it all the time, forget if I tested the 32bit version!
  6. How to use SSE, inline assembly

    I believe gcc has some built in vectorization ability without inline assembly or intrinsics. Might want to google that. Intrinsics are really easy to use however, and also tend to port really well between windows/unix/mac on different compilers.
  7. 3D textures

    I believe for DX10.1 you can have cubemap arrays where {x,y,z} selects the position in the cubemap, and {w} selects the cubemap from the array of cubemaps. This is known as a texture array in OpenGL, but there is no support for cubemap texture arrays in OpenGL. As for rendering into multiple sides of a cubemap in one pass, you can do this with a geometry shader, selecting the face to draw into. Last I checked, this was really slow on even the newest hardware (something about geometry shaders limiting the thread parallelism on the GPU). Not sure if it was a driver issue, but drawing 64K triangles was faster simply by doing 6 passes to each of the faces individually. But don't take my word for it, try it and see what's fastest for you. Since there is no filtering across cubemap faces, cubemaps are really not that useful in my opinion. Especially in your case of really diffuse 1x1 pixel face cubemaps. Bi/Tri-Linear filtering only works after selecting a face to filter in. Perhaps spherical harmonics is probably more like what you are looking for.
  8. I've been using TBOs also, but not with integers. Perhaps it is a driver bug, I'd suggest checking if you have the newest drivers (latest Linux-64bit driver released Sept 18th, don't know about windows off-hand). Also if it is simply a driver bug, then a possible workaround would be to try the Cg function floatToRawIntBits() (according to a post on the NVidia Dev Forum, I have not tried it yet). You can use Cg functions in GLSL if you leave off the #version line in your shader. You will get a warning, but it will compile with NVidia drivers.
  9. Radiosity + Participating media

    Quote:Original post by Poons my radiosity renderer works quite well now Screen shot?
  10. Use acos rather than atan2?

    Quote:Original post by Zukix instead of using atan2 Perhaps this might be useful, cos(atan(x)) = 1/sqrt(1+x^2) sin(atan(x)) = x/sqrt(1+x^2)
  11. Anyone using half floats in vertex buffers as a form of compression (instead of full 32bit floats)? What I am really wondering is if the newer GPU's (like the GeForce 8 series) have the ability when fetching vertex attributes to expand half floats to full size floats in dedicated hardware (like automatically expanding a vertex 32bit RGBA color value into four floating point values). Meaning that the drivers don't use hidden vertex shader operations (or unified shader operations) to do the conversion. If there is dedicated parallel hardware then there should be no extra shader instruction cost for improved memory bandwidth trade off for using half floats in vertex buffers. Anyone know the answer for this? Here is some background on my reasoning, From what I can gather I seems as if prior to the GeForce 8 that the NVidia cards had dedicated GPU instructions, in the fragment shader, to pack and unpack half floats (and bytes) from full size floats (instructions commonly used for G-buffer compression). From the CUDA PTX guide and from what I have been told from others, the GeForce 8 emulates pack and unpack in shaders now with multiple integer instructions (so to pack 2 FP16 values : 2 FP32->FP16 conversion instructions, then 1 integer shift right, and then 1 integer OR). Might cost up to 10 instructions for packing 4 bytes into a float. Also it seems as if the GeForce 8 is doing attribute interpolation in the fragment shaders using hidden fragment shader operations. So this makes me wonder if the GeForce 8 is also grabing vertex attributes using hidden vertex shader instructions, and if so there would be a 4 cycle cost for per pair of half float vertex attributes streamed in from a vertex buffer (for only a 50% deduction of memory bandwidth).
  12. Quote:Original post by jd_24 Probably easier to give you a paper reference: www.cs.berkeley.edu/~daf/appsem/Tracking/papers/cvpr.pdf Didn't see this post before my last. BTW that paper is tough to quickly get a full understanding of. So you are simply sampling 10 cones * 20 points = 200 points per frame. Seems to me that you should simply compute the projection (in 2D screen space) of the circle caps of the cones, which can be approximated by ellipses (unless the cone is perpendicular to the screen) then use the longest ends of the ellipses to construct the end points of the line segments. Then use linear interpolation in both 2D screen space (offset to grab depth from z buffer) and 3D camera space (to have a z value to compare to) to grab the points on the line. Then simply do a depth lookup for each point from a z buffer computed from your stereo image pair. If you need to check occlusion of the cones with only 200 points per frame and only 10 cones, I would simply do the 200*9 checks algorithmically to see if the point is inside the bounds of the other cones. Probably would not even run through either a FBO or transform feedback.
  13. First, transform feedback is going to capture all your vertex points because it happens before drawing. For the stuff I do, I don't use back-face clipping, so you might want to double check when that happens in the pipeline. So the reason you need back-face clipping is because you are not drawing the caps of the cones? That makes sense. Drawing Z only filled quads into a frame buffer object (FBO) with only a depth texture bound and back-face clipping on should give you what you want (only front most Z values for front facing pixels). If you didn't care about occlusion, you could use a quick algorithm to simply project the ends of the cone to figure the bounding "corners" of the two ellipses which compose the red end points. Easy to do without ever drawing anything. I do something similar to this when finding say a bounding quad or triangle for a motion blurred circle (circle stretched in the direction of motion, used for billboarding motion blurred particles). If ultimately you are going to be using the result of all this information for drawing something, it might be better to simply do a Z only pass and then use the hardware Z clipping for future drawing passes instead of going to extremes to compute occlusion. Hardware Z clipping is extremely efficient (hierarchical Z checks to clip out large regions before a fragment shader runs). Then only using some other method of grouped occlusion for clipping out entire fully occluded cones. Still near impossible to give good advice without knowing what you are ultimately using this for :)
  14. OpenGL Gamma correction in a scene

    How about doing a final post processing fragment shader which adjusts the gamma.
  15. # of Interpolated Varyings, Hidden Costs?

    Quote:Original post by Pragma There are two costs associated with attributes. The first cost is the space used in passing attributes between the vertex and fragment shaders. Before g80, there was just a fixed maximum number of attributes you could pass and beyond that your shader would simply fail to compile. I'm not sure whether g80 acts the same way. The second cost is in perspective correction of the interpolated attributes. Each interpolated attribute costs one division per component in the fragment shader. Of course this cost shouldn't apply to flat varyings. Been looking for more info on this and I just read something on ATI's HD doing the interpolation in hardware something like 80 FP32s per clock cycle. It also appears that the GeForce 8 series does interpolation as stream processor ops in the fragment shader. Still found no info on the costs associated with the triangle setup side...