Jump to content

  • Log In with Google      Sign In   
  • Create Account

Dave Eberly

Member Since 22 Aug 2004
Offline Last Active Oct 15 2012 01:13 AM

Posts I've Made

In Topic: XNAMath vs D3DX10Math && *.fx files vs *.psh and *.vsh

15 October 2012 - 01:11 AM

...generally requires you to write more code since you have to explicitly load and store SIMD values.

Or if you are careful, you can use 16-byte alignment directives so that the variables you care about are automatically 16-byte aligned, thus allowing you not to have to explicitly load/store SIMD values. The "care" is in dynamic allocation; for example, if you have an STL container of SIMD values requiring 16-byte alignment, then you need to use custom allocators. If you have 16-byte-aligned members in a class/struct, you need dynamic allocation of that class/struct to produce 16-byte aligned memory.

In Topic: Are GPU drivers optimizing pow(x,2)?

15 October 2012 - 01:05 AM

AMD has a tool called GPU Shader Analyzer that will take HLSL/GLSL and show you the actual machine instructions generated by the driver's compiler (you select which GPU you want to target). It can also estimated performance for you and analyze the bottlenecks. It's quite useful for answering these kinds of questions because you can change the HLSL dynamically and watch how the generated code changes.

The GUI version appears to limit you to Shader Model 3. Running from a command line, you can get to Shader Model 5 (in theory), but it crashes for me on my Windows 8 machine. I have not resorted to trying this on a Windows 7 machine. The performance counter libraries AMD provides allows you to instrument manually, and they appear to give similar information that the GUI performance tool does. The only nit is that they leak DX objects (buffers and counters during sampling), so if you have any logic to verify that all DX reference counts go to zero on program termination, you have to disable those...

In Topic: What's happening under the hood when several GPUs are used?

15 October 2012 - 12:58 AM


I wonder what's happening at the low level when I have several GPUs and many displays connected to them. For example I got 4 video cards and 16 displays. The cards are not connected with sli/crossfire. And I want to render something on all these screens. For example a window which has its part on every display. Let's say this window uses Direct3d.

Which video card does actual rendering? Can I change that?
Is it possible to render the same scene on every GPU without transfering lots of data between them each frame?
Is it possible to us GPGPU on one card and use another one for rendering?

Can anyone point me to docs/examples which show how all this stuff works?

Thank you.

If you have two GPUs with SLI enabled, enumeration of adapters leads to a "single" adapter. If you disable SLI, enumeration shows two adapters. If Adapter0 has the monitor attached to it and Adapter1 has no monitor, if you make "draw" calls to Adapter1, you'll see a noticeable decrease in frame rate compared to the SLI-enabled case. The shader output on Adapter1 has to make its way to the monitor somehow. Of course this statement has the implication that you can make rendering calls on both adapters even though only one has a monitor attached.

If you have to read-back from one GPU and upload to another, you'll see a performance hit. On a single GPU, you can share a 2D texture created by one device with another device (on my AMD Radeon HD cards, I can actually share structured buffers, but that is not part of the DirectX documentation--and this does not work on NVIDIA cards). I believe DX11.1 has improved support for sharing resources, but I don't recall what they are off top of my head (they are mentioned online in the MSDN docs).

I tend to use the primary GPU for rendering (visual) and the other for compute shaders, but the output from my compute shaders is read-back and not ever used for visual display (on the machine that generated that data).

An experiment I have not yet tried is to have SLI disabled and two monitors, one per graphics card, and examine the performance.

In Topic: Figuring out ellipsoid-ellipsoid collision detection

06 September 2012 - 11:50 PM

My advice is to skip ellipsoids. Use a bounding box or a k-DOP (or some convex polyhedron with a small number of faces), then use separating axis tests. The coding is just a lot easier, and the numerical robustness issues in determining ellipsoid-ellipsoid intersection can be avoided.

That said, there is an implementation of the ellipsoid-ellipsoid intersection at my web site.

Regarding capsules, better choice than ellipsoids because the intersection/separation tests are simpler to implement. For capsule-capsule sweep, a simple implementation uses bisection over the desired time interval. At each time you apply a (static) capsule-capsule overlap test, which reduces to a computation of distance between line segments and a comparson involving this distance and capsule radii. You can avoid the iteration--I have pseudocode for this in my Game Physics 2nd edition (section 6.3.2).

Regarding bounded cylinders, the game physics book and a document at my web site show the complexity of intersection testing via separating axes. Turns out that it is simpler to do cylinder-cylinder intersection with infinite cylinders, then clip the intersection set based on the finite cylinder heights. Not an exact test (result is not a closed form solution), but effective. I have a document about this at my web site and a sample application (for the infinite cylinder-cylinder test).

In Topic: Approximating Sine?

14 August 2012 - 10:40 PM

Taylor polynomials provide local approximations to a function. Better is to use global approximations that minimize some norm. My standard is to use minimax approximations (minimize the L-infinity norm for the difference between polynomial and function). The math for generating the polynomial coefficients is heavy, but the results are pleasing. DirectX Math used to use Taylor polynomials for sine and cosine, but the version shipping with Windows 8 (and DX 11.1) now uses minimax approximations.