Advertisement Jump to content
  • Advertisement

Dave Eberly

  • Content Count

  • Joined

  • Last visited

Community Reputation

1175 Excellent

About Dave Eberly

  • Rank
    Advanced Member

Personal Information

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Or if you are careful, you can use 16-byte alignment directives so that the variables you care about are automatically 16-byte aligned, thus allowing you not to have to explicitly load/store SIMD values. The "care" is in dynamic allocation; for example, if you have an STL container of SIMD values requiring 16-byte alignment, then you need to use custom allocators. If you have 16-byte-aligned members in a class/struct, you need dynamic allocation of that class/struct to produce 16-byte aligned memory.
  2. The GUI version appears to limit you to Shader Model 3. Running from a command line, you can get to Shader Model 5 (in theory), but it crashes for me on my Windows 8 machine. I have not resorted to trying this on a Windows 7 machine. The performance counter libraries AMD provides allows you to instrument manually, and they appear to give similar information that the GUI performance tool does. The only nit is that they leak DX objects (buffers and counters during sampling), so if you have any logic to verify that all DX reference counts go to zero on program termination, you have to disable those...
  3. If you have two GPUs with SLI enabled, enumeration of adapters leads to a "single" adapter. If you disable SLI, enumeration shows two adapters. If Adapter0 has the monitor attached to it and Adapter1 has no monitor, if you make "draw" calls to Adapter1, you'll see a noticeable decrease in frame rate compared to the SLI-enabled case. The shader output on Adapter1 has to make its way to the monitor somehow. Of course this statement has the implication that you can make rendering calls on both adapters even though only one has a monitor attached. If you have to read-back from one GPU and upload to another, you'll see a performance hit. On a single GPU, you can share a 2D texture created by one device with another device (on my AMD Radeon HD cards, I can actually share structured buffers, but that is not part of the DirectX documentation--and this does not work on NVIDIA cards). I believe DX11.1 has improved support for sharing resources, but I don't recall what they are off top of my head (they are mentioned online in the MSDN docs). I tend to use the primary GPU for rendering (visual) and the other for compute shaders, but the output from my compute shaders is read-back and not ever used for visual display (on the machine that generated that data). An experiment I have not yet tried is to have SLI disabled and two monitors, one per graphics card, and examine the performance.
  4. My advice is to skip ellipsoids. Use a bounding box or a k-DOP (or some convex polyhedron with a small number of faces), then use separating axis tests. The coding is just a lot easier, and the numerical robustness issues in determining ellipsoid-ellipsoid intersection can be avoided. That said, there is an implementation of the ellipsoid-ellipsoid intersection at my web site. Regarding capsules, better choice than ellipsoids because the intersection/separation tests are simpler to implement. For capsule-capsule sweep, a simple implementation uses bisection over the desired time interval. At each time you apply a (static) capsule-capsule overlap test, which reduces to a computation of distance between line segments and a comparson involving this distance and capsule radii. You can avoid the iteration--I have pseudocode for this in my Game Physics 2nd edition (section 6.3.2). Regarding bounded cylinders, the game physics book and a document at my web site show the complexity of intersection testing via separating axes. Turns out that it is simpler to do cylinder-cylinder intersection with infinite cylinders, then clip the intersection set based on the finite cylinder heights. Not an exact test (result is not a closed form solution), but effective. I have a document about this at my web site and a sample application (for the infinite cylinder-cylinder test).
  5. Dave Eberly

    Approximating Sine?

    Taylor polynomials provide local approximations to a function. Better is to use global approximations that minimize some norm. My standard is to use minimax approximations (minimize the L-infinity norm for the difference between polynomial and function). The math for generating the polynomial coefficients is heavy, but the results are pleasing. DirectX Math used to use Taylor polynomials for sine and cosine, but the version shipping with Windows 8 (and DX 11.1) now uses minimax approximations.
  6. Dave Eberly

    How do you multithread in Directx 11?

    The deferred context requires a lot of care getting things right (in a multithreaded manner). For most of my applications, I don't bother. Instead, I create the device multithreaded and use it for resource creation and destruction. Done right, you can get the resource creation occurring in one CPU thread while another CPU thread is busy with the previous resources. When the second thread is ready to process new data, it (hopefully) is available in GPU memory. Always make sure you profile. For some of my applications it is faster to create/destroy resources each frame rather than map/unmap an already existing resource.
  7. Dave Eberly

    Compressed quaternions

    The PDF link is still active, and it discusses the fitting algorithm for N-dimensional quantities (for quaternions, N = 4). It is the same math as for 3-D space (N=3). As mentioned, the fitted curve is close to the unit hypersphere but not always exactly on it, so you can evaluate and then normalize. My website has sample code for fitting in 3D, but the code can be extended easily to quaternions.
  8. Dave Eberly

    Hieroglyph 3 Rendering engine Question

    The code is good quality. More importantly, purchase the book that goes with it: Practical Rendering & Computation with Direct3D 11, by J. Zink, M. Pettineo, and J. Hoxley. I have requested reading this for the engineers on my real-time graphics team.
  9. Dave Eberly

    SSE vector normalization

    Although yours is the standard way folks do the normalization, for large components the dot product overflows. If you need something that is robust for all finite floating-point inputs, inline __m128 MaximumAbsoluteComponent (__m128 const v) { __m128 SIGN = _mm_set1_ps(0x80000000u); __m128 vAbs = _mm_andnot_ps(SIGN, v); __m128 max0 = _mm_shuffle_ps(vAbs, vAbs, _MM_SHUFFLE(0,0,0,0)); __m128 max1 = _mm_shuffle_ps(vAbs, vAbs, _MM_SHUFFLE(1,1,1,1)); __m128 max2 = _mm_shuffle_ps(vAbs, vAbs, _MM_SHUFFLE(2,2,2,2)); __m128 max3 = _mm_shuffle_ps(vAbs, vAbs, _MM_SHUFFLE(3,3,3,3)); max0 = _mm_max_ps(max0, max1); max2 = _mm_max_ps(max2, max3); max0 = _mm_max_ps(max0, max2); return max0; } inline __m128 Normalize (__m128 const v) { // Compute the maximum absolute value component. __m128 maxComponent = MaximumAbsoluteComponent(v); // Divide by the maximum absolute component. This is potentially a divide by zero. __m128 normalized = _mm_div_ps(v, maxComponent); // Set to zero when the original length is zero. __m128 zero = _mm_setzero_ps(); __m128 mask = _mm_cmpneq_ps(zero, maxComponent); normalized = _mm_and_ps(mask, normalized); // (sqrLength, sqrLength, sqrLength, sqrLength) __m128 sqrLength = _mm_dp_ps(normalized, normalized, 0x7F); // (length, length, length, length) __m128 length = _mm_sqrt_ps(sqrLength); // Divide by the length to normalize. This is potentially a divide by zero. normalized = _mm_div_ps(normalized, length); // Set to zero when the original length is zero or infinity. In the latter case, this is considered to be an unexpected condition. normalized = _mm_and_ps(mask, normalized); return normalized; }
  10. Dave Eberly

    Null space of a matrix

    The method of solving the system likely depends on the specifics of your problem. For example, this paper has a subproblem that involves solving a large sparse linear system whose matrix has null space of dimension 1. The authors show that using the conjugate gradient method leads to a solution that is unique among values when you project out the null space. The iterations always keep you on the projection space, so numerically the solver is quite robust.
  11. I think this is a hard problem theoretically. For a practical solution, have you thought about rasterizing the rectangle and polygons to a high-resolution grid? Rasterize the rectangle first. Rasterizer your polygons one at a time, keeping track of which rectangle pixels are written to (sort of like a stencil buffer) and which polygons have been rasterized to previously unwritten pixels. Once you have rasterized to all pixels, the process terminates. (You have to deal with not writing all rectangle pixels after all polygons have been processed.)
  12. Dave Eberly

    SLMATH library and SSE optimisation problem.

    std::vector should be able to support alignment through custom allocators. However, if you are using MSVS 2010, the dinkumware STL they use has a bug in that the std::vector resize does not do the right thing (fixed in MSVS 2011). For MSVS 2010, you'll have to roll your own std::vector (maybe copy what dinkumware does and "fix" the resize).
  13. Dave Eberly

    Support for C++ math libraries

    What is a "thriving" library? If you find a library that has features you want, does it matter whether it is "thriving" (according to whatever your definition is for "thriving")? What features are you looking for in a library? Such information might make it easier for folks to point you to the something you can use.
  14. Dave Eberly

    ID3D11ShaderReflection question

    I found another post of yours that mentions the assignment of each element of a float[] to a single register. That was helpful. When I query for the array member, I saw Rows=1, Columns=1, and Elements=5, which seemed strange (a 1x1 array with 5 elements?). Because the query does not tell me I have an "array", I suppose that I can infer it from Rows==1 and Columns==1 and Elements > 1. The other information that seemed strange: cbuffer Whatever { struct Something {...}; Something A; Something B[2]}. The number of bytes reported for A is different from the number of bytes reported for B. Of course, for B the number of bytes is for both array items, but it was not twice the size of A. And it appears I'd have to assume that both array items of B use the same number of bytes. Reverse engineering (by compiling some shaders that access the .x components of various struct members) made it clear that there was some set of rules the compiler was using. So I think that knowing how arrays are mapped to registers and knowing to trap the rows/columns/elements case I mentioned, I can infer the packing of the struct (which I want so I know how to typecast the mapped cbuffer data properly). Thanks.
  15. Dave Eberly

    Capsule-Capsule Collision Tutorial

    Needs more diagrams [/quote] No, it needs more mathematics . (Two capsules intersect when the distance between their line-segment axes is smaller than the sum of their radii.)
  • Advertisement

Important Information

By using, you agree to our community Guidelines, Terms of Use, and Privacy Policy. is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!