Dave Eberly

Members
  • Content count

    591
  • Joined

  • Last visited

Community Reputation

1173 Excellent

About Dave Eberly

  • Rank
    Advanced Member

Personal Information

  1. [quote name='MJP' timestamp='1350161730' post='4989884'] ...generally requires you to write more code since you have to explicitly load and store SIMD values. [/quote] Or if you are careful, you can use 16-byte alignment directives so that the variables you care about are automatically 16-byte aligned, thus allowing you not to have to explicitly load/store SIMD values. The "care" is in dynamic allocation; for example, if you have an STL container of SIMD values requiring 16-byte alignment, then you need to use custom allocators. If you have 16-byte-aligned members in a class/struct, you need dynamic allocation of that class/struct to produce 16-byte aligned memory.
  2. [quote name='chris77' timestamp='1349968638' post='4989124'] AMD has a tool called [url="http://developer.amd.com/tools/gpu/shader/Pages/default.aspx"]GPU Shader Analyzer[/url] that will take HLSL/GLSL and show you the actual machine instructions generated by the driver's compiler (you select which GPU you want to target). It can also estimated performance for you and analyze the bottlenecks. It's quite useful for answering these kinds of questions because you can change the HLSL dynamically and watch how the generated code changes. [/quote] The GUI version appears to limit you to Shader Model 3. Running from a command line, you can get to Shader Model 5 (in theory), but it crashes for me on my Windows 8 machine. I have not resorted to trying this on a Windows 7 machine. The performance counter libraries AMD provides allows you to instrument manually, and they appear to give similar information that the GUI performance tool does. The only nit is that they leak DX objects (buffers and counters during sampling), so if you have any logic to verify that all DX reference counts go to zero on program termination, you have to disable those...
  3. [quote name='valyard' timestamp='1350054320' post='4989491'] Hi. I wonder what's happening at the low level when I have several GPUs and many displays connected to them. For example I got 4 video cards and 16 displays. The cards are not connected with sli/crossfire. And I want to render something on all these screens. For example a window which has its part on every display. Let's say this window uses Direct3d. Which video card does actual rendering? Can I change that? Is it possible to render the same scene on every GPU without transfering lots of data between them each frame? Is it possible to us GPGPU on one card and use another one for rendering? Can anyone point me to docs/examples which show how all this stuff works? Thank you. [/quote] If you have two GPUs with SLI enabled, enumeration of adapters leads to a "single" adapter. If you disable SLI, enumeration shows two adapters. If Adapter0 has the monitor attached to it and Adapter1 has no monitor, if you make "draw" calls to Adapter1, you'll see a noticeable decrease in frame rate compared to the SLI-enabled case. The shader output on Adapter1 has to make its way to the monitor somehow. Of course this statement has the implication that you can make rendering calls on both adapters even though only one has a monitor attached. If you have to read-back from one GPU and upload to another, you'll see a performance hit. On a single GPU, you can share a 2D texture created by one device with another device (on my AMD Radeon HD cards, I can actually share structured buffers, but that is not part of the DirectX documentation--and this does not work on NVIDIA cards). I believe DX11.1 has improved support for sharing resources, but I don't recall what they are off top of my head (they are mentioned online in the MSDN docs). I tend to use the primary GPU for rendering (visual) and the other for compute shaders, but the output from my compute shaders is read-back and not ever used for visual display (on the machine that generated that data). An experiment I have not yet tried is to have SLI disabled and two monitors, one per graphics card, and examine the performance.
  4. My advice is to skip ellipsoids. Use a bounding box or a k-DOP (or some convex polyhedron with a small number of faces), then use separating axis tests. The coding is just a lot easier, and the numerical robustness issues in determining ellipsoid-ellipsoid intersection can be avoided. That said, there is an implementation of the ellipsoid-ellipsoid intersection at my web site. Regarding capsules, better choice than ellipsoids because the intersection/separation tests are simpler to implement. For capsule-capsule sweep, a simple implementation uses bisection over the desired time interval. At each time you apply a (static) capsule-capsule overlap test, which reduces to a computation of distance between line segments and a comparson involving this distance and capsule radii. You can avoid the iteration--I have pseudocode for this in my Game Physics 2nd edition (section 6.3.2). Regarding bounded cylinders, the game physics book and a document at my web site show the complexity of intersection testing via separating axes. Turns out that it is simpler to do cylinder-cylinder intersection with infinite cylinders, then clip the intersection set based on the finite cylinder heights. Not an exact test (result is not a closed form solution), but effective. I have a document about this at my web site and a sample application (for the infinite cylinder-cylinder test).
  5. Approximating Sine?

    Taylor polynomials provide local approximations to a function. Better is to use global approximations that minimize some norm. My standard is to use minimax approximations (minimize the L-infinity norm for the difference between polynomial and function). The math for generating the polynomial coefficients is heavy, but the results are pleasing. DirectX Math used to use Taylor polynomials for sine and cosine, but the version shipping with Windows 8 (and DX 11.1) now uses minimax approximations.
  6. DX11 How do you multithread in Directx 11?

    The deferred context requires a lot of care getting things right (in a multithreaded manner). For most of my applications, I don't bother. Instead, I create the device multithreaded and use it for resource creation and destruction. Done right, you can get the resource creation occurring in one CPU thread while another CPU thread is busy with the previous resources. When the second thread is ready to process new data, it (hopefully) is available in GPU memory. Always make sure you profile. For some of my applications it is faster to create/destroy resources each frame rather than map/unmap an already existing resource.
  7. Compressed quaternions

    [quote name='kettle' timestamp='1344687439' post='4968395'] hi,any details about quaternions fitting with Bezier curve or B-Spline? . i just don't konw how to calculate the control points. in 3-D space, i do it with least square method, thanks [/quote] The PDF link is still active, and it discusses the fitting algorithm for N-dimensional quantities (for quaternions, N = 4). It is the same math as for 3-D space (N=3). As mentioned, the fitted curve is close to the unit hypersphere but not always exactly on it, so you can evaluate and then normalize. My website has sample code for fitting in 3D, but the code can be extended easily to quaternions.
  8. Hieroglyph 3 Rendering engine Question

    [quote name='riverreal' timestamp='1344644678' post='4968261'] Is the Hieroglyph 3 well written? I am trying to learn the right way to make dynamic classes for a engine like rendering system using direct3d 11. Is there another good open source engine using direct3d? I want to know how the professionals make the thinks. [/quote] The code is good quality. More importantly, purchase the book that goes with it: Practical Rendering & Computation with Direct3D 11, by J. Zink, M. Pettineo, and J. Hoxley. I have requested reading this for the engineers on my real-time graphics team.
  9. SSE vector normalization

    [quote name='RobTheBloke' timestamp='1343299455' post='4963243'] [CODE] inline const CVector3SSE& CVector3SSE::Normalize() { static const __m128 almostZero = _mm_set1_ps(1e-5f); __m128 dp = _mm_dp_ps(m_fValsSSE, m_fValsSSE, 0x7F); const __m128 cmp = _mm_gt_ps(dp, almostZero); dp = _mm_rsqrt_ps(dp); m_fValsSSE = _mm_mul_ps(m_fValsSSE, _mm_and_ps(dp, cmp)); return *this; } [/CODE] [/quote] Although yours is the standard way folks do the normalization, for large components the dot product overflows. If you need something that is robust for all finite floating-point inputs, [CODE] inline __m128 MaximumAbsoluteComponent (__m128 const v) { __m128 SIGN = _mm_set1_ps(0x80000000u); __m128 vAbs = _mm_andnot_ps(SIGN, v); __m128 max0 = _mm_shuffle_ps(vAbs, vAbs, _MM_SHUFFLE(0,0,0,0)); __m128 max1 = _mm_shuffle_ps(vAbs, vAbs, _MM_SHUFFLE(1,1,1,1)); __m128 max2 = _mm_shuffle_ps(vAbs, vAbs, _MM_SHUFFLE(2,2,2,2)); __m128 max3 = _mm_shuffle_ps(vAbs, vAbs, _MM_SHUFFLE(3,3,3,3)); max0 = _mm_max_ps(max0, max1); max2 = _mm_max_ps(max2, max3); max0 = _mm_max_ps(max0, max2); return max0; } inline __m128 Normalize (__m128 const v) { // Compute the maximum absolute value component. __m128 maxComponent = MaximumAbsoluteComponent(v); // Divide by the maximum absolute component. This is potentially a divide by zero. __m128 normalized = _mm_div_ps(v, maxComponent); // Set to zero when the original length is zero. __m128 zero = _mm_setzero_ps(); __m128 mask = _mm_cmpneq_ps(zero, maxComponent); normalized = _mm_and_ps(mask, normalized); // (sqrLength, sqrLength, sqrLength, sqrLength) __m128 sqrLength = _mm_dp_ps(normalized, normalized, 0x7F); // (length, length, length, length) __m128 length = _mm_sqrt_ps(sqrLength); // Divide by the length to normalize. This is potentially a divide by zero. normalized = _mm_div_ps(normalized, length); // Set to zero when the original length is zero or infinity. In the latter case, this is considered to be an unexpected condition. normalized = _mm_and_ps(mask, normalized); return normalized; } [/CODE]
  10. Null space of a matrix

    The method of solving the system likely depends on the specifics of your problem. For example, this paper [url="http://www.ima.umn.edu/preprints/apr99/1611.pdf"]http://www.ima.umn.edu/preprints/apr99/1611.pdf[/url] has a subproblem that involves solving a large sparse linear system whose matrix has null space of dimension 1. The authors show that using the conjugate gradient method leads to a solution that is unique among values when you project out the null space. The iterations always keep you on the projection space, so numerically the solver is quite robust.
  11. [quote name='HPSC' timestamp='1334074004' post='4929926'] I have an ordered set of polygons and I am intersecting them with a rectangle, how do I tell when the rectangle is filled so I can stop checking? This is in 2D. I am trying to get the minimal set of polygons with cover the rectangle. The order of the set is the priority of use and the polygons are simple (don't have holes or self intersecting). So if my set is {A,B,C,D} and A B D intersect the rectangle but A contains B than I want a the set that contains just {A, D}. Tried googling but wasn't sure what to google really, I was thinking it would fall under polygon coverage but that didn't seem to produce results of what I was looking for. Thanks [/quote] I think this is a hard problem theoretically. For a practical solution, have you thought about rasterizing the rectangle and polygons to a high-resolution grid? Rasterize the rectangle first. Rasterizer your polygons one at a time, keeping track of which rectangle pixels are written to (sort of like a stencil buffer) and which polygons have been rasterized to previously unwritten pixels. Once you have rasterized to all pixels, the process terminates. (You have to deal with not writing all rectangle pixels after all polygons have been processed.)
  12. SLMATH library and SSE optimisation problem.

    [quote name='RobinsonUK' timestamp='1330698757' post='4918600'] Thanks. It does, yes. It's a giant pain, so I think I'll just switch off SSE in the project settings and live without the performance boost. [/quote] std::vector [b]should[/b] be able to support alignment through custom allocators. However, if you are using MSVS 2010, the dinkumware STL they use has a bug in that the std::vector resize does not do the right thing (fixed in MSVS 2011). For MSVS 2010, you'll have to roll your own std::vector (maybe copy what dinkumware does and "fix" the resize).
  13. Support for C++ math libraries

    [quote name='wood_brian' timestamp='1330821135' post='4919023'] The [url="http://webEbenezer.net"]C++ Middleware Writer[/url] (CMW) has marshalling support for [url="http://webEbenezer.net/misc/MarshallingFunctions.hh"]::std::complex[/url] and IEEE floating point types, but doesn't have support for math libraries beyond that. I'm wanting to[b] [i]avoid[/i][/b] [b]adding support though for [i]libraries that aren't thriving[/i][/b]. I have some scientific programming experience, but not enough to know which libraries are thriving and which aren't. Of course sometimes a library seems to be dying and then is miraculously saved so it isn't easy to determine exactly how things will turn out, but am still interested in hearing thoughts on this. I've looked at a library called [url="http://shoup.net/ntl/"]NTL [/url]-- there are references to gcc 2.95 and other references that seem it isn't being maintained. Is there an alternative to NTL out there that's doing better? I'm also interested in Rational or fixed integer libraries that are used as alternatives to floating point calculations, but again am not sure of what is being used. Just remembered the CMW supports ::std::valarray but that seems to have been a waste of time. I'd like to avoid the valarrays. Tia. [/quote] What is a "thriving" library? If you find a library that has features you want, does it matter whether it is "thriving" (according to whatever your definition is for "thriving")? What features are you looking for in a library? Such information might make it easier for folks to point you to the something you can use.
  14. ID3D11ShaderReflection question

    I found another post of yours that mentions the assignment of each element of a float[] to a single register. That was helpful. When I query for the array member, I saw Rows=1, Columns=1, and Elements=5, which seemed strange (a 1x1 array with 5 elements?). Because the query does not tell me I have an "array", I suppose that I can infer it from Rows==1 and Columns==1 and Elements > 1. The other information that seemed strange: cbuffer Whatever { struct Something {...}; Something A; Something B[2]}. The number of bytes reported for A is different from the number of bytes reported for B. Of course, for B the number of bytes is for both array items, but it was not [b]twice[/b] the size of A. And it appears I'd have to [b]assume[/b] that both array items of B use the same number of bytes. Reverse engineering (by compiling some shaders that access the .x components of various struct members) made it clear that there was some set of rules the compiler was using. So I think that knowing how arrays are mapped to registers and knowing to trap the rows/columns/elements case I mentioned, I can infer the packing of the struct (which I want so I know how to typecast the mapped cbuffer data properly). Thanks.
  15. Capsule-Capsule Collision Tutorial

    [quote name='wildbunny' timestamp='1329070052' post='4912301'] [quote name='rkeene' timestamp='1329010267' post='4912133'] I just did this tutorial and sample code. I found it difficult to find this topic on the web, so after figuring it out I thought I'd post it here. [url="http://thunderfist-podium.blogspot.com/2012/02/capsule-capsule-collision-in-games.html"]http://thunderfist-p...n-in-games.html[/url] Enjoy. Any suggestions for improvements would be nice. [/quote] Needs more diagrams [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img] [/quote] No, it needs more mathematics [img]http://public.gamedev.net//public/style_emoticons/default/tongue.png[/img] . (Two capsules intersect when the distance between their line-segment axes is smaller than the sum of their radii.)