Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 13 Apr 2009
Offline Last Active Oct 29 2012 04:01 AM

#4958584 CSM (based on nvidia's paper) swimming

Posted by on 12 July 2012 - 05:25 PM

Stabilizing the shadowmap requires a few steps:
  • Padding up the shadowmap by 1 additional texel than required, then translating the shadowmap projection by an offset so that the center texel is centered at all times. This will stop the crawling when you translate the camera.
  • Mapping the visible part of the view frustum to a sphere before projected that into a texture, will protect it from crawls caused by rotating the camera.
  • If your camera's field of view animates, the shadows will also crawl as the view frustum will dynamically make the fit sphere larger or smaller. This can be hidden by making taking the FOV and rounding it up into buckets of increments that affect the sphere fitting, or you can just live with it if you don't change the FOV more or at all. The bucket strategy will avoid the crawl but you will get pops when changing buckets instead. If you constrain the min and max FOV well enough you could probably use a single value and never see a pop.

#4937519 Number of arrays CPU can prefetch on

Posted by on 04 May 2012 - 07:18 PM

I would download the Intel Optimization Manuals from Intel's website. There is a lot of information, but Chapter 7 (Optimizing Cache Usage) should have most of your answers.


x64/x86 CPUs have extremely sophisticated hardware predictive prefecthing capabilities so generally you shouldn't need to explicitly prefetch data in your code. The first iteration of something can be an exception to this since code can frequently 'surprise' the hardware prefetcher and you will frequently need to prefetch much farther in advance in the code-base yourself. This is frequently not very practical and you have to eat the first L3 miss.

#4923514 releasing a game built with old DirectX SDK's

Posted by on 19 March 2012 - 09:14 PM

If you aren't linking against D3DX you can generally just distribute the binaries without needing to install the directx redist. A widows version check in the installer (or in the app itself if you are using delay load of d3d dlls to avoid startup failures) should be sufficient.



#4909984 C++ SIMD/SSE optimization

Posted by on 05 February 2012 - 06:11 PM

Its easier to read the various xmmintrin.h headers (there are 7 or 8 of them now) in the MSVC header directory and see what's there.

The MSDN docs are a jumbled mess (and in multiple 'docs', SSE, SSE2, SSE4, some AVX are fairly separate doc-wise).

#include <mmintrin.h>	   // MMX
#include <xmmintrin.h>	  // SSE1
#include <emmintrin.h>	  // SSE2

#include <pmmintrin.h>	  // Intel SSE3
#include <tmmintrin.h>	  // Intel SSSE3 (the extra S is not a typo)
#include <smmintrin.h>	  // Intel SSE4.1
#include <nmmintrin.h>	  // Intel SSE4.2
#include <wmmintrin.h>	  // Intel AES instructions
#include <immintrin.h>	  // Intel AVX instructions
//#include <intrin.h>	   // Includes all MSVC intrinsics, all of the above plus the crt and win32/win64 platform intrinsics

#4908266 If developers hate Boost, what do they use?

Posted by on 31 January 2012 - 08:43 PM

Its less of a question about Boost and more along the lines of 'How do you choose what parts of the C++ language and libraries do you use'.

I've never met anyone that admitted to even using C++ iostreams, let alone liking them or using them for anything beyond stuff in an academic environment (i.e. homework).

STL and Boost pretty much require exception handling to be enabled. This is a dealbreaker for a lot of people, especially with codebases older than modern C++ codebases that are exception-safe. You are more or less forced to be 'C with Classes, type traits, and the STL/Boost templates and that don't allocate memory'.

RAII design more or less requires exception handling for anything useful, as you can't put any interesting code in the constructors without being able to unwind (i.e. two phase initialization is required). The cleanup-on-scope aspect is useful though without exception handling, since the destructors aren't supposed to throw anyway.

STL containers have poor to non-existant control over their memory management strategies. You can replace the 'allocator' for a container but it is near useless when the nodes linked list are forced to use the same allocator as the data they are pointing to, ruling out fixed size allocators for those objects etc. This is a lot of the motivation behind EASTL, having actual control, as the libraries are 'too generic'.

And memory management ties heavily into threading: We use Unreal Engine here which approaches the 'ridiulous' side of the spectrum on the amount of dynamic memory allocation it does at runtime. The best weapon to fight this (as we cannot redesign the engine) is to break up the memory management into lots of heaps and fixed size allocators, so that any given allocation is unlikely or not at all going to contend with a lock from other threads. Stack based allocators are also a big help, but are very not-C++-like.

My rule of thumb for using these libraries is if doesn't allocate memory, it is probably ok to use:

algorithms for std::sort is quite useful even without proper STL containers, and outperforms qsort by quite a lot due to being able to inline everything.
Type traits (either MS extensions, TR1, or Boost) can make your own templates quite a bit easier to write

I've also never seen the need for thread libraries, the code just isn't that interesting or difficult to write (and libraries tend to do things like making the stack size hard to set, or everyone uses the library in their own code and you end up with 22 thread pools and 400 threads etc)

#4907800 C++ SIMD/SSE optimization

Posted by on 30 January 2012 - 04:57 PM

Awesome, this works right out of the box! This code is around 30% faster than the native c++ code. Thanks for the fast response Zoner!

The loop will likely need to be unrolled 2-4 more times as to pipeline better (i.e. use more registers until it starts spilling over onto the stack)

If the data is aligned, the load and store can use the aligned 'non-u' versions instead.

I am using the non-u version, but it didn't make much difference. Also unrolling the loop (4 times) didn't have a significant impact, although I re-used the same variables. By "use more registers" did you mean I should introduce more variables for each unrolled loop sequence?

SIMD intrinics can only be audited by looking at optimized code (unoptimized SIMD code is pretty horrific), basically when an algorithm gets too complicated it has to spill various XMM registers onto the stack. So you have to build the code, check out the asm in a debugger and see if it is doing that or not. This is much less of a problem with 64 bit code as there are twice as many registers to work with.

Re-using the same variables should work for a lot of code, although making the pointers use __restrict will probably be necessary so it can schedule the code more aggressively. If the restrict is helping the resulting asm should look something like:

read A
do work A
read B
do work B
store A
do more work on B
read C
store B
do work C
store C


read A
do work A
store A
read B
do work B
store B
read C
do work C
store C

#4907781 C++ SIMD/SSE optimization

Posted by on 30 January 2012 - 03:56 PM

static const __m128i GAlphaMask = _mm_set_epi32(0xFF000000,0xFF000000,0xFF000000,0xFF000000); // make this a global not in the function

void foo()
unsigned int* f = (unsigned int*)frame;
unsigned int* k = (unsigned int*)alphaKey;

size_t numitems = mFrameHeight * mFrameWidth;
size_t numloops = numitems / 4;
size_t remainder = numitems - numloops * 4;
for (size_t index=0;index<numloops; ++index)
  __m128i val = _mm_loadu_si128((__m128i*)f);
  __m128i valmasked = _mm_and_si128(val, GAlphaMask);
  __m128i shiftA = _mm_srli_epi32(valmasked , 8);
  __m128i shiftB = _mm_srli_epi32(valmasked , 16);
  __m128i shiftC = _mm_srli_epi32(valmasked , 24);
  __m128i result = _mm_or_si128(_mm_or_si128(shiftA, shiftB), _mm_or_si128(shiftC, GAlphaMask));
  _mm_storeu_si128((__m128i*)k, result);
  f += 4;
  k += 4;
// TODO - finish remainder with non-simd code

The loop will likely need to be unrolled 2-4 more times as to pipeline better (i.e. use more registers until it starts spilling over onto the stack)

If the data is aligned, the load and store can use the aligned 'non-u' versions instead.

#4905182 Forcing Alignment of SSE Intrinsics Types onto the Heap with Class Hierachies

Posted by on 22 January 2012 - 02:03 PM

Also I haven't yet replaced the heap in my homebrew project yet, but I have written a fairly large SIMD library for it (FPU, multiple versions of SSE and AVX supported), the global heap is all i've had to hook:

This is ultimately windows code, as they have an aligned heap available out of the box (_aligned_malloc)

mDEFAULT_ALIGNMENT is 16 in my codebase, ideally you would pass in the alignof the type here but the C++ ABI only passes in size to new (and you don't have the ability to get the type information either).

void* mAlloc(zSIZE size, zSIZE alignment)
	void* pointer = _aligned_malloc(size, alignment);
	if (pointer == null)
		throw std::bad_alloc();
	return pointer;

void mFree(void* pointer)

void* operator new(zSIZE allocationSize)
	return mAlloc(allocationSize, mDEFAULT_ALIGNMENT);

void* operator new[](zSIZE allocationSize)
	return mAlloc(allocationSize, mDEFAULT_ALIGNMENT);

void operator delete(void* pointer)

void operator delete[](void* pointer)

#4905180 Forcing Alignment of SSE Intrinsics Types onto the Heap with Class Hierachies

Posted by on 22 January 2012 - 02:01 PM

A combination of an aligned allocator and compiler-specific alignment attributes should suffice. For Visual C++, look at __declspec(align). For GCC, look at __attribute__((aligned)).

My intrinsic wrappers assert the alignment in the constructors and copy constructors. It can be useful to leave these asserts enabled in release builds for a while, as on the Windows it seems allocations from the debug heap are aligned sufficiently for SSE2.

The SSE types __m128 and friends already have declspec align 16 applied to them for you. Placing them as a member in a struct will promote the structs alignment.

Looking at the original data structures from the first post, the compiler should be generating 12 bytes of padding before the worldMatrix member in struct Transform, and also padding 4 bytes between the struct and the base class (as their alignments are different)

	struct TestA
		zBYTE bytey;
	struct TestB : public TestA
		vfloat vecy;

	zSIZE sizeA = sizeof(TestA);
	zSIZE alignA = alignof(TestA);
	zSIZE sizeB = sizeof(TestB);
	zSIZE alignB = alignof(TestB);
	zSIZE offset_bytey = offsetof(TestB, bytey);
	zSIZE offset_vecy = offsetof(TestB, vecy);

watch window:


#4833576 Return Values

Posted by on 10 July 2011 - 09:50 PM

Wow, thanks for being so thourough!

I'm perplexed as to why so many heavyweight math libraries seem to worry about this!
D3DX being one of them.

Ok, so it's probably still better to use:

Vector3DAdd (a, b, dest);

dest = a + b;

Right? To avoid temp objects?
Also, I hear that if the return value is named, it will have to construct/destruct it regardless.

I'm trying to see the assembly myself but the compiler optimizes out all my test code :P

Also, just to get this out of the way, yes, the profiler is telling me that vector math could go to be faster since my physics engine is choking atm.

I've written a few math libraries, and the temporaries are rarely a problem, provided you keep the structs containing the objects pod-like. The last big complication I've seen is that sometimes SIMD wrapper classes don't interoperate well with exception handling being enabled. Not a problem, exceptions can be turned off! :)

D3DX is structured the way it is for a few reasons:
  • It is C like on purpose, D3D has had a history of being supported for C to some degree, as you can still find all those wonderfully ugly macros in the headers to call the COM functions from C.
  • The D3DX dll is designed to be able to run on a wide variety of hardware (aka real old), most applications pick a minspec for the CPU instruction set and use that. SSE2 is a pretty solid choice these days (and is also the guaranteed minimum for x64). However if you write the math using D3DX and 'plain C/C++' it will work on all platforms.
  • The x86 ABI can't pass SSE instructions by value in the x86 ABI, so pointers (or references) are used instead.
  • The x64's ABI can pass SSE datatypes by value at the code level, but on the backend they are always passed by pointer, so only inlined code can keep the values in registers. With this restriction you might as well explicitly use pointers or references in the code, so you can see the 'cost' better, and also to cross-compile back in 32 bit.
  • A lot of 'basic' SIMD math operations take more than two arguments, which don't fit into existing C++ operators. This basically causes you to structure the lowest level of a math library in terms of C-like function primitives, of which the operator overloads can use as needed to provide syntactic sugar.
  • Some functions return more than one value, which also gets to be rather annoying without wrapper it in a struct, some tuple container, so a lot of times its easier to just have multiple out arguments. For example: A function that computes sin and cos simultaneous, frequently can be done at the same or similar cost as either sin or cos on quite a bit of hardware. Another example: Matrix inversion functions can also return the determinant as they have to compute it anyway as part of their inversion.
The bulk of the functions in D3DX are provided so you can operate on the types with a known working set of library functions. The speed really isn't there unless you use the platform intrinsic yourself and get perfect inlining out of them. The slower algorithms (matrix multiplication, matrix inversion, quaterion slerp) are an exception, as they do quite a bit of work per call, and you wouldn't normally inline a function that large anyway.

Microsoft did take the time to do some runtime patchups to the function calls to call CPU specific functions (SSE, SSE2 etc), so you basically end up with this mix of 'better than plain C code' and 'worse than pure SIMD code'.

#4833250 Mip maps..... no understanding

Posted by on 09 July 2011 - 11:22 PM

The mipmaps themselves are (typically) a series of down-resed copies of the original image, each mip is half the size of the one above it. A full mip chain goes all the way down to 1x1 (regardless if the image is square or not). A complete mip chain is not technically required, but not providing one can cause some pretty bad performance problems in some cases.

The lower the resolution of the mipmap, the better it maps to the cache on the GPU, which speeds things up (quite a lot actually). The hardware normally automatically picks which mipmap level to display quite well, except when working in screen space style effects. The filtering modes work in three 'dimensions':

mag filter - filter applied when the image is up-resed (typically when you are already rendering the largest mip level and there isn't another one to switch to)
min filter - filter applied when the image is down-resed
mip filter - filter applied between mip levels (on, off, linear)

When the mip filter is set to linear, the hardware picks a blend of of two mipmap levels to display, so the effect looks more seamless. The UV you feed into the fetch causes the hardware to fetch the color from two miplevels, and it automatically crossfades them together. If you set the mipfilter to nearest, it will only fetch one mip, and this will typically generate seams in the world you render, where the resolution of the texture jumps (as the hardware selects them automatically in most cases). This is faster however since it only has to do half the work.

When the mag filter is set to linear, the hardware fetches a 2x2 block of pixels from a single mip level, and crossfades them together with a biliinear filter. If the filter is set to anisotropic, it uses a proprietary multi-sample kernel to sample multiple sets of pixels from the image in various patterns. The number of samples corresponds to the anisotropic setting (from 2 to 16), at a substantial cost to performance in most cases. However it helps maintain the image quality when the polygons are nearly parallel to the camera, and this can be pretty important for text on signs, stripes on roads, and other objects that tend to mip to transparent values too fast (chain link fences).

You can set the hardware in quite a few configurations, as these settings are more or less mutually exclusive with each other.

#4832710 First person weapon with different FOV in deferred engine

Posted by on 08 July 2011 - 04:00 AM

We have done this with a forward renderer for Brothers in Arms:Hell's Highway and Borderlands. Aliens:Colonial Marines is deferred and uses nearly the same setup for the weapons. I've always called it the foreground FOV hack, it is becoming less hack like and more like a real feature :) But it still a hack because the render pipelines on the platforms we develop for are fairly different from each other, and sometimes needs some hand holding when the pipeline is modified. We work pretty much entirely with DX9 so some approaches can be improved by moving to 10/11 (particularly in the case of sampling the depth buffer as a texture).

First off, the good news:

Provided the near and far planes are the same for the projections, the depth values will be equivalent. What is different is the FOV is different so the screen space XY positions of the pixels will come out different, which generally only matters when de-projecting a screen pixel back into the world. For most effects that do this, the depth is usable as-is for depth based fog, and whatnot. This only requires making your artists cope with the same near plane as the world (and they will beg and scream for a closer plane for scopes and things that get right up on the near plane, but you have to say NO!, you get the custom FOV but you get it with this limitation).

And the bad news:

The depths are quite literally equivalent, which means the weapon will draw 'in the world' and have the rather annoying behavior of poking through walls you walk up to. So the fix is to render the gun several times, primarily for depth-only or stencil rendering. One possibility (and the one we use) is to render the gun with a viewport set to a range of 0/0.0000001 in order to get the gun to occlude the world. This is good for performance reasons, but bad if you have post process effects that absolutely must be able to to sample pixels from 'behind the gun'. This is a trade off someone has to sign off on. Performance usually wins that argument though, so we have opted to have the guns occlude everything (including hardware occlusion queries!). Another possibility is to render a pass to create a stencil mask of the weapon and occlude with that, but there are some complications that need to be understood, which I will talk about down near the bottom of this post.

Forward renderers can just draw the gun later in the frame at their leisure for the most part, after clearing depth (another thing I need to explain later) and drawing the gun. Deferred rendering doesn't have it as easy, as you need the gun to exist properly in the GBuffers when doing lighting passes, for both performance reasons, and to accept and real-time shadowmaps properly along with the world.

More good news:

Aside from the case of the gun or the player needing to cast shaders, the weapon will generally look just fine lit (and shadowed!) with the not-quite correct de-projected world space position. The depth will be completely correct from the view-origin's point of view, and the gun itself wont be too far from a correct XY screen position so it will just light and shadow just fine. UNLESS you attach a tiny light directly to the gun, at which point the light needs its position adjusted to be in the guns coordinate system instead of the world, so the gun looks correct when lit by the light. Muzzle flash sprites and whatnot have a similar problem, but in reverse, in that the sprite needs to be placed in the world correctly relative to the gun's barrel.

More bad news:

Getting the gun into the GBuffer properly and without breaking performance can be a bit tricky. We store a version of the scene depth's W coordinate in the alpha channel of the framebuffer (which is also the same buffer that is the the GBuffer's emissive buffer when it is generated). This is true of the PC and PS3. Rendering is basically Clear Depth, Render Gun Depth to the super-tight viewport, Render Depth, Render Scene GBuffer, Clear Z, Render Gun to GBuffer, perform lighting, translucency, post process etc. We can clear Z in this setup because the rest of the engine reads the alpha channel version of the depth for everything. The XBOX version read's the Z-buffer directly, so we have to preserve world depth values, so instead of 'Clear Z' we render the GUN twice, once a depth-always write and the second with the traditional less-equal test. This is necessary because the viewport clamped depths are not something you want the game to be using. This particular method is an extremely bad idea for PC DX9 hardware in general (NVIDIA's in particular).

The hardware is going to fight you:

You might be tempted to use oDepth in a shader. This is a bad idea, in that it disables the early depth & stencil reject feature of the hardware when the pixel shader outputs a custom depth. It is also not necessary for getting guns showing up correctly with a custom FOV. It is also a bad idea because you will also need to run a pixel shader when doing depth-only rendering, and it is extremely slow to do this (hardware LOVES rendering depth-only no-pixel-shader setups!). This is also the same reason why you should limit allowing masked textures to be used in shadowmap rendering, as they are significantly slower to render into the shadowmap (somewhere between 4 and 20x slower, its kind of insane how big of a difference it can be).

Getting it visually correct is not the real challenge. The real challenges lie in how many ways the hardware can break and performance can go off a cliff.

The early-depth and early-stencil reject behaviors of the hardware are particularlly finiky, which gets progressively worse the older the hardware is. NVIDIA's name for these culls are called ZCull and SCull. ATI(AMD whatever) calls it Hi-Z and Hi-Stencil. These early-rejects can be disabled both by some combinations of render states, as well as changing your depth test direction in the middle of the frame. When these early-rejects are not working your pixel shaders will execute for all pixels, even if depth or stencil tests kill the pixels. The result will be visually correct, but the official location for these depth and stencil tests is after the pixel shader.

Writing depth or stencil while testing depth or stencil will disable the early-reject for the draw calls doing this. This is sometimes unavoidable, but luckily only affects the specific draw calls that are setup this way.

On a lot of NVIDIA hardware, if you change the depth write direction (like I mentioned doing a pass of 'always' before 'lessequal' in order to the fix the Z-buffer on the XBOX), the zcull and scull will be disabled UNTIL THE NEXT DEPTH & STENCIL CLEAR. I expect this to be better or a non-problem with Geforce 280 series and newer, but haven't looked into it for sure. This also means you should always clear both at least at the start of the frame (and use the API's to do it, and not render a quad). This also makes the alternating depth lessequal/greaterequal every other frame trick to try and avoid depth clears a colossally bad idea.

The early-stencil test is very limited. On most hardware It pretty much caches the result of a group of stencil tests on some block size number of pixels and compresses it down to a a few bits. This means that using the stencil buffer for storing anything other than 0 and 'non-zero' pretty much worthless. And if you test for anything other than ==0 or !=0, the early stencil reject is not likely to work for you. It also means sharing the stencil buffer with a second mask is extremely difficult if you care about performance, and I definitely don't recommend trying it unless you can afford a second depth buffer with its own stencil buffer.

#4826787 Write BITMAPINFOHEADER image data to IDirect3DTexture9

Posted by on 23 June 2011 - 08:11 AM

A managed texture should be fine for most purposes.

The pBits pointer from the lock is the address of the upper left corner of the texture.

The lock structure contains how many bytes to advance the pointer to get to the next row (it might be padded!).

A general texture copy loop that doesn't require format conversion (ARGB to BGRA swizzling etc) can use memcpy, but needs to be written correctly:

const BYTE* src = bitmapinfo.pointer.whatever;
size_t numBytesPerRowSrc = bitmapinfo.width * (bitmapinfo.bitsperpixel/8); // warning: pseudocode 

BYTE* dst = lock.pBits;
size_t numBytesPerRowDst = lock.pitch;

size_t numBytesToCopyPerRow = min(numBytesPerRowSrc, numBytesPerRowDst);

for (y=0; y<numRows; ++y)
  memcpy(dst, src, numBytesToCopyPerRow);
  src += numBytesPerRowSrc;
  dst += numBytesPerRowDst;

As an optimimzation you can test if (numBytesPerRowSrc==numBytesPerRowDst) and do it with a single memcpy instead of with a loop. You are most likely to run into pitch being padded with non-power of two textures, and other special cases (a 1x2 U8V8 textures that is 2 bytes, will likely yield a pitch of at least 4 bytes for instance).

If you need to do format conversion its easiest to cast the two pointers to a struct mapped to the pixel layout, and replace the memcpy with custom code.

#4823820 Depth of Field

Posted by on 15 June 2011 - 04:05 PM

There is no one right answer, but there are many wrong ones :)

Typically a pipeline looks something like this:

depth pass
opaque pass (possibly with forward lighting)
opaque lighting pass (either deferred or additional passes for multiple lights)
opaque post processing (screen space depth fog)
translucency pass (glass, particles, etc)
post processing (dof, bloom, distortion, color correction, etc)
final post processing (gamma correction)

SSAO ideally is part of the lighting pass (and even better if it can be made to only affect diffuse lighting)
Screen space fog is easy to do for opaque values (as they have easy to access depth information) but then you need to solve fog for translucency.
In Unreal, DOF and Bloom are frequently combined into the same operation, though this restricts the bloom kernel quite a bit, but it is fast.

So to answer the question: if it is wrong, change it. Screen space algorithms are pretty easy to swap or merge, especially compared to rendering a normal scene. A natural order should fall out pretty quickly.

#4818883 For being good game programmer ,good to use SDL?

Posted by on 02 June 2011 - 04:41 PM

I've never seen much of a point for using SDL for a few reasons:

  • It probably has the wrong open source license (its unusable as-is on platforms that don't have dll support), and for the platform's you would need the commercial license won't come with code to run on that platform (PS3, XBOX, etc) anyway, which more or less makes me wonder why the commercial license of SDL costs money.
  • You still have to deal with shaders (GLSL, Cg, HLSL), which arguably is where a huge portion of the code is going to live. Supporting more than one flavor is a huge amount of work, which can be mitigated with a language neutral shader generator (editor etc), which is also a huge amount of work to create.
  • Graphics API's in the grand scheme of things aren't all that complicated since the hardware does all of the work, using the APIs raw or even making a basic wrapper for one is a pretty trivial thing to do.
  • For C++ the real time consuming things end up being things like serialization, proper localization support and string handling, and memory management (multiple heaps, streaming textures etc)