• Create Account

Member Since 03 Jul 2006
Offline Last Active Private

### #5309078Faster Sin and Cos

Posted by on 01 September 2016 - 05:52 PM

You can make those functions significantly faster (at least on PC) by rearranging the expressions to cut down on dependencies between instructions. This let's the CPU pipeline work more efficiently.

The downside is that you lose a little accuracy. You may be able to get some of that back by tweaking the constants, and/or the bracketing for the adds.

Here's what I ended up with:

```float Sin(float x)
{
int32_t i32I = int32_t( x * (1.0f / PI) );
x = (x - float( i32I ) * PI);

float fX2 = x * x;
float fX4 = fX2 * fX2;
float fX6 = fX2 * fX4;
float fX8 = fX4 * fX4;
float fX10 = fX6 * fX4;
float fX12 = fX6 * fX6;
float fX14 = fX6 * fX8;

return (i32I & 1) ?
-x * (float( 1.00000000000000000000e+00 ) +
(fX2 * float( -1.66666671633720397949e-01 )) +
((fX4 * float( 8.33333376795053482056e-03 )) +
(fX6 * float( -1.98412497411482036114e-04 ))) +
((fX8 * float( 2.75565571428160183132e-06 )) +
(fX10 * float( -2.50368472620721149724e-08 ))) +
((fX12 * float( 1.58849267073435385100e-10 )) +
(fX14 * float( -6.58925550841432672300e-13 )))
):
x * (float( 1.00000000000000000000e+00 ) +
(fX2 * float( -1.66666671633720397949e-01 )) +
((fX4 * float( 8.33333376795053482056e-03 )) +
(fX6 * float( -1.98412497411482036114e-04 ))) +
((fX8 * float( 2.75565571428160183132e-06 )) +
(fX10 * float( -2.50368472620721149724e-08 ))) +
((fX12 * float( 1.58849267073435385100e-10 )) +
(fX14 * float( -6.58925550841432672300e-13 )))
);
}

float Cos(float x)
{
int32_t i32I = int32_t( x * (1.0f / PI) );
x = (x - float( i32I ) * PI);

float fX2 = x * x;
float fX4 = fX2 * fX2;
float fX6 = fX2 * fX4;
float fX8 = fX4 * fX4;
float fX10 = fX6 * fX4;
float fX12 = fX6 * fX6;
float fX14 = fX6 * fX8;

return (i32I & 1) ?
float( -1.00000000000000000000e+00 ) - (
(fX2 * float( -5.00000000000000000000e-01 )) +
((fX4 * float( 4.16666641831398010254e-02 )) +
(fX6 * float( -1.38888671062886714935e-03 ))) +
((fX8 * float( 2.48006890615215525031e-05 )) +
(fX10 * float( -2.75369927749125054106e-07 ))) +
((fX12 * float( 2.06207229069832465029e-09 )) +
(fX14 * float( -9.77507137733812925262e-12 )))
) :
float( 1.00000000000000000000e+00 ) + (
(fX2 * float( -5.00000000000000000000e-01 )) +
((fX4 * float( 4.16666641831398010254e-02 )) +
(fX6 * float( -1.38888671062886714935e-03 ))) +
((fX8 * float( 2.48006890615215525031e-05 )) +
(fX10 * float( -2.75369927749125054106e-07 ))) +
((fX12 * float( 2.06207229069832465029e-09 )) +
(fX14 * float( -9.77507137733812925262e-12 )))
);
}```

I tested this in a VS 2015 x64 release build. YMMV.

### #5308678win10: no native support for pre-dx10 games. additional runtime required?

Posted by on 30 August 2016 - 09:42 AM

its my understanding that d3dx versions are all backwardly compatible. so all you need is the version your game uses or any newer version. and in general, you don't need every version used by every game installed, just the newest, which can be used by all

This isn't the case - each D3DX version uses a different DLL. You can actually customize the redistributable for your game to only include the version of D3DX that you actually need. This lets you cut down the download size.

### #5305722lpCmdLine, open with and spaces

Posted by on 14 August 2016 - 05:22 AM

You have one small bug there - you're not freeing the memory correctly.

From the documentation:

CommandLineToArgvW allocates a block of contiguous memory for pointers to the argument strings, and for the argument strings themselves; the calling application must free the memory used by the argument list when it is no longer needed. To free the memory, use a single call to the LocalFree function.

### #5305681lpCmdLine, open with and spaces

Posted by on 13 August 2016 - 05:47 PM

The standard windows function to parse the command line into the more usual argc and argv format is CommandLineToArgvW().

It does handle quoted arguments.

### #5304761For Loop Max Bug

Posted by on 08 August 2016 - 04:40 PM

However, this means, that when adding points to the volume, I either have to ADD ONE to the maxes when adding points, or add one to the maxes AFTER adding all the points. Adding one to the maxes during the add is a needless extra computation (when you are dealing with thousands of points). But it is possible to FORGET to add one to the maxes after adding the points, leading to subtle bugs, which should be something I am detecting at compile time, rather than relying on remembering to deal with.

I'd recommend writing code primarily for readability and correctness, unless it involves a big sacrifice of performance to do so. You can always try to optimize any bottlenecks found by your profiler later on.

In that particular case there's a good chance that the compiler will work out that it can move the +1 outside of the loop anyway.

It sounds like you might also benefit from some unit tests, so that you can find out about things that get broken at compile time (assuming you run the tests at the end of each successful build).

Posted by on 30 July 2016 - 02:40 PM

Here's a few possibilities:

- The libraries weren't available when the game engine was created. For example PPL first appeared with Visual Studio 2010.

- The libraries may not support all the platforms that the game engine does.

- They may not support the features required for the game (OpenMP certainly falls short there).

- Cost. As far as I can tell, TBB isn't free for commercial use.

### #5296730DirectXMath - storing transform matrices

Posted by on 15 June 2016 - 03:19 PM

If you use the "vectorcall" calling convention it should be returned from the function in registers.

### #5296537DirectXMath - storing transform matrices

Posted by on 14 June 2016 - 05:15 PM

The simple option is to override global operator new and call _aligned_malloc (or similar) from it. If you do that you can make every heap allocation 16-byte aligned, with only a few lines of code in one place.

Doing that also means you can do things like putting aligned types in a std::vector.

Note that there's several variants of operator new and delete that you'll need to replace - you want to change the non-throwing variants as well.

Also note that if you compile as x64 instead of x86 then you don't need to do any of that. The standard heap allocation alignment on 64-bit Windows is 16 bytes.

### #5296197whats the "Big O" of this algo?

Posted by on 12 June 2016 - 05:27 AM

For reference, the sort algorithm that the standard library implements as std::sort() is usually https://en.wikipedia.org/wiki/Introsort which has an O(n log n) average and worst case performance, but it could also be something else.

https://en.wikipedia.org/wiki/Sorting_algorithm has a big table comparing performance and memory use of a variety of sort algorithms.

### #5296030Trivially trivial?

Posted by on 10 June 2016 - 05:32 PM

You could set up your own trait called say IsRelocatable<T>, other people have done it that way - https://github.com/facebook/folly/blob/master/folly/docs/Traits.md

I found some discussions at https://groups.google.com/a/isocpp.org/forum/#!topic/std-discussion/wphImiqfX7Y[1-25] which suggest that some possible implementations of virtual functions aren't compatible with memcpy() but I don't know of any compilers where that's actually true.

### #5295019[SharpDX] DXGI Adapter Video Memory problem

Posted by on 04 June 2016 - 05:48 PM

Looking at the documentation https://msdn.microsoft.com/en-us/library/windows/desktop/bb173058%28v=vs.85%29.aspx you can see the size is stored in a SIZE_T type, which on a 32-bit PC will be an unsigned 32-bit integer.

If you treat -1073741824 as an unsigned value, it's actually 3GB.

It's possible you will get a different result if you compile your program as 64-bit.

### #5293449DX9 Doubling Memory Usage in x64

Posted by on 25 May 2016 - 05:11 PM

I can guess at one possibility. Some time ago Microsoft optimized the address space usage of D3D9 on Vista - see https://support.microsoft.com/en-gb/kb/940105 It's possible that that optimization was only applied to x86 as you're not going to run out of address space on x64.

Is this extra memory usage actually causing a significant, measurable performance issue? If not I wouldn't worry about it.

If you really want to investigate what's going on, I'd suggest creating the simplest possible test program that shows the memory usage difference, and using a tool like https://technet.microsoft.com/en-us/sysinternals/vmmap.aspx to investigate how memory gets allocated differently.

### #5287244Downsampling texture size importance

Posted by on 16 April 2016 - 07:21 PM

To render to part of a render target, what you usually want to do is to adjust the viewport. This allows you to render anything you want - it will handle the scaling and clipping for you.

The only downside is that when sampling from the render target, you can't clamp or wrap at the edge of the viewport, so it's easy to read outside of the viewport that you wrote to.

### #5285988D3D alternative for OpenGL gl_BaseInstanceARB

Posted by on 09 April 2016 - 05:31 AM

The standard technique to draw multiple copies of the same thing in D3D is called instancing.

There's a decent explanation of how to do that in D3D9 at: https://msdn.microsoft.com/en-us/library/windows/desktop/bb173349%28v=vs.85%29.aspx

You can do the same thing in D3D11, but the API is a bit different. There's some example code at: http://www.rastertek.com/dx11tut37.html

### #5283954Need help understanding line in code snippet for linked list queue please.

Posted by on 28 March 2016 - 05:49 PM

That code appears to be formatted for fitting in limited vertical space for printing - that is it's deliberately making the code less readable to make it fit on one page. It's also using three xors to swap two variables, instead of using std::swap(), which is much more readable (and probably faster too).

The line of code in question would be easier to read if it was split up into two or three separate statements:

```++count;
int operandCount = (n/2) - 1;
if ( count > operandCount )
break;```

I think the reasoning behind the test is that when there's only binary operators available, there will always be a known ratio of operators to operands, and that is testing for it. It's not a very nice way to detect the end of the expression though.

That is:

- With an input of length 3, you must have one operator and two operands.

- Input of length 4 is invalid.

- Input of length 5 will always have two operators.

- etc.

PARTNERS