Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 03 Jul 2006
Offline Last Active Private

#5293449 DX9 Doubling Memory Usage in x64

Posted by Adam_42 on 25 May 2016 - 05:11 PM

I can guess at one possibility. Some time ago Microsoft optimized the address space usage of D3D9 on Vista - see https://support.microsoft.com/en-gb/kb/940105 It's possible that that optimization was only applied to x86 as you're not going to run out of address space on x64.


Is this extra memory usage actually causing a significant, measurable performance issue? If not I wouldn't worry about it.


If you really want to investigate what's going on, I'd suggest creating the simplest possible test program that shows the memory usage difference, and using a tool like https://technet.microsoft.com/en-us/sysinternals/vmmap.aspx to investigate how memory gets allocated differently.

#5287244 Downsampling texture size importance

Posted by Adam_42 on 16 April 2016 - 07:21 PM

To render to part of a render target, what you usually want to do is to adjust the viewport. This allows you to render anything you want - it will handle the scaling and clipping for you.


The only downside is that when sampling from the render target, you can't clamp or wrap at the edge of the viewport, so it's easy to read outside of the viewport that you wrote to.

#5285988 D3D alternative for OpenGL gl_BaseInstanceARB

Posted by Adam_42 on 09 April 2016 - 05:31 AM

The standard technique to draw multiple copies of the same thing in D3D is called instancing.


There's a decent explanation of how to do that in D3D9 at: https://msdn.microsoft.com/en-us/library/windows/desktop/bb173349%28v=vs.85%29.aspx


You can do the same thing in D3D11, but the API is a bit different. There's some example code at: http://www.rastertek.com/dx11tut37.html

#5283954 Need help understanding line in code snippet for linked list queue please.

Posted by Adam_42 on 28 March 2016 - 05:49 PM

That code appears to be formatted for fitting in limited vertical space for printing - that is it's deliberately making the code less readable to make it fit on one page. It's also using three xors to swap two variables, instead of using std::swap(), which is much more readable (and probably faster too).


The line of code in question would be easier to read if it was split up into two or three separate statements:

int operandCount = (n/2) - 1;
if ( count > operandCount )

I think the reasoning behind the test is that when there's only binary operators available, there will always be a known ratio of operators to operands, and that is testing for it. It's not a very nice way to detect the end of the expression though.


That is:

- With an input of length 3, you must have one operator and two operands.

- Input of length 4 is invalid.

- Input of length 5 will always have two operators.

- etc.

#5281760 casting double* to float*

Posted by Adam_42 on 17 March 2016 - 04:23 PM

The most significant performance hit for using doubles will probably come from the CPU cache misses and memory bandwidth caused by them taking up twice as much memory.


In addition SSE instructions can handle four floats at a time, but only two doubles at a time. So for code which uses them the performance hit can be significant.


For basic operations on values in registers, floats aren't significantly faster than doubles. For some more complex operations (like division) doubles will be slower than floats, as they have more precision, but overall you probably won't notice much difference.


For details on specific instructions look at Intel's optimization manual - http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html (for example you can compare the divsd and divss instructions there, to see the timings for float vs double division).

#5278428 How to find the cause of framedrops

Posted by Adam_42 on 27 February 2016 - 06:38 AM

One possible cause of stalls is that drivers tend to upload textures (and other resources like shaders) on first use.


This can cause a noticeable stall if a large number of things get used for the first time on one frame.



From: https://fgiesen.wordpress.com/2011/07/01/a-trip-through-the-graphics-pipeline-2011-part-1/



Incidentally, this is also the reason why you’ll often see a delay the first time you use a new shader or resource; a lot of the creation/compilation work is deferred by the driver and only executed when it’s actually necessary (you wouldn’t believe how much unused crap some apps create!). Graphics programmers know the other side of the story – if you want to make sure something is actually created (as opposed to just having memory reserved), you need to issue a dummy draw call that uses it to “warm it up”. Ugly and annoying, but this has been the case since I first started using 3D hardware in 1999 – meaning, it’s pretty much a fact of life by this point, so get used to it. :)


#5275813 MSVC generating much slower code compared to GCC

Posted by Adam_42 on 15 February 2016 - 03:44 PM

There's one thing Visual Studio can do to trip up performance measurements if you're not aware of it, which has nothing to do with the compiler.


If you run a program by pressing F5 then you will get the Windows debug heapenabled, which is much much slower than the non-debug one.


The simple workaround is to launch it without the debugger attached by using Control+F5 if you're doing performance testing.

#5275681 Multithreading Nowadays

Posted by Adam_42 on 14 February 2016 - 06:18 PM

I think at least for games, another significant problem with hyperthreading is that many workloads don't scale perfectly with core count. See https://en.wikipedia.org/wiki/Amdahl's_law for one cause of poor scaling.


That is, even if you compare say two core performance to four core performance without any hyperthreading, then the four cores probably won't be exactly double the speed of two cores. It might get say 1.8 times faster instead.


This means that the performance benefit from hyperthreading has to be higher than the overheads from using more threads, if it's going to actually improve performance.

#5275578 Do game developers still have any reason to support Direct3D 10 cards?

Posted by Adam_42 on 13 February 2016 - 03:21 PM

It's probably some combination of:


- The Xbox One and PS4 use D3D11 hardware, so for a cross platform game it makes sense to use the same feature set on all platforms.


- It would add more work to target D3D10 cards (which are missing some handy features that D3D11 has, especially if they are 10.0 instead of 10.1). Compute shader limitations spring to mind as an obvious example.


- D3D10 cards are older and therefore slower, for some games they wouldn't be fast enough even if they were supported.


- According to http://store.steampowered.com/hwsurvey only about 15% of Steam users have a DX10 capable PC, 85% are DX11 or better. Looking at the list of DX10 GPUs people have is also informative.

#5270253 What happens with 32bit overflow?

Posted by Adam_42 on 09 January 2016 - 06:56 AM

It's worth noting that there's some extra Windows specific detail here:


- On a 32-bit version of Windows, you don't get 4GB of address space. You get 2GB. The other 2GB is used by Windows (for cache, drivers, etc).

- On a 64-bit version of Windows, a 32-bit program also gets 2GB of address space by default, but you can link your program with /LARGEADDRESSAWARE and get 4GB.

- If you compile your program as 64-bit then you get lots more address space - up to 128TB (48 bit addresses, with half of the address space used by the OS).


Note that for simplicity I'm ignoring the /3GB and /USERVA boot flags, as it's not generally something you can control on the PC your software will be installed on.

#5269288 Questions About Blur Effect

Posted by Adam_42 on 04 January 2016 - 05:32 PM

One way to do a variable strength blur is to set up a filter kernel (i.e. array of weights) big enough for the maximum blur quantity you need.


You can pass that weights array as a parameter to the shader, and change the blur amount by adjusting the weights, and padding with zeros.


Making some numbers up, you could set up a 5 tap blur, with weights of say { 0.1, 0.2, 0.4, 0.2, 0.1 } and convert it to a 3 tap blur by setting the weights as say {0, 0.25, 0.5, 0.25, 0}


Obviously you'd need to calculate those weights on the CPU and upload them to the shader, and you'd probably want much more than a 5 tap shader.

#5263363 8x FSAA not working on render targets

Posted by Adam_42 on 23 November 2015 - 08:23 PM

You will probably find that what you also want to do is use ResolveSubResource()to copy the MSAA texture to a normal one to read it back with. If you don't do that you may get some weird looking results as you will be reading back individual samples.


If you're doing that I'd also drop the D3D11_BIND_SHADER_RESOURCE flag from the MSAA version to make sure you don't try and use it directly. Making an MSAA render target a shader resource is also not supported on a D3DFEATURE_LEVEL_9_* device.

#5262855 Question about type, and displaying the bits of a char

Posted by Adam_42 on 20 November 2015 - 07:31 AM

Just to be clear about the lack of portability I was taking about, try out this bit of code:

    #include <stdio.h>
    int main()
        struct test1 {char p:4; unsigned char q:4;};
        struct test2 {int p:4; int q:4; char c;};
        struct test3 {int p:4; unsigned int q:4;};
        struct test4 {int p:4; unsigned char q:4;};
        printf("%d %d %d %d\n", sizeof(test1), sizeof(test2), sizeof(test3), sizeof(test4));

VS2010 outputs "1 8 4 8".


Clang and gcc output "1 4 4 4".


Those sorts of differences can cause problems if you don't know about them. At the very least Visual Studio can easily use more memory with the same code, if you're not careful.

#5262747 Question about type, and displaying the bits of a char

Posted by Adam_42 on 19 November 2015 - 09:12 AM

You need to be a bit careful when using bitfields, as there are some significant portability issues. Some of them affect correctness, and others just affect how much space saving you get.


  • For the example in the OP sizeof(character) may vary depending on which compiler you've used (I'd expect either sizeof(unsigned) or sizeof(char)). If it ends up as sizeof(unsigned), then endianness will cause portability problems if you write to one member then read from the other.
  • The order of the bits within a bitfield isn't well defined either, so writing to b.b0 and then reading c isn't portable even if you avoid the size/endianness issue.
  • Various other stuff also isn't defined. For example there's several different numbers this code could output: struct test {int p:4; unsigned int q:4;}; printf("%d\n", sizeof(test));


One thing you should definitely avoid for cross platform portability is serializing anything that contains a bitfield. It's even worse than writing out a whole struct. Use masks and shifts instead if you need to be space efficient.



C99 implementation may allocate any addressable storage unit large enough to hold a bit- field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.


#5260396 Runtime BC7 texture compression

Posted by Adam_42 on 03 November 2015 - 04:57 PM

According to https://en.wikipedia.org/wiki/Adaptive_Scalable_Texture_Compression ASTC is available via D3D, but in Windows 10 only. Unfortunately, there's no citation to back that claim up.


It is available on OpenGL as an extension - "KHR_texture_compression_astc_ldr".


I can't find any information on which current PC GPUs support it, if any. I suspect that means none do support it, as that sort of thing tends to get mentioned in reviews. This article says it is supported on some mobile phone GPUs.


Because of the limited support, it probably won't be very useful for PC games for a few years. One of the big advantages of DXT1 and DXT5 is that all PC GPUs support them, so you don't have to handle the awkward case where the hardware can't use your compressed texture data.