Back to General and Gameplay Programming

massive allocation perf difference between Win8, Win7 and linux.

ukiddinme · 2015-06-26T22:48:14

Hello, maybe you've seen my topic about the open address hash map I provide, this topic is about a mystery in its performance benchmarking. I've noticed something plain crazy, between machines and OS, the performance of the same test, is radically in opposition. Here are my raw results: (full source code here) Ok no graphs, you're all big enough to look at numbers. These results are all issued from the same code. I have overloaded operator new afterward to count the number of allocations that each method resulted in, in the push test. this is the result: mem blocks std vector: 42 reserved vector: 1 *open address: 26 *reserved oahm: 2 std unordered: 32778 std map: 32769 So my conclusion, purely from these figures, is that windows 8 must have a very different malloc function in the common runtime. I tried to use the same binary exported by the visual studio I had on Win7 and run it on Win8, I got the same results as the binary built directly by VS on Win8. So It has to be the CRT dll. Or, its the virtual allocator in the kernel that has become much faster. What do you think, and do you think there is a scientific way to really know what is going on ? Can you believe iterating over a map is 170 times slower on gcc/linux than on VS12/Win8.1 ? 'the heck ?? (actually for this one I suspect an optimizer joke) ps: 32778 nodes comes from the fact that i push using rand().

General and Gameplay Programming Programming

Started by Lightness1024 June 24, 2015 01:46 PM

19 comments, last by Mona2000 8 years, 10 months ago

Mona2000

1,967

June 25, 2015 12:34 PM

Actually, the issue is that there really is no such thing as the Vista STL (or standard library). The STL ships with the compiler so if you install VS2005 on Vista you get the VS2005 STL and if you install VS2013 you get the VS2013 STL.

Lightness1024

939

Author

June 25, 2015 02:33 PM

Without seeing the test's code is hard to draw any conclussion.

All the code is available http://sourceforge.net/p/cgenericopenaddresshashmap/code/ci/master/tree/

But its true it was in the other thread sorry.

samoth

9,833

June 25, 2015 02:37 PM

Beginning with Windows XP, the low fragmentation heap (LFH) is available, and beginning with Vista it is the default. To mitigate against buffer overrun attacks, it has been made "non-deterministic" with Windows 8.

Which means that the memory locations returned by the allocator under Windows 8 follow a considerably different pattern compared to Windows 7 (see e.g. this writeup). Also, on every Windows version, the same program can be made to perform significantly differently by calling one API function (enabling or disabling the LFH, and playing with the serialization flag, etc.).

So, to answer your question, yes, the allocator is significantly different. Now why is it 5-20 times faster? We can only guess, but factors of 5 to 20 smell of cache or VM effects (doing one test in debug build and the other in release is another common reason, but you said you used the very same executable on two machines -- so that's not it). My gut feeling is that your "random deletes" just happen to play better, by coincidence, with the "random addresses" that the allocator gave you for consecutive elements. As compared to the strictly linear addresses for consecutive elements that the Windows 7 allocator would give you.

So maybe, on the average, this turns out in 10-20% fewer cache misses and 1-2% fewer page faults, but since these have such a huge impact compared to everything else, you see them as 5-20x speedup.

Oberon_Command

6,371

June 25, 2015 02:42 PM

Actually, the issue is that there really is no such thing as the Vista STL (or standard library). The STL ships with the compiler so if you install VS2005 on Vista you get the VS2005 STL and if you install VS2013 you get the VS2013 STL.

Maybe he meant the C standard library? Of course, that's still not the "STL", but I could imagine a case where a particular libc implementation could perform well on one OS and not on other, eg. if the syscalls it made became slower in that version, or were deprecated in favor of something else.

SiCrane

11,840

June 25, 2015 04:37 PM

The problem with that theory is that since at least Windows 2000, Windows has shipped with multiple implementations of the C standard library. Of the one's that are officially documented, there's the version implemented in msvcrt.dll, which is actually the Visual C++ 6 C standard library implementation, and there's the version implemented in the POSIX subsystem/Windows services for Unix. They are definitely different implementations, as the way they handle file names are different. However the chances are pretty good that any program that you ran would end up using none of them in favor the the C standard library implementation of the compiler, which was usually specific to the compiler version. So MSVC 7 had it's own C standard library implementation, and so did MSVC 7.1, and so on.

(I understand that, officially, since Windows 8, you can no longer assume that any copy of the C runtime is distributed along with the OS, and you must distribute your own. However, a clean installation of Windows 8 still seems to have that msvcrt.dll file. This may be prep work for a later version of Windows, where they deprecate it but don't actually remove it until later.)

Alessio1989

4,648

June 25, 2015 07:10 PM

mscrt.dll version 7.0.10130.0 is still present on Windows 10 10130... I don't think MS want to screw gazillions of enterprise system due a single dll removal.. Starting with MSVC 14 however they re-organized (again) the VC-CRT implementation http://blogs.msdn.com/b/vcblog/archive/2014/06/10/the-great-crt-refactoring.aspx

"Recursion is the first step towards madness." - "Skegg?ld, Skálm?ld, Skildir ro Klofnir!"
Direct3D 12 quick reference: https://github.com/alessiot89/D3D12QuickRef/

samoth

9,833

June 25, 2015 10:27 PM

mscrt.dll version 7.0.10130.0 is still present on Windows 10 10130... I don't think MS want to screw gazillions of enterprise system due a single dll removal

Hahaha... imagine what a support nightmare that would be... 3/4 of all programs stop working for no apparent reason, with a weird missing DLL message that no end user will understand

No Sir, not going to happen.

L. Spiro

25,818

June 26, 2015 07:03 AM

Can you believe iterating over a map is 170 times slower on gcc/linux than on VS12/Win8.1 ? 'the heck ?? (actually for this one I suspect an optimizer joke)

Did you specifically define NDEBUG? It isn’t defined in release mode by default on all compilers.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Lightness1024

939

Author

June 26, 2015 01:17 PM

Hey Spiro, yep, you're right, I've noticed this as well and that is why I included it in the CMakeLists https://sourceforge.net/p/cgenericopenaddresshashmap/code/ci/master/tree/samples/CMakeLists.txt

Lightness1024

939

Author

June 26, 2015 01:58 PM

new episode in the weirdness, I counted the blocks allocated on the linux version of the test, and here it is:

mem blocks:


std vector: 21
reserved vector: 1
*open address: 44
*reserved openaddr: 2
std unordered: 999788
std map: 999771

This explains a lot about the heavy difference that we see happening. The number of elements inserted in the maps is not under control, using the rand() function was a liability.

This shows that the gcc implementation of rand() doesn't return values between 0 and 32768, but a very wide range of values, which in effect makes the tests completely different.

Since, the windows test are testing for maps with ~32k elements, and lots of failed insertions (already existing keys), and the linux test is inserting almost everything it tries.

So I changed rand calls to:

http://rosettacode.org/wiki/Linear_congruential_generator#C

Now I get: (AMD6800k / g++4.9 / linux)


== 1 million int pushes ==
std vector:             7.99805 ms
reserved vector:        4.16898 ms
*open address:          16.5275 ms
*reserved openaddr:     14.9709 ms
std unordered:          22.5586 ms
std map:                145.302 ms


== 100k random erasures ==
*openaddr:              5040.85 ms
std unordered:          2.38067 ms
std map:                14.1778 ms


== 1M iteration ==
*openaddr:              4.60644 ms
std unordered:          2.55456 ms
std map:                13.0115 ms


== 50k random finds among 1M contenance ==
*openaddr:              0.584305 ms
std unordered:          0.712742 ms
std map:                6.75083 ms

Which is all of a sudden looks very similar to the Win8 figures. I guess I'll have to re-run the whole thing again everywhere now.

Also, we can see the monstrous impact of the dumb fix I provided for the "anti clumps orphaning" using the "delete" flag per entry (the 5040.85ms figure). Without maintenance the hash map becomes polluted with deleted flags and all find operations become O(N). This is very unlikely to happen in real life scenario, but still I'll work on a "garbage collection" technique, like google did in their dense map.

clang++ gives almost equivalent results, all proportions are conserved, only 5% slower in all tests or something. so it wasn't much due to an optimizer or stdlib difference, but rather a benchmark bug, due to rand() implementation differences.

Now that we brought linux to the level of Win8, and apparently even faster; will the rand() stuff explain the Win7 behavior too ? I'll check that when I can. to be continued

massive allocation perf difference between Win8, Win7 and linux.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

massive allocation perf difference between Win8, Win7 and linux.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines