massive allocation perf difference between Win8, Win7 and linux.

Started by
19 comments, last by Mona2000 8 years, 10 months ago

Actually, the issue is that there really is no such thing as the Vista STL (or standard library). The STL ships with the compiler so if you install VS2005 on Vista you get the VS2005 STL and if you install VS2013 you get the VS2013 STL.

Advertisement

Without seeing the test's code is hard to draw any conclussion.

All the code is available http://sourceforge.net/p/cgenericopenaddresshashmap/code/ci/master/tree/

But its true it was in the other thread sorry.

Beginning with Windows XP, the low fragmentation heap (LFH) is available, and beginning with Vista it is the default. To mitigate against buffer overrun attacks, it has been made "non-deterministic" with Windows 8.

Which means that the memory locations returned by the allocator under Windows 8 follow a considerably different pattern compared to Windows 7 (see e.g. this writeup). Also, on every Windows version, the same program can be made to perform significantly differently by calling one API function (enabling or disabling the LFH, and playing with the serialization flag, etc.).

So, to answer your question, yes, the allocator is significantly different. Now why is it 5-20 times faster? We can only guess, but factors of 5 to 20 smell of cache or VM effects (doing one test in debug build and the other in release is another common reason, but you said you used the very same executable on two machines -- so that's not it). My gut feeling is that your "random deletes" just happen to play better, by coincidence, with the "random addresses" that the allocator gave you for consecutive elements. As compared to the strictly linear addresses for consecutive elements that the Windows 7 allocator would give you.

So maybe, on the average, this turns out in 10-20% fewer cache misses and 1-2% fewer page faults, but since these have such a huge impact compared to everything else, you see them as 5-20x speedup.

Actually, the issue is that there really is no such thing as the Vista STL (or standard library). The STL ships with the compiler so if you install VS2005 on Vista you get the VS2005 STL and if you install VS2013 you get the VS2013 STL.


Maybe he meant the C standard library? Of course, that's still not the "STL", but I could imagine a case where a particular libc implementation could perform well on one OS and not on other, eg. if the syscalls it made became slower in that version, or were deprecated in favor of something else.

The problem with that theory is that since at least Windows 2000, Windows has shipped with multiple implementations of the C standard library. Of the one's that are officially documented, there's the version implemented in msvcrt.dll, which is actually the Visual C++ 6 C standard library implementation, and there's the version implemented in the POSIX subsystem/Windows services for Unix. They are definitely different implementations, as the way they handle file names are different. However the chances are pretty good that any program that you ran would end up using none of them in favor the the C standard library implementation of the compiler, which was usually specific to the compiler version. So MSVC 7 had it's own C standard library implementation, and so did MSVC 7.1, and so on.

(I understand that, officially, since Windows 8, you can no longer assume that any copy of the C runtime is distributed along with the OS, and you must distribute your own. However, a clean installation of Windows 8 still seems to have that msvcrt.dll file. This may be prep work for a later version of Windows, where they deprecate it but don't actually remove it until later.)

mscrt.dll version 7.0.10130.0 is still present on Windows 10 10130... I don't think MS want to screw gazillions of enterprise system due a single dll removal.. Starting with MSVC 14 however they re-organized (again) the VC-CRT implementation http://blogs.msdn.com/b/vcblog/archive/2014/06/10/the-great-crt-refactoring.aspx

"Recursion is the first step towards madness." - "Skegg?ld, Skálm?ld, Skildir ro Klofnir!"
Direct3D 12 quick reference: https://github.com/alessiot89/D3D12QuickRef/

mscrt.dll version 7.0.10130.0 is still present on Windows 10 10130... I don't think MS want to screw gazillions of enterprise system due a single dll removal
Hahaha... imagine what a support nightmare that would be... 3/4 of all programs stop working for no apparent reason, with a weird missing DLL message that no end user will understand biggrin.png

No Sir, not going to happen.

Can you believe iterating over a map is 170 times slower on gcc/linux than on VS12/Win8.1 ? 'the heck ?? (actually for this one I suspect an optimizer joke)

Did you specifically define NDEBUG? It isn’t defined in release mode by default on all compilers.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Hey Spiro, yep, you're right, I've noticed this as well and that is why I included it in the CMakeLists https://sourceforge.net/p/cgenericopenaddresshashmap/code/ci/master/tree/samples/CMakeLists.txt

new episode in the weirdness, I counted the blocks allocated on the linux version of the test, and here it is:

mem blocks:

std vector: 21
reserved vector: 1
*open address: 44
*reserved openaddr: 2
std unordered: 999788
std map: 999771

This explains a lot about the heavy difference that we see happening. The number of elements inserted in the maps is not under control, using the rand() function was a liability.

This shows that the gcc implementation of rand() doesn't return values between 0 and 32768, but a very wide range of values, which in effect makes the tests completely different.

Since, the windows test are testing for maps with ~32k elements, and lots of failed insertions (already existing keys), and the linux test is inserting almost everything it tries.

So I changed rand calls to:

http://rosettacode.org/wiki/Linear_congruential_generator#C

Now I get: (AMD6800k / g++4.9 / linux)


== 1 million int pushes ==
std vector:             7.99805 ms
reserved vector:        4.16898 ms
*open address:          16.5275 ms
*reserved openaddr:     14.9709 ms
std unordered:          22.5586 ms
std map:                145.302 ms


== 100k random erasures ==
*openaddr:              5040.85 ms
std unordered:          2.38067 ms
std map:                14.1778 ms


== 1M iteration ==
*openaddr:              4.60644 ms
std unordered:          2.55456 ms
std map:                13.0115 ms


== 50k random finds among 1M contenance ==
*openaddr:              0.584305 ms
std unordered:          0.712742 ms
std map:                6.75083 ms

Which is all of a sudden looks very similar to the Win8 figures. I guess I'll have to re-run the whole thing again everywhere now.

Also, we can see the monstrous impact of the dumb fix I provided for the "anti clumps orphaning" using the "delete" flag per entry (the 5040.85ms figure). Without maintenance the hash map becomes polluted with deleted flags and all find operations become O(N). This is very unlikely to happen in real life scenario, but still I'll work on a "garbage collection" technique, like google did in their dense map.

clang++ gives almost equivalent results, all proportions are conserved, only 5% slower in all tests or something. so it wasn't much due to an optimizer or stdlib difference, but rather a benchmark bug, due to rand() implementation differences.

Now that we brought linux to the level of Win8, and apparently even faster; will the rand() stuff explain the Win7 behavior too ? I'll check that when I can. to be continued

This topic is closed to new replies.

Advertisement