massive allocation perf difference between Win8, Win7 and linux.

Started by
19 comments, last by Mona2000 8 years, 9 months ago

Hello, maybe you've seen my topic about the open address hash map I provide, this topic is about a mystery in its performance benchmarking.

I've noticed something plain crazy, between machines and OS, the performance of the same test, is radically in opposition.

Here are my raw results:

2sbxvwO.png

(full source code here)

Ok no graphs, you're all big enough to look at numbers. These results are all issued from the same code.

I have overloaded operator new afterward to count the number of allocations that each method resulted in, in the push test.

this is the result:

mem blocks


std vector: 42
reserved vector: 1
*open address: 26
*reserved oahm: 2
std unordered: 32778
std map: 32769

So my conclusion, purely from these figures, is that windows 8 must have a very different malloc function in the common runtime. I tried to use the same binary exported by the visual studio I had on Win7 and run it on Win8, I got the same results as the binary built directly by VS on Win8. So It has to be the CRT dll. Or, its the virtual allocator in the kernel that has become much faster.

What do you think, and do you think there is a scientific way to really know what is going on ?

Can you believe iterating over a map is 170 times slower on gcc/linux than on VS12/Win8.1 ? 'the heck ?? (actually for this one I suspect an optimizer joke)

ps: 32778 nodes comes from the fact that i push using rand().

Advertisement
First off - I wouldn't really call any hand-made tests good cases for performance testing. Compilers and OSs are getting smarter and smarter about detecting patters and simply doing no work at all, or guessing at what you're going to do next and being right. Both great for the user, not so great for determining whether A is better than B. As such, I'd recommend using a profiler instead.

As to the actual performance seen here, I can completely believe it. A large part of Windows 8 was a focus on reducing memory usage and running on even older and less capable hardware than Windows 7. Despite all the hate directed at the OS for UI decisions, the underlying OS is actually faster and more efficient.

The performance difference is not likely to come from the standard library DLL, as that is the same across platforms (windows platforms anyway), but in the underlying OS allocator and additional tricks the OS can use to guess what you're doing and make it faster.

As to the whole iteration on Linux thing I suspect that more has to do with GCC. While I've never used it, I've never heard any complimentary things about GCC ever since Clang took off. Maybe try switching to Clang and re-running your benchmarks?
I hate to ask the really obvious question but are you sure you had the -O flag setup correctly? Gcc is getting old in the tooth and doesn't compare well to the newer compilers but that is pretty ridiculous. In general though, memory allocators are all very different which is why many engines/libraries use custom allocators, mostly just to provide consistency.

That's right. I'm aware that Win8 has many improvements at kernel level, and maybe it is just a cognitive bias, but it seems that it is palpable at user level. better scheduling reactivity notably.

But here, we're talking about allocating 30k integers, this is exactly the kind of things malloc does, and not the OS. malloc reserves blocks from the system in huge chunks, using virtual alloc, or NT heap manager. Then it uses its famous binning/ordered freelist stuff to return the small blocks to the caller never issuing a system call, or the least as possible.

Therefore, if difference there is, it should be in the CRT. That was my reasoning.

edit: AllEightUp (you posted while I wrote). Yes I'm sure, you can check on the topic I linked the crazy difference between O0 and O3.

But here, we're talking about allocating 30k integers, this is exactly the kind of things malloc does, and not the OS.

Actually, if you dig through the MSVC standard library source code, malloc() is just a wrapper around the OS HeapAlloc() function.

Without seeing the test's code is hard to draw any conclussion.

Most benchmarks out there are completely useless.

Why? Because implementations are very different (it's not just the compilers). For example MS' vector implementation grows the vector by 1.5x every time it reaches the capacity. Default GCC's implementation grows the vector by 2.0x. So, if you didn't reserve all the space beforehand, a test where lots of pushes are made to a vector, GCC will always be the clear winner because it overestimates compared to MS' implementation, and therefore will perform less reallocations. That doesn't mean it will perform as well on a real world case.

Likewise, map implementations are optimized for different things (lookup, traversal, erasures, insertions, runtime error correction / safety checks). But even if we compare all that, implementations can be optimized for large structures or short structures.

An implementation optimized for short structures means std::map<int, int> will perform faster; while an implementation optimized for large structures means std::map<MyStruct, AnotherStruct> where sizeof(MyStruct) and/or sizeof(AnotherStruct) is large (i.e. bigger than 64 bytes) will perform faster.

Last, but not least; yes, it's not hard to make an implementation that beats the one provided by your compiler tool suite that solve your particular need. These implementations are first coded for correctness and safety. Then, if there's time or if a lot of devs are demanding it, it gets optimized.

Until recently GCC's std::shared_ptr was a using a big fat mutex. The latest version uses atomic/interlocked instructions. It's really hard to get the latter correct, so it needs a lot of planning, and a lot of testing. Likewise, MS std::chrono implementation leaves much to be desired. All of this because, like everything else, standard library writers are hit by time constraints and meeting deadlines; and their libraries are generic that have to work for everybody without crashing or exhibiting incorrect behavior.

This goes against the general knowledge that the STL is unbeatable and you can always trust it. Trust it? Yes. Fastest? Well most game developers sooner or later learn it the hard way this is not always true. There's a reason EA created the EASTL library many years ago.

Well, you can easily look directly at the allocation and see if that's the issue.

Allocation is pretty slow and deletion even slower, on many OSes. But I think there is more going on than that. The STL implementation on windows vista is particularly pathetic. I got some dumb flames here in the past showing similar benchmarks.

This is my thread. There are many threads like it, but this one is mine.

The STL implementation on windows vista is particularly pathetic. I got some dumb flames here in the past showing similar benchmarks.

Maybe you got flamed because there is no such thing as the "STL implementation on windows vista"?

The STL implementation on windows vista is particularly pathetic. I got some dumb flames here in the past showing similar benchmarks.

Maybe you got flamed because there is no such thing as the "STL implementation on windows vista"?

Not to pedants anyway.

This is my thread. There are many threads like it, but this one is mine.

It's not an issue of pedantry - the standard library contains features that were absent from the STL and vice-versa.

The terms shouldn't be used interchangeably, and to not only do so, but to double-down when called out on it says bad things about you.

This topic is closed to new replies.

Advertisement