Why high level languages are slow

Started by
32 comments, last by Matias Goldberg 9 years ago

*ends up skipping most of the thread*

Honestly the article seems to be pointing out a single problem that I'm honestly tired of: everything gets allocated in the heap. Depending on the languages even simple integers can end up like this. This alone will bring down the CPU to its knees, constantly allocating and deallocating, cache coherency getting completely destroyed, the GC going crazy having to deallocate stuff all the time, etc.

Also wow that Haswell chart. I just made a quick calculation, a program that's constantly causing cache misses would be easily brought down to the equivalent of sub-100MHz speeds (cache miss every access would be 13MHz on a 3GHz Haswell, but I imagine the code to execute has cache coherency =P). No wonder so many modern programs seem to be slow despite the fact computers should be much faster. This is probably a bullshit calculation honestly but it would explain a lot.


Moreover, the compiler is usually better at asm than us, so it takes our intentions and cleverly optimizes them better than we ever could, using it's super-human knowledge of instruction scheduling, dependency chain tracking, register juggling abilities, etc...

Yes.

I've once written some simple matrix multiplication code in the laziest way possible (double for loop) in order to make the code as simple as possible. Decided to pass it through GCC with optimizations at maximum to see what it'd do. Cue the entire thing having been optimized away into an unrolled loop full of SIMD instructions (also having crammed in two matrices entirely into registers). Then I've decided to do transformation functions, but just passing the matrices as-is instead of trying to optimize them out (since I know most calculations become redundant, due to lots of 0s and 1s). So GCC completely inlined the matrix multiplication function, then optimized out that inlined function taking into account the constant values (basically, doing the very thing I refused to do in the source code).

Now, that was C, but basically don't underestimate the compiler. Just make sure the code is reasonably simple enough that the compiler will be able to catch the idiom. This is the code, if somebody wonders.

Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.
Advertisement

However, the slow part of vector insertion is moving all the elements after the insertion point, which is just a simple memcpy and is still faster than iterating over the elements of a list to find the insertion point - resizes are a little bit slower (but still O(n)) but infrequent, and never occur if the vector is pre-allocated.

I thought with vectors any movement of the elements would result in the copy or move constructors being called, as well?

Should have mentioned that it's all assuming POD types.

In the end it is not languages that are fast or slow, it is implementations that are fast or slow.

It all comes down to algorithmic complexity. I've seen really badly written C++ that brings a system to it's knees because the developer did not know the difference between a std::hash_map and a std::list.

Similarly, ive seen extremely well written C# and Java applications written for enterprise use, which kick the ass of anything written in C or C++ simply because of how quickly they can be iteratively developed by a team, and time is of course money.

In the end, you need to use the right tool for the job, sometimes this is an interpreted or bytecode language and other times this might even be assembly language, but the biggest mistake is always selecting the wrong tool for the job because it's the only tool you know how to use. As a wise man once said, "If all you have is a hammer, everything looks like nail".

But the language informs implementation - the way the C# language and associated runtime function requires compromises which, when writing idiomatic code which does not fight the language, results in a slower performance level. These are language design choices directly impacting things.

This is run time performance pure and simple and, unless you are going to start waving Pro-Language Flags around, no reasonable person can argue otherwise because the nature of the language removes the control of layout and placement.

You can argue good vs bad development all you like, in the context of this discussion it doesn't matter - it matters even less when the good vs bad is always a defensive and falls back to the tropes of "I've seen bad C++ code and good C# code which contradicts this so it must be wrong" because it is not wrong.

More to the point the continued refrain of "use the best tool for the job" isn't required either; neither the author nor people in this thread have argued otherwise so the constantly repeating of this line feels like a defensive 'must not upset anyone' whine more than anything else.

This thread isn't required.
The discussion here isn't productive.
Any honest user of a language would have looked at this for what it is - a comparison in a specific situation - nodded and got on with their lives.

Instead we have two pages of people trying to defend a language from points which were never made and conclusions not drawn to.. what? feel good about using it? Feel like 'all languages are equal'? Not upset some precious flower who might feel bad because their language of choice has a flaw which can't be bypassed without some excessive thinking?

Ugh...

I still contend that if you're going to use a GC language, then you need to play nice with the GC

I'm curious if you have you tried doing this on a large scale?

I've spent an awful lot of time the last couple of years refactoring swathes of Java code (or porting it to C++), to reach the 'zero allocations' bar that one absolutely must hit if one wants a fluid and responsive android application. I'm not kidding about this - you will not be able to reliably hit 60fps if there are any allocations in your rendering or layout loops. And Java doesn't have struct types, so that means no short-lived compound types at all...
Sadly, this is 100% true. I just finsihed having to optimize both my Android rendering and layout loops for this very reason.

"The code you write when you learn a new language is shit.
You either already know that and you are wise, or you don’t realize it for many years and you are an idiot. Either way, your learning code is objectively shit." - L. Spiro

"This is called programming. The art of typing shit into an editor/IDE is not programming, it's basically data entry. The part that makes a programmer a programmer is their problem solving skills." - Serapth

"The 'friend' relationship in c++ is the tightest coupling you can give two objects. Friends can reach out and touch your privates." - frob

Hang on, what is all this "fighting the language" nonsense that keeps getting bandied about?

I won't speak for Java, but C# at least was designed up front for value types, pointers, and blocks of native memory. There are several whole keywords that exist solely for working with unsafe code. In fact, there's essentially no code you can write in straight C that you also cannot write in C#, (save for hideous macro tomfoolery). I do this fairly frequently for scenarios where memory layout is key; allocate a block of native memory and carve out some structs (value types) from within that block.

Granted, this kind of coding isn't *as* pretty as writing code that doesn't care at all about memory, but when you do care it's not all that hard to write. If you consider it a fight, then *every line* of C that you write is an equivalent kind of fight. The difference is that C# gives you the option to descend to that level; it turns out that a huge majority of the code that we write doesn't actually get run thousands or millions of times in a tight loop, so you don't much care how well it performs as long as it's fast enough.

If you're really paranoid about the GC kicking in when it's not wanted, tune it down or turn it off altogether for sections of the game that need low latency.

Want some native code generation so that we can get rid of JIT startup costs and get the benefits of having all damn day to optimize the code? Got you covered bro. Want to avoid taking a large dependency on the whole framework and having to ship that to end users? No longer an issue. SIMD? I hardly even know her!

Let's not even get started on multithreading. If it's a fight in C#, then it's the charge of the goddamn Light Brigade in C.

Mike Popoloski | Journal | SlimDX
Honestly the part that annoys me most about trying to code with CPU cache in mind?

I have no guarantees.

I don't know how big the cache is. I don't know how big the cache lines are. Structs that fit in caches on one machine will fail to fit and cripple performance on another. Alignments that set up the addresses all nice and neat on CPU A will all be offset on CPU B and, again, push it over the cache line.

Sure, I might be able to make some assumptions about today's processes (64 byte line size seems to be relatively common) but what about that 128mb monster mentioned earlier in the thread?

And people wonder why consoles are popular to develop games for - things like cache sizes and line sizes are fixed for 5+ years ;)

Ironically, a language like C# or Java could actually JIT the struct sizes or split them up differently based on cache sizes of the CPU the program is running on - thereby increasing speed. I am not saying either of them do - but because they make fewer guarantees about memory layout and the like they have the flexibility to do so... a C/C++ compiler cannot do that. So that's another case where a "higher level" language can win out - tailoring your code at execution time to the actual end-user's environment.

Honestly the part that annoys me most about trying to code with CPU cache in mind?

I have no guarantees.

I don't know how big the cache is. I don't know how big the cache lines are. Structs that fit in caches on one machine will fail to fit and cripple performance on another. Alignments that set up the addresses all nice and neat on CPU A will all be offset on CPU B and, again, push it over the cache line.

Sure, I might be able to make some assumptions about today's processes (64 byte line size seems to be relatively common) but what about that 128mb monster mentioned earlier in the thread?

And people wonder why consoles are popular to develop games for - things like cache sizes and line sizes are fixed for 5+ years ;)

Ironically, a language like C# or Java could actually JIT the struct sizes or split them up differently based on cache sizes of the CPU the program is running on - thereby increasing speed. I am not saying either of them do - but because they make fewer guarantees about memory layout and the like they have the flexibility to do so... a C/C++ compiler cannot do that. So that's another case where a "higher level" language can win out - tailoring your code at execution time to the actual end-user's environment.

This is not true in general in x86 land your cache lines are 64 bytes and have been for a long time now, heck even the PowerPc chips in 360 and PS3 had 64 byte cachelines. When moving to arm chips is where things might change but I bet the arm architecture has a common chacheline size.

But you actually dont have to care that much about this, as long as you keep what you want to access close to each other and go through this in a linear fashion, the prefetcher in the CPU will hide most of the latency for you.

You dont have to care about the L4 cache size you have 20MB haswells out there right now, thats not what you optimize for just means that the 64 cacheline code you wrote can be faster longer before you have to hit actual memory.

A JIT however doesn't have the information needed to make sure that the split it makes for the cachelines is how the algoritm that operates on that memory is using that structure.

And yes consoles among developers are popular for their fixed hardware but that matters less than you think, now that all hardware is out of order hardware. The CPU will do more micro optimisations on the instructions than what you can do with asm.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

This is not true in general in x86 land...


Good thing everything I make is only for a x86-based chip smile.png

In a less snarky manner - a lot of people are trying to get games to run on ARM chips these days. And AFAIK nothing (technological) is preventing Intel or AMD from changing their cache line size in the future.

Either way - I was mostly expounding on the problem of the C memory model as others have pointed out.

C likes to pretend that all memory is the same, and that we have "infinite" amounts of it (or at least up to the size of a 32 or 64-bit pointer depending on platform - but sometimes not even that depending on address mode).

But modern computers now have RAM, graphics RAM (sometimes shared with regular RAM), and up to three levels of cache. And that's discounting special memory hardware like Sony's SPUs.

Plus now you have complicated prefetchers and branch predictors and other CPU machinery to try to maintain the illusion of fast RAM by guessing what you need before you need it.

So should C/C++ expose more information about the memory model? Should it expose cache information? RAM information? Heck, what about swap space information?

I dunno. Maybe. Maybe the compiler becomes good enough to hide it all and maintain the illusion of monolithic "fast" memory.

(And this is ignoring all the other fun bits like sharing memory between processes)

I know C and C++ have tried at some points to account for strange memory layouts, but they didn't work out very well. (See: Why signed chars have a -127 to 127 range, and why original STL allocators had ways to change pointer and reference types)


I don't know how big the cache is. I don't know how big the cache lines are. Structs that fit in caches on one machine will fail to fit and cripple performance on another. Alignments that set up the addresses all nice and neat on CPU A will all be offset on CPU B and, again, push it over the cache line.

The cache size isn't the most important part, but rather which memory addresses you access. If you access consecutive addresses, then it will be the equivalent of cache friendly, since RAM is optimized for consecutive accesses and thereby the CPU will try to fetch consecutive addresses ahead of time (effectively leaving the data in the cache by the time you need it). If you start accessing memory addresses all over the place then that is when you'll get cache misses, because RAM latency will force the CPU to wait to fill the cache.

Also another thing to take into account: if you write to memory, it will go into the cache first (until it gets flushed), if you access the same address again quickly, it will still be in the cache and you avoid the latency as well. So in some sense the cache can also act sorta like surrogate registers (especially useful when you have more local variables than you can fit in registers). If you're using RAM in this way then you won't be seeing cache misses either.

Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

This topic is closed to new replies.

Advertisement