Simple & Robust OOP vs Performance

Started by
19 comments, last by rockseller 11 years, 7 months ago

That's more memory, isn't it?


Yes, it's likely 12 measly bytes.

A small price to pay for the productivity and quality gains that come from not having a 100-method class.
Advertisement

But there are still optimizations that you should do whenever possible.
For one, count down to 0 in for-loops whenever possible.
In Java, it looks like this:
[source lang="java"]for ( int i = iTotal; --i >= 0; ) {
}[/source]

This is also faster in C/C++ (many ways to compare against 0, and those instructions are smaller and faster), but in Java it is a major help. It reduces the size of the code itself significantly while also providing a faster iteration.
If the order of the loop does not matter, you should always always always do it this way.


This is also very much a premature optimization as well (in C++ at least). Reverse iteration might yield a slightly faster comparison, but is just as likely to cause an increased number of cache misses if the index is used in any way at all (even if the order does not matter semantically).

This is also faster in C/C++ (many ways to compare against 0, and those instructions are smaller and faster), but in Java it is a major help. It reduces the size of the code itself significantly while also providing a faster iteration.
If the order of the loop does not matter, you should always always always do it this way.

Not on my machine using C++.

#include <iostream>
#include <Windows.h>

using namespace std;

int main()
{
const int arraySize = 10000;
int a[arraySize];
int b[arraySize];

for(int i = 0; i < arraySize; i++)
{
a = 0;
b = 0;
}

LARGE_INTEGER li, start;
if( !QueryPerformanceFrequency(&li) )
return 0;
double freq = double(li.QuadPart)/1000.0;

QueryPerformanceCounter(&start);
for(int i = 0; i < arraySize; i++)
a += i;
QueryPerformanceCounter(&li);
cout << double(li.QuadPart - start.QuadPart)/freq << endl;

QueryPerformanceCounter(&start);
for(int i = arraySize; i >= 0 ; --i)
b += i;
QueryPerformanceCounter(&li);
cout << double(li.QuadPart - start.QuadPart)/freq << endl;
return 0;
}


Multiple tests indicate that the second iteration is slower, sometimes as much as 2 times slower.

Edit: code not showing up
Edit2: that wasn't a fair comparison, strangly enough switching the order of the iterations makes a difference
@Mussi: try pastebin, forum software seems to have issues with cout << currently. Also for your test, make sure you reset data every time, preferably using pseudorandom numbers to prevent result precomputation and/or caching, time multiple times and with different array sizes.

Also, if looping backwards is faster on a given platform, and the compiler for this platform can prove that the direction of the loop doesn't matter (or can compensate for it), the compiler probably already does it at some optimization level. This isn't exactly an arcane trick and is just as well-known as, say, xoring a register with itself to zero it.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”


What do you mean with sensible design?


What I meant is using an idiomatic and clean design first. Most programmers would argue that a class with 100 methods is smelling pretty bad and that the potential performance gains would not justify keeping such a beast in your codebase. wink.png

However, I must admit that I know virtually nothing about Java regarding its allocator and memory layout strategy. But what about this:

Having big objects with lots of independent members usually means you are pulling unused data into the cache, butchering performance.
By splitting up the object into smaller components we can achieve better cache utilisation by grouping the data that gets processed together.
If I understand correctly that's what DOD is all about.

So having many classes can in fact be faster than having big "all in one" classes. Especially if you consider that these days RAM is usually the biggest bottleneck. (Apart from disk access of course)

OTOH this is about mobile development and java, so I have no idea if the bottlenecks are different in that case.

Also, if looping backwards is faster on a given platform, and the compiler for this platform can prove that the direction of the loop doesn't matter (or can compensate for it), the compiler probably already does it at some optimization level. This isn't exactly an arcane trick and is just as well-known as, say, xoring a register with itself to zero it.


It also depends on the hardware under the hood; CPUs can prefetch data into cache in a speculative manner so that when you come to use it it has already been fetched from main memory. If a CPU can't prefetch cache lines backwards very well then a reverse loop could end up being much slower.

(Also register XOR can be slower than simply setting it to zero with an immediate, again depending on CPU, as it can cause pipeline stalls due to register dependencies in the pipeline. I'm pretty sure x86/x64 compilers don't use the XOR trick any more and haven't in some time now for that reason.)

This is also very much a premature optimization as well (in C++ at least). Reverse iteration might yield a slightly faster comparison, but is just as likely to cause an increased number of cache misses if the index is used in any way at all (even if the order does not matter semantically).


Not on my machine using C++.

My extensive benchmarks across a wide variety of platforms and compilers reveals that it is overall the fastest choice.
For iOS, for example, profiling revealed 2 functions to be high on the time-consuming list (Xcode comes with “Instruments” that show you how long every function takes) and I was able to reduce the time spent inside those functions by half by reversing the order of the loop to count down to 0.

My coworkers saw my commit to SVN and we had a long discussion about it, since it was the first time they had seen this kind of loop.
We drew out how the cache was being used etc.
Turns out that cache misses are exactly equal to those when counting up in an array. You are going to have the same number of cache misses no matter which way you go.
One of them pointed out that some machines have predict-forward cache mechanisms, which is true, but while you don’t get to exploit this using a reverse array, in practice, the small instruction set and faster comparisons make up for this, especially in inlined performance-critical functions.
When you think about it, predict-forward caching is rarely exploitable in regards to overall memory access. It is mainly for the benefit of the instruction cache, not the memory cache.

For arrays spanning fewer than 16 bytes, you don’t get to take advantage of this regardless of the array traversal direction, which means reverse traversal is always a win.

Hence it is not a premature optimization in practice.
The code is smaller, the iterations are faster, and the time spent coding is shortened (typing “for ( u32 I = uTotal; I--; )” is at least twice as fast as typing “for ( u32 I = 0; I < uTotal; ++I )”).


When it comes to Java on the other hand, there is no other side of the coin. We considered this absolutely necessary when possible. For both performance and for code size.



Edit2: that wasn't a fair comparison, strangly enough switching the order of the iterations makes a difference

I take that as meaning that your results are non-indicative of any performance penalties from reverse traversal.
Besides the fact that your code is flawed and you have invalid access in your reverse loop, plus the fact that this causes it to iterate more times than the forward loop.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid


For iOS, for example, profiling revealed 2 functions to be high on the time-consuming list (Xcode comes with “Instruments” that show you how long every function takes) and I was able to reduce the time spent inside those functions by half by reversing the order of the loop to count down to 0.


Yeah I would pay to see that happen. If you actually had that result, then there either had to have been something in your loop that was early-outing sooner by counting down, or you were doing something in your comparison that was resulting in a calculation on each loop iteration to possibly have gotten those results: (i.e. i < something.size()).

Either that or you were using some kind of interpreted or intermediate language that did strange things. Otherwise it is physically impossible to get that result with native machine code. Can't say for Java but I would really have to see it to believe that it would have anything more than a extremely negligible effect.

The difference between iterating up and comparing against a memory address, and iterating down and comparing to zero, is exactly 1 test instruction on all ARM and x86 platforms with any reasonable compiler, and an extremely small memory latency issue which is definitely going to be a cache hit on every single iteration except possibly the first, unless your loop is doing so much that the difference in iteration method couldn't even be in the equation.

I just did a quick benchmark on 4 different systems with different CPUs (Windows and OSX, AMD and Intel), counting up and counting down by 2 billion iterations and ran the test 10,000 times, and the worst-case difference was 0.3%. This is with the internals of the loop doing nothing more than a simple addition accumulation and rollover.

It is true that it is usually going to be slightly faster to iterate to 0, but the difference is only going to matter in the most extreme cases.
The code that I changed was inside the vector math library and had been inlined into probably thousands of locations in the code, and included a nested loop from 0 to 2 (3 iterations) that had been unwound in all of those locations.
My little change cascaded into something your simple test cases would never reveal: Smaller unwindings over thousands of locations in code led to fewer instruction-cache misses, not to mention multiplying that 0.3% from your results by thousands.

You cited a cache miss on every iteration which implies you do not know how the cache works, which is backed by the fact that you yourself got a 0.3% increase in performance. You didn’t say that 0.3% was an increase, but you called it a “worst-case”, and since you are against reverse looping, the worst case for you would be an increase in performance. Followed by your last sentence.


So here is the deal.
Looping down to 0 allows the compiler to emit smaller instructions that take fewer cycles. There are many ways to compare against 0 so the compiler has many more options than it would otherwise and can make the best choices in all situations.
There is never any doubt as to whether a function such as XXX.size() will be called only once and cached or if it will be called on every iteration based on compiler competency.
Smaller instructions can in practice cascade into much smaller code—which simple test cases can never reveal—and lead to fewer cache misses.
It takes half as long to type.
It is never slower than, but can often be faster than forward looping.
But you don’t like it because…?

You yourself admit it is usually going to be faster. Is that not enough in itself? Why you do have to consider that it matters only in extreme cases?
Yes, it probably does only matter in extreme cases such as the one that showed up in my profiler. But as long as the algorithmic result is the same, how does it hurt you to reverse the loop?
This is the kind of, “I am against premature optimizations at all costs,” mentality that I can’t stand.
I agree that profiling should be used to locate bottlenecks, but at the end of the day I would rather not end up with a list of bottlenecks that wouldn’t have even existed in the first place had I taken the time to do simple things such as these. I prefer my profiling to show me a list of functions that baffle me as to why they are taking so long instead of a bunch of functions whose solutions’ are as mundane as the order of the looping.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

On code design:
Rule #1: Keep it simple (you can always add complexity later).

There are two versions of a programmer: The one who writes the code and the one who has to maintain/update the code. Sometimes its the same programmer, sometimes its different programmers. Sometimes you'll come back to your code six months later and barely remember what you were doing and why you did it. This is where the power of simplicty and good comments will shine.

Simple algorithms are easier to understand and easier to troubleshoot. When you're inventing the solution to your problem, ask yourself "is this the simplest solution to solve the problem?" (sort of like Occams Razor).

Rule #2: Spend more time and effort trying to choose the best algorithms rather than trying to tweak little bits of code for performance gains.
Example: If you have an O(n^2) algorithm and you try to speed up its performance by tightening the inner execution code, your gains will be marginal compared to if you were able to find a O(log n) or O(1) algorithm.

Rule #3: (you should follow rules 1 & 2 first) Spend time optimizing your code when you're seeing actual optimization problems.
I've spent lots of time trying to solve problems that don't exist. It's generally a waste of time and can lead you down quite a rabbit hole. Don't do it. At the same time, don't take this as permission to start writing sloppy code.

Rule #4: Use empirical evidence to justify your optimization.
That means of course, that you have to gather the empircal evidence through a profiler, by counting CPU instruction counts, measuring CPU load, network bandwidth, etc.
Remember debugging 101: Identify, Replicate, Isolate, Solve, Review.

Rule #5: (or maybe rule #0) If you can avoid reinventing the wheel, then do so.
Has someone else done what you're trying to do? Did they use a better technique? Is there available code you can use? Is there a library you could use? A game engine which satisfies your requirements? A software app which already does what you're trying to do? (of course, this is completely ignoring the counter-productive legal BS)

Rule #6: Don't lose sight of your end users and what they care about.
All through university, I pushed myself to write the most technically complicated code. If an assignment asked for a 2D game, I'd go above and beyond and create a 3D game. If we had to find and write up a ray tracer feature, I'd find one that experts said was impossible and then try to do it. Then we'd demo our projects to our fellow students and claim bragging rights. Well, my games/apps were impressive in their own right, but in their technical complexity, they were buggy and unpolished. Other students spent more time refining and polishing their simpler apps/games. Guess who won? Not me. It turns out that nobody gives a shit about the backend technical implementation and its difficulty. It's ALL about functionality, usability, and presentation. It was a good, albeit hard lesson learned and worth remembering :)
(so, your end users won't care about how your OOP is ultimately designed, as long as your app is functional, usable and well presented.)

This topic is closed to new replies.

Advertisement