Will this perform better?

Started by
11 comments, last by King Mir 10 years, 6 months ago

Hi

I have a question towards memory locality and speed. Lets say I have two ways of doing a thing:

case A:

int* data=memebervariablepointer;

for (int i=0;i<9000000;i++)

{

*(data+i*4)=*(data+i*4)*2;

}

case B:

for (int i=0;i<9000000;i++)

{

*(memebervariablepointer+i*4)=*(memebervariablepointer+i*4)*2;

}

will case A be faster than case B? I am thinking that case B will perform a very distant memory access, in a way: fetch pointer value from memebervariablepointer and then fetch value like 32000+ bytes away from it. Will compiler optimilize to case A, so that it moves pointer value to stack and go fluently through memory in the cycle? Does case A make sence?

Advertisement

They'll be equally fast with case A using 32/64 (if x86 or x64) additional bit for the data pointer variable. (both, 'data' and 'membervariablepointer' are pointing to the same memory address so there's IMO not much sense in using case A.)

Visit my blog, follow me on twitter or check out my bitbucket repositories.

The compiler will in release mode convert case A in to case B anyway at the minimum optimisation it will apply.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

How are you allocating the memory membervariablepointer is pointing to?

Just some notes on your code: You are multiplying only every fourth data element in your array by 2, is this what you want? (because you're multiplying i by 4)

Compilers are pretty smart these days, so you can expect it to optimize it well. I think your code would look cleaner like this:


    // allocate memory
    int* membervariablepointer = new int[9000000];

    int* begin = membervariablepointer;
    int* end = membervariablepointer+9000000-1;
    for( int* i = begin; i < end; ++i )
        *i = *i * 2;

    // clean up
    delete[] membervariablepointer;

But that's also very dirty. You should use std::vector unless you have a very good reason not to:


    // allocate memory
    std::vector<int> membervariable;
    membervariable.resize(9000000);

    for( std::vector<int>::iterator it = membervariable.begin(); it != membervariable.end(); ++it )
        *it *= 2;
"I would try to find halo source code by bungie best fps engine ever created, u see why call of duty loses speed due to its detail." -- GettingNifty

well the thing is that at compile time the value of membervariable pointer is unknown, so at every cycle of case B, the pointer value is red from &membervariablepointer location in memory, and then the value from membervariablepointer+offset location is red. I wonder wheather this discontinuity will invalidate cache and page at every itteration.

Not sure, but I think you'll save 9 million CPU instructions by changing your i++ to ++i.

i++ has an extra copy made before incrementing.

Not sure, but I think you'll save 9 million CPU instructions by changing your i++ to ++i.
i++ has an extra copy made before incrementing.

In principle you are correct that ++i is never slower than and sometimes faster than i++, however in this case the copy is never accessed and is therefore unlikely to generate any code. The result will be the same either way on nearly all compilers.

If you were really anal you would write it as follows:
if ( int i = 9000000; --i >= 0; ) { }
Most platforms have instructions designed for comparing against 0 that require fewer bytes and cycles, so this is will usually be fastest and smallest. But should still be checked on the target platform, since PlayStation Vita for example doesn’t really like this.

For the rest of what has been said here, most is conjecture. You don’t know what the compiler will do, and speaking about absolutes in regards to what the compiler will do is always wrong unless it involves impossibilities regarding the target instruction set. For example, I can say as an absolute that “if ( 3 == 90 )” will never generate code on any compiler for any platform, because no platforms exist with instructions for comparing constants, so no conditional test can be performed at run-time, so no CMP/JMP instruction pair (or the equivalent for any given platform) can be generated, making it a physical impossibility for any code to be generated since it is always going to be a compile-time evaluation that resolves to 0.

You should benchmark or look at assembly to determine which is faster, and on various platforms.

Hopefully, when you do benchmark/run performance tests, you will find that you should be focusing on more high-level code if you are truly interested in optimization.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Nah, it will optimise that to ++i since the value before the increment isn't being used.

EDIT: Ninja'd

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley
I believe the main obstacle a modern compiler might have over such optimisations is pointer aliasing. Let us take the following example:
class Example {
public:
    int count;
    int *pointer;
 
    void frobnicate() {
          for(int i = 0 ; i < count; ++i) {
               // Modify *pointer
          }
    }
};
Here, the reason the compiler might "reload" the value of count every loop iteration is to handle the case where pointer is pointing at count!

While I'm not an expert in the behaviour of optimising C++ compilers, I believe a competent compiler should rule out this case in the example you originally posted and compile both to the same machine code. Compilers use a number of heuristics, such as assuming that you aren't fooling with the type system, to try and avoid generating the kind of machine code which needs to take aliasing into account.

A general rule is difficult to extract, however. As the complexity of the program increases, it is possible the compiler will not be able to deduce that aliasing is "impossible" and generate sub-optimal code. Personally, I recommend writing the program in the most straightforward way possible, and if later profiling indicates a bottleneck addressing these on a case by case basis. The most effective way to solve a bottleneck could involve a different algorithm or data structure, or managing memory to better suit the processor caches, instead of low level changes like your original example.

Even if one loop somehow generated more complicated code, there's a good chance it wouldn't affect performance very much - loops like that with small quantities of work in them are often limited by memory bandwidth and not computation speed.

See http://igoro.com/archive/gallery-of-processor-cache-effects/ for an example.

This topic is closed to new replies.

Advertisement