Cost of Abstraction

Started by
39 comments, last by Finalspace 7 years ago

I double checked that in VStudio 2015.

Well, try a no-shit compiler then: https://godbolt.org/g/fzUlOH


mov     edi, .L.str
xor     esi, esi
xor     eax, eax
call    printf

I challenge you to write code that is free of those super-heavyweight abstractions and that performs significantly better than this.

Inline assembly allowed if you like... :)

(EDIT: I just noticed that I'm only printing out indices (identical to what you're doing) when actually I intended to print values from the array, which is somewhat more meaninful for ranges -- hence the superfluous array. Just ignore that.

Results in basically the same code either way, only a load from memory instead of an immediate load. Same if you use std::array or even std::vector instead of a C array too, by the way, only a call to operator new and operator delete at the beginning/end of main in the case of using a vector, which is however a kinda obvious necessity if you dynamically allocate something).

Advertisement

Did you time it? Can you compile both to assembly language and post the difference?

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <limits.h>

#define MAX_ITER 100000000
#define TEST_A

#ifdef TEST_A
namespace util {
class range {
public:
class iterator {
friend class range;
public:
unsigned long int operator *() const {
    return i_;
}
const iterator &operator ++() {
    ++i_; return *this;
}
iterator operator ++(int) {
    iterator copy(*this); ++i_; return copy;
}
bool operator ==(const iterator &other) const {
    return i_ == other.i_;
}
bool operator !=(const iterator &other) const {
    return i_ != other.i_;
}
protected:
iterator(unsigned long int start) : i_(start) {
}
private:
unsigned long int i_;
};
iterator begin() const {
return begin_;
}
iterator end() const {
return end_;
}
range(unsigned long int  begin, unsigned long int end) : begin_(begin), end_(end) {
}
private:
iterator begin_;
iterator end_;
};
};

int main(int argc, char **args)
{
    unsigned long int i;
    unsigned long int var = 0;
    struct timespec time_start = {0,0};
    struct timespec time_end = {0,0};
    printf("USING CLASS, %d iterations:\n", MAX_ITER);
    clock_gettime(CLOCK_MONOTONIC, &time_start);
    auto range = util::range(0, MAX_ITER);
    for (auto i : range) {
        var = var + 1;
    }
    clock_gettime(CLOCK_MONOTONIC, &time_end);
    printf("Time: %.5f seconds\n",
        ((double)time_end.tv_sec + 1.0e-9 * time_end.tv_nsec) -
        ((double)time_start.tv_sec + 1.0e-9 * time_start.tv_nsec));
    return 0;
}
#endif

#ifdef TEST_B
int main(int argc, char **args)
{
    unsigned long int i;
    unsigned long int var = 0;
    struct timespec time_start = {0,0};
    struct timespec time_end = {0,0};
    printf("NO CLASS, %d iterations:\n", MAX_ITER);
    clock_gettime(CLOCK_MONOTONIC, &time_start);
    for (i = 0; i < MAX_ITER; i++) {
        var = var + 1;
    }
    clock_gettime(CLOCK_MONOTONIC, &time_end);
    printf("Time: %.5f seconds\n",
        ((double)time_end.tv_sec + 1.0e-9 * time_end.tv_nsec) -
        ((double)time_start.tv_sec + 1.0e-9 * time_start.tv_nsec));
    return 0;
}
#endif

NO CLASS, 100000 iterations:
Time: 0.00094 seconds

NO CLASS, 1000000 iterations:
Time: 0.00809 seconds

NO CLASS, 10000000 iterations:
Time: 0.05149 seconds

NO CLASS, 100000000 iterations:
Time: 0.44766 seconds

USING CLASS, 100000 iterations:
Time: 0.00251 seconds

USING CLASS, 1000000 iterations:
Time: 0.01749 seconds

USING CLASS, 10000000 iterations:
Time: 0.13273 seconds

USING CLASS, 100000000 iterations:
Time: 1.35591 seconds

g++ 4.8.5 optimizes away the whole thing in both cases, so I get "0.00000 seconds".

Perhaps you should use a more realistic setup.

I was about to say that any decent compiler would notice that all the loop is doing is twiddling a few locals that never get referenced, so it doesn't have to implement the loop at all.

Allow me to translate this code


int main(int argc, char **args)
{
    unsigned long int i;
    unsigned long int var = 0;
    struct timespec time_start = {0,0};
    struct timespec time_end = {0,0};
    printf("USING CLASS, %d iterations:\n", MAX_ITER);
    clock_gettime(CLOCK_MONOTONIC, &time_start);
    auto range = util::range(0, MAX_ITER);
    for (auto i : range) {
        var = var + 1;
    }
    clock_gettime(CLOCK_MONOTONIC, &time_end);
    printf("Time: %.5f seconds\n",
        ((double)time_end.tv_sec + 1.0e-9 * time_end.tv_nsec) -
        ((double)time_start.tv_sec + 1.0e-9 * time_start.tv_nsec));
    return 0;
}

to how an optimizing compiler reads it:


int main()
{
    printf("blah", immediate value);
    time_gettime();
    // what the hell?
    // you think I'll execute invariant code?
    time_gettime();
    printf(time2 - time1);
}

Consequently, the optimizing compiler will transform the above into two consecutive calls to time_gettime followed by printing the difference, which times nothing but the duration of a syscall: https://godbolt.org/g/ACQaqO

Allow me to translate this code


int main(int argc, char **args)
{
    unsigned long int i;
    unsigned long int var = 0;
    struct timespec time_start = {0,0};
    struct timespec time_end = {0,0};
    printf("USING CLASS, %d iterations:\n", MAX_ITER);
    clock_gettime(CLOCK_MONOTONIC, &time_start);
    auto range = util::range(0, MAX_ITER);
    for (auto i : range) {
        var = var + 1;
    }
    clock_gettime(CLOCK_MONOTONIC, &time_end);
    printf("Time: %.5f seconds\n",
        ((double)time_end.tv_sec + 1.0e-9 * time_end.tv_nsec) -
        ((double)time_start.tv_sec + 1.0e-9 * time_start.tv_nsec));
    return 0;
}

to how an optimizing compiler reads it:


int main()
{
    printf("blah", immediate value);
    time_gettime();
    // what the hell?
    // you think I'll execute invariant code?
    time_gettime();
    printf(time2 - time1);
}

Consequently, the optimizing compiler will transform the above into two consecutive calls to time_gettime followed by printing the difference, which times nothing but the duration of a syscall: https://godbolt.org/g/ACQaqO

g++ 4.8.5 optimizes away the whole thing in both cases, so I get "0.00000 seconds".

Perhaps you should use a more realistic setup.

In both tests I don't use optimization, which makes both tests run the loop. Compiled on g++ version 6.3.1

Test_A and Test_B produce the exact same assembly. I used Compiler Explorer with your code using GCC 6.3 (-std=c++1z -O3) and just changed the define

Test A


.LC0:

        .string "USING CLASS, %d iterations:\n"

.LC2:

        .string "Time: %.5f seconds\n"

main:

        pxor    xmm0, xmm0

        sub     rsp, 40

        mov     esi, 100000000

        mov     edi, OFFSET FLAT:.LC0

        xor     eax, eax

        movaps  XMMWORD PTR [rsp], xmm0

        movaps  XMMWORD PTR [rsp+16], xmm0

        call    printf

        mov     rsi, rsp

        mov     edi, 1

        call    clock_gettime

        lea     rsi, [rsp+16]

        mov     edi, 1

        call    clock_gettime

        pxor    xmm0, xmm0

        mov     edi, OFFSET FLAT:.LC2

        movsd   xmm2, QWORD PTR .LC1[rip]

        mov     eax, 1

        pxor    xmm1, xmm1

        cvtsi2sdq       xmm0, QWORD PTR [rsp+24]

        mulsd   xmm0, xmm2

        cvtsi2sdq       xmm1, QWORD PTR [rsp+16]

        addsd   xmm0, xmm1

        pxor    xmm1, xmm1

        cvtsi2sdq       xmm1, QWORD PTR [rsp+8]

        mulsd   xmm1, xmm2

        pxor    xmm2, xmm2

        cvtsi2sdq       xmm2, QWORD PTR [rsp]

        addsd   xmm1, xmm2

        subsd   xmm0, xmm1

        call    printf

        xor     eax, eax

        add     rsp, 40

        ret

.LC1:

        .long   3894859413

        .long   1041313291

Test B


.LC0:

        .string "NO CLASS, %d iterations:\n"

.LC2:

        .string "Time: %.5f seconds\n"

main:

        pxor    xmm0, xmm0

        sub     rsp, 40

        mov     esi, 100000000

        mov     edi, OFFSET FLAT:.LC0

        xor     eax, eax

        movaps  XMMWORD PTR [rsp], xmm0

        movaps  XMMWORD PTR [rsp+16], xmm0

        call    printf

        mov     rsi, rsp

        mov     edi, 1

        call    clock_gettime

        lea     rsi, [rsp+16]

        mov     edi, 1

        call    clock_gettime

        pxor    xmm0, xmm0

        mov     edi, OFFSET FLAT:.LC2

        movsd   xmm2, QWORD PTR .LC1[rip]

        mov     eax, 1

        pxor    xmm1, xmm1

        cvtsi2sdq       xmm0, QWORD PTR [rsp+24]

        mulsd   xmm0, xmm2

        cvtsi2sdq       xmm1, QWORD PTR [rsp+16]

        addsd   xmm0, xmm1

        pxor    xmm1, xmm1

        cvtsi2sdq       xmm1, QWORD PTR [rsp+8]

        mulsd   xmm1, xmm2

        pxor    xmm2, xmm2

        cvtsi2sdq       xmm2, QWORD PTR [rsp]

        addsd   xmm1, xmm2

        subsd   xmm0, xmm1

        call    printf

        xor     eax, eax

        add     rsp, 40

        ret

.LC1:

        .long   3894859413

        .long   1041313291

The only difference was the string definition.

.string "USING CLASS, %d iterations:\n" .string "NO CLASS, %d iterations:\n"

"I can't believe I'm defending logic to a turing machine." - Kent Woolworth [Other Space]

Just kinda thinking out loud, but it seems to me like if you've worked yourself into some kind of corner where you have to optimize how many instructions get spit out for using for loops, then something else is already wrong. I can understand that some people are more comfortable working with structures and things that either make life easier, or make code more readable in exchange for some overhead, and I get that a literal for-loop is not always the most appropriate choice, but a lot of these sort of "utility functions" strike me as trying to out-smart the language, or be clever for clever's sake, with not much real world benefit. Great, you've shaved a few micro seconds from your iteration code - but the real problem might be that you're iterating over too many things in the first place, since we're optimizing loops instead of optimizing the actual content of the game scene or whatever else have you. ¯\_(?)_/¯

To be fair, I do spend a lot of time in javascript-land where people invent all kinds of "clever" nonsense- trying to recreate features of languages they're more comfortable with, or trying to turn every tiny function of a system into "modules" with all the overhead and process and dependencies etc. that go along with it *cough*node*cough*.

I won't claim to have anything really valuable to say about which approach is reaaaaaally "the best" way to go in this specific case, or which is really faster or more correct, but I'm usually very leary of code that's trying to be "clever" when there was no good reason not to use a straightforward approach in the first place. Just my 2c.

I know these things emerge from beginners in their beginning years, trying to abstract things, because they are in the heat
I believe the real culprit here is not abstraction but over-generalization for future, currently unknown, extensions. It leads to lots of layers through interfaces, derived classes, etc, all because maybe you need to extend in such and such direction tomorrow.

(Things get worse if you try to build a framework or library, like a game engine, as you really don't know where people will extend it.)

In both tests I don't use optimization

Then both tests are pointless, and you're testing the wrong thing.

The entire system has been designed for the optimizations, even before the language was C++ the compiler model was built fundamentally around the idea that the transformations would take place. The ENTIRE REASON the language family allow placing those little blocks of code inline is so they can be elided, removed from the final executable.

That is also the reason the language family doesn't work with features like reflection in modern languages: When the code is compiled the infrastructure vanishes completely. Reflection requires all those little hooks, but in properly compiled code the hooks no longer exist and the abstractions cannot be found. All those little accessors and mutators, the small utility functions, they are all supposed to vanish completely, and they do vanish completely in all but the most non-optimized debug-build settings. All those times where your debugger shows many mystery levels of template types in containers don't really exist in the final builds. It boils down to direct memory addresses and direct manipulations rather than the abstract versions.

The abstractions are there to make human lives easier. The compiler makes them vanish.

If the abstractions don't make your life easier as a programmer then don't use them. But be aware that they do make other programmer's lives easier and they have no runtime cost, so other programmers are going to use them whenever they make sense.

This topic is closed to new replies.

Advertisement