Jump to content

  • Log In with Google      Sign In   
  • Create Account


tanzanite7

Member Since 20 Nov 2005
Offline Last Active Today, 05:06 AM

Topics I've Started

Testing a GC implementation

08 March 2014 - 08:32 PM

Just finished implementing a "cooperative" incremental conservative mark and sweep garbage collector. Basic test coverage is also present and passes, BUT ... having trouble coming up with bigger tests and stress tests to better understand its behavior and theoretical limits.

How to generate a mess of classes and modify the relations without accidentally killing off too many branches or having it infinitely glow? Ideas?

(minor note: in the end i want to record the modifications/additions as replay for tests, so - generating it does not have to be super fast, just reasonable).

Some details about the GC implementation (most of it likely fairly irrelevant to the question, but anyway):

User view (slightly relevant to the question):

struct GcObject {
  USE_GC(100); // adds GC trait to this class (not inheritable), 100 = hint of how many objects you expect to have (not a limit in any sense)
  GC<GcObject> var_next; // behaves as a pointer to GcObject
  // GC<NonGcObject> var_error; // this is compile time error (all GC pointers, a'la GCRoot etc, accept only objects with GC trait)
  Array<GcObject> var_array; // there are wrappers for a few things i use the most: array/vector and hashmap - the wrapper detects on its own that it is for GC-trait objects and does what it needs to do (most notably, the array will be for "GcObject*"). + version for dynamic root and static
};

struct NonGcObject {
  GCRoot<GcObject> var_root; // dynamic root for GC - behaves as a pointer to GcObject
  // GC<GcObject> var_error; // this is runtime error (assert) - so, debug mode only
};

GCStatic<GcObject> var_static; // static root for GC - behaves as a pointer to GcObject
//GC<GcObject> var_error; // assert

// currently there is no way to detect misplaced GCRoot/GCStatic

Features/assumptions/intended-usage:
* It is meant for class-objects and not data blobs.
* Discards pointers to already encountered objects very cache friendly (32MB of objects are covered by 2-16 pages of GC internal structures needed for identification / discard) - does not reference objects memory at any point.
* Per object type: discovers and records object structure for faster processing in future (however: never assumes that it identified the structure correctly - except locations for non pointers to GC memory). If it detects that an object can not have any pointers to GC memory then it will skip processing of all objects of the relevant type in future.
* Processes in patches per type (for cache locality of the target objects and GC internals).
* GC trait adds a memory pooling scheme for the class - cache locality for GC and whatever is using the class.
* Pool: maintains an ordered list (inc pointer address) of free items - to eliminate degradation of locality as much as possible.
* GC does not touch the objects memory on delete if the object does not have a destructor. Same for constructor - but that is hardly useful.
* Does not examine stack and hence does not allow invocation at random points.
* Garbage collection invocation: MemGC::tickMain(timeslice_in_seconds).
* It is possible to offload the majority of the work to a worker thread - but i doubt it is useful and i probably will never implement it.
* Allocation of GC-trait objects is not thread safe - simply never needed, which makes justifying the overhead rather impossible.
* probably something more


Visual Studio 2013 wariness.

05 March 2014 - 05:10 PM

Have been toying with VS2013 for ~week, and i am a bit worried about its fitness. Please share your experience so far.

 

In my case, it is a mixed bag.

 

The good:

* more standards stuff implemented (inc. stuff i really missed).

* improved Intellisense considerably (major context dependent cleanup).

* semi-optimized debug build (something that the VS-devs told they would like to work on ... well, it sure looks like they have worked on it :D ).

* updated support for XP target.

* "overview" on scrollbar.

... the rest i am not yet benefiting from.

 

The bad:

* Highly irritating UI bug - tabs always close themselves on reload (reported and they can reproduce - so, there is hope for the far-away future).

* Compiler crash - the newly added default template parameters for functions is a bit touchy (report pending reproduction confirmation - which is certain to follow).

* Intellisense likes to incorrectly mark a lot of code to be buggy - also, some services like to crash when working with the suffering code. Don't get me wrong, i am happy that it is as good as it is. Just worried that at some point it craps its pants so completely/persistently that i essentially lose it completely. I wonder if i should take the time to report Intellisense mistakes (in the sense that they certainly already know a crapton of problems with it) - especially as it would take considerable time to generate the reproduction cases.

* Property pages are as buggy as they have always been - but not bothered too much by it as i have learned to work around them all.

* Registering it requires that Internet Explorer is installed, its security settings lowered and IE set as default browser - facepalm. At least that is over. Not sure if i can uninstall IE now or not :/ - just disabled it for the time being.


Conditional default wrapper template parameter.

27 February 2014 - 03:30 AM

Let's say i have:
template<class T, template<class> class Wrap> struct Snafu;
I can use it as "Snafu<int, WrapZ>". Nice.

Now i would like Wrap to have a default value, conditionally:
template<class T> struct Type {
    typedef typename std::enable_if<std::is_funky<T>::value, WrapA<T>>::type type;
    ... etc
};

template<class T, class Wrap = Type<T>::type> struct Snafu;
Unfortunately, this will degrade it down to "Snafu<int, WrapB<int>>" instead of the previous, preferred "Snafu<int, WrapB>" (assuming for the sake of this specific example, that i happen to want to use WrapB instead of whatever it selected on its own).

Have been up for 24 hours ... and can not wrap my head around it. Can i have the cake and eat it too?

edit: Wait a minute. Actually, it seems i can not use typedef for substitution trickery as easily. I need sleep.

inquisitive mind vs cache

18 February 2014 - 08:08 AM

It has been several years when i last checked how cache behavior matches how i expect it to behave - so, decided to poke around a bit.

Note: Keeping in mind that the topic is extremely hardware dependent and not particularly useful for generalization. Also, a specific extremely-corner-case test, the only ones one can reasonably make, tells little about real world programs - the test is mostly just curiosity.

A few relevant hardware metrics: Ivy, 6MB L3, 32KB L0 separate from code cache, 8GB main memory.
Os: Windows 7 64bit (relevant for TLB behavior), swap file disabled.

Things i wondered:
* What is the cost of reading main memory (inc address translation misses)?
* Random access vs consecutive?
* Consecutive forward vs backwards (and cache line vs page)?
* Consecutive cache line vs 2 cache lines?

Conclusions derived so far:
* Full cache miss (TLB included) cost - is nigh' impossible to get (that kind of things are extremely rare in normal circumstances and i have failed to confidently measure it so far). Seems to be between 250-800 ticks (extreme oversimplification! - for starters, CPU and main memory are separate entities).
* Random access is equal to any ordered access if the consecutive reads are skipping cache lines. biggrin.png, as i suspected.
* Random access is ~1.5-3 times slower vs consecutive (must be distanced less or equal to cache line to not be equal to random).
* Consecutive forward vs backward - makes no difference at all.

* Cache makes huge difference.

Now the kicker, all the results are wrong (to an unknown extent) - the damn compiler and the actual processor are just way too good at their job and i have not been able to stop them. ARGH! Which is why i am making this thread, besides just sharing the results.

The test program (not standalone, but whatever else it uses is fairly self explanatory):

        TIMES(15); TIMEE(15); TIMES(15); TIMEE(15); // TIMES - starts a timer, TIMEE - ends a timer, TIME - read the timer value
        GLOGD(L" timer overhead: %I64u", TIME(15));

        // working set
        #define REP     8
        #define TESTS   12
        //#define PAGES   2048
        //#define PAGES   4096
        #define PAGES   16384
        intp trash1 = intp(Mem::getChunks(MB(64)));
        intp trash2 = intp(Mem::getChunks(MB(64)));
        //intp trash3 = intp(VirtualAlloc(0, GB(1), MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE));
        intp at = intp(Mem::getChunks(PAGES * 4096));
        int16u offs[PAGES];       // offsets to simulate random access with minimal overhead
        int64u stats[TESTS][REP];

        // init: build offset array, ensuring all values are in range and unique
        Random::MT rnd(123);
        int16u build[PAGES];
        for(int32u i=0; i<PAGES; i++) build[i] = int16u(i);
        for(int32u i=0; i<PAGES; i++) {
            int32u nr = rnd.nextIntMax(PAGES - i);
            offs[i] = build[nr];
            build[nr] = build[PAGES - 1 - i];
        }

        // macro for trashing all cache levels
        #define TRASH for(intp i=0; i<PAGES * 4096; i+=4) *(int32u*)(at + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash1 + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash2 + i) += int32u(i);
                      //for(intp i=0; i<GB(1); i+=64) *(int32u*)(trash3 + i) += int32u(i);
        // macro fo directional access
        #define CHECK_DIR(step) if(step > 0) for(int32u i=0;       i<PAGES; i++) sum ^= *(int32u*)(at + i * step);\
                                else         for(int32u i=PAGES-1; i<PAGES; i--) sum ^= *(int32u*)(at + i * -(step));
        // macro for random access
        #define CHECK_RND(size) for(int32u i=0; i<PAGES; i++) sum ^= *(int32u*)(at + intp(offs[i]) * size);

        // init
        TRASH;
        int32u sum = 0;// anti optimization

        // crude attempt to get times for cold and hot single access.
        TRASH; TRASH;
        TIMES(0); sum ^= *(int32u*)at; TIMEE(0);
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        TIMES(1); sum ^= *(int32u*)at; TIMEE(1);
        GLOGD(L"  single access: %I64u %I64u", TIME(0), TIME(1));

        // tests
        for(int32u rep=0; rep<REP; rep++) {
            TRASH;
            TRASH; TIMES(0); CHECK_DIR(  128); TIMEE(0); //
            TRASH; TIMES(10); CHECK_DIR(  192); TIMEE(10); //
            TRASH; TIMES(1); CHECK_DIR( 4096); TIMEE(1); // +4K
            TRASH; TIMES(2); CHECK_DIR(-4096); TIMEE(2); // -4K
            TRASH; TIMES(3); CHECK_RND( 4096); TIMEE(3); // ?4K
            TRASH; TIMES(4); CHECK_DIR( 64);   TIMEE(4); // +64
            TRASH; TIMES(5); CHECK_DIR(-64);   TIMEE(5); // -64
            TRASH; TIMES(6); CHECK_RND( 64);   TIMEE(6); // ?64
            TRASH; TIMES(7); CHECK_DIR( 4);    TIMEE(7); // +4
            TRASH; TIMES(8); CHECK_DIR(-4);    TIMEE(8); // +4
            TRASH; TIMES(9); CHECK_RND( 4);    TIMEE(9); // ?4
            for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i);
            TIMES(11); for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i); TIMEE(11); // L1
            // record results
            for(int32u test=0; test<TESTS; test++) stats[test][rep] = TIME(test);
        }

        // throw away outliers ... well, not really => just throw away half of the results that disagree the most with the rest
        #define CUT (REP/2)
        for(int32u test=0; test<TESTS; test++) {
            for(int32u cut=0; cut<CUT; cut++) {
                // get average at cut point
                int64u sum = 0;
                for(int32u rep=cut; rep<REP; rep++) sum += stats[test][rep];
                int64u avg = sum / (REP - cut);
                // find outlier
                int32u outlier = cut;
                int64u dist = _abs64(stats[test][outlier] - avg);
                for(int32u rep=cut+1; rep<REP; rep++) {
                    int64u distCur = _abs64(stats[test][rep] - avg);
                    if(dist < distCur) {
                        dist = distCur;
                        outlier = rep;
                    }
                }
                // swap out the outlier (ie. swap outlier with value from cut point)
                int64u tmp = stats[test][outlier];
                stats[test][outlier] = stats[test][cut];
                stats[test][cut] = tmp;
            }
        }

        // calculate averages and minimums
        int64u average[TESTS], minimum[TESTS];
        for(int32u test=0; test<TESTS; test++) {
            int64u sum = 0;
            for(int32u rep=CUT; rep<REP; rep++) sum += stats[test][rep];
            average[test] = sum / (REP - CUT);
            int64u min = stats[test][0];
            for(int32u rep=0; rep<REP; rep++) if(min > stats[test][rep]) min = stats[test][rep];
            minimum[test] = min;
        }

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", minimum[1], minimum[2], minimum[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", minimum[4], minimum[5], minimum[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", minimum[7], minimum[8], minimum[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", minimum[0], minimum[10]);
        GLOGD(L" test L0     : %7I64u %c", minimum[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", average[1], average[2], average[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", average[4], average[5], average[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", average[7], average[8], average[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", average[0], average[10]);
        GLOGD(L" test L0     : %7I64u %c", average[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(minimum[1]) / PAGES, float(minimum[2]) / PAGES, float(minimum[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(minimum[4]) / PAGES, float(minimum[5]) / PAGES, float(minimum[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(minimum[7]) / PAGES, float(minimum[8]) / PAGES, float(minimum[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(minimum[0]) / PAGES, float(minimum[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(minimum[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(average[1]) / PAGES, float(average[2]) / PAGES, float(average[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(average[4]) / PAGES, float(average[5]) / PAGES, float(average[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(average[7]) / PAGES, float(average[8]) / PAGES, float(average[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(average[0]) / PAGES, float(average[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(average[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

The actual results this gives:

¤0   0:00:00.2546 {D}:   overhead: 43  cold: 43  hot: 53 // this is fairly bokners as the timer code i use is not meant for micro measures (just __rdtscp without cpuid and accesses memory because VC2012 intrinsic for it is terribly implemented and not usable for this purpose)
¤0   0:00:00.3180 {D}:   single access: 1691 411 // more bonkers
¤0   0:00:03.3538 {D}: --------------------------------- minimums
¤0   0:00:03.3544 {D}:  test 4K +-? :  730986  752583  782969
¤0   0:00:03.3547 {D}:  test 64 +-? :  292295  284462  462598
¤0   0:00:03.3551 {D}:  test  4 +-? :   20399   31741   53250
¤0   0:00:03.3554 {D}:  test 64*2*3 :  444014  541493
¤0   0:00:03.3557 {D}:  test L0     :   11589
¤0   0:00:03.3559 {D}: --------------------------------- averages
¤0   0:00:03.3562 {D}:  test 4K +-? :  739322  753575  786489
¤0   0:00:03.3564 {D}:  test 64 +-? :  296640  295497  462916
¤0   0:00:03.3567 {D}:  test  4 +-? :   20560   32331   55008
¤0   0:00:03.3569 {D}:  test 64*2*3 :  447766  547238
¤0   0:00:03.3570 {D}:  test L0     :   11641
¤0   0:00:03.3572 {D}: --------------------------------- minimums
¤0   0:00:03.3574 {D}:  test 4K +-? : 44.62 45.93 47.79
¤0   0:00:03.3576 {D}:  test 64 +-? : 17.84 17.36 28.23
¤0   0:00:03.3578 {D}:  test  4 +-? :  1.25  1.94  3.25
¤0   0:00:03.3579 {D}:  test 64*2*3 : 27.10 33.05
¤0   0:00:03.3581 {D}:  test L0     :  0.71
¤0   0:00:03.3583 {D}: --------------------------------- averages
¤0   0:00:03.3584 {D}:  test 4K +-? : 45.12 45.99 48.00 // consecutive +4K, consecutive -4K, and random
¤0   0:00:03.3586 {D}:  test 64 +-? : 18.11 18.04 28.25
¤0   0:00:03.3588 {D}:  test  4 +-? :  1.25  1.97  3.36
¤0   0:00:03.3590 {D}:  test 64*2*3 : 27.33 33.40 // consecutive +128 and +192
¤0   0:00:03.3591 {D}:  test L0     :  0.71 // consecutive +1, hot cache, reading a total of 16KB - one byte at a time.

Note:
* While the single access tests are bonkers - the rest are valid ... sorta, i'll get to that.
* Reading 16384 values in a loop took ~11-12 clock cycles with hot cache.
* Worst cache hit ratio slowed it down by a factor of ~70.
* TLB miss is just and extra memory miss with Windows 7 / x86-64.

So, why are the results kinda wrong?
* compiler partially unrolled SOME for the loops (ie. doing 4 per cycle).
* processor can (AND ABSOLUTELY LOVES TO), in effect, unroll the loops too and have multiple reads in flight.
* processor can execute multiple instructions at the same clock cycle (which is why the last result is as fast as it is).

The results are not comparable - and i do not know the margin of error. Ie. how many reads are in flight for example in the 4K test?

Glancing at one of thous 4K tests: the loop code is 15 bytes total (5 basic instructions)! It more than fits fully in decoder cache (and VC likes to pad nops in front of hot loops to make the alignment perfect for it [Ivy likes 32B]). Also, the register pressure is, expectedly, very low (in regards to renaming).

Any ideas how to stop the processor and to a lesser extent, the compiler, from being so good at their job and still get usable results?

-------------------------------------------------------------------------------
Now, to get single access timings i dusted out my x86 Asm class (and added cpuid/rdtsc/rdtscp):
 

        // build test function
        int64u timings[3]; // 0 = plain overhead, 1 = cold access (+overhead), 2 = hot access (+overhead)
        Asm fn; // x64 cdecl, scap: RAX, RCX, RDX, R8, R9, R10, R11

        fn.nop(16); // VS2012 fails do decode first 16 bytes - NOP'ing over that area, so i can actually see what i am stepping through while debugging
        //fn.int3();
        fn.push(Asm::RBX);                                        //  push RBX
        fn.push(Asm::R15);                                        //  push R15
        // get plain overhead of serialised timer query
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();                                              //  rdtscp
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[0]), Asm::na, 0, Asm::R9);  //  mov [&timings[0]], R9
        // cold access
        fn.nop(2);
        fn.mov(Asm::R15, at);                                     //  mov R15, at
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[1]), Asm::na, 0, Asm::R9);  //  mov [&timings[1]], R9
        // warmup
        fn.mov(Asm::R9, 128);                                     //  mov R9, 128
        int32u loop = fn.getIP();                                 //loop:
        fn.alu(Asm::ADD, Asm::R8, Asm::R15, 0, Asm::na, 0);       //  add R8, [R15]
        fn.dec(Asm::R9);                                          //  dec R9
        fn.j(Asm::NZ, loop);                                      //  jnz loop
        // hot access
        fn.nop(2);
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[2]), Asm::na, 0, Asm::R9);  //  mov [&timings[2]], R9
        fn.pop(Asm::R15);                                         //  pop R15
        fn.pop(Asm::RBX);                                         //  pop RBX
        fn.ret();
        fn.done();

        void *function = VirtualAlloc(0, 65536, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
        if(!fn.rebase(function)) GLOGF(L"Rebase failed!");
        TRASH; TRASH;
        ((void(*)())function)(); // call function
        GLOGD(L"  overhead: %I64u  cold: %I64u  hot: %I64u", timings[0], timings[1], timings[2]);

Which gives me (the number vary test by test, but are fairly close to each other): "overhead: 39 cold: 36 hot: 64".

x_x

Seriously!? Any idea what is wrong there?

Though that perhaps telling windows to nuke instruction cache or whatever it was (in case it somehow manages to affect things - it should not in this case) - but i can not for the life of me remeber the function call.

 

Note: rdtscp, while serializing, goes not prevent OoO executing instructions that follow it - including the cache-fail reads i want to test. Hence the cpuid as it is said to stop that. rdtsc does no serialization at all, but none is needed as i need to use cpuid anyway. Evidently, something is wrong.


(VS2012) SFINAE solution needing a sanity check (detecting a trait added by a macro)

12 February 2014 - 04:21 PM

Had some really hard time to make VC2012 produce the required results. The version that finally did what i wanted (code is fairly self explanatory) is:

template<class T> class has_rc {
    typedef char yes;
    typedef long no;
    template<class S, S> struct sig; // requiring an exact signature match to exclude possibility of inheriting the trait
    template<class C> static yes test(sig<void(*)(C), &C::_rc_class_only>*);
    template<class C> static no test(...);
public:
    enum Boolean { value = sizeof(test<T>(0)) == sizeof(yes) };
};

// Macro for adding the trait. All trait related irrelevant code removed, only the trait marker for use in "has_rc".
// The "Class" is in my case auto-generated via macros, added here as a parameter instead for clarity.
#define RC(Class) public: static void _rc_class_only(Class) {}

struct TestRCnot {};
TEST(!has_rc<TestRCnot>::value);                 // value = false, no RC trait

struct TestRC { RC(TestRC); };                   // value = true, have RC trait
TEST( has_rc<TestRC>::value);

struct TestRCderived : TestRC { };               // value = false, no RC trait *trait is not inheritable!*
TEST(!has_rc<TestRCderived>::value);

struct TestRC2 : TestRCderived { RC(TestRC2); }; // value = true, have RC trait *re-adding a trait is OK*
TEST( has_rc<TestRC2>::value);
Question, can i simplify this code? Anything pointlessly complex? Would like "_rc_class_only" to be private, but it fails to compile (TestRCderived) even if i friend the "has_rc" class.

Note: can not use "constexpr".

PARTNERS