Jump to content

  • Log In with Google      Sign In   
  • Create Account


inquisitive mind vs cache


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
3 replies to this topic

#1 tanzanite7   Members   -  Reputation: 995

Like
2Likes
Like

Posted 18 February 2014 - 08:08 AM

It has been several years when i last checked how cache behavior matches how i expect it to behave - so, decided to poke around a bit.

Note: Keeping in mind that the topic is extremely hardware dependent and not particularly useful for generalization. Also, a specific extremely-corner-case test, the only ones one can reasonably make, tells little about real world programs - the test is mostly just curiosity.

A few relevant hardware metrics: Ivy, 6MB L3, 32KB L0 separate from code cache, 8GB main memory.
Os: Windows 7 64bit (relevant for TLB behavior), swap file disabled.

Things i wondered:
* What is the cost of reading main memory (inc address translation misses)?
* Random access vs consecutive?
* Consecutive forward vs backwards (and cache line vs page)?
* Consecutive cache line vs 2 cache lines?

Conclusions derived so far:
* Full cache miss (TLB included) cost - is nigh' impossible to get (that kind of things are extremely rare in normal circumstances and i have failed to confidently measure it so far). Seems to be between 250-800 ticks (extreme oversimplification! - for starters, CPU and main memory are separate entities).
* Random access is equal to any ordered access if the consecutive reads are skipping cache lines. biggrin.png, as i suspected.
* Random access is ~1.5-3 times slower vs consecutive (must be distanced less or equal to cache line to not be equal to random).
* Consecutive forward vs backward - makes no difference at all.

* Cache makes huge difference.

Now the kicker, all the results are wrong (to an unknown extent) - the damn compiler and the actual processor are just way too good at their job and i have not been able to stop them. ARGH! Which is why i am making this thread, besides just sharing the results.

The test program (not standalone, but whatever else it uses is fairly self explanatory):

        TIMES(15); TIMEE(15); TIMES(15); TIMEE(15); // TIMES - starts a timer, TIMEE - ends a timer, TIME - read the timer value
        GLOGD(L" timer overhead: %I64u", TIME(15));

        // working set
        #define REP     8
        #define TESTS   12
        //#define PAGES   2048
        //#define PAGES   4096
        #define PAGES   16384
        intp trash1 = intp(Mem::getChunks(MB(64)));
        intp trash2 = intp(Mem::getChunks(MB(64)));
        //intp trash3 = intp(VirtualAlloc(0, GB(1), MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE));
        intp at = intp(Mem::getChunks(PAGES * 4096));
        int16u offs[PAGES];       // offsets to simulate random access with minimal overhead
        int64u stats[TESTS][REP];

        // init: build offset array, ensuring all values are in range and unique
        Random::MT rnd(123);
        int16u build[PAGES];
        for(int32u i=0; i<PAGES; i++) build[i] = int16u(i);
        for(int32u i=0; i<PAGES; i++) {
            int32u nr = rnd.nextIntMax(PAGES - i);
            offs[i] = build[nr];
            build[nr] = build[PAGES - 1 - i];
        }

        // macro for trashing all cache levels
        #define TRASH for(intp i=0; i<PAGES * 4096; i+=4) *(int32u*)(at + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash1 + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash2 + i) += int32u(i);
                      //for(intp i=0; i<GB(1); i+=64) *(int32u*)(trash3 + i) += int32u(i);
        // macro fo directional access
        #define CHECK_DIR(step) if(step > 0) for(int32u i=0;       i<PAGES; i++) sum ^= *(int32u*)(at + i * step);\
                                else         for(int32u i=PAGES-1; i<PAGES; i--) sum ^= *(int32u*)(at + i * -(step));
        // macro for random access
        #define CHECK_RND(size) for(int32u i=0; i<PAGES; i++) sum ^= *(int32u*)(at + intp(offs[i]) * size);

        // init
        TRASH;
        int32u sum = 0;// anti optimization

        // crude attempt to get times for cold and hot single access.
        TRASH; TRASH;
        TIMES(0); sum ^= *(int32u*)at; TIMEE(0);
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        TIMES(1); sum ^= *(int32u*)at; TIMEE(1);
        GLOGD(L"  single access: %I64u %I64u", TIME(0), TIME(1));

        // tests
        for(int32u rep=0; rep<REP; rep++) {
            TRASH;
            TRASH; TIMES(0); CHECK_DIR(  128); TIMEE(0); //
            TRASH; TIMES(10); CHECK_DIR(  192); TIMEE(10); //
            TRASH; TIMES(1); CHECK_DIR( 4096); TIMEE(1); // +4K
            TRASH; TIMES(2); CHECK_DIR(-4096); TIMEE(2); // -4K
            TRASH; TIMES(3); CHECK_RND( 4096); TIMEE(3); // ?4K
            TRASH; TIMES(4); CHECK_DIR( 64);   TIMEE(4); // +64
            TRASH; TIMES(5); CHECK_DIR(-64);   TIMEE(5); // -64
            TRASH; TIMES(6); CHECK_RND( 64);   TIMEE(6); // ?64
            TRASH; TIMES(7); CHECK_DIR( 4);    TIMEE(7); // +4
            TRASH; TIMES(8); CHECK_DIR(-4);    TIMEE(8); // +4
            TRASH; TIMES(9); CHECK_RND( 4);    TIMEE(9); // ?4
            for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i);
            TIMES(11); for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i); TIMEE(11); // L1
            // record results
            for(int32u test=0; test<TESTS; test++) stats[test][rep] = TIME(test);
        }

        // throw away outliers ... well, not really => just throw away half of the results that disagree the most with the rest
        #define CUT (REP/2)
        for(int32u test=0; test<TESTS; test++) {
            for(int32u cut=0; cut<CUT; cut++) {
                // get average at cut point
                int64u sum = 0;
                for(int32u rep=cut; rep<REP; rep++) sum += stats[test][rep];
                int64u avg = sum / (REP - cut);
                // find outlier
                int32u outlier = cut;
                int64u dist = _abs64(stats[test][outlier] - avg);
                for(int32u rep=cut+1; rep<REP; rep++) {
                    int64u distCur = _abs64(stats[test][rep] - avg);
                    if(dist < distCur) {
                        dist = distCur;
                        outlier = rep;
                    }
                }
                // swap out the outlier (ie. swap outlier with value from cut point)
                int64u tmp = stats[test][outlier];
                stats[test][outlier] = stats[test][cut];
                stats[test][cut] = tmp;
            }
        }

        // calculate averages and minimums
        int64u average[TESTS], minimum[TESTS];
        for(int32u test=0; test<TESTS; test++) {
            int64u sum = 0;
            for(int32u rep=CUT; rep<REP; rep++) sum += stats[test][rep];
            average[test] = sum / (REP - CUT);
            int64u min = stats[test][0];
            for(int32u rep=0; rep<REP; rep++) if(min > stats[test][rep]) min = stats[test][rep];
            minimum[test] = min;
        }

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", minimum[1], minimum[2], minimum[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", minimum[4], minimum[5], minimum[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", minimum[7], minimum[8], minimum[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", minimum[0], minimum[10]);
        GLOGD(L" test L0     : %7I64u %c", minimum[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", average[1], average[2], average[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", average[4], average[5], average[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", average[7], average[8], average[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", average[0], average[10]);
        GLOGD(L" test L0     : %7I64u %c", average[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(minimum[1]) / PAGES, float(minimum[2]) / PAGES, float(minimum[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(minimum[4]) / PAGES, float(minimum[5]) / PAGES, float(minimum[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(minimum[7]) / PAGES, float(minimum[8]) / PAGES, float(minimum[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(minimum[0]) / PAGES, float(minimum[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(minimum[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(average[1]) / PAGES, float(average[2]) / PAGES, float(average[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(average[4]) / PAGES, float(average[5]) / PAGES, float(average[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(average[7]) / PAGES, float(average[8]) / PAGES, float(average[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(average[0]) / PAGES, float(average[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(average[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

The actual results this gives:

¤0   0:00:00.2546 {D}:   overhead: 43  cold: 43  hot: 53 // this is fairly bokners as the timer code i use is not meant for micro measures (just __rdtscp without cpuid and accesses memory because VC2012 intrinsic for it is terribly implemented and not usable for this purpose)
¤0   0:00:00.3180 {D}:   single access: 1691 411 // more bonkers
¤0   0:00:03.3538 {D}: --------------------------------- minimums
¤0   0:00:03.3544 {D}:  test 4K +-? :  730986  752583  782969
¤0   0:00:03.3547 {D}:  test 64 +-? :  292295  284462  462598
¤0   0:00:03.3551 {D}:  test  4 +-? :   20399   31741   53250
¤0   0:00:03.3554 {D}:  test 64*2*3 :  444014  541493
¤0   0:00:03.3557 {D}:  test L0     :   11589
¤0   0:00:03.3559 {D}: --------------------------------- averages
¤0   0:00:03.3562 {D}:  test 4K +-? :  739322  753575  786489
¤0   0:00:03.3564 {D}:  test 64 +-? :  296640  295497  462916
¤0   0:00:03.3567 {D}:  test  4 +-? :   20560   32331   55008
¤0   0:00:03.3569 {D}:  test 64*2*3 :  447766  547238
¤0   0:00:03.3570 {D}:  test L0     :   11641
¤0   0:00:03.3572 {D}: --------------------------------- minimums
¤0   0:00:03.3574 {D}:  test 4K +-? : 44.62 45.93 47.79
¤0   0:00:03.3576 {D}:  test 64 +-? : 17.84 17.36 28.23
¤0   0:00:03.3578 {D}:  test  4 +-? :  1.25  1.94  3.25
¤0   0:00:03.3579 {D}:  test 64*2*3 : 27.10 33.05
¤0   0:00:03.3581 {D}:  test L0     :  0.71
¤0   0:00:03.3583 {D}: --------------------------------- averages
¤0   0:00:03.3584 {D}:  test 4K +-? : 45.12 45.99 48.00 // consecutive +4K, consecutive -4K, and random
¤0   0:00:03.3586 {D}:  test 64 +-? : 18.11 18.04 28.25
¤0   0:00:03.3588 {D}:  test  4 +-? :  1.25  1.97  3.36
¤0   0:00:03.3590 {D}:  test 64*2*3 : 27.33 33.40 // consecutive +128 and +192
¤0   0:00:03.3591 {D}:  test L0     :  0.71 // consecutive +1, hot cache, reading a total of 16KB - one byte at a time.

Note:
* While the single access tests are bonkers - the rest are valid ... sorta, i'll get to that.
* Reading 16384 values in a loop took ~11-12 clock cycles with hot cache.
* Worst cache hit ratio slowed it down by a factor of ~70.
* TLB miss is just and extra memory miss with Windows 7 / x86-64.

So, why are the results kinda wrong?
* compiler partially unrolled SOME for the loops (ie. doing 4 per cycle).
* processor can (AND ABSOLUTELY LOVES TO), in effect, unroll the loops too and have multiple reads in flight.
* processor can execute multiple instructions at the same clock cycle (which is why the last result is as fast as it is).

The results are not comparable - and i do not know the margin of error. Ie. how many reads are in flight for example in the 4K test?

Glancing at one of thous 4K tests: the loop code is 15 bytes total (5 basic instructions)! It more than fits fully in decoder cache (and VC likes to pad nops in front of hot loops to make the alignment perfect for it [Ivy likes 32B]). Also, the register pressure is, expectedly, very low (in regards to renaming).

Any ideas how to stop the processor and to a lesser extent, the compiler, from being so good at their job and still get usable results?

-------------------------------------------------------------------------------
Now, to get single access timings i dusted out my x86 Asm class (and added cpuid/rdtsc/rdtscp):
 

        // build test function
        int64u timings[3]; // 0 = plain overhead, 1 = cold access (+overhead), 2 = hot access (+overhead)
        Asm fn; // x64 cdecl, scap: RAX, RCX, RDX, R8, R9, R10, R11

        fn.nop(16); // VS2012 fails do decode first 16 bytes - NOP'ing over that area, so i can actually see what i am stepping through while debugging
        //fn.int3();
        fn.push(Asm::RBX);                                        //  push RBX
        fn.push(Asm::R15);                                        //  push R15
        // get plain overhead of serialised timer query
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();                                              //  rdtscp
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[0]), Asm::na, 0, Asm::R9);  //  mov [&timings[0]], R9
        // cold access
        fn.nop(2);
        fn.mov(Asm::R15, at);                                     //  mov R15, at
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[1]), Asm::na, 0, Asm::R9);  //  mov [&timings[1]], R9
        // warmup
        fn.mov(Asm::R9, 128);                                     //  mov R9, 128
        int32u loop = fn.getIP();                                 //loop:
        fn.alu(Asm::ADD, Asm::R8, Asm::R15, 0, Asm::na, 0);       //  add R8, [R15]
        fn.dec(Asm::R9);                                          //  dec R9
        fn.j(Asm::NZ, loop);                                      //  jnz loop
        // hot access
        fn.nop(2);
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[2]), Asm::na, 0, Asm::R9);  //  mov [&timings[2]], R9
        fn.pop(Asm::R15);                                         //  pop R15
        fn.pop(Asm::RBX);                                         //  pop RBX
        fn.ret();
        fn.done();

        void *function = VirtualAlloc(0, 65536, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
        if(!fn.rebase(function)) GLOGF(L"Rebase failed!");
        TRASH; TRASH;
        ((void(*)())function)(); // call function
        GLOGD(L"  overhead: %I64u  cold: %I64u  hot: %I64u", timings[0], timings[1], timings[2]);

Which gives me (the number vary test by test, but are fairly close to each other): "overhead: 39 cold: 36 hot: 64".

x_x

Seriously!? Any idea what is wrong there?

Though that perhaps telling windows to nuke instruction cache or whatever it was (in case it somehow manages to affect things - it should not in this case) - but i can not for the life of me remeber the function call.

 

Note: rdtscp, while serializing, goes not prevent OoO executing instructions that follow it - including the cache-fail reads i want to test. Hence the cpuid as it is said to stop that. rdtsc does no serialization at all, but none is needed as i need to use cpuid anyway. Evidently, something is wrong.



Sponsor:

#2 SeanMiddleditch   Members   -  Reputation: 1861

Like
3Likes
Like

Posted 18 February 2014 - 12:05 PM

Full cache miss (TLB included) cost - is nigh' impossible to get (that kind of things are extremely rare in normal circumstances and i have failed to confidently measure it so far).


And this is why little micro-benchmarks written without extensive experience writing benchmarks are so often misleading. Cache misses are hardly "rare in normal circumstances" once you start dealing with a game that has hundreds of megabytes or even gigabytes of resources and complex and fully-featured graphics/physics/AI systems.

A lot of the information you're asking for can be acquired with tools like Intel VTune. Which ain't cheap, unfortunately. You can also try simulators like cachegrind (only available on UNIX-like OSes currently, no Windows). You really should use benchmarks that simulate real workloads rather than try to generate something synthetic by hand, though.

If you're dead set on this approach, though, you can prevent out-of-order memory operations with memory barriers/fences (mfence, lfence, sfence, etc.) and you can flush a cache using instructions posted at https://stackoverflow.com/questions/1756825/cpu-cache-flush.

#3 Ohforf sake   Members   -  Reputation: 1048

Like
4Likes
Like

Posted 19 February 2014 - 04:44 AM

Since you are interested in latencies (I presume) you should build your benchmark around pointer chasing, since that prevents any form of OoO.

 

Essentially you specifically craft an array of integers and then you do the following:

unsigned index = 0;
for (unsigned iterations = 0; iteration < numIterations; iteration++) {
     index = array[index];
}
// do s.th. with index so the compiler doesn't remove the loop.

If you craft the array in a way such that array[i] = i+n then you get sequential access with stride n. But you can also create all kinds of other semi random patterns. Note that modern prefetchers can detect the stride of sequentiall accesses, so you might need s.th. more random to throw them off.

 

Since the compiler doesn't know anything about the array, it can't optimize anything away. And the only thing the cpu can run in parallel is the increment and comparison from the for loop (which is totally fine). But it has to perform all those loads and in sequence in order to arrive at the final index.

 



#4 tanzanite7   Members   -  Reputation: 995

Like
0Likes
Like

Posted 19 February 2014 - 08:25 AM

Since you are interested in latencies (I presume) you should build your benchmark around pointer chasing, since that prevents any form of OoO.

Originally dismissed that approach as it was not possible with one of the tests (read bytes, 16K, hot cache) and wanted the tests to be comparable - since that did not work out, i should have returned to pointer chasing. Thanks for the sanity reminder biggrin.png.

Code changes:

        // macro for directional access
        #define CHECK_DIR(step, nr) { if(step > 0) for(int32u i=0;       i<PAGES; i++) ((int32u*)at)[i * step/4] = i*step/4 + step/4;       \
                                      else         for(int32u i=PAGES-1; i<PAGES; i--) ((int32u*)at)[i * -(step)/4] = i*-(step)/4 + step/4; \
                                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash1 + i) += int32u(i);                                    \
                                      int32u index = step < 0 ? (PAGES-1) * -(step)/4 : 0;                                                  \
                                      TIMES(nr); for(int32u i=0; i<PAGES; i++) index = ((int32u*)at)[index]; TIMEE(nr);                     \
                                      sum ^= index; }
        // macro for random access
        #define CHECK_RND(size, nr) { for(int32u i=0; i<PAGES; i++) ((int32u*)at)[i * size/4] = int32u(offs[(i + 1) % PAGES]) * size/4;     \
                                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash1 + i) += int32u(i);                                    \
                                      int32u index = int32u(offs[0]) * size/4;                                                              \
                                      TIMES(nr); for(int32u i=0; i<PAGES; i++) index = ((int32u*)at)[index]; TIMEE(nr);                     \
                                      sum ^= index; }

/.../

        // tests
        for(int32u rep=0; rep<REP; rep++) {
            CHECK_DIR(  128,  0); //
            CHECK_DIR(  192, 10); //
            CHECK_DIR( 4096,  1); // +4K
            CHECK_DIR(-4096,  2); // -4K
            CHECK_RND( 4096,  3); // ?4K
            CHECK_DIR(   64,  4); // +64
            CHECK_DIR(  -64,  5); // -64
            CHECK_RND(   64,  6); // ?64
            CHECK_DIR(    4,  7); // +4
            CHECK_DIR(   -4,  8); // +4
            CHECK_RND(    4,  9); // ?4
            for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i);
            TIMES(11); for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i); TIMEE(11); // L0
            // record results
            for(int32u test=0; test<TESTS; test++) stats[test][rep] = TIME(test);
        }

Just in case looked at the generated assembly too: compiler really likes to partially unroll the loop (does all tests 4 at a time). Fairly sure prefether tracks just misses and does not use code at all, so i'm fine with it.
 

000000014000D1A0  mov         eax,r9d  
000000014000D1A3  mov         ecx,dword ptr [r8+rax*4]  
000000014000D1A7  mov         eax,dword ptr [r8+rcx*4]  
000000014000D1AB  mov         ecx,dword ptr [r8+rax*4]  
000000014000D1AF  mov         r9d,dword ptr [r8+rcx*4]  
000000014000D1B3  dec         rdx  
000000014000D1B6  jne         <lambda_c013c40faa6d69a31e50a975878d99c9>::operator()+0A30h (014000D1A0h)  

Note that modern prefetchers can detect the stride of sequentiall accesses, so you might need s.th. more random to throw them off.

Results seem to confirm that. Results:

¤0   0:00:01.2021 {D}: --------------------------------- minimums
¤0   0:00:01.2027 {D}:  test 4K +-? : 3494819 3497453 3603509
¤0   0:00:01.2032 {D}:  test 64 +-? :  302194  307338 2559251
¤0   0:00:01.2036 {D}:  test  4 +-? :   78268   78265  197659
¤0   0:00:01.2040 {D}:  test 64*2*3 :  674367 1161588
¤0   0:00:01.2042 {D}:  test L0     :   11589 
¤0   0:00:01.2045 {D}: --------------------------------- averages
¤0   0:00:01.2047 {D}:  test 4K +-? : 3496837 3528575 3614809
¤0   0:00:01.2049 {D}:  test 64 +-? :  308422  308939 2563171
¤0   0:00:01.2051 {D}:  test  4 +-? :   78364   78308  198548
¤0   0:00:01.2053 {D}:  test 64*2*3 :  677994 1175889
¤0   0:00:01.2055 {D}:  test L0     :   11649 
¤0   0:00:01.2057 {D}: --------------------------------- minimums
¤0   0:00:01.2059 {D}:  test 4K +-? : 213.31 213.47 219.94
¤0   0:00:01.2061 {D}:  test 64 +-? :  18.44  18.76 156.20
¤0   0:00:01.2063 {D}:  test  4 +-? :   4.78   4.78  12.06
¤0   0:00:01.2065 {D}:  test 64*2*3 :  41.16  70.90
¤0   0:00:01.2066 {D}:  test L0     :   0.71 
¤0   0:00:01.2068 {D}: --------------------------------- averages
¤0   0:00:01.2070 {D}:  test 4K +-? : 213.43 215.37 220.63
¤0   0:00:01.2071 {D}:  test 64 +-? :  18.82  18.86 156.44
¤0   0:00:01.2073 {D}:  test  4 +-? :   4.78   4.78  12.12
¤0   0:00:01.2075 {D}:  test 64*2*3 :  41.38  71.77
¤0   0:00:01.2077 {D}:  test L0     :   0.71 

Notables:
* 128 and 192 strides are not equal to random anymore! (original faulty test made them equal [apart from slight difference from TLB pressure])
* Numbers are in a much more believable, and interesting, range.
* TLB faults can not explain the difference between 128 and 192 tests - some stride prediction presumably? But why should it care about the length of stride?
* Stride of 4096 is exactly equal to random access (the +6 tick difference comes from bad TLB access pattern - the array is bigger than what fits in TLB).

Anyone able/willing to run the test on some other CPU? AMD for example?
 

 

Full cache miss (TLB included) cost - is nigh' impossible to get (that kind of things are extremely rare in normal circumstances and i have failed to confidently measure it so far).

Cache misses are hardly "rare in normal circumstances" once you start dealing with a game that has hundreds of megabytes or even gigabytes of resources and complex and fully-featured graphics/physics/AI systems.

 

Re-read what i said. As the quote tells, i am talking about full miss (*). And it IS very-very rare in normal circumstances.

For example my 4K max miss tests gets a, estimated, max of 1 (**) - (65536/4/512 = 32 (***)) full misses over 16374 reads. So, even my extreme case i get only 0.006%-2% of the reads to experience full miss. Sidenote of relevance: on some platforms TLB miss is handled by interrupts - because the misses are just so rare that the overhead does not make much of a difference. IIRC, x86 has that mode too, but i think no-one uses it.

(*) full cache miss: cache miss -> TLB miss -> multiple full cache misses to service TLB miss.
(**) +-4K tests. Cannot have more than one full miss.
(***) just a rough estimate: 64MB buffer / 4K pages / size of typical TLB cache. While it is likely possible to craft a special pattern to be worse - it is rather difficult. I think this estimate is way too big for the exactly-one-read-per-page-in-random-order case.






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS