• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
tanzanite7

inquisitive mind vs cache

3 posts in this topic

It has been several years when i last checked how cache behavior matches how i expect it to behave - so, decided to poke around a bit.

Note: Keeping in mind that the topic is extremely hardware dependent and not particularly useful for generalization. Also, a specific extremely-corner-case test, the only ones one can reasonably make, tells little about real world programs - the test is mostly just curiosity.

A few relevant hardware metrics: Ivy, 6MB L3, 32KB L0 separate from code cache, 8GB main memory.
Os: Windows 7 64bit (relevant for TLB behavior), swap file disabled.

Things i wondered:
* What is the cost of reading main memory (inc address translation misses)?
* Random access vs consecutive?
* Consecutive forward vs backwards (and cache line vs page)?
* Consecutive cache line vs 2 cache lines?

Conclusions derived so far:
* Full cache miss (TLB included) cost - is nigh' impossible to get (that kind of things are extremely rare in normal circumstances and i have failed to confidently measure it so far). Seems to be between 250-800 ticks (extreme oversimplification! - for starters, CPU and main memory are separate entities).
* Random access is equal to any ordered access if the consecutive reads are skipping cache lines. biggrin.png, as i suspected.
* Random access is ~1.5-3 times slower vs consecutive (must be distanced less or equal to cache line to not be equal to random).
* Consecutive forward vs backward - makes no difference at all.

* Cache makes huge difference.

Now the kicker, all the results are wrong (to an unknown extent) - the damn compiler and the actual processor are just way too good at their job and i have not been able to stop them. ARGH! Which is why i am making this thread, besides just sharing the results.

The test program (not standalone, but whatever else it uses is fairly self explanatory):

        TIMES(15); TIMEE(15); TIMES(15); TIMEE(15); // TIMES - starts a timer, TIMEE - ends a timer, TIME - read the timer value
        GLOGD(L" timer overhead: %I64u", TIME(15));

        // working set
        #define REP     8
        #define TESTS   12
        //#define PAGES   2048
        //#define PAGES   4096
        #define PAGES   16384
        intp trash1 = intp(Mem::getChunks(MB(64)));
        intp trash2 = intp(Mem::getChunks(MB(64)));
        //intp trash3 = intp(VirtualAlloc(0, GB(1), MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE));
        intp at = intp(Mem::getChunks(PAGES * 4096));
        int16u offs[PAGES];       // offsets to simulate random access with minimal overhead
        int64u stats[TESTS][REP];

        // init: build offset array, ensuring all values are in range and unique
        Random::MT rnd(123);
        int16u build[PAGES];
        for(int32u i=0; i<PAGES; i++) build[i] = int16u(i);
        for(int32u i=0; i<PAGES; i++) {
            int32u nr = rnd.nextIntMax(PAGES - i);
            offs[i] = build[nr];
            build[nr] = build[PAGES - 1 - i];
        }

        // macro for trashing all cache levels
        #define TRASH for(intp i=0; i<PAGES * 4096; i+=4) *(int32u*)(at + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash1 + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash2 + i) += int32u(i);
                      //for(intp i=0; i<GB(1); i+=64) *(int32u*)(trash3 + i) += int32u(i);
        // macro fo directional access
        #define CHECK_DIR(step) if(step > 0) for(int32u i=0;       i<PAGES; i++) sum ^= *(int32u*)(at + i * step);\
                                else         for(int32u i=PAGES-1; i<PAGES; i--) sum ^= *(int32u*)(at + i * -(step));
        // macro for random access
        #define CHECK_RND(size) for(int32u i=0; i<PAGES; i++) sum ^= *(int32u*)(at + intp(offs[i]) * size);

        // init
        TRASH;
        int32u sum = 0;// anti optimization

        // crude attempt to get times for cold and hot single access.
        TRASH; TRASH;
        TIMES(0); sum ^= *(int32u*)at; TIMEE(0);
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        TIMES(1); sum ^= *(int32u*)at; TIMEE(1);
        GLOGD(L"  single access: %I64u %I64u", TIME(0), TIME(1));

        // tests
        for(int32u rep=0; rep<REP; rep++) {
            TRASH;
            TRASH; TIMES(0); CHECK_DIR(  128); TIMEE(0); //
            TRASH; TIMES(10); CHECK_DIR(  192); TIMEE(10); //
            TRASH; TIMES(1); CHECK_DIR( 4096); TIMEE(1); // +4K
            TRASH; TIMES(2); CHECK_DIR(-4096); TIMEE(2); // -4K
            TRASH; TIMES(3); CHECK_RND( 4096); TIMEE(3); // ?4K
            TRASH; TIMES(4); CHECK_DIR( 64);   TIMEE(4); // +64
            TRASH; TIMES(5); CHECK_DIR(-64);   TIMEE(5); // -64
            TRASH; TIMES(6); CHECK_RND( 64);   TIMEE(6); // ?64
            TRASH; TIMES(7); CHECK_DIR( 4);    TIMEE(7); // +4
            TRASH; TIMES(8); CHECK_DIR(-4);    TIMEE(8); // +4
            TRASH; TIMES(9); CHECK_RND( 4);    TIMEE(9); // ?4
            for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i);
            TIMES(11); for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i); TIMEE(11); // L1
            // record results
            for(int32u test=0; test<TESTS; test++) stats[test][rep] = TIME(test);
        }

        // throw away outliers ... well, not really => just throw away half of the results that disagree the most with the rest
        #define CUT (REP/2)
        for(int32u test=0; test<TESTS; test++) {
            for(int32u cut=0; cut<CUT; cut++) {
                // get average at cut point
                int64u sum = 0;
                for(int32u rep=cut; rep<REP; rep++) sum += stats[test][rep];
                int64u avg = sum / (REP - cut);
                // find outlier
                int32u outlier = cut;
                int64u dist = _abs64(stats[test][outlier] - avg);
                for(int32u rep=cut+1; rep<REP; rep++) {
                    int64u distCur = _abs64(stats[test][rep] - avg);
                    if(dist < distCur) {
                        dist = distCur;
                        outlier = rep;
                    }
                }
                // swap out the outlier (ie. swap outlier with value from cut point)
                int64u tmp = stats[test][outlier];
                stats[test][outlier] = stats[test][cut];
                stats[test][cut] = tmp;
            }
        }

        // calculate averages and minimums
        int64u average[TESTS], minimum[TESTS];
        for(int32u test=0; test<TESTS; test++) {
            int64u sum = 0;
            for(int32u rep=CUT; rep<REP; rep++) sum += stats[test][rep];
            average[test] = sum / (REP - CUT);
            int64u min = stats[test][0];
            for(int32u rep=0; rep<REP; rep++) if(min > stats[test][rep]) min = stats[test][rep];
            minimum[test] = min;
        }

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", minimum[1], minimum[2], minimum[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", minimum[4], minimum[5], minimum[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", minimum[7], minimum[8], minimum[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", minimum[0], minimum[10]);
        GLOGD(L" test L0     : %7I64u %c", minimum[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", average[1], average[2], average[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", average[4], average[5], average[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", average[7], average[8], average[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", average[0], average[10]);
        GLOGD(L" test L0     : %7I64u %c", average[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(minimum[1]) / PAGES, float(minimum[2]) / PAGES, float(minimum[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(minimum[4]) / PAGES, float(minimum[5]) / PAGES, float(minimum[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(minimum[7]) / PAGES, float(minimum[8]) / PAGES, float(minimum[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(minimum[0]) / PAGES, float(minimum[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(minimum[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(average[1]) / PAGES, float(average[2]) / PAGES, float(average[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(average[4]) / PAGES, float(average[5]) / PAGES, float(average[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(average[7]) / PAGES, float(average[8]) / PAGES, float(average[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(average[0]) / PAGES, float(average[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(average[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

The actual results this gives:

¤0   0:00:00.2546 {D}:   overhead: 43  cold: 43  hot: 53 // this is fairly bokners as the timer code i use is not meant for micro measures (just __rdtscp without cpuid and accesses memory because VC2012 intrinsic for it is terribly implemented and not usable for this purpose)
¤0   0:00:00.3180 {D}:   single access: 1691 411 // more bonkers
¤0   0:00:03.3538 {D}: --------------------------------- minimums
¤0   0:00:03.3544 {D}:  test 4K +-? :  730986  752583  782969
¤0   0:00:03.3547 {D}:  test 64 +-? :  292295  284462  462598
¤0   0:00:03.3551 {D}:  test  4 +-? :   20399   31741   53250
¤0   0:00:03.3554 {D}:  test 64*2*3 :  444014  541493
¤0   0:00:03.3557 {D}:  test L0     :   11589
¤0   0:00:03.3559 {D}: --------------------------------- averages
¤0   0:00:03.3562 {D}:  test 4K +-? :  739322  753575  786489
¤0   0:00:03.3564 {D}:  test 64 +-? :  296640  295497  462916
¤0   0:00:03.3567 {D}:  test  4 +-? :   20560   32331   55008
¤0   0:00:03.3569 {D}:  test 64*2*3 :  447766  547238
¤0   0:00:03.3570 {D}:  test L0     :   11641
¤0   0:00:03.3572 {D}: --------------------------------- minimums
¤0   0:00:03.3574 {D}:  test 4K +-? : 44.62 45.93 47.79
¤0   0:00:03.3576 {D}:  test 64 +-? : 17.84 17.36 28.23
¤0   0:00:03.3578 {D}:  test  4 +-? :  1.25  1.94  3.25
¤0   0:00:03.3579 {D}:  test 64*2*3 : 27.10 33.05
¤0   0:00:03.3581 {D}:  test L0     :  0.71
¤0   0:00:03.3583 {D}: --------------------------------- averages
¤0   0:00:03.3584 {D}:  test 4K +-? : 45.12 45.99 48.00 // consecutive +4K, consecutive -4K, and random
¤0   0:00:03.3586 {D}:  test 64 +-? : 18.11 18.04 28.25
¤0   0:00:03.3588 {D}:  test  4 +-? :  1.25  1.97  3.36
¤0   0:00:03.3590 {D}:  test 64*2*3 : 27.33 33.40 // consecutive +128 and +192
¤0   0:00:03.3591 {D}:  test L0     :  0.71 // consecutive +1, hot cache, reading a total of 16KB - one byte at a time.

Note:
* While the single access tests are bonkers - the rest are valid ... sorta, i'll get to that.
* Reading 16384 values in a loop took ~11-12 clock cycles with hot cache.
* Worst cache hit ratio slowed it down by a factor of ~70.
* TLB miss is just and extra memory miss with Windows 7 / x86-64.

So, why are the results kinda wrong?
* compiler partially unrolled SOME for the loops (ie. doing 4 per cycle).
* processor can (AND ABSOLUTELY LOVES TO), in effect, unroll the loops too and have multiple reads in flight.
* processor can execute multiple instructions at the same clock cycle (which is why the last result is as fast as it is).

The results are not comparable - and i do not know the margin of error. Ie. how many reads are in flight for example in the 4K test?

Glancing at one of thous 4K tests: the loop code is 15 bytes total (5 basic instructions)! It more than fits fully in decoder cache (and VC likes to pad nops in front of hot loops to make the alignment perfect for it [Ivy likes 32B]). Also, the register pressure is, expectedly, very low (in regards to renaming).

Any ideas how to stop the processor and to a lesser extent, the compiler, from being so good at their job and still get usable results?

-------------------------------------------------------------------------------
Now, to get single access timings i dusted out my x86 Asm class (and added cpuid/rdtsc/rdtscp):
 

        // build test function
        int64u timings[3]; // 0 = plain overhead, 1 = cold access (+overhead), 2 = hot access (+overhead)
        Asm fn; // x64 cdecl, scap: RAX, RCX, RDX, R8, R9, R10, R11

        fn.nop(16); // VS2012 fails do decode first 16 bytes - NOP'ing over that area, so i can actually see what i am stepping through while debugging
        //fn.int3();
        fn.push(Asm::RBX);                                        //  push RBX
        fn.push(Asm::R15);                                        //  push R15
        // get plain overhead of serialised timer query
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();                                              //  rdtscp
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[0]), Asm::na, 0, Asm::R9);  //  mov [&timings[0]], R9
        // cold access
        fn.nop(2);
        fn.mov(Asm::R15, at);                                     //  mov R15, at
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[1]), Asm::na, 0, Asm::R9);  //  mov [&timings[1]], R9
        // warmup
        fn.mov(Asm::R9, 128);                                     //  mov R9, 128
        int32u loop = fn.getIP();                                 //loop:
        fn.alu(Asm::ADD, Asm::R8, Asm::R15, 0, Asm::na, 0);       //  add R8, [R15]
        fn.dec(Asm::R9);                                          //  dec R9
        fn.j(Asm::NZ, loop);                                      //  jnz loop
        // hot access
        fn.nop(2);
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[2]), Asm::na, 0, Asm::R9);  //  mov [&timings[2]], R9
        fn.pop(Asm::R15);                                         //  pop R15
        fn.pop(Asm::RBX);                                         //  pop RBX
        fn.ret();
        fn.done();

        void *function = VirtualAlloc(0, 65536, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
        if(!fn.rebase(function)) GLOGF(L"Rebase failed!");
        TRASH; TRASH;
        ((void(*)())function)(); // call function
        GLOGD(L"  overhead: %I64u  cold: %I64u  hot: %I64u", timings[0], timings[1], timings[2]);

Which gives me (the number vary test by test, but are fairly close to each other): "overhead: 39 cold: 36 hot: 64".

x_x

Seriously!? Any idea what is wrong there?

Though that perhaps telling windows to nuke instruction cache or whatever it was (in case it somehow manages to affect things - it should not in this case) - but i can not for the life of me remeber the function call.

 

Note: rdtscp, while serializing, goes not prevent OoO executing instructions that follow it - including the cache-fail reads i want to test. Hence the cpuid as it is said to stop that. rdtsc does no serialization at all, but none is needed as i need to use cpuid anyway. Evidently, something is wrong.

2

Share this post


Link to post
Share on other sites

Since you are interested in latencies (I presume) you should build your benchmark around pointer chasing, since that prevents any form of OoO.

Originally dismissed that approach as it was not possible with one of the tests (read bytes, 16K, hot cache) and wanted the tests to be comparable - since that did not work out, i should have returned to pointer chasing. Thanks for the sanity reminder biggrin.png.

Code changes:

        // macro for directional access
        #define CHECK_DIR(step, nr) { if(step > 0) for(int32u i=0;       i<PAGES; i++) ((int32u*)at)[i * step/4] = i*step/4 + step/4;       \
                                      else         for(int32u i=PAGES-1; i<PAGES; i--) ((int32u*)at)[i * -(step)/4] = i*-(step)/4 + step/4; \
                                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash1 + i) += int32u(i);                                    \
                                      int32u index = step < 0 ? (PAGES-1) * -(step)/4 : 0;                                                  \
                                      TIMES(nr); for(int32u i=0; i<PAGES; i++) index = ((int32u*)at)[index]; TIMEE(nr);                     \
                                      sum ^= index; }
        // macro for random access
        #define CHECK_RND(size, nr) { for(int32u i=0; i<PAGES; i++) ((int32u*)at)[i * size/4] = int32u(offs[(i + 1) % PAGES]) * size/4;     \
                                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash1 + i) += int32u(i);                                    \
                                      int32u index = int32u(offs[0]) * size/4;                                                              \
                                      TIMES(nr); for(int32u i=0; i<PAGES; i++) index = ((int32u*)at)[index]; TIMEE(nr);                     \
                                      sum ^= index; }

/.../

        // tests
        for(int32u rep=0; rep<REP; rep++) {
            CHECK_DIR(  128,  0); //
            CHECK_DIR(  192, 10); //
            CHECK_DIR( 4096,  1); // +4K
            CHECK_DIR(-4096,  2); // -4K
            CHECK_RND( 4096,  3); // ?4K
            CHECK_DIR(   64,  4); // +64
            CHECK_DIR(  -64,  5); // -64
            CHECK_RND(   64,  6); // ?64
            CHECK_DIR(    4,  7); // +4
            CHECK_DIR(   -4,  8); // +4
            CHECK_RND(    4,  9); // ?4
            for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i);
            TIMES(11); for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i); TIMEE(11); // L0
            // record results
            for(int32u test=0; test<TESTS; test++) stats[test][rep] = TIME(test);
        }

Just in case looked at the generated assembly too: compiler really likes to partially unroll the loop (does all tests 4 at a time). Fairly sure prefether tracks just misses and does not use code at all, so i'm fine with it.
 

000000014000D1A0  mov         eax,r9d  
000000014000D1A3  mov         ecx,dword ptr [r8+rax*4]  
000000014000D1A7  mov         eax,dword ptr [r8+rcx*4]  
000000014000D1AB  mov         ecx,dword ptr [r8+rax*4]  
000000014000D1AF  mov         r9d,dword ptr [r8+rcx*4]  
000000014000D1B3  dec         rdx  
000000014000D1B6  jne         <lambda_c013c40faa6d69a31e50a975878d99c9>::operator()+0A30h (014000D1A0h)  

Note that modern prefetchers can detect the stride of sequentiall accesses, so you might need s.th. more random to throw them off.

Results seem to confirm that. Results:

¤0   0:00:01.2021 {D}: --------------------------------- minimums
¤0   0:00:01.2027 {D}:  test 4K +-? : 3494819 3497453 3603509
¤0   0:00:01.2032 {D}:  test 64 +-? :  302194  307338 2559251
¤0   0:00:01.2036 {D}:  test  4 +-? :   78268   78265  197659
¤0   0:00:01.2040 {D}:  test 64*2*3 :  674367 1161588
¤0   0:00:01.2042 {D}:  test L0     :   11589 
¤0   0:00:01.2045 {D}: --------------------------------- averages
¤0   0:00:01.2047 {D}:  test 4K +-? : 3496837 3528575 3614809
¤0   0:00:01.2049 {D}:  test 64 +-? :  308422  308939 2563171
¤0   0:00:01.2051 {D}:  test  4 +-? :   78364   78308  198548
¤0   0:00:01.2053 {D}:  test 64*2*3 :  677994 1175889
¤0   0:00:01.2055 {D}:  test L0     :   11649 
¤0   0:00:01.2057 {D}: --------------------------------- minimums
¤0   0:00:01.2059 {D}:  test 4K +-? : 213.31 213.47 219.94
¤0   0:00:01.2061 {D}:  test 64 +-? :  18.44  18.76 156.20
¤0   0:00:01.2063 {D}:  test  4 +-? :   4.78   4.78  12.06
¤0   0:00:01.2065 {D}:  test 64*2*3 :  41.16  70.90
¤0   0:00:01.2066 {D}:  test L0     :   0.71 
¤0   0:00:01.2068 {D}: --------------------------------- averages
¤0   0:00:01.2070 {D}:  test 4K +-? : 213.43 215.37 220.63
¤0   0:00:01.2071 {D}:  test 64 +-? :  18.82  18.86 156.44
¤0   0:00:01.2073 {D}:  test  4 +-? :   4.78   4.78  12.12
¤0   0:00:01.2075 {D}:  test 64*2*3 :  41.38  71.77
¤0   0:00:01.2077 {D}:  test L0     :   0.71 

Notables:
* 128 and 192 strides are not equal to random anymore! (original faulty test made them equal [apart from slight difference from TLB pressure])
* Numbers are in a much more believable, and interesting, range.
* TLB faults can not explain the difference between 128 and 192 tests - some stride prediction presumably? But why should it care about the length of stride?
* Stride of 4096 is exactly equal to random access (the +6 tick difference comes from bad TLB access pattern - the array is bigger than what fits in TLB).

Anyone able/willing to run the test on some other CPU? AMD for example?
 

 

Full cache miss (TLB included) cost - is nigh' impossible to get (that kind of things are extremely rare in normal circumstances and i have failed to confidently measure it so far).

Cache misses are hardly "rare in normal circumstances" once you start dealing with a game that has hundreds of megabytes or even gigabytes of resources and complex and fully-featured graphics/physics/AI systems.

 

Re-read what i said. As the quote tells, i am talking about full miss (*). And it IS very-very rare in normal circumstances.

For example my 4K max miss tests gets a, estimated, max of 1 (**) - (65536/4/512 = 32 (***)) full misses over 16374 reads. So, even my extreme case i get only 0.006%-2% of the reads to experience full miss. Sidenote of relevance: on some platforms TLB miss is handled by interrupts - because the misses are just so rare that the overhead does not make much of a difference. IIRC, x86 has that mode too, but i think no-one uses it.

(*) full cache miss: cache miss -> TLB miss -> multiple full cache misses to service TLB miss.
(**) +-4K tests. Cannot have more than one full miss.
(***) just a rough estimate: 64MB buffer / 4K pages / size of typical TLB cache. While it is likely possible to craft a special pattern to be worse - it is rather difficult. I think this estimate is way too big for the exactly-one-read-per-page-in-random-order case.

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0