Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 20 Nov 2005
Offline Last Active Today, 05:06 AM

#5133413 I don't get c++11.

Posted by tanzanite7 on 21 February 2014 - 06:11 PM

I didn't realize non-capturing lambdas didn't just compile into regular functions.

It was a late addition to the standard for this conversion operator to even exist at all - so late in fact that VC++ 2012 original release did not support it. Had to use template magic to produce a regular function pointer out of it till the support was added.

#5133339 I don't get c++11.

Posted by tanzanite7 on 21 February 2014 - 12:27 PM

Lambdas can only be defined and used inside functions:

O_o, wut? Lambda is a closure object - saying that it can only be defined / used inside functions makes as much sense than saying one can use/define the value 7 ONLY inside functions. Do not get confused that the lambda object is called "closure object" - that too is still just an object. In short, there is no such limitation.

Perhaps you meant that defining a capturing Lambda outside functions is conceptually kinda totally bonkers -> capture list requires function scope.

Another discrepancy i noticed:

Lambda object defines a few operators. One that seemed to cause a bit unfortunate/confusing wordings here is: type conversion operator to the underlying function type (only defined if the lambda does not capture anything - as otherwise the operator would not make any sense). IE. Lambda object is NOT just a function, however it might be possible to get a function pointer out of one via implicit/explicit conversion:

void (*snafu)() = [ ]()->void { ... } // this is legal through conversion operator
void (*snafu)() = [&]()->void { ... } // ... this is not as there is no applicable conversion.

#5132794 Binary to Integer (Read hex binary file)

Posted by tanzanite7 on 19 February 2014 - 05:15 PM

It's a 16-bit floating point type.

Half-float does not have an 8-bit exponent tongue.png. The realization that an exponent could explain the related change in the byte pairs and the one bit crossing the byte border (due to the sign bit) was the smoking gun for me. The exponent encoding works as a fair giveaway too, but missed it at first (64 = 128 >> 1).

Because the game only let me input integer value, so I think that it's integer !

This has been actually very common occurrence i my savegamemucking endeavors. Even with values where having a float makes no sense - ie. the value can logically only be whole numbers (a'la unit count etc).

#5132736 Binary to Integer (Read hex binary file)

Posted by tanzanite7 on 19 February 2014 - 01:47 PM

Your notation is extremely cryptic and you do not explain anything - i would be amazed if anyone would be able to guess what you are talking about.


Anyway, just in case, this is how values are stored on your typical little-endian format:


(2 byte integer)

original value: 4660 -(hex)-> 0x1234 -(as bytes)-> 0x34, 0x12

original value: -4660 -(hex)-> 0xEDCC -(as bytes)-> 0xCC, 0xED

Note: 0xEDCC = 0x10000 - 0x1234

(4 byte integer)

original value: -123456789 -(hex)-> 0xF8A432EB -(as bytes)-> 0xEB, 0x32, 0xA4, 0xF8



If i understood your notation - looks like floating point to me. Treat it as a standard 4 byte float - see if that does the trick.

#5132330 inquisitive mind vs cache

Posted by tanzanite7 on 18 February 2014 - 08:08 AM

It has been several years when i last checked how cache behavior matches how i expect it to behave - so, decided to poke around a bit.

Note: Keeping in mind that the topic is extremely hardware dependent and not particularly useful for generalization. Also, a specific extremely-corner-case test, the only ones one can reasonably make, tells little about real world programs - the test is mostly just curiosity.

A few relevant hardware metrics: Ivy, 6MB L3, 32KB L0 separate from code cache, 8GB main memory.
Os: Windows 7 64bit (relevant for TLB behavior), swap file disabled.

Things i wondered:
* What is the cost of reading main memory (inc address translation misses)?
* Random access vs consecutive?
* Consecutive forward vs backwards (and cache line vs page)?
* Consecutive cache line vs 2 cache lines?

Conclusions derived so far:
* Full cache miss (TLB included) cost - is nigh' impossible to get (that kind of things are extremely rare in normal circumstances and i have failed to confidently measure it so far). Seems to be between 250-800 ticks (extreme oversimplification! - for starters, CPU and main memory are separate entities).
* Random access is equal to any ordered access if the consecutive reads are skipping cache lines. biggrin.png, as i suspected.
* Random access is ~1.5-3 times slower vs consecutive (must be distanced less or equal to cache line to not be equal to random).
* Consecutive forward vs backward - makes no difference at all.

* Cache makes huge difference.

Now the kicker, all the results are wrong (to an unknown extent) - the damn compiler and the actual processor are just way too good at their job and i have not been able to stop them. ARGH! Which is why i am making this thread, besides just sharing the results.

The test program (not standalone, but whatever else it uses is fairly self explanatory):

        TIMES(15); TIMEE(15); TIMES(15); TIMEE(15); // TIMES - starts a timer, TIMEE - ends a timer, TIME - read the timer value
        GLOGD(L" timer overhead: %I64u", TIME(15));

        // working set
        #define REP     8
        #define TESTS   12
        //#define PAGES   2048
        //#define PAGES   4096
        #define PAGES   16384
        intp trash1 = intp(Mem::getChunks(MB(64)));
        intp trash2 = intp(Mem::getChunks(MB(64)));
        //intp trash3 = intp(VirtualAlloc(0, GB(1), MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE));
        intp at = intp(Mem::getChunks(PAGES * 4096));
        int16u offs[PAGES];       // offsets to simulate random access with minimal overhead
        int64u stats[TESTS][REP];

        // init: build offset array, ensuring all values are in range and unique
        Random::MT rnd(123);
        int16u build[PAGES];
        for(int32u i=0; i<PAGES; i++) build[i] = int16u(i);
        for(int32u i=0; i<PAGES; i++) {
            int32u nr = rnd.nextIntMax(PAGES - i);
            offs[i] = build[nr];
            build[nr] = build[PAGES - 1 - i];

        // macro for trashing all cache levels
        #define TRASH for(intp i=0; i<PAGES * 4096; i+=4) *(int32u*)(at + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash1 + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash2 + i) += int32u(i);
                      //for(intp i=0; i<GB(1); i+=64) *(int32u*)(trash3 + i) += int32u(i);
        // macro fo directional access
        #define CHECK_DIR(step) if(step > 0) for(int32u i=0;       i<PAGES; i++) sum ^= *(int32u*)(at + i * step);\
                                else         for(int32u i=PAGES-1; i<PAGES; i--) sum ^= *(int32u*)(at + i * -(step));
        // macro for random access
        #define CHECK_RND(size) for(int32u i=0; i<PAGES; i++) sum ^= *(int32u*)(at + intp(offs[i]) * size);

        // init
        int32u sum = 0;// anti optimization

        // crude attempt to get times for cold and hot single access.
        TRASH; TRASH;
        TIMES(0); sum ^= *(int32u*)at; TIMEE(0);
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        TIMES(1); sum ^= *(int32u*)at; TIMEE(1);
        GLOGD(L"  single access: %I64u %I64u", TIME(0), TIME(1));

        // tests
        for(int32u rep=0; rep<REP; rep++) {
            TRASH; TIMES(0); CHECK_DIR(  128); TIMEE(0); //
            TRASH; TIMES(10); CHECK_DIR(  192); TIMEE(10); //
            TRASH; TIMES(1); CHECK_DIR( 4096); TIMEE(1); // +4K
            TRASH; TIMES(2); CHECK_DIR(-4096); TIMEE(2); // -4K
            TRASH; TIMES(3); CHECK_RND( 4096); TIMEE(3); // ?4K
            TRASH; TIMES(4); CHECK_DIR( 64);   TIMEE(4); // +64
            TRASH; TIMES(5); CHECK_DIR(-64);   TIMEE(5); // -64
            TRASH; TIMES(6); CHECK_RND( 64);   TIMEE(6); // ?64
            TRASH; TIMES(7); CHECK_DIR( 4);    TIMEE(7); // +4
            TRASH; TIMES(8); CHECK_DIR(-4);    TIMEE(8); // +4
            TRASH; TIMES(9); CHECK_RND( 4);    TIMEE(9); // ?4
            for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i);
            TIMES(11); for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i); TIMEE(11); // L1
            // record results
            for(int32u test=0; test<TESTS; test++) stats[test][rep] = TIME(test);

        // throw away outliers ... well, not really => just throw away half of the results that disagree the most with the rest
        #define CUT (REP/2)
        for(int32u test=0; test<TESTS; test++) {
            for(int32u cut=0; cut<CUT; cut++) {
                // get average at cut point
                int64u sum = 0;
                for(int32u rep=cut; rep<REP; rep++) sum += stats[test][rep];
                int64u avg = sum / (REP - cut);
                // find outlier
                int32u outlier = cut;
                int64u dist = _abs64(stats[test][outlier] - avg);
                for(int32u rep=cut+1; rep<REP; rep++) {
                    int64u distCur = _abs64(stats[test][rep] - avg);
                    if(dist < distCur) {
                        dist = distCur;
                        outlier = rep;
                // swap out the outlier (ie. swap outlier with value from cut point)
                int64u tmp = stats[test][outlier];
                stats[test][outlier] = stats[test][cut];
                stats[test][cut] = tmp;

        // calculate averages and minimums
        int64u average[TESTS], minimum[TESTS];
        for(int32u test=0; test<TESTS; test++) {
            int64u sum = 0;
            for(int32u rep=CUT; rep<REP; rep++) sum += stats[test][rep];
            average[test] = sum / (REP - CUT);
            int64u min = stats[test][0];
            for(int32u rep=0; rep<REP; rep++) if(min > stats[test][rep]) min = stats[test][rep];
            minimum[test] = min;

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", minimum[1], minimum[2], minimum[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", minimum[4], minimum[5], minimum[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", minimum[7], minimum[8], minimum[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", minimum[0], minimum[10]);
        GLOGD(L" test L0     : %7I64u %c", minimum[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", average[1], average[2], average[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", average[4], average[5], average[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", average[7], average[8], average[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", average[0], average[10]);
        GLOGD(L" test L0     : %7I64u %c", average[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(minimum[1]) / PAGES, float(minimum[2]) / PAGES, float(minimum[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(minimum[4]) / PAGES, float(minimum[5]) / PAGES, float(minimum[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(minimum[7]) / PAGES, float(minimum[8]) / PAGES, float(minimum[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(minimum[0]) / PAGES, float(minimum[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(minimum[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(average[1]) / PAGES, float(average[2]) / PAGES, float(average[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(average[4]) / PAGES, float(average[5]) / PAGES, float(average[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(average[7]) / PAGES, float(average[8]) / PAGES, float(average[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(average[0]) / PAGES, float(average[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(average[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

The actual results this gives:

¤0   0:00:00.2546 {D}:   overhead: 43  cold: 43  hot: 53 // this is fairly bokners as the timer code i use is not meant for micro measures (just __rdtscp without cpuid and accesses memory because VC2012 intrinsic for it is terribly implemented and not usable for this purpose)
¤0   0:00:00.3180 {D}:   single access: 1691 411 // more bonkers
¤0   0:00:03.3538 {D}: --------------------------------- minimums
¤0   0:00:03.3544 {D}:  test 4K +-? :  730986  752583  782969
¤0   0:00:03.3547 {D}:  test 64 +-? :  292295  284462  462598
¤0   0:00:03.3551 {D}:  test  4 +-? :   20399   31741   53250
¤0   0:00:03.3554 {D}:  test 64*2*3 :  444014  541493
¤0   0:00:03.3557 {D}:  test L0     :   11589
¤0   0:00:03.3559 {D}: --------------------------------- averages
¤0   0:00:03.3562 {D}:  test 4K +-? :  739322  753575  786489
¤0   0:00:03.3564 {D}:  test 64 +-? :  296640  295497  462916
¤0   0:00:03.3567 {D}:  test  4 +-? :   20560   32331   55008
¤0   0:00:03.3569 {D}:  test 64*2*3 :  447766  547238
¤0   0:00:03.3570 {D}:  test L0     :   11641
¤0   0:00:03.3572 {D}: --------------------------------- minimums
¤0   0:00:03.3574 {D}:  test 4K +-? : 44.62 45.93 47.79
¤0   0:00:03.3576 {D}:  test 64 +-? : 17.84 17.36 28.23
¤0   0:00:03.3578 {D}:  test  4 +-? :  1.25  1.94  3.25
¤0   0:00:03.3579 {D}:  test 64*2*3 : 27.10 33.05
¤0   0:00:03.3581 {D}:  test L0     :  0.71
¤0   0:00:03.3583 {D}: --------------------------------- averages
¤0   0:00:03.3584 {D}:  test 4K +-? : 45.12 45.99 48.00 // consecutive +4K, consecutive -4K, and random
¤0   0:00:03.3586 {D}:  test 64 +-? : 18.11 18.04 28.25
¤0   0:00:03.3588 {D}:  test  4 +-? :  1.25  1.97  3.36
¤0   0:00:03.3590 {D}:  test 64*2*3 : 27.33 33.40 // consecutive +128 and +192
¤0   0:00:03.3591 {D}:  test L0     :  0.71 // consecutive +1, hot cache, reading a total of 16KB - one byte at a time.

* While the single access tests are bonkers - the rest are valid ... sorta, i'll get to that.
* Reading 16384 values in a loop took ~11-12 clock cycles with hot cache.
* Worst cache hit ratio slowed it down by a factor of ~70.
* TLB miss is just and extra memory miss with Windows 7 / x86-64.

So, why are the results kinda wrong?
* compiler partially unrolled SOME for the loops (ie. doing 4 per cycle).
* processor can (AND ABSOLUTELY LOVES TO), in effect, unroll the loops too and have multiple reads in flight.
* processor can execute multiple instructions at the same clock cycle (which is why the last result is as fast as it is).

The results are not comparable - and i do not know the margin of error. Ie. how many reads are in flight for example in the 4K test?

Glancing at one of thous 4K tests: the loop code is 15 bytes total (5 basic instructions)! It more than fits fully in decoder cache (and VC likes to pad nops in front of hot loops to make the alignment perfect for it [Ivy likes 32B]). Also, the register pressure is, expectedly, very low (in regards to renaming).

Any ideas how to stop the processor and to a lesser extent, the compiler, from being so good at their job and still get usable results?

Now, to get single access timings i dusted out my x86 Asm class (and added cpuid/rdtsc/rdtscp):

        // build test function
        int64u timings[3]; // 0 = plain overhead, 1 = cold access (+overhead), 2 = hot access (+overhead)
        Asm fn; // x64 cdecl, scap: RAX, RCX, RDX, R8, R9, R10, R11

        fn.nop(16); // VS2012 fails do decode first 16 bytes - NOP'ing over that area, so i can actually see what i am stepping through while debugging
        fn.push(Asm::RBX);                                        //  push RBX
        fn.push(Asm::R15);                                        //  push R15
        // get plain overhead of serialised timer query
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();                                              //  rdtscp
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[0]), Asm::na, 0, Asm::R9);  //  mov [&timings[0]], R9
        // cold access
        fn.mov(Asm::R15, at);                                     //  mov R15, at
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[1]), Asm::na, 0, Asm::R9);  //  mov [&timings[1]], R9
        // warmup
        fn.mov(Asm::R9, 128);                                     //  mov R9, 128
        int32u loop = fn.getIP();                                 //loop:
        fn.alu(Asm::ADD, Asm::R8, Asm::R15, 0, Asm::na, 0);       //  add R8, [R15]
        fn.dec(Asm::R9);                                          //  dec R9
        fn.j(Asm::NZ, loop);                                      //  jnz loop
        // hot access
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[2]), Asm::na, 0, Asm::R9);  //  mov [&timings[2]], R9
        fn.pop(Asm::R15);                                         //  pop R15
        fn.pop(Asm::RBX);                                         //  pop RBX

        void *function = VirtualAlloc(0, 65536, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
        if(!fn.rebase(function)) GLOGF(L"Rebase failed!");
        TRASH; TRASH;
        ((void(*)())function)(); // call function
        GLOGD(L"  overhead: %I64u  cold: %I64u  hot: %I64u", timings[0], timings[1], timings[2]);

Which gives me (the number vary test by test, but are fairly close to each other): "overhead: 39 cold: 36 hot: 64".


Seriously!? Any idea what is wrong there?

Though that perhaps telling windows to nuke instruction cache or whatever it was (in case it somehow manages to affect things - it should not in this case) - but i can not for the life of me remeber the function call.


Note: rdtscp, while serializing, goes not prevent OoO executing instructions that follow it - including the cache-fail reads i want to test. Hence the cpuid as it is said to stop that. rdtsc does no serialization at all, but none is needed as i need to use cpuid anyway. Evidently, something is wrong.

#5122204 runtime cost of stack overflow checking

Posted by tanzanite7 on 08 January 2014 - 12:12 PM

Dealing with stack is the responsibility of whoever uses it by the rules of whoever manages the stack to the extent of the OS getting out of the way. In regards to Windows, that means: you can override every aspect of the stack behavior completely - just not as efficiently as the OS could.


Slight corrections to what has been said already:

By default on windows a typical stack uses a guard page to trigger stack expansion with the additional requirement that if a function uses more that one page worth of stack then it must call a special function to ensure the guard page is not skipped (this special function call is added automatically by the compiler as part of calling convention / exception handling - in function prolog: "__alloca_probe").


So, besides the overhead of the function call when more than a page is needed - there is no overhead as no checks are done.


Stack underflow is not guarded (just not worth it). IF you are lucky then the memory there is inaccessible and hence causes an immediate access violation crash.



Something i use to check the state of address space (VC++):

void dump() {
    // dump memory map
    intp at = 0;
    while(true) {
        if(VirtualQuery((void*)at, &info, sizeof(info)) != sizeof(info)) break;
        LOG(L"%c %p - %p %c%c%c%c%c%c ( %7d KB  %s ) %s"
            , info.AllocationBase == info.BaseAddress || info.AllocationBase == 0 ? L'>' : L' '
            , intp(info.BaseAddress)
            , intp(info.BaseAddress) + intp(info.RegionSize) - 1
            , info.Protect == 0 ? L'?' : (info.Protect & 0x10 ? L'!' : L' ')
            , info.Protect & 0xF0 ? L'x' : '-'
            , info.Protect & 0x66 ? L'r' : '-'
            , info.Protect & 0xCC ? L'w' : '-'
            , info.Protect & 0x88 ? L'c' : '-'
            , info.Protect & 0x100 ? L'G' : '-'
            , int32u(info.RegionSize >> 10)
            , intp(info.BaseAddress) & 65535 ? L"a:???" : L"a:64K"
            , info.State == MEM_FREE ? L"¤" : (info.State == MEM_COMMIT ? L"USED" : L"RESERVED")
        at += info.RegionSize;
// some quickly made dirty Logger of dubious quality
void logging(wchar_t const *format, ...) {
    va_list argptr; va_start(argptr, format);
    wchar_t buf[1024];
    int wrote = vswprintf(&buf[0], 1024, format, argptr);
    if(wrote<0) wrote = 0;
    if(wrote) OutputDebugStringW(buf);
#define LOG(...) logging(L"\n" __VA_ARGS__);
// intp = unsigned integer fit for pointer (ie. uintptr_t), int32u = unsigned 32 bit integer

which, in my case, gives (some excerpts):

> 00000000 - 0000FFFF  ----- (      64 KB  a:64K ) ¤ // a page granularity worth of a general no-go area
> 00020000 - 0002EFFF  -rw-- (      60 KB  a:64K ) USED // a guarded cyclic buffer of mine - just something the program used for the dump() apparently uses
  0002F000 - 0002FFFF  -r--G (       4 KB  a:??? ) USED
> 00090000 - 00090FFF  -r--- (       4 KB  a:64K ) USED // my program (non fixed image address)
  00091000 - 000A0FFF  xrw-- (      64 KB  a:??? ) USED // ... "x" AND "w" !? ... wut!? What is this!?
  000A1000 - 000B5FFF  xr--- (      84 KB  a:??? ) USED // ... program code
  000B6000 - 000B9FFF  -r--- (      16 KB  a:??? ) USED // ... data
  000BA000 - 00168FFF  -rw-- (     700 KB  a:??? ) USED
  00169000 - 0016BFFF  -r--- (      12 KB  a:??? ) USED
> 00250000 - 00288FFF ?----- (     228 KB  a:64K ) RESERVED // some stack (should not be mine, directly - dunno)
  00289000 - 0028BFFF  -rw-G (      12 KB  a:??? ) USED
  0028C000 - 0028FFFF  -rw-- (      16 KB  a:??? ) USED
> 00290000 - 0031FFFF  ----- (     576 KB  a:64K ) ¤ // memory following stack is not reserved and is free for grab
> 00440000 - 0053BFFF ?----- (    1008 KB  a:64K ) RESERVED // my stack (1MB stack is the the default setting in VC++)
  0053C000 - 0053CFFF  -rw-G (       4 KB  a:??? ) USED
  0053D000 - 0053FFFF  -rw-- (      12 KB  a:??? ) USED
> 00540000 - 005CFFFF  ----- (     576 KB  a:64K ) ¤ // again - free for grab
> 005D0000 - 005D7FFF  -rw-- (      32 KB  a:64K ) USED // heap perhaps?
  005D8000 - 006CFFFF ?----- (     992 KB  a:??? ) RESERVED
> 007D0000 - 207CFFFF  -rw-- (  524288 KB  a:64K ) USED // half a GB used for my own - non heap
> 207D0000 - 607CFFFF  ----- ( 1048576 KB  a:64K ) ¤ // first big chunk of unused address space
> 607D0000 - 607D0FFF  -r--- (       4 KB  a:64K ) USED // dll/system land begins
  607D1000 - 60958FFF  xr--- (    1568 KB  a:??? ) USED
  60959000 - 6095EFFF  -rw-- (      24 KB  a:??? ) USED
  6095F000 - 60970FFF  -r--- (      72 KB  a:??? ) USED
> 60971000 - 7533FFFF  ----- (  337724 KB  a:??? ) ¤ // a stray block of free address space cut short by some dll/system shenanigans
  dll/system heaven
> 7FFF0000 - FFFAFFFF  ----- ( 2096896 KB  a:64K ) ¤ // some more address space - the benefits of "large address aware" on 64bit windows.


"but i heard something that now pages are bigger than 4k ?"


Bigger pages have always been possible (i386+ for the whole x86 line) - just that it is rarely ever used on desktop machines.


Note also: on some hardware (non x86) - 4KB pages are not even an option and 8KB is the smallest possible for example.

#5119600 Tiling textures using mod() in shader causes stripes on textures

Posted by tanzanite7 on 27 December 2013 - 07:43 PM

You might have bad mipmap selection as mentioned, however that would generate only 2x2 spots of garbage - that clearly does not match the picture.


Would guess it is the half texel issue (depending on whether Direct3D or OpenGL is used - (0,0) is either corner [OGL] or middle [D3D? - or have they changed that?] of texel) - however, the error seems to cover more texels that one would expect on such a case (might be just my eyes - the error should be about 1, rarely 2 texels worth).


My guess is that your texture coordinates are just wrong. Take pencil and paper and check that the final coordinates are exactly what they should be.

#5119410 Forcing Code To Work !

Posted by tanzanite7 on 26 December 2013 - 07:41 PM

The browser doesn't let you type those extra characters, but you can either post from a separate file/page or use Javascript to alter the form settings.

In my case, all i have to do is pick "Forms" -> "Remove Maximum Lengths" from the always visible web developer toolbar.

Reminds me a finished project i was handed down for greenlighting into active use some years ago when its original dev left. Instead of simply giving a go/no-go, i opted to construct a string ready for copy paste and just showed them:
* two mouse clicks to remove the form field limitations
* copy-paste the crafted username into login form
* press enter
=> logged into the application as superuser. As an additional bonus - the login crashed the logging subsystem, without invalidating the login itself -> leaving no traces.

Not blaming the original dev too much tho - he was not qualified to do the job and everyone involved knew it from the outset, they just though "how hard can it be?".

#5118314 How to write fast code without optimizing

Posted by tanzanite7 on 20 December 2013 - 12:54 AM

Edit: Seriously, why the anonymous downvote...?

Beats me.

Never liked the particular reputation system that gamedev employs, even less i like the misuse of it - so, upvoted for balance.

#5070571 OpenGL Auxiliary

Posted by tanzanite7 on 17 June 2013 - 03:00 PM

There is no way to explicitly disallow them - but there would be no point to do so anyway even if the option existed.


Aux buffers are created only when "used" (ie. referenced etc) (*) - gdebugger "uses" them (to show that there is nothing going on). In short: they are not created when you do not run through gdebugger and do not use them in your code.


(*) techincally, it is implementation detail - but you can safely assume every ogl diver that ever existed works that way.

#5062636 GLSL fragment shader, prevent depth writing

Posted by tanzanite7 on 17 May 2013 - 01:03 PM

Early depth testing is very efficient (nothing even remotely as efficient can be done in shader) - mucking with the depth in fragment shader will turn early depth test off (because there is no depth early available to do anything with ... early).


You can not disable depth writes in fragment shader - just discard the fragment completely (the "discard" you mentioned) omitting the depth write also.


Newer OGL versions might provide something that could be of use tho (but i doubt it) - i rarely use anything above OGL3.3.

#5058037 Handling other languages and font.

Posted by tanzanite7 on 30 April 2013 - 05:52 AM

To get a minimum passable support you need to support at least every code point between 0 and 65535 (thous that have anything valid defined). To get a reasonably good support one should also support combining characters and code points above 65535.

Generating any texture to cover it is impossible. You have to generate and prune them on the fly as needed and possibly combine them yourself in the CPU side as GPU tends to be very inefficient/limiting to hack text together.

#5053101 "this" keyword question

Posted by tanzanite7 on 14 April 2013 - 04:51 AM

For me it is not really a question of typing more - it is readability. Adding more junk to the code that does ultimately nothing can not be a good thing IMO.

Also, i am strongly against code conventions that are solely risen out of tool/user limitations: tools change, but the code is what one has to live with. Delay in invoking intellisense is a good example of that (Systems Hungarian notion being the canonical example of shit hitting the fan rather fast). IDE is there to help you - not to own you and your code.

PS. Why not set the intellisense delay to 0 (as i have done)? It needs a bit of getting used to, but well worth it.

edit: One more thing, my current IDE (VS2012) has quite good coloring options. I know that a variable has an implicit "this" from the variable color alone. No need to notice whether it has "this->" written in front of it nor whether it starts with "m_" ... i do not have to read anything and i already know!

PS2. Of course, consistency is paramount and i am not saying you should start converting all your code - which is bound to be a massive undertaking and hard to justify. I am just saying that, IMMHO, what you do is bad practice and should be avoided if possible.

#5049734 D3DX Memory leaks

Posted by tanzanite7 on 03 April 2013 - 04:27 PM

#define SAFE_RELEASE(x)           if((x) != NULL){(x)->Release();(x) = NULL;}
#define SAFE_DELETE_ARRAY(x)   if((x) != NULL){delete[] (x);(x) = NULL;}
#define SAFE_DELETE_OBJECT(x)  if((x) != NULL){delete (x);(x) = NULL;}
I don't understand why anybody use them anymore, /.../

Because it is a horrible practice. Just try to forget MS put that crap in its SDK - never recommend it to anyone!

Using macros is fine, provided they are sane - thous are not because:
1. macros should be avoided when there is not a darn good reason to use them - this just pointlessly obscures what it does.
2. all macros should properly handle parameters with side effects.
3. old VC ignored the standard and needed a check for null in case of delete - that cruft is now useless.

Don't use it. And sure as hell do not recommend anyone else to use it.

#5020428 Is it such possible to create fast games without using C/C++ ?

Posted by tanzanite7 on 11 January 2013 - 01:42 PM

Yes, you can write a game faster without c++. ;)

Did not notice anyone mentioning it and the riming with the last post just begged for it. In short, getting stuff done in a reasonable time-frame is worth to be considered. The extra time might allow you to try other - better - algorithms also, which is where most of the performance improvements lie.