Jump to content

  • Log In with Google      Sign In   
  • Create Account

Interested in a FREE copy of HTML5 game maker Construct 2?

We'll be giving away three Personal Edition licences in next Tuesday's GDNet Direct email newsletter!

Sign up from the right-hand sidebar on our homepage and read Tuesday's newsletter for details!


We're also offering banner ads on our site from just $5! 1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


tanzanite7

Member Since 20 Nov 2005
Offline Last Active Nov 20 2014 01:47 PM

#5139373 Needing some quick tips on how to organize my code

Posted by tanzanite7 on 15 March 2014 - 09:38 PM

@dejaime, well... I'm extremely (really) picky when it comes to colors. I can't stand writing code in a white background, and I like to just get my hands dirty when I'm learning something. All the IDEs I tried (well, CB, VC and DC, don't know any others), kind of got in the way of it. There's always something that needs to be set or some intricacy that needs to be understood (i.e., project templates, MS's main() arguments), or otherwise something that doesn't work for very specific reasons. I've been away from C++ for years because of this. Also, they clutter my hard drive with project folders (VS is particularly unorganized, it mixes projects from all apps in one folder by default) when all I want at this point is a source file to experiment with. When learning the basics, I need a basic setup to get right down to it and keep me focused and without obstacles.

I would recommend you to reconsider - a proper ide (which Sublime Text at least does not seem to be from my cursory examination) is an invaluable assistance, especially if you are relatively new. And even more so when you are not new and your projects grow to anything above trivial.

I would recommend Visual Studio 2013 (for desktops, express edition - ie. free). Its coloring scheme is highly customizable - comes even with a "dark" theme as a preset option (or a starting point for your own customizations).

Syntax coloring options include separation of: global/local/member/static-member variables, namespaces/classes/enums/types, static/non-static member functions, macros etc...

Intellisense can also automatically pick up and mark with red-wiggles most errors without the need to compile and its hover-tooltips, as you will see below, are quite informative.

An example of my, slightly altered from default, coloring (i prefer white - used to prefer dark when i first started out ~20y ago):

http://postimg.org/image/r0ckj03i7/

edit: Uh, what, why the downvote? That makes no sense.


#5138119 Your one stop shop for the cause of all coding horrors

Posted by tanzanite7 on 11 March 2014 - 09:22 AM

I enjoyed this - thanks for sharing.
 
Also:
// somedev1 -  6/7/02 Adding temporary tracking of Login screen
// somedev2 -  5/22/07 Temporary my ass
The history of software development in a nutshell.


#5135912 Large textures are really slow...

Posted by tanzanite7 on 02 March 2014 - 03:25 PM

Reread OP:

 

"One of my main game features (visually) is the overlay texture I'm using.  This overlay texture is full screen, and covers the entire screen which gives it a nice transparent pattern."

 

And the post following it:
 

"http://shogun3d.net/images2/looptil/04.jpg"

 

Depth test is absolutely useless there - as i mentioned in my original post. I take you did not read any of it.




#5134284 Changing graphics settings at real time

Posted by tanzanite7 on 24 February 2014 - 06:59 PM

I read and understand your arguments. Saying "it's not meant to be" seems awfully closed-minded, but it's the truth. The graphics API's really didn't want it to be. Take OpenGL for instance:
 

GLFWwindow* window;
window = glfwCreateWindow( 1024, 768, "MyWindow", NULL, NULL);
glfwMakeContextCurrent(window);
You'd have to re-create the OpenGL-context to change the resolution. The same goes for D3D, you lose the device when you change settings.
 
I guess what I'm saying is, I'm just amazed that this hasn't been solved yet.

GLFW is not OGL and window management and what-not has nothing to-do with OGL either (WGL is for that side).
 
Short story:
* Each window has a device context => encapsulated stuff for GDI
* We want a sensible pixelformat (RGBA8, doublebuffered) for it that has hardware acceleration.
* We get a rendering context for it.
* Windows/GDI is there to composite all that stuff on screen.
 
At no point does anything we care about care what the screen resolution is (*)(**).
Just change the resolution and reposition/resize your window, add/remove borders for windowed mode if user wanted that too. No need to recreate the OpenGL context (why would you? want windows software GL1.1 implementation instead?).

 
(*) Windows/GDI will have a bit more work todo internally when your framebuffer is RGBA8, but screen is in some silly format (like 256-palette). But that is not our concern (it slightly mattered in the old ancient days where using an equally silly framebuffer was a reasonable compromise option).
(**) OpenGL API specification does not mention screen resolution changes => nothing is allowed to be lost because of screen resolution change. This fact is even mentioned by some specs, ex: http://www.opengl.org/registry/specs/ARB/vertex_buffer_object.txt "Do buffer objects survive screen resolution changes, etc.?". However WGL is not OGL, so for example p-buffer (***), if you happen to use them, might be lost.
(***) do not confuse with pixelbuffer, p-buffer -> https://www.opengl.org/wiki/P-buffer

edit: Oh, forgot to comment the last line: it has never been a problem to begin with ... at the OGL side of the fence at least. At D3D side, afaik, things were considerably less rosy.


#5134057 GLSL and GLU Teapot

Posted by tanzanite7 on 24 February 2014 - 03:40 AM

It used to be that Mac only supported OGL 3.2 (there was a fakery way to make the OS believe it has 3.3, not sure whether there has been any official support added) - which iirc does not have attribute layout location in core. The extension of it might be available though, as TheCubu noted.

Glass_Knife, i did not notice you telling how the "Not working" manifests itself? What is the error message given? Are you sure you have OGL 3.3 available in the first place?


#5133851 deinstancing c-strings

Posted by tanzanite7 on 23 February 2014 - 07:59 AM

 

If you restrict yourself to 4 characters, you can do this:

 foo(x,y, 'red' );
 foo(x,y, 'grn' );
 foo(x,y, 'quik' );
That's interesting, I didn't know about these "multi-character literals". Even though they're implementation defined, I wonder if there's some agreeance between compilers as to how they should work? Maybe it just depends if you're compiling for big or little endian...

 

I used thous a lot a-bloody-long-time-ago, but abandoned it when more than one compiler entered my life and one of them pointed out that i am sitting in a minefield. Anyway, IIRC, it was fairly consistent on little endian (with tripple-facepalm worthy highly objectionable nonsensical ordering!), if you can stomach the bazillion warning messages it produces - i could not.

It is a dead feature for me (even though 'foo' == 'foo' is always true and even 'foo' != 'bar' was consistently true - event though it does not have to be).

 

edit: actually, i think borland used to default to sensible ordering (ie. it changed the default at some point which was the source of the warnings i remember - i think), but had a compatibility switch to reverse order - so, that would make it clear that there really is no agreeance. Microsoft has the retarded order and no switch iirc.

 

edit2: http://www.borlandtalk.com/is-there-a-standard-for-multi-char-constants-vt104092.html seems to confirm my vague memory. It changed between Borlands C Builder and Developer Studio. Ie. not only is it inconsistent on little-endian platform - it is inconsistent on compilers from the same manufacturer and also dependent on compiler settings.

 

That said, it seems Microsofts ordering has prevailed and one could say that it is now more consistent than ever before :D, whatever da-f* that is worth :/.




#5133413 I don't get c++11.

Posted by tanzanite7 on 21 February 2014 - 06:11 PM

I didn't realize non-capturing lambdas didn't just compile into regular functions.

It was a late addition to the standard for this conversion operator to even exist at all - so late in fact that VC++ 2012 original release did not support it. Had to use template magic to produce a regular function pointer out of it till the support was added.




#5133339 I don't get c++11.

Posted by tanzanite7 on 21 February 2014 - 12:27 PM

Lambdas can only be defined and used inside functions:

O_o, wut? Lambda is a closure object - saying that it can only be defined / used inside functions makes as much sense than saying one can use/define the value 7 ONLY inside functions. Do not get confused that the lambda object is called "closure object" - that too is still just an object. In short, there is no such limitation.

Perhaps you meant that defining a capturing Lambda outside functions is conceptually kinda totally bonkers -> capture list requires function scope.


Another discrepancy i noticed:

Lambda object defines a few operators. One that seemed to cause a bit unfortunate/confusing wordings here is: type conversion operator to the underlying function type (only defined if the lambda does not capture anything - as otherwise the operator would not make any sense). IE. Lambda object is NOT just a function, however it might be possible to get a function pointer out of one via implicit/explicit conversion:

void (*snafu)() = [ ]()->void { ... } // this is legal through conversion operator
void (*snafu)() = [&]()->void { ... } // ... this is not as there is no applicable conversion.


#5132794 Binary to Integer (Read hex binary file)

Posted by tanzanite7 on 19 February 2014 - 05:15 PM

It's a 16-bit floating point type.

Half-float does not have an 8-bit exponent tongue.png. The realization that an exponent could explain the related change in the byte pairs and the one bit crossing the byte border (due to the sign bit) was the smoking gun for me. The exponent encoding works as a fair giveaway too, but missed it at first (64 = 128 >> 1).
 

Because the game only let me input integer value, so I think that it's integer !

This has been actually very common occurrence i my savegamemucking endeavors. Even with values where having a float makes no sense - ie. the value can logically only be whole numbers (a'la unit count etc).




#5132736 Binary to Integer (Read hex binary file)

Posted by tanzanite7 on 19 February 2014 - 01:47 PM

Your notation is extremely cryptic and you do not explain anything - i would be amazed if anyone would be able to guess what you are talking about.

 

Anyway, just in case, this is how values are stored on your typical little-endian format:

 

(2 byte integer)

original value: 4660 -(hex)-> 0x1234 -(as bytes)-> 0x34, 0x12

original value: -4660 -(hex)-> 0xEDCC -(as bytes)-> 0xCC, 0xED

Note: 0xEDCC = 0x10000 - 0x1234

(4 byte integer)

original value: -123456789 -(hex)-> 0xF8A432EB -(as bytes)-> 0xEB, 0x32, 0xA4, 0xF8

 

edit:

If i understood your notation - looks like floating point to me. Treat it as a standard 4 byte float - see if that does the trick.




#5132330 inquisitive mind vs cache

Posted by tanzanite7 on 18 February 2014 - 08:08 AM

It has been several years when i last checked how cache behavior matches how i expect it to behave - so, decided to poke around a bit.

Note: Keeping in mind that the topic is extremely hardware dependent and not particularly useful for generalization. Also, a specific extremely-corner-case test, the only ones one can reasonably make, tells little about real world programs - the test is mostly just curiosity.

A few relevant hardware metrics: Ivy, 6MB L3, 32KB L0 separate from code cache, 8GB main memory.
Os: Windows 7 64bit (relevant for TLB behavior), swap file disabled.

Things i wondered:
* What is the cost of reading main memory (inc address translation misses)?
* Random access vs consecutive?
* Consecutive forward vs backwards (and cache line vs page)?
* Consecutive cache line vs 2 cache lines?

Conclusions derived so far:
* Full cache miss (TLB included) cost - is nigh' impossible to get (that kind of things are extremely rare in normal circumstances and i have failed to confidently measure it so far). Seems to be between 250-800 ticks (extreme oversimplification! - for starters, CPU and main memory are separate entities).
* Random access is equal to any ordered access if the consecutive reads are skipping cache lines. biggrin.png, as i suspected.
* Random access is ~1.5-3 times slower vs consecutive (must be distanced less or equal to cache line to not be equal to random).
* Consecutive forward vs backward - makes no difference at all.

* Cache makes huge difference.

Now the kicker, all the results are wrong (to an unknown extent) - the damn compiler and the actual processor are just way too good at their job and i have not been able to stop them. ARGH! Which is why i am making this thread, besides just sharing the results.

The test program (not standalone, but whatever else it uses is fairly self explanatory):

        TIMES(15); TIMEE(15); TIMES(15); TIMEE(15); // TIMES - starts a timer, TIMEE - ends a timer, TIME - read the timer value
        GLOGD(L" timer overhead: %I64u", TIME(15));

        // working set
        #define REP     8
        #define TESTS   12
        //#define PAGES   2048
        //#define PAGES   4096
        #define PAGES   16384
        intp trash1 = intp(Mem::getChunks(MB(64)));
        intp trash2 = intp(Mem::getChunks(MB(64)));
        //intp trash3 = intp(VirtualAlloc(0, GB(1), MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE));
        intp at = intp(Mem::getChunks(PAGES * 4096));
        int16u offs[PAGES];       // offsets to simulate random access with minimal overhead
        int64u stats[TESTS][REP];

        // init: build offset array, ensuring all values are in range and unique
        Random::MT rnd(123);
        int16u build[PAGES];
        for(int32u i=0; i<PAGES; i++) build[i] = int16u(i);
        for(int32u i=0; i<PAGES; i++) {
            int32u nr = rnd.nextIntMax(PAGES - i);
            offs[i] = build[nr];
            build[nr] = build[PAGES - 1 - i];
        }

        // macro for trashing all cache levels
        #define TRASH for(intp i=0; i<PAGES * 4096; i+=4) *(int32u*)(at + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash1 + i) += int32u(i);\
                      for(intp i=0; i<MB(64); i+=4) *(int32u*)(trash2 + i) += int32u(i);
                      //for(intp i=0; i<GB(1); i+=64) *(int32u*)(trash3 + i) += int32u(i);
        // macro fo directional access
        #define CHECK_DIR(step) if(step > 0) for(int32u i=0;       i<PAGES; i++) sum ^= *(int32u*)(at + i * step);\
                                else         for(int32u i=PAGES-1; i<PAGES; i--) sum ^= *(int32u*)(at + i * -(step));
        // macro for random access
        #define CHECK_RND(size) for(int32u i=0; i<PAGES; i++) sum ^= *(int32u*)(at + intp(offs[i]) * size);

        // init
        TRASH;
        int32u sum = 0;// anti optimization

        // crude attempt to get times for cold and hot single access.
        TRASH; TRASH;
        TIMES(0); sum ^= *(int32u*)at; TIMEE(0);
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        sum ^= *(int32u volatile *)at;
        TIMES(1); sum ^= *(int32u*)at; TIMEE(1);
        GLOGD(L"  single access: %I64u %I64u", TIME(0), TIME(1));

        // tests
        for(int32u rep=0; rep<REP; rep++) {
            TRASH;
            TRASH; TIMES(0); CHECK_DIR(  128); TIMEE(0); //
            TRASH; TIMES(10); CHECK_DIR(  192); TIMEE(10); //
            TRASH; TIMES(1); CHECK_DIR( 4096); TIMEE(1); // +4K
            TRASH; TIMES(2); CHECK_DIR(-4096); TIMEE(2); // -4K
            TRASH; TIMES(3); CHECK_RND( 4096); TIMEE(3); // ?4K
            TRASH; TIMES(4); CHECK_DIR( 64);   TIMEE(4); // +64
            TRASH; TIMES(5); CHECK_DIR(-64);   TIMEE(5); // -64
            TRASH; TIMES(6); CHECK_RND( 64);   TIMEE(6); // ?64
            TRASH; TIMES(7); CHECK_DIR( 4);    TIMEE(7); // +4
            TRASH; TIMES(8); CHECK_DIR(-4);    TIMEE(8); // +4
            TRASH; TIMES(9); CHECK_RND( 4);    TIMEE(9); // ?4
            for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i);
            TIMES(11); for(int32u i=0; i<PAGES; i++) sum ^= *(byte*)(at + i); TIMEE(11); // L1
            // record results
            for(int32u test=0; test<TESTS; test++) stats[test][rep] = TIME(test);
        }

        // throw away outliers ... well, not really => just throw away half of the results that disagree the most with the rest
        #define CUT (REP/2)
        for(int32u test=0; test<TESTS; test++) {
            for(int32u cut=0; cut<CUT; cut++) {
                // get average at cut point
                int64u sum = 0;
                for(int32u rep=cut; rep<REP; rep++) sum += stats[test][rep];
                int64u avg = sum / (REP - cut);
                // find outlier
                int32u outlier = cut;
                int64u dist = _abs64(stats[test][outlier] - avg);
                for(int32u rep=cut+1; rep<REP; rep++) {
                    int64u distCur = _abs64(stats[test][rep] - avg);
                    if(dist < distCur) {
                        dist = distCur;
                        outlier = rep;
                    }
                }
                // swap out the outlier (ie. swap outlier with value from cut point)
                int64u tmp = stats[test][outlier];
                stats[test][outlier] = stats[test][cut];
                stats[test][cut] = tmp;
            }
        }

        // calculate averages and minimums
        int64u average[TESTS], minimum[TESTS];
        for(int32u test=0; test<TESTS; test++) {
            int64u sum = 0;
            for(int32u rep=CUT; rep<REP; rep++) sum += stats[test][rep];
            average[test] = sum / (REP - CUT);
            int64u min = stats[test][0];
            for(int32u rep=0; rep<REP; rep++) if(min > stats[test][rep]) min = stats[test][rep];
            minimum[test] = min;
        }

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", minimum[1], minimum[2], minimum[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", minimum[4], minimum[5], minimum[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", minimum[7], minimum[8], minimum[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", minimum[0], minimum[10]);
        GLOGD(L" test L0     : %7I64u %c", minimum[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %7I64u %7I64u %7I64u", average[1], average[2], average[3]);
        GLOGD(L" test 64 +-? : %7I64u %7I64u %7I64u", average[4], average[5], average[6]);
        GLOGD(L" test  4 +-? : %7I64u %7I64u %7I64u", average[7], average[8], average[9]);
        GLOGD(L" test 64*2*3 : %7I64u %7I64u", average[0], average[10]);
        GLOGD(L" test L0     : %7I64u %c", average[11], sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit minimums
        GLOGD(L"--------------------------------- minimums");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(minimum[1]) / PAGES, float(minimum[2]) / PAGES, float(minimum[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(minimum[4]) / PAGES, float(minimum[5]) / PAGES, float(minimum[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(minimum[7]) / PAGES, float(minimum[8]) / PAGES, float(minimum[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(minimum[0]) / PAGES, float(minimum[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(minimum[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

        // vomit averages
        GLOGD(L"--------------------------------- averages");
        GLOGD(L" test 4K +-? : %5.2f %5.2f %5.2f", float(average[1]) / PAGES, float(average[2]) / PAGES, float(average[3]) / PAGES);
        GLOGD(L" test 64 +-? : %5.2f %5.2f %5.2f", float(average[4]) / PAGES, float(average[5]) / PAGES, float(average[6]) / PAGES);
        GLOGD(L" test  4 +-? : %5.2f %5.2f %5.2f", float(average[7]) / PAGES, float(average[8]) / PAGES, float(average[9]) / PAGES);
        GLOGD(L" test 64*2*3 : %5.2f %5.2f", float(average[0]) / PAGES, float(average[10]) / PAGES);
        GLOGD(L" test L0     : %5.2f %c", float(average[11]) / PAGES, sum ? L' ' : L'\0'); // my L0 is 32K

The actual results this gives:

¤0   0:00:00.2546 {D}:   overhead: 43  cold: 43  hot: 53 // this is fairly bokners as the timer code i use is not meant for micro measures (just __rdtscp without cpuid and accesses memory because VC2012 intrinsic for it is terribly implemented and not usable for this purpose)
¤0   0:00:00.3180 {D}:   single access: 1691 411 // more bonkers
¤0   0:00:03.3538 {D}: --------------------------------- minimums
¤0   0:00:03.3544 {D}:  test 4K +-? :  730986  752583  782969
¤0   0:00:03.3547 {D}:  test 64 +-? :  292295  284462  462598
¤0   0:00:03.3551 {D}:  test  4 +-? :   20399   31741   53250
¤0   0:00:03.3554 {D}:  test 64*2*3 :  444014  541493
¤0   0:00:03.3557 {D}:  test L0     :   11589
¤0   0:00:03.3559 {D}: --------------------------------- averages
¤0   0:00:03.3562 {D}:  test 4K +-? :  739322  753575  786489
¤0   0:00:03.3564 {D}:  test 64 +-? :  296640  295497  462916
¤0   0:00:03.3567 {D}:  test  4 +-? :   20560   32331   55008
¤0   0:00:03.3569 {D}:  test 64*2*3 :  447766  547238
¤0   0:00:03.3570 {D}:  test L0     :   11641
¤0   0:00:03.3572 {D}: --------------------------------- minimums
¤0   0:00:03.3574 {D}:  test 4K +-? : 44.62 45.93 47.79
¤0   0:00:03.3576 {D}:  test 64 +-? : 17.84 17.36 28.23
¤0   0:00:03.3578 {D}:  test  4 +-? :  1.25  1.94  3.25
¤0   0:00:03.3579 {D}:  test 64*2*3 : 27.10 33.05
¤0   0:00:03.3581 {D}:  test L0     :  0.71
¤0   0:00:03.3583 {D}: --------------------------------- averages
¤0   0:00:03.3584 {D}:  test 4K +-? : 45.12 45.99 48.00 // consecutive +4K, consecutive -4K, and random
¤0   0:00:03.3586 {D}:  test 64 +-? : 18.11 18.04 28.25
¤0   0:00:03.3588 {D}:  test  4 +-? :  1.25  1.97  3.36
¤0   0:00:03.3590 {D}:  test 64*2*3 : 27.33 33.40 // consecutive +128 and +192
¤0   0:00:03.3591 {D}:  test L0     :  0.71 // consecutive +1, hot cache, reading a total of 16KB - one byte at a time.

Note:
* While the single access tests are bonkers - the rest are valid ... sorta, i'll get to that.
* Reading 16384 values in a loop took ~11-12 clock cycles with hot cache.
* Worst cache hit ratio slowed it down by a factor of ~70.
* TLB miss is just and extra memory miss with Windows 7 / x86-64.

So, why are the results kinda wrong?
* compiler partially unrolled SOME for the loops (ie. doing 4 per cycle).
* processor can (AND ABSOLUTELY LOVES TO), in effect, unroll the loops too and have multiple reads in flight.
* processor can execute multiple instructions at the same clock cycle (which is why the last result is as fast as it is).

The results are not comparable - and i do not know the margin of error. Ie. how many reads are in flight for example in the 4K test?

Glancing at one of thous 4K tests: the loop code is 15 bytes total (5 basic instructions)! It more than fits fully in decoder cache (and VC likes to pad nops in front of hot loops to make the alignment perfect for it [Ivy likes 32B]). Also, the register pressure is, expectedly, very low (in regards to renaming).

Any ideas how to stop the processor and to a lesser extent, the compiler, from being so good at their job and still get usable results?

-------------------------------------------------------------------------------
Now, to get single access timings i dusted out my x86 Asm class (and added cpuid/rdtsc/rdtscp):
 

        // build test function
        int64u timings[3]; // 0 = plain overhead, 1 = cold access (+overhead), 2 = hot access (+overhead)
        Asm fn; // x64 cdecl, scap: RAX, RCX, RDX, R8, R9, R10, R11

        fn.nop(16); // VS2012 fails do decode first 16 bytes - NOP'ing over that area, so i can actually see what i am stepping through while debugging
        //fn.int3();
        fn.push(Asm::RBX);                                        //  push RBX
        fn.push(Asm::R15);                                        //  push R15
        // get plain overhead of serialised timer query
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();                                              //  rdtscp
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[0]), Asm::na, 0, Asm::R9);  //  mov [&timings[0]], R9
        // cold access
        fn.nop(2);
        fn.mov(Asm::R15, at);                                     //  mov R15, at
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[1]), Asm::na, 0, Asm::R9);  //  mov [&timings[1]], R9
        // warmup
        fn.mov(Asm::R9, 128);                                     //  mov R9, 128
        int32u loop = fn.getIP();                                 //loop:
        fn.alu(Asm::ADD, Asm::R8, Asm::R15, 0, Asm::na, 0);       //  add R8, [R15]
        fn.dec(Asm::R9);                                          //  dec R9
        fn.j(Asm::NZ, loop);                                      //  jnz loop
        // hot access
        fn.nop(2);
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.tick();                                                //  rdtsc
        fn.mov(Asm::R8B, Asm::R15, 0, Asm::na, 0);                //  mov R8B, [R15]
        fn.mov(Asm::R10, Asm::RAX);                               //  mov R10, RAX
        fn.mov(Asm::R11, Asm::RDX);                               //  mov R11, RDX
        fn.tick_s();
        fn.bit(Asm::SHL, Asm::RDX, 32);                           //  shl RDX, 32
        fn.alu(Asm::OR, Asm::RAX, Asm::RDX);                      //  or  RAX, RDX
        fn.mov(Asm::R9, Asm::RAX);                                //  mov R9,  RAX
        fn.alu(Asm::XOR, Asm::RAX, Asm::RAX);                     //  xor rax, rax
        fn.cpuid();                                               //  cpuid
        fn.bit(Asm::SHL, Asm::R11, 32);                           //  shl R11, 32
        fn.alu(Asm::OR, Asm::R10, Asm::R11);                      //  or  R10, R11
        fn.alu(Asm::SUB, Asm::R9, Asm::R10);                      //  sub R9, R10
        fn.mov(Asm::na, intp(&timings[2]), Asm::na, 0, Asm::R9);  //  mov [&timings[2]], R9
        fn.pop(Asm::R15);                                         //  pop R15
        fn.pop(Asm::RBX);                                         //  pop RBX
        fn.ret();
        fn.done();

        void *function = VirtualAlloc(0, 65536, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
        if(!fn.rebase(function)) GLOGF(L"Rebase failed!");
        TRASH; TRASH;
        ((void(*)())function)(); // call function
        GLOGD(L"  overhead: %I64u  cold: %I64u  hot: %I64u", timings[0], timings[1], timings[2]);

Which gives me (the number vary test by test, but are fairly close to each other): "overhead: 39 cold: 36 hot: 64".

x_x

Seriously!? Any idea what is wrong there?

Though that perhaps telling windows to nuke instruction cache or whatever it was (in case it somehow manages to affect things - it should not in this case) - but i can not for the life of me remeber the function call.

 

Note: rdtscp, while serializing, goes not prevent OoO executing instructions that follow it - including the cache-fail reads i want to test. Hence the cpuid as it is said to stop that. rdtsc does no serialization at all, but none is needed as i need to use cpuid anyway. Evidently, something is wrong.




#5122204 runtime cost of stack overflow checking

Posted by tanzanite7 on 08 January 2014 - 12:12 PM

Dealing with stack is the responsibility of whoever uses it by the rules of whoever manages the stack to the extent of the OS getting out of the way. In regards to Windows, that means: you can override every aspect of the stack behavior completely - just not as efficiently as the OS could.

 

Slight corrections to what has been said already:

By default on windows a typical stack uses a guard page to trigger stack expansion with the additional requirement that if a function uses more that one page worth of stack then it must call a special function to ensure the guard page is not skipped (this special function call is added automatically by the compiler as part of calling convention / exception handling - in function prolog: "__alloca_probe").

 

So, besides the overhead of the function call when more than a page is needed - there is no overhead as no checks are done.

 

Stack underflow is not guarded (just not worth it). IF you are lucky then the memory there is inaccessible and hence causes an immediate access violation crash.

 

----------------------

Something i use to check the state of address space (VC++):

void dump() {
    // dump memory map
    intp at = 0;
    MEMORY_BASIC_INFORMATION info;
    while(true) {
        if(VirtualQuery((void*)at, &info, sizeof(info)) != sizeof(info)) break;
        LOG(L"%c %p - %p %c%c%c%c%c%c ( %7d KB  %s ) %s"
            , info.AllocationBase == info.BaseAddress || info.AllocationBase == 0 ? L'>' : L' '
            , intp(info.BaseAddress)
            , intp(info.BaseAddress) + intp(info.RegionSize) - 1
            , info.Protect == 0 ? L'?' : (info.Protect & 0x10 ? L'!' : L' ')
            , info.Protect & 0xF0 ? L'x' : '-'
            , info.Protect & 0x66 ? L'r' : '-'
            , info.Protect & 0xCC ? L'w' : '-'
            , info.Protect & 0x88 ? L'c' : '-'
            , info.Protect & 0x100 ? L'G' : '-'
            , int32u(info.RegionSize >> 10)
            , intp(info.BaseAddress) & 65535 ? L"a:???" : L"a:64K"
            , info.State == MEM_FREE ? L"¤" : (info.State == MEM_COMMIT ? L"USED" : L"RESERVED")
        );
        at += info.RegionSize;
    }
}
 
// some quickly made dirty Logger of dubious quality
void logging(wchar_t const *format, ...) {
    va_list argptr; va_start(argptr, format);
    wchar_t buf[1024];
    int wrote = vswprintf(&buf[0], 1024, format, argptr);
    if(wrote<0) wrote = 0;
    va_end(argptr);
    if(wrote) OutputDebugStringW(buf);
}
#define LOG(...) logging(L"\n" __VA_ARGS__);
// intp = unsigned integer fit for pointer (ie. uintptr_t), int32u = unsigned 32 bit integer

which, in my case, gives (some excerpts):

> 00000000 - 0000FFFF  ----- (      64 KB  a:64K ) ¤ // a page granularity worth of a general no-go area
  ...
> 00020000 - 0002EFFF  -rw-- (      60 KB  a:64K ) USED // a guarded cyclic buffer of mine - just something the program used for the dump() apparently uses
  0002F000 - 0002FFFF  -r--G (       4 KB  a:??? ) USED
  ...
> 00090000 - 00090FFF  -r--- (       4 KB  a:64K ) USED // my program (non fixed image address)
  00091000 - 000A0FFF  xrw-- (      64 KB  a:??? ) USED // ... "x" AND "w" !? ... wut!? What is this!?
  000A1000 - 000B5FFF  xr--- (      84 KB  a:??? ) USED // ... program code
  000B6000 - 000B9FFF  -r--- (      16 KB  a:??? ) USED // ... data
  000BA000 - 00168FFF  -rw-- (     700 KB  a:??? ) USED
  00169000 - 0016BFFF  -r--- (      12 KB  a:??? ) USED
  ...
> 00250000 - 00288FFF ?----- (     228 KB  a:64K ) RESERVED // some stack (should not be mine, directly - dunno)
  00289000 - 0028BFFF  -rw-G (      12 KB  a:??? ) USED
  0028C000 - 0028FFFF  -rw-- (      16 KB  a:??? ) USED
> 00290000 - 0031FFFF  ----- (     576 KB  a:64K ) ¤ // memory following stack is not reserved and is free for grab
  ...
> 00440000 - 0053BFFF ?----- (    1008 KB  a:64K ) RESERVED // my stack (1MB stack is the the default setting in VC++)
  0053C000 - 0053CFFF  -rw-G (       4 KB  a:??? ) USED
  0053D000 - 0053FFFF  -rw-- (      12 KB  a:??? ) USED
> 00540000 - 005CFFFF  ----- (     576 KB  a:64K ) ¤ // again - free for grab
  ...
> 005D0000 - 005D7FFF  -rw-- (      32 KB  a:64K ) USED // heap perhaps?
  005D8000 - 006CFFFF ?----- (     992 KB  a:??? ) RESERVED
  ...
> 007D0000 - 207CFFFF  -rw-- (  524288 KB  a:64K ) USED // half a GB used for my own - non heap
  ...
> 207D0000 - 607CFFFF  ----- ( 1048576 KB  a:64K ) ¤ // first big chunk of unused address space
> 607D0000 - 607D0FFF  -r--- (       4 KB  a:64K ) USED // dll/system land begins
  607D1000 - 60958FFF  xr--- (    1568 KB  a:??? ) USED
  60959000 - 6095EFFF  -rw-- (      24 KB  a:??? ) USED
  6095F000 - 60970FFF  -r--- (      72 KB  a:??? ) USED
> 60971000 - 7533FFFF  ----- (  337724 KB  a:??? ) ¤ // a stray block of free address space cut short by some dll/system shenanigans
  ...
  dll/system heaven
  ...
> 7FFF0000 - FFFAFFFF  ----- ( 2096896 KB  a:64K ) ¤ // some more address space - the benefits of "large address aware" on 64bit windows.

-----------------------------------

"but i heard something that now pages are bigger than 4k ?"

 

Bigger pages have always been possible (i386+ for the whole x86 line) - just that it is rarely ever used on desktop machines.

 

Note also: on some hardware (non x86) - 4KB pages are not even an option and 8KB is the smallest possible for example.




#5119600 Tiling textures using mod() in shader causes stripes on textures

Posted by tanzanite7 on 27 December 2013 - 07:43 PM

You might have bad mipmap selection as mentioned, however that would generate only 2x2 spots of garbage - that clearly does not match the picture.

 

Would guess it is the half texel issue (depending on whether Direct3D or OpenGL is used - (0,0) is either corner [OGL] or middle [D3D? - or have they changed that?] of texel) - however, the error seems to cover more texels that one would expect on such a case (might be just my eyes - the error should be about 1, rarely 2 texels worth).

 

My guess is that your texture coordinates are just wrong. Take pencil and paper and check that the final coordinates are exactly what they should be.




#5119410 Forcing Code To Work !

Posted by tanzanite7 on 26 December 2013 - 07:41 PM

The browser doesn't let you type those extra characters, but you can either post from a separate file/page or use Javascript to alter the form settings.

In my case, all i have to do is pick "Forms" -> "Remove Maximum Lengths" from the always visible web developer toolbar.

Reminds me a finished project i was handed down for greenlighting into active use some years ago when its original dev left. Instead of simply giving a go/no-go, i opted to construct a string ready for copy paste and just showed them:
* two mouse clicks to remove the form field limitations
* copy-paste the crafted username into login form
* press enter
=> logged into the application as superuser. As an additional bonus - the login crashed the logging subsystem, without invalidating the login itself -> leaving no traces.

Not blaming the original dev too much tho - he was not qualified to do the job and everyone involved knew it from the outset, they just though "how hard can it be?".


#5118314 How to write fast code without optimizing

Posted by tanzanite7 on 20 December 2013 - 12:54 AM

Edit: Seriously, why the anonymous downvote...?

Beats me.

Never liked the particular reputation system that gamedev employs, even less i like the misuse of it - so, upvoted for balance.




PARTNERS