runtime cost of stack overflow checking

Started by
7 comments, last by SimonForsman 10 years, 3 months ago

I often wonder but i do not know this thing: when you do a push

data on stack it could be stack overflow - on the old machines it was

not checked so you winding the stack then overwriting of old stack

spaces with new walues - stack corruption

On windows system it seem to me that this is quarded it is you got

an exception, memory acces fault, something like that

Does it cost some runtime or this is free? I suspect that maybe it

is maybe like that that stack acces is not runtime checked but

after the stack pages you got an empty memory page where writting

wil just cause an exception? is this the case? if you got a large

amounts of user stacks in the system are they are guarded at the end

by empty pages?

If this runtime cost free mechanism is working like that i find accidentaly

that this probably could be used for guarding normal arrays boundary

acces checking - signaled exception is better than quiet data corruption..

Is it working like that?

Advertisement
As far as I know, for non-segmented stacks, a stack overflow manifests as an access violation into the next (or previous) memory sector.

For segmented stacks, there's a run-time check that happens when a function is called to see if there's enough room for all local variables and similar on the stack. It's a simple arithmetic calculation that does not access memory.

As far as I know, for non-segmented stacks, a stack overflow manifests as an access violation into the next (or previous) memory sector.

For segmented stacks, there's a run-time check that happens when a function is called to see if there's enough room for all local variables and similar on the stack. It's a simple arithmetic calculation that does not access memory.

could you tell maybe a bit more what do you mean by segmented stacks? does windows use that? (i checked something about that

in google but i do not know it is used or this is just some artificial

contstruct made in user programs)

as to the main question i suspect i answered to myslelf (i suspect that

it is guarded at the end (or att the bottom as stack grows opposite)

by non writable pages - though i am not sure

also yet one unclear cuestion - people say sometimes that every thread has its own stack and they say that by default the stack is usually 1MB size - Checking system info gives me the information of

600 threads running - do this really means 600MB of reserved ram

all for thread stacks ? :-/

It's a compiler option. Gcc calls it split stack. Segmented stack is the clang term. Not sure if VC has a similar feature. MinGW should, however; it's not OS integrated.

A large number of threads is exactly the motivation for split stacks, because it allows smaller stacks for most threads. So an application with 600 threads is more likely to be using split stacks.

But without split stacks, yeah, each thread needs to allocate a lot of memory for the stack. A saving grace is that allocating memory and not using it doesn't have a real performance impact; it only uses up your application's virtual memory pool. Also, the side of a non-split stack is configurable.

Edit: It seems to be poorly documented, but present in clang/llvm.

As far as I know on x86 Windows, guard pages are handled by using the processor's protected mode page fault behavior.

When any execute/read/write operation accesses a page that's guarded:

  • The processor triggers a page fault exception since the page is marked as "not-present" to the processor.
  • Windows sees the page fault and looks up the offending address, sees that it has a guard page at that location, pages the memory in (if it previously was an active page before being changed to a guard page), disables the guard flag and raises a Windows-specific guard page exception.

The guard page exception can then be:

  • Intercepted by a debugger.
  • Used to expand stack space, which allocates the offending page as more stack space and then sets up a new guard page above it.
  • Used to cause a stack overflow exception if the stack cannot expand any more.

Windows only puts guard pages at the top (low address) of its stacks. It does not put guard pages after them (at the high addresses). However, the memory that comes after the stack will not have any allocated pages. So this will happen:

  • You accidentally access memory below the stack.
  • A page fault occurs.
  • Windows looks up that page and sees that nothing exists there, then causes an access violation exception, which crashes your thread.

tl;dr: The processor does this the same way it handles access violations, so from a software point of view there is no additional overhead until the access violation actually occurs.

The reason nobody uses this to automatically do bounds checking on standard data arrays is because the minimum page size is 4K. This means your array would need to be a multiple of 4K in size, and you would need to isolate the array in address space such that no other allocated pages are nearby. It would waste a lot of address space to do this.

The reason nobody uses this to automatically do bounds checking on standard data arrays is because the minimum page size is 4K. This means your array would need to be a multiple of 4K in size, and you would need to isolate the array in address space such that no other allocated pages are nearby. It would waste a lot of address space to do this.

I think it wouldnt be a lot (I am still not sure if the stacks are guarded like that but maybe, and i think it would be okay - bounding checks checks for free) 4k is not so much, you usually got no millions arrays

in the system (especially i thing guarded could be only the bigger ones) i think this would be 'number of guarded arrays' x 4kb (x2 if both ways) ;/ (+ maybe the unused space of last array page till the 4k boundary) especially on x64 you got a lot of adress space

but i heard something that now pages are bigger tham 4k ?

as to wasting logical space : are the previous c static arrays

putted each in the new pages or just tighten with the old ones in their old pages?

Dealing with stack is the responsibility of whoever uses it by the rules of whoever manages the stack to the extent of the OS getting out of the way. In regards to Windows, that means: you can override every aspect of the stack behavior completely - just not as efficiently as the OS could.

Slight corrections to what has been said already:

By default on windows a typical stack uses a guard page to trigger stack expansion with the additional requirement that if a function uses more that one page worth of stack then it must call a special function to ensure the guard page is not skipped (this special function call is added automatically by the compiler as part of calling convention / exception handling - in function prolog: "__alloca_probe").

So, besides the overhead of the function call when more than a page is needed - there is no overhead as no checks are done.

Stack underflow is not guarded (just not worth it). IF you are lucky then the memory there is inaccessible and hence causes an immediate access violation crash.

----------------------

Something i use to check the state of address space (VC++):


void dump() {
    // dump memory map
    intp at = 0;
    MEMORY_BASIC_INFORMATION info;
    while(true) {
        if(VirtualQuery((void*)at, &info, sizeof(info)) != sizeof(info)) break;
        LOG(L"%c %p - %p %c%c%c%c%c%c ( %7d KB  %s ) %s"
            , info.AllocationBase == info.BaseAddress || info.AllocationBase == 0 ? L'>' : L' '
            , intp(info.BaseAddress)
            , intp(info.BaseAddress) + intp(info.RegionSize) - 1
            , info.Protect == 0 ? L'?' : (info.Protect & 0x10 ? L'!' : L' ')
            , info.Protect & 0xF0 ? L'x' : '-'
            , info.Protect & 0x66 ? L'r' : '-'
            , info.Protect & 0xCC ? L'w' : '-'
            , info.Protect & 0x88 ? L'c' : '-'
            , info.Protect & 0x100 ? L'G' : '-'
            , int32u(info.RegionSize >> 10)
            , intp(info.BaseAddress) & 65535 ? L"a:???" : L"a:64K"
            , info.State == MEM_FREE ? L"¤" : (info.State == MEM_COMMIT ? L"USED" : L"RESERVED")
        );
        at += info.RegionSize;
    }
}
 
// some quickly made dirty Logger of dubious quality
void logging(wchar_t const *format, ...) {
    va_list argptr; va_start(argptr, format);
    wchar_t buf[1024];
    int wrote = vswprintf(&buf[0], 1024, format, argptr);
    if(wrote<0) wrote = 0;
    va_end(argptr);
    if(wrote) OutputDebugStringW(buf);
}
#define LOG(...) logging(L"\n" __VA_ARGS__);
// intp = unsigned integer fit for pointer (ie. uintptr_t), int32u = unsigned 32 bit integer

which, in my case, gives (some excerpts):


> 00000000 - 0000FFFF  ----- (      64 KB  a:64K ) ¤ // a page granularity worth of a general no-go area
  ...
> 00020000 - 0002EFFF  -rw-- (      60 KB  a:64K ) USED // a guarded cyclic buffer of mine - just something the program used for the dump() apparently uses
  0002F000 - 0002FFFF  -r--G (       4 KB  a:??? ) USED
  ...
> 00090000 - 00090FFF  -r--- (       4 KB  a:64K ) USED // my program (non fixed image address)
  00091000 - 000A0FFF  xrw-- (      64 KB  a:??? ) USED // ... "x" AND "w" !? ... wut!? What is this!?
  000A1000 - 000B5FFF  xr--- (      84 KB  a:??? ) USED // ... program code
  000B6000 - 000B9FFF  -r--- (      16 KB  a:??? ) USED // ... data
  000BA000 - 00168FFF  -rw-- (     700 KB  a:??? ) USED
  00169000 - 0016BFFF  -r--- (      12 KB  a:??? ) USED
  ...
> 00250000 - 00288FFF ?----- (     228 KB  a:64K ) RESERVED // some stack (should not be mine, directly - dunno)
  00289000 - 0028BFFF  -rw-G (      12 KB  a:??? ) USED
  0028C000 - 0028FFFF  -rw-- (      16 KB  a:??? ) USED
> 00290000 - 0031FFFF  ----- (     576 KB  a:64K ) ¤ // memory following stack is not reserved and is free for grab
  ...
> 00440000 - 0053BFFF ?----- (    1008 KB  a:64K ) RESERVED // my stack (1MB stack is the the default setting in VC++)
  0053C000 - 0053CFFF  -rw-G (       4 KB  a:??? ) USED
  0053D000 - 0053FFFF  -rw-- (      12 KB  a:??? ) USED
> 00540000 - 005CFFFF  ----- (     576 KB  a:64K ) ¤ // again - free for grab
  ...
> 005D0000 - 005D7FFF  -rw-- (      32 KB  a:64K ) USED // heap perhaps?
  005D8000 - 006CFFFF ?----- (     992 KB  a:??? ) RESERVED
  ...
> 007D0000 - 207CFFFF  -rw-- (  524288 KB  a:64K ) USED // half a GB used for my own - non heap
  ...
> 207D0000 - 607CFFFF  ----- ( 1048576 KB  a:64K ) ¤ // first big chunk of unused address space
> 607D0000 - 607D0FFF  -r--- (       4 KB  a:64K ) USED // dll/system land begins
  607D1000 - 60958FFF  xr--- (    1568 KB  a:??? ) USED
  60959000 - 6095EFFF  -rw-- (      24 KB  a:??? ) USED
  6095F000 - 60970FFF  -r--- (      72 KB  a:??? ) USED
> 60971000 - 7533FFFF  ----- (  337724 KB  a:??? ) ¤ // a stray block of free address space cut short by some dll/system shenanigans
  ...
  dll/system heaven
  ...
> 7FFF0000 - FFFAFFFF  ----- ( 2096896 KB  a:64K ) ¤ // some more address space - the benefits of "large address aware" on 64bit windows.

-----------------------------------

"but i heard something that now pages are bigger than 4k ?"

Bigger pages have always been possible (i386+ for the whole x86 line) - just that it is rarely ever used on desktop machines.

Note also: on some hardware (non x86) - 4KB pages are not even an option and 8KB is the smallest possible for example.


Gcc calls it split stack. Segmented stack is the clang term. Not sure if VC has a similar feature. MinGW should, however; it's not OS integrated.

This is a GNU/Linux only feature at the present time. MinGW doesn't support it.

Long explanation of guard pages

This, exactly.

600 threads running - do this really means 600MB of reserved ram

600 threads running means you're doing something seriously wrong. An application will very rarely need more than a dozen threads. Usually you will want approximately one work-intensive thread per CPU core (or fewer, if some cores are "hyperthreaded" cores), and one or two mostly-idle threads for having "other stuff" such as lengthy input/output run asynchronously without blocking the main thread.

But yes, given default stack sizes, 600 threads would mean 600MiB of reserved address space (and 2400kiB [?2.3MiB] of committed RAM). Note that having 600 threads running on the system as a whole (not in one single application) isn't an issue at all since they're in different address spaces.

Large pages are something that works "almost-automatically" under recent versions of Linux if configured right, and with some limitations. Large pages are quite unwieldy under Windows, and come with serious limitations. In either case, it is not something that 99% of all programmers have to worry about. The default works fine.

The only advantage of large pages is that you save a few TLB entries, and unless you're writing something like a relational database of the kind that e.g. Oracle is selling, there is absolutely no measurable difference for you.

Windows can automatically create guard pages after every allocation for any application that uses HeapAlloc(), if you ask it to. See http://msdn.microsoft.com/en-us/library/windows/hardware/ff549561%28v=vs.85%29.aspx

It's generally only used for debugging, but you could easily implement it yourself by using VirtualAlloc() and marking the page before and after the array as PAGE_NO_ACCESS using VirtualProtect(). It will obviously only reliably detect overruns in one direction - either off the start or off the end of the array, unless the array size is an exact multiple of 4k.

In addition to the extra address space it eats up (which isn't an issue on 64-bit), it adds extra overhead to allocation and deletion. You may also see a performance hit from extra cache and TLB misses.


Gcc calls it split stack. Segmented stack is the clang term. Not sure if VC has a similar feature. MinGW should, however; it's not OS integrated.

This is a GNU/Linux only feature at the present time. MinGW doesn't support it.

Long explanation of guard pages

This, exactly.

600 threads running - do this really means 600MB of reserved ram

600 threads running means you're doing something seriously wrong. An application will very rarely need more than a dozen threads. Usually you will want approximately one work-intensive thread per CPU core (or fewer, if some cores are "hyperthreaded" cores), and one or two mostly-idle threads for having "other stuff" such as lengthy input/output run asynchronously without blocking the main thread.

But yes, given default stack sizes, 600 threads would mean 600MiB of reserved address space (and 2400kiB [?2.3MiB] of committed RAM). Note that having 600 threads running on the system as a whole (not in one single application) isn't an issue at all since they're in different address spaces.

Large pages are something that works "almost-automatically" under recent versions of Linux if configured right, and with some limitations. Large pages are quite unwieldy under Windows, and come with serious limitations. In either case, it is not something that 99% of all programmers have to worry about. The default works fine.

The only advantage of large pages is that you save a few TLB entries, and unless you're writing something like a relational database of the kind that e.g. Oracle is selling, there is absolutely no measurable difference for you.

600 threads might be a lot on a x86 desktop PC, but not all computers are 4-8 core PCs these days. (There are several embedded CPU architectures that feature far higher core counts)

[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!

This topic is closed to new replies.

Advertisement