[c++, asm] A slow stack?

Started by
27 comments, last by ttdeath 18 years, 7 months ago
Hi, I've learned that the stack is a special piece of RAM, that is positioned next to the processor. It should be incredibly fast because of that. At the moment, I need fast RAM, but when I test the stack's speed using in-line assembly in minGW, I find that the stack is not quicker than 'normal' memory. Executing the following piece of code takes about 3 seconds when you let ptr point to the stack, and the same amount of time when you let it point to a (normal) array.

for (int t = 0; t < 1000; t++) {
	for (int i = 0; i < 512 * 512; i++) {
		ptr = k;
	}
}
Does anybody know why this happens? Why isn't the stack quicker? Should it be possible (in pure assembly perhaps) to make the code shown above execute faster? Thanks for any help, it's much appreciated. Bas
Advertisement
Stack isn't physically any different from normal memory.

The only way it can possibly be different, is that on some CPUs in some circumstances, it may be able to be addressed with shorter pointers. I've no idea whether this applies to intel, but if it does, the difference will only be because of the size of the code required to access it (smaller code == faster code usually).

All machines I've ever used have had stack as a normal part of the memory, not separate in any way - therefore it will be exactly the same speed.

It should also be noted that on multitasking OSs, each task has its own stack.

Mark
Quote:Original post by basananas
Hi,
I've learned that the stack is a special piece of RAM, that is positioned next to the processor. It should be incredibly fast because of that.
That's just a rumour I'm afraid.
You're description fits well with the description of a cache though, so maybe you (or someone else) is confusing 'stack' with 'cache'?

Is there some piece of code in particular that you'd like to speed up? There are a number of optimisation gurus here that would be glad to help if you'd like to post your code.[smile]
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Unfortunately, you are incorrect. The stack is just like any other memory, except the allocation and deletion of it is handled for you, instead of you having to explicitly do it. What you might be thinking of is the cache, an amount of memory usually on the CPU itself, in which the CPU generally puts memory it has used recently, along with the surrounding memory.

So yes, stack vs. heap will show no significant difference. But if you could precache the memory that you plan to use before you use it, then it should go a good sight faster, then a cache miss every (or every fourth, or whatever, depending on how much of the surrounding memory the CPU caches with every memory access) memory read and write.

BTW, in your test code, did you also time the time it takes to allocate and free the array from the heap? Because that it also part of what makes the stack slightly faster, is that when you are allocating something from the stack, the CPU has a register that always points to the next open space on the stack, so allocating from it (or pushing) and freeing from it (popping) is usually much faster then malloc/free. Which of course is because malloc/free both have to run through the open memory lists, find the bext/first fit, etc. They can't just shove the memory on the next open address, because there might not be enough space there. But on the stack, it does do that, because it only expands in one direction, and because of the unique nature of it, there are never any 'holes' in the stack memory, always a continuous block.

Hope this helps;
Free speech for the living, dead men tell no tales,Your laughing finger will never point again...Omerta!Sing for me now!
Thanks all for the help. I probably was confusing the cache with the stack, though I always thought that the stack was part of the cache (that would have been really handy).

Quote:Original post by SirLuthor
So yes, stack vs. heap will show no significant difference. But if you could precache the memory that you plan to use before you use it, then it should go a good sight faster, then a cache miss every (or every fourth, or whatever, depending on how much of the surrounding memory the CPU caches with every memory access) memory read and write.


That sounds interesting, do you know where I can get more information on the cache and how to use it? I suppose that there are no commands to simply place an array in the cache, but some general information would be very interesting.

Greetings,
Bas
have you tried using the keywork register? its part of standard c++ should be supported by all compilers. not sure if its actually faster, but it suggests to the compiler to place the varaible in a register for super fast access. Correct me if i'm wrong guys?

for (register int t = 0; t < 1000; t++) {	for (register int i = 0; i < 512 * 512; i++) {		ptr = k;	}}


~guyaton
~guyaton
the fact that the stack is not a part of the cache is not a problem, because it "becomes" a part of the stack when you use it. Generally speaking, the levels of the stack that are farthest from main() always reside in L1 cache. That means that the stack always runs as fast as it could.
In your example, you are definitely L1 cache or instruction bound. all the memory you use (since you are touching it 1000 times, and it is not a large amount of memory) resides in cache.
Act of War - EugenSystems/Atari
Quote:Original post by guyaton
have you tried using the keywork register? its part of standard c++ should be supported by all compilers. not sure if its actually faster, but it suggests to the compiler to place the varaible in a register for super fast access. Correct me if i'm wrong guys?


Major compilers ignore the register keyword completely.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by guyaton
have you tried using the keywork register? its part of standard c++ should be supported by all compilers. not sure if its actually faster, but it suggests to the compiler to place the varaible in a register for super fast access. Correct me if i'm wrong guys?


You could have been correct more than ten years ago or even more, when compilers did not optimize code. I cannot imagine a program that calls itself "C compiler" which will not optimize this (if you turn optimizations, of course). In fact, as C language (unlike C++, as far as I know) does not allow to take address of register variable, register keyword is nothing a benefit but a hazard.
The register keyword didn't make any difference, but it was a nice try ;)

I've read an AMD article about optimizing assembly code and found an interesting piece of text about cache memory (http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf, page 63 and further)

Basically, the text explained that if you make sure that the memory that you want is in L1 cache, the execution speed will increase. The strange thing is, that one would think that the processor already would make sure that all the things you need are in L1 cache. Like Janos said:
Quote:
the fact that the stack is not a part of the cache is not a problem, because it "becomes" a part of the stack when you use it. Generally speaking, the levels of the stack that are farthest from main() always reside in L1 cache. That means that the stack always runs as fast as it could.


But, according to the AMD document, this is NOT true! For some silly reason, you will get faster code if you make sure that, before you enter the main loop, everything you need is in L1 cache. I tried it myself, and it works! The program code got more than 50% faster!

In C-code, both of the asm-codes further below do the following:
for (int t = 0; t < 20000; t++) {	for (int i = 0; i < 512 * 512; i++) {		k = ptr;	}}


This is the asm code with pre-caching in Level 1 cache
for (int i = 0; i < 20000; i++) {	asm("pushl %edi");	//Pre-caching; max 8KB L1 cache	asm("xorl %edi, %edi");	asm("movl $8192, %edx");	asm("movl $1048576, %ebx");	asm("mainloop:");		//Pre-cache 8192 bytes		asm("fetch:");			asm("movl _xy_z(%edi), %eax");			asm("addl $64, %edi");			asm("cmpl %edx, %edi");		asm("jne fetch");		asm("subl $8192, %edi"); //now that the mem is cached, go read it out		asm("loop:");			asm("movl _xy_z(%edi), %eax");			asm("movl %eax, _k");			asm("addl $4, %edi");			asm("cmpl %edx, %edi");		asm("jne loop");		//Next block		asm("addl $8192, %edx");		asm("cmpl %edi, %ebx");	asm("jne mainloop");	asm("popl %edi");}

The code above took 20 seconds to execute (tried it a couple of times; stable execution time)

for (int i = 0; i < 20000; i++) { //30 s	asm("pushl %edi");	//No Pre-caching	asm("xorl %edi, %edi");	asm("movl $8192, %edx");	asm("movl $1048576, %ebx");		asm("mainloop:");		asm("loop:");			asm("movl _xy_z(%edi), %eax");			asm("movl %eax, _k");			asm("addl $4, %edi");			asm("cmpl %edx, %edi");		asm("jne loop");		//Next block		asm("addl $8192, %edx");		asm("cmpl %edi, %ebx");	asm("jne mainloop");	asm("popl %edi");}


The code above took 30 seconds to execute. This is without pre-caching.

So if you don't pre-cache, it's slower. Weird, does anybody know why this is so?

Bas

This topic is closed to new replies.

Advertisement