Sign in to follow this  
basananas

[c++, asm] A slow stack?

Recommended Posts

Hi, I've learned that the stack is a special piece of RAM, that is positioned next to the processor. It should be incredibly fast because of that. At the moment, I need fast RAM, but when I test the stack's speed using in-line assembly in minGW, I find that the stack is not quicker than 'normal' memory. Executing the following piece of code takes about 3 seconds when you let ptr point to the stack, and the same amount of time when you let it point to a (normal) array.
for (int t = 0; t < 1000; t++) {
	for (int i = 0; i < 512 * 512; i++) {
		ptr[i] = k;
	}
}
Does anybody know why this happens? Why isn't the stack quicker? Should it be possible (in pure assembly perhaps) to make the code shown above execute faster? Thanks for any help, it's much appreciated. Bas

Share this post


Link to post
Share on other sites
Stack isn't physically any different from normal memory.

The only way it can possibly be different, is that on some CPUs in some circumstances, it may be able to be addressed with shorter pointers. I've no idea whether this applies to intel, but if it does, the difference will only be because of the size of the code required to access it (smaller code == faster code usually).

All machines I've ever used have had stack as a normal part of the memory, not separate in any way - therefore it will be exactly the same speed.

It should also be noted that on multitasking OSs, each task has its own stack.

Mark

Share this post


Link to post
Share on other sites
Quote:
Original post by basananas
Hi,
I've learned that the stack is a special piece of RAM, that is positioned next to the processor. It should be incredibly fast because of that.
That's just a rumour I'm afraid.
You're description fits well with the description of a cache though, so maybe you (or someone else) is confusing 'stack' with 'cache'?

Is there some piece of code in particular that you'd like to speed up? There are a number of optimisation gurus here that would be glad to help if you'd like to post your code.[smile]

Share this post


Link to post
Share on other sites
Unfortunately, you are incorrect. The stack is just like any other memory, except the allocation and deletion of it is handled for you, instead of you having to explicitly do it. What you might be thinking of is the cache, an amount of memory usually on the CPU itself, in which the CPU generally puts memory it has used recently, along with the surrounding memory.

So yes, stack vs. heap will show no significant difference. But if you could precache the memory that you plan to use before you use it, then it should go a good sight faster, then a cache miss every (or every fourth, or whatever, depending on how much of the surrounding memory the CPU caches with every memory access) memory read and write.

BTW, in your test code, did you also time the time it takes to allocate and free the array from the heap? Because that it also part of what makes the stack slightly faster, is that when you are allocating something from the stack, the CPU has a register that always points to the next open space on the stack, so allocating from it (or pushing) and freeing from it (popping) is usually much faster then malloc/free. Which of course is because malloc/free both have to run through the open memory lists, find the bext/first fit, etc. They can't just shove the memory on the next open address, because there might not be enough space there. But on the stack, it does do that, because it only expands in one direction, and because of the unique nature of it, there are never any 'holes' in the stack memory, always a continuous block.

Hope this helps;

Share this post


Link to post
Share on other sites
Thanks all for the help. I probably was confusing the cache with the stack, though I always thought that the stack was part of the cache (that would have been really handy).

Quote:
Original post by SirLuthor
So yes, stack vs. heap will show no significant difference. But if you could precache the memory that you plan to use before you use it, then it should go a good sight faster, then a cache miss every (or every fourth, or whatever, depending on how much of the surrounding memory the CPU caches with every memory access) memory read and write.


That sounds interesting, do you know where I can get more information on the cache and how to use it? I suppose that there are no commands to simply place an array in the cache, but some general information would be very interesting.

Greetings,
Bas

Share this post


Link to post
Share on other sites
have you tried using the keywork register? its part of standard c++ should be supported by all compilers. not sure if its actually faster, but it suggests to the compiler to place the varaible in a register for super fast access. Correct me if i'm wrong guys?


for (register int t = 0; t < 1000; t++) {

for (register int i = 0; i < 512 * 512; i++) {

ptr[i] = k;

}

}



~guyaton

Share this post


Link to post
Share on other sites
the fact that the stack is not a part of the cache is not a problem, because it "becomes" a part of the stack when you use it. Generally speaking, the levels of the stack that are farthest from main() always reside in L1 cache. That means that the stack always runs as fast as it could.
In your example, you are definitely L1 cache or instruction bound. all the memory you use (since you are touching it 1000 times, and it is not a large amount of memory) resides in cache.

Share this post


Link to post
Share on other sites
Quote:
Original post by guyaton
have you tried using the keywork register? its part of standard c++ should be supported by all compilers. not sure if its actually faster, but it suggests to the compiler to place the varaible in a register for super fast access. Correct me if i'm wrong guys?


Major compilers ignore the register keyword completely.

Share this post


Link to post
Share on other sites
Quote:
Original post by guyaton
have you tried using the keywork register? its part of standard c++ should be supported by all compilers. not sure if its actually faster, but it suggests to the compiler to place the varaible in a register for super fast access. Correct me if i'm wrong guys?


You could have been correct more than ten years ago or even more, when compilers did not optimize code. I cannot imagine a program that calls itself "C compiler" which will not optimize this (if you turn optimizations, of course). In fact, as C language (unlike C++, as far as I know) does not allow to take address of register variable, register keyword is nothing a benefit but a hazard.

Share this post


Link to post
Share on other sites
The register keyword didn't make any difference, but it was a nice try ;)

I've read an AMD article about optimizing assembly code and found an interesting piece of text about cache memory (http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf, page 63 and further)

Basically, the text explained that if you make sure that the memory that you want is in L1 cache, the execution speed will increase. The strange thing is, that one would think that the processor already would make sure that all the things you need are in L1 cache. Like Janos said:
Quote:

the fact that the stack is not a part of the cache is not a problem, because it "becomes" a part of the stack when you use it. Generally speaking, the levels of the stack that are farthest from main() always reside in L1 cache. That means that the stack always runs as fast as it could.


But, according to the AMD document, this is NOT true! For some silly reason, you will get faster code if you make sure that, before you enter the main loop, everything you need is in L1 cache. I tried it myself, and it works! The program code got more than 50% faster!

In C-code, both of the asm-codes further below do the following:

for (int t = 0; t < 20000; t++) {
for (int i = 0; i < 512 * 512; i++) {
k = ptr[i];
}
}


This is the asm code with pre-caching in Level 1 cache

for (int i = 0; i < 20000; i++) {
asm("pushl %edi");

//Pre-caching; max 8KB L1 cache
asm("xorl %edi, %edi");
asm("movl $8192, %edx");
asm("movl $1048576, %ebx");

asm("mainloop:");
//Pre-cache 8192 bytes
asm("fetch:");
asm("movl _xy_z(%edi), %eax");
asm("addl $64, %edi");
asm("cmpl %edx, %edi");
asm("jne fetch");

asm("subl $8192, %edi"); //now that the mem is cached, go read it out
asm("loop:");
asm("movl _xy_z(%edi), %eax");
asm("movl %eax, _k");
asm("addl $4, %edi");
asm("cmpl %edx, %edi");
asm("jne loop");

//Next block
asm("addl $8192, %edx");
asm("cmpl %edi, %ebx");
asm("jne mainloop");

asm("popl %edi");
}

The code above took 20 seconds to execute (tried it a couple of times; stable execution time)


for (int i = 0; i < 20000; i++) { //30 s
asm("pushl %edi");
//No Pre-caching
asm("xorl %edi, %edi");
asm("movl $8192, %edx");
asm("movl $1048576, %ebx");

asm("mainloop:");
asm("loop:");
asm("movl _xy_z(%edi), %eax");
asm("movl %eax, _k");
asm("addl $4, %edi");
asm("cmpl %edx, %edi");
asm("jne loop");

//Next block
asm("addl $8192, %edx");
asm("cmpl %edi, %ebx");
asm("jne mainloop");

asm("popl %edi");
}


The code above took 30 seconds to execute. This is without pre-caching.

So if you don't pre-cache, it's slower. Weird, does anybody know why this is so?

Bas

Share this post


Link to post
Share on other sites
You are confusing a logical ~"software" construct with a piece of hardware.
The stack is merely a region of RAM (any form of RAM) that is pointed to by an allocated CPU register. You can POP and PUSH values on stack, and the SP (stack pointer) decreases or increases accordingly.

The cache is a piece of hardware that functions somewhat like the RAM and is indeed tied to the CPU core. It's role is to prefetch the code and data around the current point of execution of the current thread in order to make waiting times shorter. As in: the cache is several times faster than regular RAM in access times.

Forcing a cache load is unnecessary in our days, good compilers always employ the correct cache instructions. If you write ASM the old way (pre-cache instructions), it is obviously going to run slower than if you had been using the mentioned instructions. Unless you overdo it, and then with all precaching, the whole hell gets loose.

Share this post


Link to post
Share on other sites
Quote:
Original post by ttdeath
Forcing a cache load is unnecessary in our days, good compilers always employ the correct cache instructions. If you write ASM the old way (pre-cache instructions), it is obviously going to run slower than if you had been using the mentioned instructions. Unless you overdo it, and then with all precaching, the whole hell gets loose.


In my previous post I forgot to mention the results from what the compiler (minGW 3.2.3) came up with.

I compiled this code, then ran it a couple of times.

for (int t = 0; t < 20000; t++) {
for (int i = 0; i < 512 * 512; i++) {
k = xy_z[i];
}
}


Here are the results of 9 seperate tests (time to execute code measured):

C++ 37 s
ASM No pre-caching 30 s
ASM pre-chaching 19 s

C++ 37 s
ASM No pre-caching 29 s
ASM pre-chaching 19 s

C++ 37 s
ASM No pre-caching 29 s
ASM pre-chaching 19 s


I think that this shows my point. Why doesn't the minGW compiler come up with proper pre-caching code? Why doesn't the processor cache the memory while compiling the 'normal' asm code properly so that it's still slow? I've read about cache memory, and I don't understand why the processor doesn't cache the memory properly by itself.

Are there some compiler/computer gurus who can explain what's the matter?

Bas

Share this post


Link to post
Share on other sites
Quote:
Original post by basananas
Here are the results of 9 seperate tests (time to execute code measured):

C++ 37 s
ASM No pre-caching 30 s
ASM pre-chaching 19 s

C++ 37 s
ASM No pre-caching 29 s
ASM pre-chaching 19 s

C++ 37 s
ASM No pre-caching 29 s
ASM pre-chaching 19 s


I think that this shows my point. Why doesn't the minGW compiler come up with proper pre-caching code? Why doesn't the processor cache the memory while compiling the 'normal' asm code properly so that it's still slow? I've read about cache memory, and I don't understand why the processor doesn't cache the memory properly by itself.

Are there some compiler/computer gurus who can explain what's the matter?

Bas


Off hand, I'd say that the difference between the no-pre-caching and the pre-caching code is the time needed to cache the code. I can't speak to why minGW does what it does or doesn't do.

Was that C++ compiled with optimization turned on?

Share this post


Link to post
Share on other sites
Quote:
Original post by LessBread
Off hand, I'd say that the difference between the no-pre-caching and the pre-caching code is the time needed to cache the code.

But the pre-cached code is faster, so I guess it takes some time but it saves more in the end...

Quote:
I can't speak to why minGW does what it does or doesn't do.

Was that C++ compiled with optimization turned on?


No, I never used that option. Interestingly, after I turned it on it was just as fast as thee no pre-cached ASM code (7 s fasters than without optimization). But still not quite as fast as the pre-cached code. It remains a mistery to me why that one is so much faster..

Share this post


Link to post
Share on other sites
The question is, is it fast enough?
You might not need the extra seconds actually for your purpose.
In any case, again you seem to confuse what happens there.

Compilers write out code that is later executed by the CPU.
Optimisation occurs at compile time only. Precaching occurs at run time only.

Precaching also largely depends on how big the CPU cache is (each cpu has a different cache size). That means that only a small amount of the program's data can be accesed in cache level 1, only so much in cache level 2 etc. As in round hole of width X can only get round ball of width x. If the round ball (the program) was optimized for a larger hole (larger cache), then you are in trouble, pushing and pulling the ball in and out of the hole.

Also, as soon as another thread gets hold of the cpu (thousands of times/second) the data from the cache may need to be flushed in order to allow for the current running thread's data to be accesed.

Round Ball - Square hole = cache miss. Gugl this, you'll find it interesting.

Share this post


Link to post
Share on other sites
Perhaps pre-caching is turned on (only) when using platform-specific optimizations? I do not know for sure, but the preloading instructions could be AMD-specific.

Share this post


Link to post
Share on other sites
Quote:
Original post by ttdeath
Compilers write out code that is later executed by the CPU.
Optimisation occurs at compile time only. Precaching occurs at run time only.

Ok, but it's just the way you want to call it of course. Bottom-line is; you can optimise a program by adding code which will pre-cache the memory at run time.

Quote:
Original post by ttdeath
Precaching also largely depends on how big the CPU cache is (each cpu has a different cache size). That means that only a small amount of the program's data can be accesed in cache level 1, only so much in cache level 2 etc. As in round hole of width X can only get round ball of width x. If the round ball (the program) was optimized for a larger hole (larger cache), then you are in trouble, pushing and pulling the ball in and out of the hole.


As far as I know all modern processors have a level 1 cache of at least 8K for data, so that's well enough for what I need it for. I've read this in a document from AMD, but I have the feeling that this trick also works for Intel processors.

By the way, if there is a thread switch and the cache is flushed, there's no big problem. Of course it will cost time to refill the cache (which the processor will do itself), but this would also be the case without pre-caching.

Greetings,
Bas

Share this post


Link to post
Share on other sites
Quote:
Original post by Ubik
Perhaps pre-caching is turned on (only) when using platform-specific optimizations? I do not know for sure, but the preloading instructions could be AMD-specific.


No preloading instructions are needed Ubik, this is what AMD says about the technique:


Quote:
By AMD http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
In the final optimization to the memory copy code, the
technique called block prefetch is applied. Much as read
grouping gave a boost to performance, block prefetch is an
extreme extension of this idea. The strategy is to read a large
stream of sequential data from main memory into the cache,
without any interruptions.
In block prefetch, the MOV instruction is used, rather than the
software prefetch instruction. Unlike a prefetch instruction, the
MOV instruction cannot be ignored by the CPU. The result is
that a series of MOVs will force the memory system to read
sequential, back-to-back address blocks, which maximizes
memory bandwidth.
And because the processor always loads an entire cache line
(e.g. 64 bytes) whenever it accesses main memory, the block
prefetch MOV instructions only need to read ONE address per
cache line. Reading just one address per cache line is a subtle
trick, which is essential in achieving maximum read
performance.


Bas

Share this post


Link to post
Share on other sites
Quote:
Original post by basananas
But the pre-cached code is faster, so I guess it takes some time but it saves more in the end...


I was also getting at that the code becomes cached once it executes. So, if you ran some non-precached code repeatedly, the first execution would be slower than subsequent executions.

ttdeath asks a good question. Is the code fast enough to do what it needs to do? This is the central question of optimization. The general answer is to focus attention on those portions of the code where optimization produces the biggest overall improvements.

Share this post


Link to post
Share on other sites
Quote:
Original post by basananas
Why doesn't the minGW compiler come up with proper pre-caching code?


Because it's too hard to do, except in the simplest cases. How would you, a human, precache data in the following case (a and b being two cache-sized chunks of data):

if( Foo( 10 ) ) {
Frobnicate( a );
} else {
Frobnicate( b );
}

Would you precache a or b? You can't do a reasonable choice here.

Quote:
Why doesn't the processor cache the memory while compiling the 'normal' asm code properly so that it's still slow? I've read about cache memory, and I don't understand why the processor doesn't cache the memory properly by itself.


First, stalling occurs. Once you start putting data into the cache, you are limited by the bus frequency, which forces you to wait for the data to become available. By fetching data ahead of time, you can avoid stalling, because by the time you start accessing the data, it's already in the cache. Prefetching data ahead enough requires you to predict what data you'll need, which only becomes obvious once you start reading it.

Second, when you read data in memory, the processor doesn't know if you're going to read the data frequently or not. Consider the following code:

for( x = 0; x < 1000; ++x ) {

while( Bar( a, x ) ) {
Foo( a );
}

while( Bar( b, x ) ) {
Foo( b );
}
}

It's usually a good idea to prefetch "a" on the first run because the cache is empty. However, is it a good idea to prefecth "b" ? If the b-loop iterates a lot, then you actually gained time. If it only iterates once, you've kicked a out of cache for nothing (since you only performed one operation on b anyway), and you need to read it back. A practical example of this behavior is: in the middle of many operations on data set "a" by a process, the kernel lets another process update a variable "b" in memory. To the processor, this is similar to the situation above.

So, basically, the processor and compiler seldom have a clue about what should be cached and what shouldn't. Prefetch operations tell them what is expected to be the optimal behavior, which is why the code runs faster.

[/quote]

Share this post


Link to post
Share on other sites
Quote:
Original post by LessBread
I was also getting at that the code becomes cached once it executes. So, if you ran some non-precached code repeatedly, the first execution would be slower than subsequent executions.

ttdeath asks a good question. Is the code fast enough to do what it needs to do? This is the central question of optimization. The general answer is to focus attention on those portions of the code where optimization produces the biggest overall improvements.


Yeah I know. I've become interested in optimization as I've been working on a (sphere) 3d engine and it's too slow. Hopefully I can gain some speed with these memory tricks..


ToohrVyk, thanks! That's a very clear explanation of those questions. It's nice to hear that learning assembly actually is important for making hotspots faster :)

Greetings,
Bas

Share this post


Link to post
Share on other sites
Quote:
Original post by basananas
It's nice to hear that learning assembly actually is important for making hotspots faster :)


Actually, no, it's not important in PC video game development. In PC games, the main bottleneck is rendering, and is performed by the 3D accelerator card. Therefore, you can fiddle with assembly as much as you want, you will only get minimal improvements because the time is really spent on rendering, which you can't fiddle fith in assembly.

To optimize rendering speed, you have to carefully select what you send to the graphics card, and in what order, and with what options. No assembly involved - only rearranging your rendering API calls, sorting your scene, and in general knowing what makes your favourite drivers and cards tick. The kind of difference I'm talking about here is getting 1 additional FPS from two hours working on the game update code, and 60 additional FPS after spending 10 minutes of coding on sorting polygons by material before rendering them.

With the scale of todays hardware, I'd even say it's not useful to optimize at low levels anymore on the (non-gaming) PC platform, unless you're really dying for speed. Back in the day, optimization meant keeping the game map and textures in cache, and it was hard because the cache was small. Today, for any practical purposes where execution speed is processor and memory bound, the scales involved are larger: the data sets are larger (Gigabytes of data are necessary to cause any real problems) and the resources are larger (a company work server can keep the Gigs above in RAM: it's cheaper to buy more RAM than to pay for the optimization time). Most of the tweaking today comes from large-scale optimizations, the kind that, in one strike and an hour's worth of coding, makes half your code twice as fast.

The only area where assembly is still useful today is handhelds: these things have no graphics card to render the game on, so it's back to basics and you actually have to render things yourself. Much precaching fun ensues as you try to optimize your pixel-rendering function.

Share this post


Link to post
Share on other sites
Quote:
Original post by ToohrVyk
It's usually a good idea to prefetch "a" on the first run because the cache is empty. However, is it a good idea to prefecth "b" ? If the b-loop iterates a lot, then you actually gained time. If it only iterates once, you've kicked a out of cache for nothing (since you only performed one operation on b anyway), and you need to read it back. A practical example of this behavior is: in the middle of many operations on data set "a" by a process, the kernel lets another process update a variable "b" in memory. To the processor, this is similar to the situation above.

So, basically, the processor and compiler seldom have a clue about what should be cached and what shouldn't. Prefetch operations tell them what is expected to be the optimal behavior, which is why the code runs faster.


Uhhh... I could be wrong, but aren't you confusing the cache, and registers, with regards to 'kicking a out of cache'? I don't think there is a PC CPU that is used today which has only 4 bytes of cache space... In a simple case like this, I'd just precache both the variables. Of course, the speed gain would be neglible for only 2 variables, but you get the point? Again, I could be wrong, or maybe I misunderstood the whole bit about the cache in my assembly book.

Share this post


Link to post
Share on other sites
Quote:
Original post by basananas
Yeah I know. I've become interested in optimization as I've been working on a (sphere) 3d engine and it's too slow. Hopefully I can gain some speed with these memory tricks..


It's the standard response. A good reply, however, is that to learn how to optimize you have to start somewhere. Have you checked out Agner Fog's book?

Share this post


Link to post
Share on other sites
Quote:
Original post by LessBread
Have you checked out Agner Fog's book?


That's a fantastic book LessBread! I especially like the part about increasing the speed of c++ code, we can all learn from that.

Bas

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this