[c++, asm] A slow stack?

Started by
27 comments, last by ttdeath 18 years, 7 months ago
Quote:Original post by LessBread
I was also getting at that the code becomes cached once it executes. So, if you ran some non-precached code repeatedly, the first execution would be slower than subsequent executions.

ttdeath asks a good question. Is the code fast enough to do what it needs to do? This is the central question of optimization. The general answer is to focus attention on those portions of the code where optimization produces the biggest overall improvements.


Yeah I know. I've become interested in optimization as I've been working on a (sphere) 3d engine and it's too slow. Hopefully I can gain some speed with these memory tricks..


ToohrVyk, thanks! That's a very clear explanation of those questions. It's nice to hear that learning assembly actually is important for making hotspots faster :)

Greetings,
Bas
Advertisement
Quote:Original post by basananas
It's nice to hear that learning assembly actually is important for making hotspots faster :)


Actually, no, it's not important in PC video game development. In PC games, the main bottleneck is rendering, and is performed by the 3D accelerator card. Therefore, you can fiddle with assembly as much as you want, you will only get minimal improvements because the time is really spent on rendering, which you can't fiddle fith in assembly.

To optimize rendering speed, you have to carefully select what you send to the graphics card, and in what order, and with what options. No assembly involved - only rearranging your rendering API calls, sorting your scene, and in general knowing what makes your favourite drivers and cards tick. The kind of difference I'm talking about here is getting 1 additional FPS from two hours working on the game update code, and 60 additional FPS after spending 10 minutes of coding on sorting polygons by material before rendering them.

With the scale of todays hardware, I'd even say it's not useful to optimize at low levels anymore on the (non-gaming) PC platform, unless you're really dying for speed. Back in the day, optimization meant keeping the game map and textures in cache, and it was hard because the cache was small. Today, for any practical purposes where execution speed is processor and memory bound, the scales involved are larger: the data sets are larger (Gigabytes of data are necessary to cause any real problems) and the resources are larger (a company work server can keep the Gigs above in RAM: it's cheaper to buy more RAM than to pay for the optimization time). Most of the tweaking today comes from large-scale optimizations, the kind that, in one strike and an hour's worth of coding, makes half your code twice as fast.

The only area where assembly is still useful today is handhelds: these things have no graphics card to render the game on, so it's back to basics and you actually have to render things yourself. Much precaching fun ensues as you try to optimize your pixel-rendering function.
Quote:Original post by ToohrVyk
It's usually a good idea to prefetch "a" on the first run because the cache is empty. However, is it a good idea to prefecth "b" ? If the b-loop iterates a lot, then you actually gained time. If it only iterates once, you've kicked a out of cache for nothing (since you only performed one operation on b anyway), and you need to read it back. A practical example of this behavior is: in the middle of many operations on data set "a" by a process, the kernel lets another process update a variable "b" in memory. To the processor, this is similar to the situation above.

So, basically, the processor and compiler seldom have a clue about what should be cached and what shouldn't. Prefetch operations tell them what is expected to be the optimal behavior, which is why the code runs faster.


Uhhh... I could be wrong, but aren't you confusing the cache, and registers, with regards to 'kicking a out of cache'? I don't think there is a PC CPU that is used today which has only 4 bytes of cache space... In a simple case like this, I'd just precache both the variables. Of course, the speed gain would be neglible for only 2 variables, but you get the point? Again, I could be wrong, or maybe I misunderstood the whole bit about the cache in my assembly book.
Free speech for the living, dead men tell no tales,Your laughing finger will never point again...Omerta!Sing for me now!
Quote:Original post by basananas
Yeah I know. I've become interested in optimization as I've been working on a (sphere) 3d engine and it's too slow. Hopefully I can gain some speed with these memory tricks..


It's the standard response. A good reply, however, is that to learn how to optimize you have to start somewhere. Have you checked out Agner Fog's book?
"I thought what I'd do was, I'd pretend I was one of those deaf-mutes." - the Laughing Man
Quote:Original post by LessBread
Have you checked out Agner Fog's book?


That's a fantastic book LessBread! I especially like the part about increasing the speed of c++ code, we can all learn from that.

Bas
8Kb might not be enough, cache is not linear and also has to contain the executable code around the point of execution.

Optimisation wise, I think c++ code level is more than enough for 99% of the things. Only the inner loop of a game engine may need hand optimisations, or, as said, the mobile game world.

Try this:

int t=20000;int i=512*512;//262144int k;while(t--){    i=262144;    while(i)    {      k = ptr[i--];      k = ptr[i--];      k = ptr[i--];      k = ptr[i--];      //or 8x the line, or 16, or 2, .. you get the picture    }}


Tell me if this works for you.

Oh yeah, and for the Foo a/b thingie. That sort of dillema can be solved
via 2-pass compilation, but that is extremely uncommon. 2Pass compilation
goes like this: 1. Speed optimisation, with global scope. Then runs the executable with inlined profiling code. This finds out whether the branch A has more probability of being run than branch B. After that, compiler gets another run at the code (compile 2) with the profiling data also as input. The second version would have much better results, would take into account more of the code's peculliarities.

[Edited by - ttdeath on August 31, 2005 5:55:05 AM]
Quote:Actually, no, it's not important in PC video game development. In PC games, the main bottleneck is rendering, and is performed by the 3D accelerator card. Therefore, you can fiddle with assembly as much as you want, you will only get minimal improvements because the time is really spent on rendering, which you can't fiddle fith in assembly.


Bottleneck might be rendering on PC, but that doesn't prevent some engines to be CPU-bound!
If you know what you do, a bit of hand-optimization can still be profitable.
Quote:Original post by ttdeath
Try this:

*** Source Snippet Removed ***

Tell me if this works for you.


It works, but it's slow.. very slow. I think this is caused by the read in the wrong direction (the processor expects you to read in the positive direction). Changing the direction to positive increased the speed somewhat, but it was still not faster than:

for (int t = 0; t < 20000; t++) {	for (int i = 0; i < 512 * 512; i++) {		k = xy_z;	}}
Quote:Original post by basananas
It works, but it's slow.. very slow. I think this is caused by the read in the wrong direction (the processor expects you to read in the positive direction). Changing the direction to positive increased the speed somewhat, but it was still not faster than:


Very strange, what I showed you is called loop unrolling and it's the standard
way to take advantage of multiple pipelines on a cpu.

Perhaps your compiler is really old, bad at this.
About the order though, you may be right.

This topic is closed to new replies.

Advertisement