Sign in to follow this  
Nicholas Kong

How to write fast code without optimizing

Recommended Posts

Is fast code good or bad?

 

How to write fast code without optimizing?

 

I know I am not suppose to optimizing so I want to see if I can still write fast code.

 

Edit: Seriously, why the anonymous downvote...?

Edited by warnexus

Share this post


Link to post
Share on other sites

Just try to access your data in a linear fashion, which is cache friendly, like when reading or writing an array for example.

Also, most optimization i've done where just rethinking some algorithm an rewrite them in a more efficient manner.

Edited by Vortez

Share this post


Link to post
Share on other sites

Fast code first.

Optimization comes with repeating.

After a good deal repeating the solution of the same problem I could finish with both a fast and optimized code.

You can't reach this overnight.

Some techniques I needed to re-write from scratch around 5 or 10 times before grasping the best way to do it.

There were problems only after re-visiting it after a year I could get the idea behind some optimization technique.

So unless you have a good deal of time, you shouldn't try to optimize your code and only finish it.

Share this post


Link to post
Share on other sites

Your code should be

 * clear

 * correct

 * fast

... in that order. Some people would place `correct' first, but I would rather have code that I understand even if it doesn't cover every conceivable case, as long as it is clear enough that I know immediately what the not-covered cases are.

 

Don't write stupid code that is needlessly slow, but don't obsess about performance until it's clear that the program is not fast enough. In that case, use your profiler to focus your efforts where you should.

What makes a code stupid? Unreadable code you mean?

Edited by warnexus

Share this post


Link to post
Share on other sites

 

What makes a code stupid? Unreadable code you mean?

 

 

 

stupid pseudo code:

 

For x = 0 to image_width

  For y = 0 to image_height

    Pixel(x, y) = 0

 

better version:

 

For y = 0 to image_height

  For x = 0 to image_width

    Pixel(x, y) = 0

 

Do you understand the difference between the two?

Share this post


Link to post
Share on other sites

 

 

What makes a code stupid? Unreadable code you mean?

 

 

 

stupid pseudo code:

 

For x = 0 to image_width

  For y = 0 to image_height

    Pixel(x, y) = 0

 

better version:

 

For y = 0 to image_height

  For x = 0 to image_width

    Pixel(x, y) = 0

 

Do you understand the difference between the two?

 

I don't know but Bregma above me seems to know.

Edited by warnexus

Share this post


Link to post
Share on other sites
[quote name="Bregma" post="5118223" timestamp="1387485849"]
One is very very fast in Fortran. The other is very fast in C.[/quote]
I wasn't aware that Fortran swizzled image data in memory ;-) Perhaps you were thinking of two dimensional arrays? lol
[quote name="warnexus" post="5118229" timestamp="1387486785"]
I don't know but Bregma above me seems to know.[/quote]
The second version accesses pixels that are stored consecutively in memory (across the image) , which means it is cache friendly.

The first version, however, accesses pixels that are above each other in the image, but a long way apart in memory, making it very unfriendly to the cache.

I just did a really quick test, and a C debug version processed an 8192*8192 image 6 times faster using the second version! Which makes the first version 'stupid code'.

Share this post


Link to post
Share on other sites

I wasn't aware that Fortran swizzled image data in memory ;-) Perhaps you were thinking of two dimensional arrays? lol

His point was that you're assuming that your pixels are stored in row-major order in memory (and checking if warnexus will make the same assumption and then think about the cache coherency implications).
C uses the row-major convention for arrays and Fortran uses the column-major convention for arrays. If Fortran were still more popular than C, then maybe our image formats would store their pixels in column-major order? wink.png I thought it was a good joke anyway laugh.png

Is fast code good or bad?
 
How to write fast code without optimizing?
 
I know I am not suppose to optimizing so I want to see if I can still write fast code.

Fast code is good. If code has to become unreadable/unmaintainable/overly-complex in order to become fast, then that's a trade-off, because those things are bad.
 
To write fast code, you need to know as many details about the computer architecture as possible (e.g. how the CPU works), and know exactly what your code is doing (e.g. what low-level instructions your high-level code will translate to). Then, you are able to think about performance implications while writing it... although this is basically optimizing the code mentally as it comes out of your fingers...
 
You're not supposed to be optimizing your code if: you don't need to (it's already fast enough for your purposes), if you have better things to be spending your time on, or if the optimizations that you're going to use have more cons than pros (e.g. if they will leave the code so unreadable that no-one will ever be able to understand it again).

Share this post


Link to post
Share on other sites

Regarding mark ds's post, I am getting a bit of dissonance here, like I did with the 'squaring terms' vs 'squaring sides' bit a few weeks ago...

 

 

 

const int elems = 3000;
int grid[elems][elems];

int _tmain(int argc, _TCHAR* argv[])
{
LARGE_INTEGER startqpc, stopqpc;


::QueryPerformanceFrequency(&startqpc);
double cpufreq = double(startqpc.QuadPart);


::QueryPerformanceCounter(&startqpc);
for(int i=0;i<elems;++i)
for(int j=0;j<elems;++j){
grid[i][j] = 0;
}
::QueryPerformanceCounter(&stopqpc);


std::cout << "Time taken: " << (stopqpc.QuadPart - startqpc.QuadPart)/cpufreq << std::endl;


::QueryPerformanceCounter(&startqpc);
for(int i=0;i<elems;++i)
for(int j=0;j<elems;++j){
grid[j][i] = 0;
}
::QueryPerformanceCounter(&stopqpc);


std::cout << "Time taken: " << (stopqpc.QuadPart - startqpc.QuadPart)/cpufreq << std::endl;


std::cin.ignore();


return 0;
}

Output:

 

Time taken: 0.00935299

Time taken: 0.0843954

 

which would seem to contradict mark ds. 

 

But, I have a smallish hat:collar ratio so I'm hedging. What is correct here?

 

 

Share this post


Link to post
Share on other sites

Many people have already given you some valuable advice. Cache friendlyness is one thing that is reasonably simple yet can make a large difference. Another thing that can make a huge difference is when and how you allocate your memory. That is especially true if you use a garbage collected language such as C# or Java. In C#, know when to use structs vs classes, because it can really make a difference when used right (or wrong).

Share this post


Link to post
Share on other sites

 

I wasn't aware that Fortran swizzled image data in memory ;-) Perhaps you were thinking of two dimensional arrays? lol

His point was that you're assuming that your pixels are stored in row-major order in memory (and checking if warnexus will make the same assumption and then think about the cache coherency implications).
C uses the row-major convention for arrays and Fortran uses the column-major convention for arrays. If Fortran were still more popular than C, then maybe our image formats would store their pixels in column-major order? wink.png I thought it was a good joke anyway laugh.png

Is fast code good or bad?
 
How to write fast code without optimizing?
 
I know I am not suppose to optimizing so I want to see if I can still write fast code.

Fast code is good. If code has to become unreadable/unmaintainable/overly-complex in order to become fast, then that's a trade-off, because those things are bad.
 
To write fast code, you need to know as many details about the computer architecture as possible (e.g. how the CPU works), and know exactly what your code is doing (e.g. what low-level instructions your high-level code will translate to). Then, you are able to think about performance implications while writing it... although this is basically optimizing the code mentally as it comes out of your fingers...
 
You're not supposed to be optimizing your code if: you don't need to (it's already fast enough for your purposes), if you have better things to be spending your time on, or if the optimizations that you're going to use have more cons than pros (e.g. if they will leave the code so unreadable that no-one will ever be able to understand it again).

 

 

 

this has been a recent issue, especially in my fiddling with all the video stuff.

 

it needs to be fast to avoid interfering with the framerate, but naively written image-processing and manipulation code can easily become fairly slow.

 

more so, one may run into other more subtle issues, like you don't want to make multiple passes over an image or largish buffer if it can be avoided, ...

the result then is a bunch of hairy and scary-complicated code to do things like decode video frames all in a single pass and with multiple decode routes depending on which output format is being targeted, ... (ex: RGBA, BGRA, UYVY, DXTC, ...).

 

so, yeah, it is a tradeoff.

if an entire codebase were written in performance-centric code though, this would just be scary.

 

 

the other side of this though is people writing dead-slow code with a "computers keep getting faster so why care?".

the code is often simple, but can waste huge amounts of clock cycles doing almost nothing.

 

so, there are several levels of optimization:

1, dead slow, such as invoking an RDBMS or XSLT or similar every time the user clicks something (trivial operations can potentially take seconds, ...);

2, moderately slow, such as casually using dynamic memory allocations and run-time type-checks, ...;

3, moderately fast, namely going more directly "from point A to point B", ...

4, extra fast, where the code starts "growing hair" in an attempt to get it faster (micro-optimizations start popping up, ...);

5, faster still, thing starts getting outgrowths of ASM and/or dynamically generated machine-code.

 

most of my code tends to be 2 or 3, with most of my renderer and similar in group 3, a lot of my video stuff in 4, and a lot of my script VM stuff in 4 and 5.

 

usually, it depends on what is going on and how much it can impact performance.

 

 

code in groups 4 and 5 generally evoke a lot of cries of "optimization is evil", but it is sometimes necessary, and probably shouldn't be done as a general development practice.

 

the main obvious difference:

usually group 3 code is fast but small, whereas group 4 code tends to become massively larger in an attempt to squeeze more speed out of the problem.

group 5's most obvious feature is that it usually comes with a mountain of #ifdef's and inline ASM and similar.

Edited by BGB

Share this post


Link to post
Share on other sites

Yeah, it's not that optimization is evil, it's premature optimization is evil.  But even then, it's the amount of time, and the amount of clarity lost when prematurely optimizing that comes as a factor of whether or not it's worthwhile.

 

If you optimize code and in the process the whole thing is cleaner and clearer?  That's not premature optimization, unless perhaps you took an extra week to get something done that was supposed to only take three days, and are on a schedule, and the code isn't some vital pathway that needs to be fast.

 

An example of what could be considered a preoptimization that I always end up doing, is when comparing length, I do the length squared if I can get away with it.  Sure the code may not be hit all that often, but it's a rather simple optimization, and it's one I've done so often that it's very clear when I see something like "radius*radius < distSqr" in code somewhere. 

 

Just because you don't want to prematurely over optimize doesn't mean you should turn your brain off when doing coding.

Share this post


Link to post
Share on other sites

 

One is very very fast in Fortran. The other is very fast in C.

I wasn't aware that Fortran swizzled image data in memory ;-) Perhaps you were thinking of two dimensional arrays? lol

I don't know but Bregma above me seems to know.

The second version accesses pixels that are stored consecutively in memory (across the image) , which means it is cache friendly.

The first version, however, accesses pixels that are above each other in the image, but a long way apart in memory, making it very unfriendly to the cache.

I just did a really quick test, and a C debug version processed an 8192*8192 image 6 times faster using the second version! Which makes the first version 'stupid code'.

 

So the second version would be moving from left to right row by row and the first would be going from bottom to top row by row? I seen the second code somewhere in an open source project. It is nice to know the second version is cache friendly.

Share this post


Link to post
Share on other sites

Yeah those loops make sense, to go through a 2D array of [i][j], you want to loop the right most one first. Similarly pictures tend to be saved [height][width].

woah really like this you mean saved like this:

 

image = new Image(x,y,height,width);

Share this post


Link to post
Share on other sites

Many people have already given you some valuable advice. Cache friendlyness is one thing that is reasonably simple yet can make a large difference. Another thing that can make a huge difference is when and how you allocate your memory. That is especially true if you use a garbage collected language such as C# or Java. In C#, know when to use structs vs classes, because it can really make a difference when used right (or wrong).

Is how cache friendly code also based on what data structures I used? I guess the better question would be why does the cache take a hit from the code? Is every memory from the computer only available in cache?

Edited by warnexus

Share this post


Link to post
Share on other sites

Imagine you have a CD. You want to listen to every single sound wave. Either, you can listen to the whole thing from start to end, or you can jump around randomly and listen to one second at a time until you've listened to them all. From start to end would take you something like 78 minutes (length of CD) while the other one might take you days because of the time it takes to seek/jump. While CPU caches work very very differently, the effect is pretty much the same. If we continue with the CD thing, the CPU cache has something like a billion tracks (each track represents 64[1] bytes). It takes some time to "find" a track, but once you're there, you can access it very very quickly. This means you should try to process one track at a time, and complete working with each track before moving on. Moving on to the next track is also usually faster than jumping to a random track due to prefetching and what-not.

 

In C#, an array of structs will place all data as sequential "tracks" while an array of classes [2] would create one track just containing a list of tracks with the actual data, and those tracks might be anywhere meaning you will probably need to jump around a lot to access that data.

 

Another thing to cache friendlyness comes when multithreading. NEVER make one thread write to a "track" (cache line) that another thread might be reading/writing/accessing. Apart from the fact that you might get wierd/incorrect results if you're not careful, you end up with cache invalidations and god-knows-what kind of slowdowns.

 

[1] Can vary greatly between CPUs, but on my Core i7 it's 64 bytes.

[2] Array of instances of structs vs array of instances of classes.

Edited by DvDmanDT

Share this post


Link to post
Share on other sites

Your question seems to imply that you want to write code that is fast/efficient the first time around without having to rewrite it later due to slow performance.  I find that this rarely is the case in practice, although, you can definitely design systems to be more efficient up front or design them to be made easier to optimize later.  But this is incredibly difficult to do.  It requires a lot of experience in algorithms, computer architecture, the problem domain, and just plain programming to be able to know how to write code to make it "fast enough" for your purposes in such a way that you won't have to come back to it later.

 

I've been programming pretty seriously for about 2 or 3 years now in addition to 4 years of schooling from a reasonably good CS program.  I would say I have just enough experience now to sort of see where I can afford to "be lazy" with my programming since I just know it won't make any bit of perceptible difference to the user and where I should definitely think more about the problem before embarking on a coding frenzy since performance might be a concern.

 

It's really difficult (in my opinion) to get a good feel for this other than to just write a lot of code, profiling, inspecting, and really getting to know your code and its relationship to performance.  You eventually get to a point where you can just sort of think about a problem and be like "Yeah, that problem size is so small it won't matter." or "That algorithm might be a little slow, but because I run it only once a frame on a small number of items, it's not a big deal".  If you really know your algorithms and the problem domains, you will be able to easily identify huge red performance flags, like "Oh wow, I definitely don't want to do this A* search in this loop since the graph is thousands of nodes" and therefore spend more time and energy on designing for good performance on those cases instead.

Share this post


Link to post
Share on other sites

Is how cache friendly code also based on what data structures I used? I guess the better question would be why does the cache take a hit from the code? Is every memory from the computer only available in cache?


For the CPU, fetching data from memory is a relatively expensive operation. So as an optimization, when it needs to fetch something from a specific memory address, it grabs whatever is at some of the neighboring addresses as well and stores it temporarily on the CPU (in the cache). As a result, if the next instruction manipulates data already in the cache, the expensive fetch from memory is avoided and cycles are saved.

Your code affects cache performance by the data structures and algorithms you use. For example, a typical linked list implementation will allocate a node for every item you add to the list. Unless a specialized allocator is used, there are no guarantees that nodes will be at neighboring memory addresses. Most likely, they will be scattered all over the place. Iterating a list of values (not pointers) will, in the worst case, cause one memory fetch (a "cache miss") for each iteration -- none of the nodes will be in the cache because they are scattered all over the memory space. Compare that to an array of values, where each item is sequential in memory, resulting in "cache locality". You'll do a fetch for the first item and the cpu will pull more data long with it to store in the cache. When you ask for the second item, an expensive memory fetch is avoided because the item is already in the cache (a "cache hit").

Iterating a list or array of pointers will also cause more cache misses if you do any operations on those pointers. For example, in an array of pointers, each pointer is sequential in memory, but the objects they point to may be all over the place, requiring more memory fetches.

The size of your data objects have an impact, too. More small objects can fit in the cache than larger objects, so the larger your object size the more cache misses you'll have.

For more info, some keywords you might want to search for are "cache friendly programming", "cache misses" or "cache locality".

Edited by Aldacron

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this