Sign in to follow this  

Optimisation Success Stories

This topic is 4716 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I was just wondering if anyone wanted to share any stories of optimisation where they achieved incredible speed gains, or memory reductions, or other huge measurable benefit by means of optimisation. Hopefully we can all learn some lessons from each other's successes (or prior mistakes if you want to look at it that way). I just today finished optimising a duplicate finding function. I had previously used the brute force O(n*n) method to compare every item (a 16 character string) to every subsequent item. My initial thoughts were that this would be okay, however it turned out that it didn't scale well up to the maximum size of 6144 entries. It took several minutes on my P4 2.8GHz, and over 20 minutes on a dual P3 450. I have now changed it to use an O(n) hash table with seperate-chaining. Now even on a Pentium 1, 160MHz it takes less than 7 seconds! This is one of many I could share, and it just happens to be the one where I wrote the original code. Care to share yours, anyone?

Share this post


Link to post
Share on other sites
Well this was about more than a year back when i was working on my first game. Since i was still very new to programming and graphics, i decided that i would not make my game use all the bells and whistles of newer gfx cards. I set my target machine to be pretty low

PII 450mhz processor
TNT 16MB VRAM

It was a 3d RPG and everything was going well until i added collission detection. Collission was by far the most difficult part of the game for me because maths is no my strong point. At that time i was testing the character's position against all the triangles in the scene. All the scenes in the game were very small (<20k poly). However the game ran at about ~25 on the target machine. My aim was to make it 60fps, which was no easy feat for a first game with no experience. Looking at half-life/CS i could get upto 80fps on similar resolution, and they looked far better than my game, so i knew that my gfx card was capable of much more. Obviously the limiting factor was my game code, and 90% of the processing was for collission, the rest for scripts.

I modified my loading sequence to calculate bounding spheres.

I started checking ingame objects for their distance from the character. If the object was not within the character's own sphere, it's polygons were not checked for collission. While simple, it almost doubled my performance on 700-800mhz PCs.

Share this post


Link to post
Share on other sites
I had to write a file processing program in MASM for a class. For three days I tried to optimize my program, because even though I thought it was well written, it was taking 36000 steps to do a task, and 30000 was the cutoff for a C! I slept hard the night before class and the answer came to me while I was asleep. I got up, rearranged the code in about 20 minutes, and turned in a program that got the lowest number of steps for that assignment ever!

What the optimizations boiled down to were using the position of the code pointer as part of the state of the code. It was one of those one time dirty dirty hacks that was very nonportable code, really not useful for anything other than the task at hand. I really look forward to using more evolutionary programming, so that a computer could come up with this stuff and my dreams could be a little more restful. :)

Share this post


Link to post
Share on other sites
I was playing with the bicubic spline reseampling algorith from Aramini. My goal was to replace the highly efficient bilinear resampling in our (pro) software. The first implementation of Aramini's algorithm was very fast, and a base 4 image (read: 1500x1000) needed something like 60s to be resampled to a 2048x1366 image. Of course, this was not very efficient. The bilinear algorithm took less than 600ms (100x faster) to do the same job. But the resulting image was so good! No aliasing, nearly no blur, and so on. We (a collegue of mine, my chief and myself) decided to optimize it.

The first step was to use fixed point math instead of floating point math. Result: 15s. Good try.

Then we applied the optimisation which is explained in the Aramini's article. At the time I found this article on the web, such optimisation was not documented in the paper (or it was missing in the version we got). Needless to say, using look up tables in order to find our coef values was a very big win. Result: 1s!

This was not sufficient. Due to our particular hardware environement, we had to process our photos in less than 750 ms.

So we did another pass: we defactorized the code in order to remove all tests in the inner loop. We do a particular version of the algorithm which runs inside the image, and another one (slower) to process the boundary of the image. Bingo: we did a very good 650ms! Faster enough to be used in our application (at this time, it was even faster that the intel bicubic resampling on our architecture).

Since it was already very late - no life... code junkies :'( too bad. We decided to push it to its limit. We did micro-optimisations here and here, modified tiny things, and so on.

And we again won: this time, our bicubic algorithm was even faster than our bilinear one! It took only 450ms to resample the source image.

Now, the good part: the hardware we were using was a PII 400 with 256 MiB. Using a P4, the bicubic resampling takes 300 ms to resize a 1968x1312 24 bpp image to 2560x1706 :). Better: since the algorithm is O(n2) where n is the final image dimension (meaning: if the dest size is w*h then the algorithm will loop only w*h times) we can use it to resample small images in interactive times :)

Regards,

(PS: no. The code has been sold to our customer. I cannot give it to you. :'( ).

Share this post


Link to post
Share on other sites
In compiling a kd-tree to use for spatial subdivision, the really expensive part is to determine the optimal split-plane position. The problem boils down to: given an AABB of primitive X, count all of the other primitives on each side (and intersecting) the 6 planes of X.

I came up with a really sweet piece of SSE that could do this computation, all the tests and incrementing accumulators in just 22 cycles per primitive for all 6 planes!

I also wrote a ray-tri intersection routine that runs branchless in 57 cycles with no precomputed data. The real reward isn't the cycle count so much as seeing CodeAnalyst spit out a pipeline simulation with nothing but green :)

and to PTymN, I'm totally in-tune with that 'sleeping' strategy. It's amazing how well it works that if you think really hard about a problem when you go to bed, you'll wake up in the morning with the magical solution. The shower is also a good problem-solver, but nothing beats the pillow for incredible creativity.

Share this post


Link to post
Share on other sites
I made changes in my clouds renderer to reduce number of samples of clouds function taken. (function is user defined and it is most expensive operation for sufficiently realistic clouds)
Result: at least, 10 times speed increase. (Even before that optimization, my clouds was a several times speed increase comparing to how it was before me... actually it was unuseably slow, in sense nobody really used it)

Another thing, efficient software frustum culling for blocks of stars, in The Galaxy. I was able to increase number of stars around camera about 5..10 times.

Share this post


Link to post
Share on other sites
Back in the day, I was working for a shop that liked to print a certain executive report on thier neato-kean 8-color dot matrix printer. Unfortunately the printout took over an hour to complete! So they had me take a look at the printer driver logic of their report generation app to see if I could do anything (maybe convert it to assembly the said - it was written in QuickBasic).

I looked at the code and didn't see anything obvious to do - it was pretty straightforward (shockingly. The rest of the app was some of the worst sphegehti code you've ever seen). So I got to watching the printer as it did it's report. It was pretty interesting - there were 4 pens (red, blue, green, and black). The pens could be stored at either end of the current print line. As I watched it, it would grab a pen, scroll over to the current print position, plot a pixel or three, head over to the side, drop the current pen, grab a different one, and repeat. It was basically scanning back and forth and back and forth zillions of times per line trying to print each pixel color in the order it was given.

Back to the driver code. I rewrote it to do a simple analysis of the colors and break each line into four jobs - a red job, a green job, a blue job, and a black job. After I was done the print job was down to around 5 minutes - the people who printed the reports were astounded! They actually thought the printer was broken the first time they fired it up because they had never heard it print so fast before, they were so used to it spending all it's time in pointless seeking. And all this without resorting to assembly.

Share this post


Link to post
Share on other sites
Quote:
Original post by GamerSg
...

I modified my loading sequence to calculate bounding spheres.

I started checking ingame objects for their distance from the character. If the object was not within the character's own sphere, it's polygons were not checked for collission. While simple, it almost doubled my performance on 700-800mhz PCs.


Yeah, I do the same thing for collision detection in my graphical MUD engine. The extra double per object is well worth the performance boost.

Another big optimization was my event scheduler. My engine updates the world 10 times per second, checking for object movement, collisions, healing, etc. Although my object classes were smart enough to know which objects did what, it was still necessary to iterate through the entire object list every tick (1/10 second). In a game where there are 5000 walls and 100 mobiles, that's 5100 iterations per tick, 5000 of which are unnecessary, since walls never do anything on their own.

The event scheduler fixed that problem. Now, mobile objects register a tick event with the scheduler. Static objects do not. In the same scenario as above, the scheduler only has to iterate through 100 mobiles, no matter how many static objects there are. The scheduler also allows me to register events like healing, so I don't have to check every tick if enough time has passed to let a mobile heal. The heal event occurs at the appropriate time and reschedules itself if the mobile is not completely healed. Not only is performance improved, this method is much more scalable as well.

Share this post


Link to post
Share on other sites
One time I achieved a > 50% performance increase by modifying GDI code from {DIB -> Same DIB, then DIB -> Window DC} to {MemDC -> Same MemDC, DIB -> Same MemDC, Same MemDC -> Window DC}. -> represents a single blit (or memcpy)
Basically, the image on the DIB was being scrolled up and new data added to the bottom and a memcpy was being used to do the scrolling, while a bitblt was being used in the new case to scroll the memDC.

I also received a >200% performance increase by changing a single character read inside a for loop to instead read all the data in a single function call before the foor loop. I figured the internal buffer (by the file libraries, the OS, or the HD hardware) would make them practically equivalent, but apparently not.

Another time, I obtained >500% performance increase by simply calling std::string::reserve on a string before a large number of additions to the string.

Share this post


Link to post
Share on other sites
I am learning dynamic programming, did one sample problem on topcoder, and got huge speed boost. I think that stuff like topcoder helps you not only in algorithm design, but in how you approach the problem.

Pointers help a lot too. I wrote a neural network w/o pointers and one with pointers, and the one pointers ranf faster

Share this post


Link to post
Share on other sites
I managed to cut vertex normal generation for a 1024*1024 terrain down from 55mins to 50second by swtiching from a std::vector to std::deque.. however it still takes about 2mins as it takes ages to release memory, even in release mode [sad]

Share this post


Link to post
Share on other sites
My biggest optimization success story was to create SoftWire, a run-time code generator. It allows to eliminate branches from code and this in turn has other great benefits for performance. It has proven its usefulness in my software rendering project. Together with advanced assembly, speedups of 100.0 are not uncommon.

It's probably not that useful for other projects, but maybe I inspired someone...

Share this post


Link to post
Share on other sites
Quote:
Original post by _the_phantom_
I managed to cut vertex normal generation for a 1024*1024 terrain down from 55mins to 50second by swtiching from a std::vector to std::deque.. however it still takes about 2mins as it takes ages to release memory, even in release mode [sad]


Wow, what hardware are you using? I generate normals for my 4096x4096 in under a minute... either the algo or hardware, that's on my athlon xp 1600 with 256mb ram :P.

Share this post


Link to post
Share on other sites
T'is 2.2Ghz Athlon XP, 1gig ram
the routine probably isnt optimal (I just picked the first method which seemed easy to do and worked from there) and I know where the memory issues come from (a deque is allocated in the function then takes forever to deallocate as it drops out of scope... )
It was for a proof-of-concept thing, so 2mins was fine really just 55mins was a bit 'out there' [grin]

Share this post


Link to post
Share on other sites
If you use SDL for graphics, be sure to convert loaded images to the screen format. I did this for a previous SDL project and got about 250%+ faster image blitting. Woo!

For my most recent project, i did a 1-time "explosion" particle emitter. I would compute the next step for each particle by doing angle * speed. I realized that i was computing a lot heavy math functions every frame, so i found the angle once at the beginning of the particle's life and just multiplied the resulting x/y offset every frame. Got about 40% better performance just by pre-computing a few simple things.

Share this post


Link to post
Share on other sites
2 optimisations,

1 is sort of on topic with leiavoia's. I was recently doing a project [2d oldskool sidescroller] with SDL, and getting framerates of ~30-40 on an incomplete world. I converted everything into incredibly inefficient openGL, and immediately I was getting ~750fps. That's 25ms per frame going to ~1.5ms per frame.

2, I was making a baccarat simulation for a statistics assignment, and decided to do a 1 billion game simulation. The code was all pretty straightforward c, and the 1 billion games would take approximately 8 hours on a 2ghz machine. Two days before it was due, I realised that I wasn't going to be able to collect enough data [ie variation on winning on different number of decks and the like], so I figured that I had to optimise something. I was shuffling usually more than 8 decks together at the beginning of each hand, and the routine was incredibly slow. I made it so that the computer played through each deck, and then reshuffled, and those 5 minutes of edits makde the entire 1 billion hands calculate in well under 15 minutes. Go figure. I'm sure that now I could rewrite the shuffling algorithm to get it to work in less than 5...

CJM

Share this post


Link to post
Share on other sites
I worked on a program for encoding 125 proximity tags and you used to be only able to encode about 7 cards per minute. After a bit of profiling it seemed that the hardware was taking ages so I tried tweaking a few parameters and found that one parameter in particular meant that it tried to encode the card 16 times (at different power levels) However full-power is all that is needed, so that was an easy speedup. Then I found that the comms had a 2-second timeout on each byte! I changed this to a 2-second timeout for the entire message. Now you can encode 60 cards per minute which is faster than all the other card types. The hardware now waits for the user instead of the other way around.

Share this post


Link to post
Share on other sites
I once wrote a program to sort out headers and includes in C/C++ files to make the precompiled header feature work properly. A full rebuild was taking over 30 minutes. After sorting the headers and includes it came down to a couple of minutes.

On a PS2 project, the level loading code was taking an age when loading from CD. I wrote some file routines to save out all the data into a single file as it was read from multiple files then wrote a set of file routines to read from the single chunk instead of the multiple files. This isn't the same as PAK / WAD files, once the file have been opened, no more seeks are made until all the data is read (which is what CDs are very good at - loading big sequential files). Load times became vitrually instantaneous as opposed to a couple of minutes.

Skizz

Share this post


Link to post
Share on other sites
I heard a neighbour yell "my f*cking damn polyfill code (for some kind of screen-saver) runs like a snail". It was on a Mac 68040 or so. I did not know this kind of nasty beast, I still owned many years of good practice on the Atari and Amiga.

So I threw an eye. I noticed it was incredibly slow, abnormally at first glance : say 5-10 times too slow for what it was supposed to do. Had a look at the code. In a few secs I saw some straightforward C routines on his screen and ASM files !!! Got the inner loop, typed an unrolled version (I could still remember my old 68000 experiences by heart), displaced constants out of it, removed a few unnecessary stack accesses. Done in a few minutes. He ran again : x10 speed up. OK. I asked him if he had understood the modifications. He said "No, absolutely. I have just copied/pasted a sample code and modified two things". BTW I also removed graphical bugs by rewriting the outer loops partially (errors in a Bresenham like algo).

Lesson 1) : Close to a deadline, a boss (or prog coach) should not let noobs write asm code. (formation is OK when time is available, depends). He had been two weeks on it, and had produced something worse than what he could have possibly done in C.

Lesson 2) : The most meaningful optimizations are very often in the human management.

Share this post


Link to post
Share on other sites
Quote:
Original post by C0D1F1ED
My biggest optimization success story was to create SoftWire, a run-time code generator. It allows to eliminate branches from code and this in turn has other great benefits for performance. It has proven its usefulness in my software rendering project. Together with advanced assembly, speedups of 100.0 are not uncommon.

It's probably not that useful for other projects, but maybe I inspired someone...


I tried to push my coworkers to use it in order to create mutable code in a protection system. Since they know little about assembly programming, they finally decided to move to a more "basic" protection scheme - involving hasp dongles. Too bad - it was very fun to play with it.

There are tons of possible use for this library :)

Share this post


Link to post
Share on other sites
At work, one day, i profiled an exporter plugin that was made by a collegue and found out that the CPU was spending most of its time in a class managing a huge linked list of objects (millions of elements).

I recoded the class to use an array instead - around 15 mins of work, a hundred of lines modified in the whole program total, so easy but nobody cared before.

Result: export time of our "test" model went from 20 minutes to around 15 seconds.

Y.

Share this post


Link to post
Share on other sites
While not something that got fixed, an interesting find nonetheless... We were using an in-house VRML library to load exported data from 3ds max. It was taking several hours to load the model. Eventually I did a profile, and saw that like 99.9% of the time was spent calling realloc, to resize arrays for each float read. The lesson: preallocate, or do as vectors which double their size each time.

Share this post


Link to post
Share on other sites
For me I made a particle system (arbitrary forces/emmitters/etc, physically correct, no collision though), the catch is that it was memory managed. In C++ of course. Anyways, I wanted that particle system to be only a few percent slower then unmanaged, but when I first implemented it, it was MUCH slower. 5000 particles was giving me 15fps, whereas unmanged I could get 35fps.

First thing I did was rewrite my garbage collection rountine. Origionally it looped through all the objects in the list 3 times. First to mark those that are still valid, second to find and delete the unmarked objects, and third to reset all the deletion flags. I reduced that down to 1 loop. I used a integer as the flag, then moved all my objects to a deletion list (simply assigned the list to another header pointer), then looped through all the objects and if they were valid moved them back to the list of valid objects, deleted anything that remained, and incremented the integer flag so I didn't have to reset it. That brought the fps in my particle system from 15fps to 30fps.

There was still a 5 fps difference. The problem now was simply reducing the number of objects, so instead of storing each particle as a seperate object I stored them in a big array. Now my fps was 35, just as fast as unmanaged. Note, the garbage collector is run between every frame.

Share this post


Link to post
Share on other sites
Implementing a dirty rects system for my software rendered games. Very often this optimisation means that no CPU at all is spent on drawing, and the games can work as low as my Pentium 1 233 Mhz MMX developing machine.

And the best thing is that this optimisation allowed me to be a lazy bum about almost everything else optimisation-related.

Share this post


Link to post
Share on other sites

This topic is 4716 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this