Let's build a resource for optimisations?

Started by
16 comments, last by Fruny 18 years, 2 months ago
What if that inner loop is unrolled, thereby bloating even more with many copies of the inlined optimization? Seems like it could go either way couldn't it?
Advertisement
Quote:Original post by Jan Wassenberg
Are we *discussing* micro-optimizations, as in "do you like them?"? Or were we going to give a list of known ones to help out others?
Well I hoped the former, but it seems to be going the other way. Either then everyone's decided they are of no use, or nobody knows any.

I thought we'd at least see that one about casting by using knowledge about the floating-point format...
For "A faster Square Root than sqrt()" you could use
float fast_sqrt(float Arg){    __asm {        fld DWORD PTR _Arg$[esp-4]        fsqrt        ret 0    }}
Alternately, you could just use 'sqrtf', which is what I used to have MSVS 2003 generate the above code (with default optimization settings in 'release' mode).
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Quote:Either then everyone's decided they are of no use, or nobody knows any.

Did you see this stuff in post#7?
- square root (integer, exact and inexact float)
- sin/cos
- float -> int (C truncate semantics)
- thread with link to mini-article on that topic [fast memcpy/memset replacements]

Quote:For "A faster Square Root than sqrt()" you could use [FSQRT]

Did you see the sqrt link above? Given that the FSQRT instruction takes 27 clocks in default double precision mode, it doesn't seem very fast.

Quote:What if that inner loop is unrolled, thereby bloating even more with many copies of the inlined optimization? Seems like it could go either way couldn't it?

If the loop body is large, then unrolling it would be a poor decision. Unrolling is no longer a surefire optimization as in the past, since jump overhead is very small and icache is a factor.

Quote:I thought we'd at least see that one about casting by using knowledge about the floating-point format...

You mean the 2**52 * 1.5 trick? See (1) and (2).
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Jan Wassenberg
[...]Did you see the sqrt link above? Given that the FSQRT instruction takes 27 clocks in default double precision mode, it doesn't seem very fast.[...]
Did you use basic integer arithmatic to notice that 27 clocks is less than 2.7 * 10-8 seconds on an (estimated) average machine, which means you could do it ~37M times a second? That's on a 1.0 gigahertz machine that doesn't do any of the things modern processors do to perform parallel and psuedoparallel computation on a single CPU.

Memory access will be a far worse bottleneck than things like sqrt in most modern programs. If you're calling sqrt so often that it's a problem, you're going through a LOT of data. In that case, you'd do far better to get lower-latency RAM, a multi-processor system (presuming each cpu has independant access to the memory and they're not competing for bandiwdth), or upgrade to a processor with a larger cache.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Quote:Did you use basic integer arithmatic to notice that 27 clocks is less than 2.7 * 10-8 seconds on an (estimated) average machine. That's on a 1.0 gigahertz machine [..]

Ooh, I'm not familiar with "arithmatic".
I understand you're trying to make a point here, but come on...

Quote: which means you could do it ~37M times a second?

Looks about right. However, that is absolutely meaningless and the wrong way of thinking to boot. If your program's response time is low enough, further optimization means you can process more data in the same time interval!

Memory access will be a far worse bottleneck than things like sqrt in most modern programs.
Ooh, "modern programs"?
But why should we care? Has it not been stated that sqrt optimizations are presented here *for those who need them*?

Quote:If you're calling sqrt so often that it's a problem, you're going through a LOT of data. In that case, you'd do far better to get lower-latency RAM, a multi-processor system (presuming each cpu has independant access to the memory and they're not competing for bandiwdth), or upgrade to a processor with a larger cache.

LOT of data: yep.
Are you going to give me and all users a new CPU, just so I can be lazy?


I no longer wish to waste my time here entertaining the silly and off-topic discussion of whether optimization is needed or not. If you aren't writing time-critical software or are satisfied with mediocrity, fine by me.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Jan Wassenberg
What people don't realize is that threads such as this one PRESUPPOSE the need to make use of any of these techniques. Can't we *for once* talk about this without someone saying "premature optimization = boo"?


I never said that. I said that in some cases it was unlikely that you would find a practical, better solution than hardware functions. At any stage of development. Nowhere in this discussion have I said that it is wrong to optimize in such a way. I simply said that in many cases, such trivial optimizations are difficult to perform, and gave some examples of the major difficulties that a programmer would encounter. I even admitted that you could gain a little.

Quote:Sure, increased i-cache use is a disadvantage of replacing CISC-style instructions, but this is perhaps much exaggerated. With 64KiB icache, assuming 20 instructions @ 3bytes a pop, we are using 0.1% of cache. (before 386 with no icache at all and especially on 8088 with really bad bus, you indeed had to watch out for this)
But what are you on about with "set-associative", and why do you say "nearly killed"?


I don't want to get down into the details, but instruction caches usually store predecoded micro-ops which take up more space. And no modern cpu uses a fully associative cache - it's too slow - so they use set-associative caches which have minor tradeoffs in speed and wasted space. Like I said, we could spend hours discussing modern cpu architecture and performance. Computer architecture books are thick for a reason.

Quote:
I no longer wish to waste my time here entertaining the silly and off-topic discussion of whether optimization is needed or not. If you aren't writing time-critical software or are satisfied with mediocrity, fine by me.


I'm not going to waste more time either. I posted what I thought was a useful comment on trying to optimize trivial operations - and I didn't even say that such optimizations were worthless, only that they were sometimes very difficult to do and unlikely to bring much benefit.

Sorry again to the OP, if I come across a useful trick I will post it here.
If you want fast math, just use AMD's or Intel's math libraries, instead of making your own. They're available for download on their developer web pages.
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian W. Kernighan

This topic is closed to new replies.

Advertisement