C++: Can this be optimized?

Started by
16 comments, last by cache_hit 14 years, 7 months ago
@cache_hit: I've implemented your method (Method 4), replacing the first loop of Method 1 with your suggestion. I couldn't figure out how to apply it to the second loop, as the value used to check if a bit is set or not is the same value being XOR-ed, so it changes as the algorithm runs.

Here are the results. I'm re-running all of the Win32 data since I'm on a different machine now. Original machine was an Athlon 64 desktop, current machine is a Core 2 Duo laptop.

Win32
=====
DEBUG
-----
Correctness Test: Same = 65536, Different = 0

Performance Test (10000000 iterations):
Method 1: Output (ignore) = 99, Duration = 1.87773 seconds.
Method 2: Output (ignore) = 165, Duration = 1.66733 seconds.
Method 3: Output (ignore) = 137, Duration = 0.538485 seconds.
Method 4: Output (ignore) = 239, Duration = 1.56202 seconds.
Method 5: Output (ignore) = 224, Duration = 0.423876 seconds.

RELEASE
-------
Correctness Test: Same = 65536, Different = 0

Performance Test (10000000 iterations):
Method 1: Output (ignore) = 200, Duration = 1.28909 seconds.
Method 2: Output (ignore) = 69, Duration = 0.772024 seconds.
Method 3: Output (ignore) = 64, Duration = 0.0904632 seconds.
Method 4: Output (ignore) = 155, Duration = 0.798899 seconds.
Method 5: Output (ignore) = 15, Duration = 0.03105 seconds.


Embedded (release build)
========
Method 1: Mult() cycles = 8,134,008, total system run-time = 23.157ms
Method 2: Mult() cycles = 3,912,233, total system run-time = 16.124ms
Method 3: Mult() cycles = 5,823,237, total system run-time = 19.307ms
Method 4: Mult() cycles = 10,931,047, total system run-time = 27.564ms
Method 5: Mult() cycles = 4,599,964, total system run-time = 31.605ms




Under Win32, your method is on par with Method 2, winning under Debug but losing slightly under Release. Method 3 is still the clear winner under Win32.

Under the embedded environment, your method seems to tank. The assembly output shows absolutely no pipelining happening during the Duff's Device switch/case fall-through code. It even throws in a couple NOPs per case. Chalk it up to a bad compiler, I guess. Method 2 still gives a great improvement over my original implementation, though.

My approach is going to be keeping all 4 methods available. A simple #define MULT_METHOD based off the build environment (WIN32_SUPPORT, EMBEDDED_SUPPORT, etc) to select which mutliplcation method to use. Win32 will use MULT_METHOD 3, embedded with use MULT_METHOD 2, etc. Each new platform the code moves to will have to check which method is best and #define the appropriate MULT_METHOD.




*EDIT: Added the 64k LUT as Method 5. Up to 3x faster than the next best under Win32 Release, but you do take the 64kB table hit. It doesn't fare as well under the embedded. Cycle count is 2nd best, but the overall time is the worste, most likely due to cache problems causing the cycle counts of other methods to increase.

[Edited by - Mantear on September 10, 2009 1:26:22 PM]
Advertisement
Quote:Original post by Mantear
Under Win32, your method is on par with Method 2, winning under Debug but losing slightly under Release. Method 3 is still the clear winner under Win32.


Why not use 64k table under Win32? Memory is not an issue, but cache might be.
Updated previous post to include Method 5: 64k LUT.
Quote:
Under Win32, your method is on par with Method 2, winning under Debug but losing slightly under Release. Method 3 is still the clear winner under Win32.

Under the embedded environment, your method seems to tank. The assembly output shows absolutely no pipelining happening during the Duff's Device switch/case fall-through code. It even throws in a couple NOPs per case. Chalk it up to a bad compiler, I guess. Method 2 still gives a great improvement over my original implementation, though.

My approach is going to be keeping all 4 methods available. A simple #define MULT_METHOD based off the build environment (WIN32_SUPPORT, EMBEDDED_SUPPORT, etc) to select which mutliplcation method to use. Win32 will use MULT_METHOD 3, embedded with use MULT_METHOD 2, etc. Each new platform the code moves to will have to check which method is best and #define the appropriate MULT_METHOD.


Interesting results. Just goes to show the importance of profiling (as well as the fun of optimization)

I'm a bit amazed that the 64K look-up table is the slowest on the embedded build? That's weird, it should just be one instruction to fetch the result from the look up table.
Also I just thought of another optimization. Since you said the problem was that no pipelining was happening in the implementation I proposed, it's possible to force the compiler to pipeline it.

Instead of embedding the indices into the *beginning* of the array, embed them starting from the position i=8-length. So the example I gave 0,1,4,7 there are 4 1-bits, so i=8-4=4, so you begin embedding the list 0,1,4,7 at index 4. So the indices array for the number 147 looks like {-1,-1,-1,-1,0,1,4,7}. Then your switch statement becomes:

switch (length){case 8:   temp ^= (inputB << indices[0]);case 7:   temp ^= (inputB << indices[1]);case 6:   temp ^= (inputB << indices[2]);case 5:   temp ^= (inputB << indices[3]);case 4:   temp ^= (inputB << indices[4]);case 3:   temp ^= (inputB << indices[5]);case 2:   temp ^= (inputB << indices[6]);case 1:   temp ^= (inputB << indices[7]);}


This should get pipelined more effectively.
Quote:Original post by Mantear
Updated previous post to include Method 5: 64k LUT.


What kind of embedded system is this? Looks like just a small L1 cache, if that at all.
It's a TI DSP, C6416, 600MHz. It has 1MB internal memory, which can be divided several ways, and 32MB external SDRAM. 256kB of the internal memory is set as L1 Cache, several other chunks are allocated for the DSP BIOS (the 'operating system' for TI DSPs) and PCI interface buffers, etc, with the remaining 650kB left over for heap + stack. All instruction code is stored in the SDRAM and the only thing that gets cached, so it's pretty important to keep the most used execution paths loaded in the cache. Heap + stack is always at cache speeds in the remaining 650kB space.

@ cache_hit: I may give your new suggestion a try, but I doubt it'll work. It was generating zero pipelining during the switch/case fall through, and actually threw in 2 NOPs per case. Pipelining is actually shown in the generated assembly output, so it's easy to tell. Constructs like Duff's Device aren't usually usefull for signal processing, so I wouldn't be suprised if the compiler doesn't handle it well.
Yea I guess that makes sense, it can't modify the same register with 2 different instructions at the same time. It's still doable since XOR is associative and commutative, but it would require you to use either 2 or 4 different registers and alternate which variable you're XORing into in each case label, then at the end combine them. Doubt it's worth the trouble though, the other methods are already pretty fast.

This topic is closed to new replies.

Advertisement