How much performance improvement does SSE provide?

Started by
10 comments, last by thatguyfromthething 8 years, 9 months ago

I'm new to SSE.

My take is that the basic theory is to pack four instructions together and issue it in one go. And it is compatible with multi-threading.

The question is how much performance gain do we usually have after optimizing your program with SSE, especially for the math library.

I'm not expecting four times boost, however I do expect twice performance. Is it possible?

(Let's assume that your program is highly CPU bounded at math computation.)

My simple experiment with SSE shows that it drops more than 40% cost than without SSE solution. Close to twice performance than before.

Is it normal?

Any tip is welcome.

Advertisement

My take is that the basic theory is to pack four instructions together and issue it in one go.

A single instruction that works on 2 or 4 inputs simultaneously.

And it is compatible with multi-threading.

All single instructions are atomic with regards to in/out reads/writes.

The question is how much performance gain do we usually have after optimizing your program with SSE, especially for the math library.

This depends on many factors, especially on how many SIMD instructions you can string together in order to avoid loading and flushing data and on the instruction set (which I assume to be x86 here since you are mentioning SSE). Some SIMD instructions perform 2 operations (multiply-add for example) or can otherwise replace multiple single-data instructions algorithmicly, resulting in even greater performance.

I'm not expecting four times boost, however I do expect twice performance. Is it possible?

Very generally speaking 3 times is reasonable.

My simple experiment with SSE shows that it drops more than 40% cost than without SSE solution. Close to twice performance than before.

Is it normal?

It largely depends on the math you are testing. Short functions with only 1 or 2 SIMD instructions and only a single load/flush pair will likely be near the same performance as without SIMD. Between loading the data and flushing it, the more operations that are necessary on the data and which can be replaced with SIMD instructions the more speed-up you get.

It is normal for a variety of math functions to fall along this spectrum from 30% faster to 70% faster. Also, you haven’t been clear on exactly what drops by 40% and how you measured it. Based on your report I see no reason for alarms yet.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Here's a study in optimization where, starting from a 4x4 matrix multiplied by a single 4-element vector taking 100 cycles of scalar code, ends up with an optimized version taking something like 20 cycles (amortized for bulk transforms, and depending on how far you take the implementation) of vector code, and an application-specific version that assumes input vectors with initial w=1 (a vertex which is a point in space) taking just 17, you could do similar for initial w = 0 (a geometric vector (ray) which is a direction in space) for similar or possibly better results.

The single-vector transform (non-batch) version went down to 43 cycles at a first pass, it could probably be gotten lower by applying some of the other techniques but the author of the study didn't go down that path. I'd guess you could get it down to 35-40 cycles or so, give or take, using only SSE; memory alignment would be the biggest win here, I think, maybe instruction pairing though not sure how much opportunity for that there is with a single vector.

Newer CPUs or other vector ISAs might have a dot-product instruction, which would be even better. Also, AVX instructions can process 8 32bit floats instead of 4, so that could be twice as fast again (at least on hardware that has full 256bit processing width, and isn't simply double-pumping 128-bit hardware, though double-pumped hardware might still see a small gain from reduced instruction-cache pressure or better pipelining).

throw table_exception("(? ???)? ? ???");

It's important to organize your data in a SIMD-friendly way (e.g. SoA) to get the maximum benefit from vectorization. For unrolled loops, you can expect 3-3.5x speedup. Note that your code can only be as fast as your biggest bottleneck (scalar code), so it is almost impossible to get a full 4x speedup using just SSE.

Here is a recent presentation on some best practices for writing SIMD code.

some numbers from https://software.intel.com/en-us/articles/easy-simd-through-wrappers
x86 integer 379.389s  1.0x
SSE4        108.108s  3.5x
SSE4 x2      75.659s  4.8x
AVX2         51.490s  7.4x
AVX2 x2      36.014s 10.5x
SSE has more registers (while utilizing the usual registers on top), has special instructions (e.g. Min/Max, SAD, DotProduct) etc.
when you optimize really well, you can get way beyond 4x speed up. On the other side, in your first try, you will likely get a slow down, but don't get demotivated by it, next try you'll probably make it 2x faster already ;)

If you use SoA format and have long enough instruction chains it isn't hard to get 4x for SSE and 8x for AVX

You can also do things in AVX/SSE such as (FMA, rcp_sqr, rcp) which will outperform standard C++ in specialized cases, and easily go over 8x

I have a bunch of template math functions when can be invoked either with AVX types or standard C++ float. The difference is about 10x.

Yeah anywhere from 0.5x faster (i.e. you make your code twice as sloe) to 60x faster... I'd say that 3 - 3.5x would be a good ballpark to aim for in many cases.

Simply SIMDifying a math library isnt the best use. Most of the time we're only doing 3D math, so 25% of your registers are wasted.

The best gains come from SoA data, where e.g. you load a register with 4 'x' values, a register with 4 'y' values, a register with 4 'z' values, etc...
You can then operate on 4 3D vectors at once, with 100% register utilisation.

Most of the time when trying to write this kind of SSE code, I use the ISPC language instead of trying to do it myself.

thanks, all.

Those materials are really helpful, especially the matrix-vector multiplication, quite suitable for beginners on SSE.

BTW, is there any tool that can tell me what the most expensive call is?

that's called "profiler". a free and good one is "AMD codexl".


http://developer.amd.com/tools-and-sdks/opencl-zone/codexl/

Sounds great.

However since I'm working on SSE, vtune should be better for my case. I guess codexl won't support SSE profiling.

Since when does Intel charge for every tool they provide, both of GPA and Vtune are not free.

How does developer optimize on their platform since they are not free?

This topic is closed to new replies.

Advertisement