Optimized SLERP

Started by
13 comments, last by bzt 4 years, 5 months ago

Hi,

I was looking for a way to effectively interpolate between skeletons represented by list of bones with position vector + orientation quaternion pairs. To my surprise, I couldn't find any good solution, not to my taste that is. I can't use NLERP because I need a general solution, and NLERP is only good for small distance interpolations. SLERP would be perfect, but I couldn't find a decent implementation, only slow ones, so I spent a lot of time looking at game engines and quaternion libraries and reading academic articles on the topic. Finally, I decided to collect my findings and write my own optimized version of SLERP, which focuses on performance over precision, and which I now would like to share with you.

Optimized cross-platform SLERP

I've started from the well known formula, as implemented by most C++ libraries, and used all the tricks I could find on the net. Then I took it to the next level, and made some more adjustments on my own speeding up a little bit more. Finally I ended up with two solutions:

1. one that is cross-platfrom, ANSI C solution that is easily embeddable in C++ and compatible with any C / C++ quaternion class (provided their data can be seen as 4 floats), very simple, very minimal. Use it freely if you want, MIT licensed

2. a SIMD version, which I had to leave half-ready, because my compiler and Intel disagree on what intrinsics should be provided for SSE, moreover the compiler does not work the way its documentation say it should :-( Shit happens. I would only recommend this for experimental purposes, also MIT licensed

Both versions are simpler and much faster than any implementations I could find, designed to be called in a loop several times. The only dependency they have is lib math (acosf, sinf). The SIMD version is half-ready, but even in this form it's almost 1.5 as fast as the other one, especially if you call it in a loop with memory prefetch on the quaternions.

If I made any mistake, miscalculated or mistyped something, let me know. If anybody has an idea how to overcome that intrinsics blockade I have, that would be appreciated very much, because I can clearly see the path how to make the code several times faster, only if I could get rid of the library calls. I also plan to create an ARM NEON port once I have that.

Cheers,

bzt

Advertisement

If anybody interested, I've the ultimate solution for SLERPing skeletons ?

It is not a worked out solution, more like a things-to-do and how-to list, but I came up with an idea how to interpolate skeletons on the GPU by hacking OpenGL and using a single glDrawArray() call. It is much simpler than it sounds.

Cheers,

bzt

Nice work!

I read most of the text in the GitLab page, and I have some comments:

  •  The first piece of code is missing a parenthesis (just try to compile it and see).
  • C++ really isn't slow, if you know how to use it. Watch this if you need to be convinced.
  • Your problem with intrinsics might be that you are not passing the right options to inform the compiler of what instruction set it can use.
  • Instead of optimizing trigonometric functions, you could use polynomials to approximate them, which would probably be much faster, especially if you don't care about a 1% error. I should have some time over the weekend to find decent polynomial approximations to the trigonometric computations in your code, and I'll post them here in the next couple of days.
2 hours ago, alvaro said:

[...]

  • Your problem with intrinsics might be that you are not passing the right options to inform the compiler of what instruction set it can use.

[...]

No, it's not that. Intel has some library that pretends that those instructions exist and uses whatever approximations they came up with. I think you understood the situation already, and I just didn't read your text carefully enough. Sorry about that.


void slerp(float *qa, float *qb, float t, float *ret) {
  float d = qa[0] * qb[0] + qa[1] * qb[1] + qa[2] * qb[2] + qa[3] * qb[3];
  float x = 0.8059f * (1.0f - d);
  float x2 = x * t * (1.0f - t);

  float a = 1.0f - t + x2;
  float b = t + x2;

  if (d < 0.0) b = -b;

  ret[0] = qa[0] * a + qb[0] * b;
  ret[1] = qa[1] * a + qb[1] * b;
  ret[2] = qa[2] * a + qb[2] * b;
  ret[3] = qa[3] * a + qb[3] * b;
}

This is a first-order correction to LERP. I intend to write code that is more precise than this, but I wonder if this is good enough for use in video games directly. Do you have an animation blending setup where you can just drop this and see if it looks OK?

 

Just now, alvaro said:

Nice work!

I read most of the text in the GitLab page, and I have some comments:

  •  The first piece of code is missing a parenthesis (just try to compile it and see).
  • C++ really isn't slow, if you know how to use it. Watch this if you need to be convinced.
  • Your problem with intrinsics might be that you are not passing the right options to inform the compiler of what instruction set it can use.
  • Instead of optimizing trigonometric functions, you could use polynomials to approximate them, which would probably be much faster, especially if you don't care about a 1% error. I should have some time over the weekend to find decent polynomial approximations to the trigonometric computations in your code, and I'll post them here in the next couple of days.

Ok, I'll have to turn JS off, because this editor drives me nuts. I can't properly place my comment into quotes where I want them.

"Nice work!"
Thank you very much! It could have typing errors, I was burning the candle on both ends pretty badly ?

"C++ really isn't slow"
Yes it is. Trust me, allocating objects on every slerp call is bad (this is OOP related-issue, not C++ specific). Also I can see the Assembly output, which is full of indirect register based memory accesses, or with virtuals even worse, register based branches like "call *%rax". Not significant either of them alone, but it builds up. But that's nothing compared to the lazy use of classes and objects. It takes a very skilled programmer with many years of practice to create effective code in C++, speaking from experience. An average programmer's code tends to eat up all available memory pretty quickly, and it's copying memory from here to there over and over again, just because the class' creator or the library's user was unsure of a buffer's ownership.

"Instead of optimizing trigonometric functions, you could use polynomials"
Yes, that was my thought too. I'm really big fan of the k_sin() implementation of the Sun Solaris kernel, seems like to be particularly good candidate to be vectorized, and it's accurate enough. I'm waiting for library suggestions. If I could find any proper OpenSource SSE implementation of acos/sin, I'd like to use that (it has been almost 30 years since that kernel was written, surely somebody tried to implement it using SSE, I just haven't found it yet). I definitely don't want to reinvent the wheel here ?

"This is a first-order correction to LERP."
Thank you very much, but this has different characteristics than SLERP. To properly approximate SLERP with NLERP, I'd suggest to take a look at zeux.io's article (linked from my repo's README) it is the best I've seen.

"Do you have an animation blending setup where you can just drop this and see if it looks OK?"
Unfortunately not yet. I'm just printing out bunch of numbers with printf, that's one reason why I've asked if someone can spot a mistake let me know. With a visualizer I could see it at once ?

Cheers,
bzt

UPDATE: It looks like I've found it. The single header implementation of _mm_sin_ps in the cephes library has zlib license, uses SSE intrinsics only, and it seems that it's using the same polynomial as k_sin (I know, it is pervert that the link is a Java library source on the Apple's website, but that was the first result in the search engine for Solaris kernel :-)) Now I only need _mm_acos_ps.

Cheers,
bzt

On 10/18/2019 at 11:50 AM, bzt said:

"This is a first-order correction to LERP."
Thank you very much, but this has different characteristics than SLERP. To properly approximate SLERP with NLERP, I'd suggest to take a look at zeux.io's article (linked from my repo's README) it is the best I've seen.

I'm not sure what you mean by "different characteristics". zeux.io's article is indeed very good, but I am going for something slightly different here: I want to compute your `a' and `b' directly, without having to normalize the resulting quaternion, because hopefully I'll get something with length close to 1 automatically.

I think your code doesn't handle d<0 correctly: When your code says


        c = acosf(d);

it should probably say


        c = acosf(fabsf(d));

I'll post code with a better approximation tonight.

 

One more thing: When d>0.999, it's OK to fall back to lerp; but when d<-0.999 your code returns unreasonable results.

 

Hi,

Thanks for your interest!

1 hour ago, alvaro said:

I'm not sure what you mean by "different characteristics".

I meant that a first-order correction of LERP does not travel on the unit sphere's surface, and most importantly not with constant velocity. SLERP does. That's why I'm trying to make it faster instead of finding a substitute. This is well explained in https://web.mit.edu/2.998/www/QuaternionReport1.pdf (although there's a lot of stuff in that pdf, look around section 6 where a few algorithms are compared and visualized as well. You'll see on the right side graph what makes SLERP special.)

1 hour ago, alvaro said:

it should probably say



        c = acosf(fabsf(d));

I'm going to check again, however my intention with changing the sign of the result by "if(d < 0) b = -b;" was to take care of this.
 

Just now, alvaro said:

When d>0.999, it's OK to fall back to lerp; but when d<-0.999 your code returns unreasonable results.

What makes you think that? Think it again. If d < -0.999 then |d| > 0.999. Because the condition is |d| < 0.999, that means it calculates SLERP if d > -0.999 and d < 0.999. It is either SLERP or LERP, there's no third option (or unreasonable results) possible. If it's not SLERP, then the code fallbacks to the default values of a and b, which are initialized for LERP. However there's a mistake in the GLSL version, where I've explicitly checked the interval, there I should have used && instead of ||. Yeah, it was late at night.

Cheers,
bzt

 

This topic is closed to new replies.

Advertisement