• entries
    232
  • comments
    1462
  • views
    956481

Fast Perlin noise

Sign in to follow this  
Ysaneya

4869 views

I'm here going to present a "tweak" to the Perlin noise function that multiplies its speed by a factor of 7.

The original Perlin noise function that is used is Perlin's Improved noise function.

Of course, a speed increase of x7 means that we're going to loose some nice properties / accuracy of the original noise, but everything is a trade-off in life, isn't it ? Whether you want to use the fast version instead of the original version is up to you and depends on if you need performance over quality or not..

On a Pentium 4 @ 3 Ghz, Improved noise as implemented directly in C++ from the code linked above takes 7270 milliseconds to generate 1024x1024 samples of fBm noise ( the basis function being Improved noise ) with 12 octaves. In other words, for each sample, the Improved noise function is called 12 times. I'm using the 3D version of the algorithm. I ran the test 8 times and averaged the results to get something relevant.

The first thing that can be improved in the Improved noise are the casts from float-to-int:


const TInt xtr = floorf(xyz.x);
const TInt ytr = floorf(xyz.y);
const TInt ztr = floorf(xyz.z);


Those are really, really.. REALLY bad. Instead you can use a bit of assembly:


__forceinline TInt __stdcall MFloatToInt(const TFloat x)
{
TInt t;
__asm fld x
__asm fistp t
return t;
}


and replace the floor calls by:

const TInt xtr = MFloatToInt(xyz.x - 0.5f);
const TInt ytr = MFloatToInt(xyz.y - 0.5f);
const TInt ztr = MFloatToInt(xyz.z - 0.5f);


The same performance test ( 1024x1024 @ 12 octaves ) now takes 3324 milliseconds, a 118% improvement.

The next trick is an idea of a co-worker, Inigo Quilez, who's heavily involved in the 64 KB-demo coding scene. His idea is simple: instead of computing a gradiant, just replace it with a lookup table.

The original code looked like this:

return(_lerp(w, _lerp(v, _lerp(u, _grad(ms_p[8][AA], x, y, z),
_grad(ms_p[8][BA], x - 1, y, z)),
_lerp(u, _grad(ms_p[8][AB], x, y - 1, z),
_grad(ms_p[8][BB], x - 1, y - 1, z))),
_lerp(v, _lerp(u, _grad(ms_p[8][AA + 1], x, y, z - 1),
_grad(ms_p[8][BA + 1], x - 1, y, z - 1)),
_lerp(u, _grad(ms_p[8][AB + 1], x, y - 1, z - 1),
_grad(ms_p[8][BB + 1], x - 1, y - 1, z - 1)))));


The "fast" version looks like this:

return(_lerp(w, _lerp(v, _lerp(u, ms_grad4[AA],
ms_grad4[BA]),
_lerp(u, ms_grad4[AB],
ms_grad4[BB])),
_lerp(v, _lerp(u, ms_grad4[AA + 1],
ms_grad4[BA + 1]),
_lerp(u, ms_grad4[AB + 1],
ms_grad4[BB + 1]))));


Here, "ms_grad4" is a 512-entry lookup table that contains absolutely random float values in the [-0.7; +0.7] range.

It is initialized like this:

static TFloat ms_grad4[512];

TFloat kkf[256];
for (TInt i = 0; i < 256; i++)
kkf = -1.0f + 2.0f * ((TFloat)i / 255.0f);

for (TInt i = 0; i < 256; i++)
{
ms_grad4 = kkf[ms_p2] * 0.7f;
}


At this point you're maybe wondering why 0.7 ? This is not clear to us yet; we're suspecting it has something to do with the average value you can get from a normal Perlin noise basis. We tried a range of [ -1; +1 ] originally, but we found that the fBm was saturating to black or white too often, hurting quality by a lot.

This "fast" version of noise runs at 1027 milliseconds, a 223% improvement compared to the previous tweak, or a 607% improvement compared to the original Improved noise version.

As for quality, well, everything has a price. There's no visible artifacts in the fast noise, but the features don't seem to be as well regular and well distributed than the improved noise. The following image has been generated with the same frequency/scaling values. The features appear different since the lookup table holds different values in the two versions, but that's normal.


http://www.fl-tw.com/Infinity/Media/Misc/fastnoise.jpg
Sign in to follow this  


14 Comments


Recommended Comments

Hmm, have you tried comparing noise generated on the CPU to noise generated on the GPU ? A quick test on my (simplex) GPU Noise/Starfield Generator experiment on my old Radeon 9800 pro shows that it takes me less than 200 ms to generate 1024x1024 12 octave noise, including a readback of the final result and some tone-mapping. Now my timing method is probably "less than optimal", initialization costs are higher, it takes some GPU memory, doesnt work on anything withoit PS 2.0 and with my card I'm more or less limited to 16-bit FP precision, but if you need to generate lots of noise values it seems like a good alternative. Or has anyone else tested this and got completely different results ?

Nice journal post anyway, the LUT trick here is new to me :)

Share this comment


Link to comment
It is a possibility that i'm keeping in mind. What i'm afraid of is slowing down the rendering by 200 ms * 6. The *6 is due to the fact that the place where i make the most intensive usage of noise are the planetary global texture generation and the spacebox background texture generation, and those are cubes of 6 faces. Of course, they could be spread over many frames, but still.. you'd see some serious slow downs / stuttering..

My idea was to use multi-threading instead. It would take more time to calculate, but it wouldn't affect the rendering speed..

Share this comment


Link to comment
Interestingly enough, the original java code runs even faster than your optimized C++ code on my 1.8Ghz Intel Centrino duo, which should be a slower CPU than yours.

I got 500 ms when calculating 1024x1024 noise values on JRE 1.6.

Used this as basis:

http://mrl.nyu.edu/~perlin/noise/

Then added some simple benchmarking code:

public static double a;
private static int bench()
{
a=0;
long start=System.currentTimeMillis();
for (int y=0;y<1024;y++)
for (int x=0;x<1024;x++)
a+=noise(x,y,1);
return (int)(System.currentTimeMillis()-start);
}

public static void main(String args[])
{
System.out.println("Starting");
for (int i=0;i<10;i++)
System.out.println((i+1)+". "+bench());
}

The static a variable is used to ensure the compiler does not just omit the call to noise.

Does this mean that Java 14 times faster than C++? :)

Share this comment


Link to comment
Oops, just read the fine print: 12 octaves... :D

That surely makes the C++ version faster, as my benchmark just used one octave. Still,I would be very interested in knowing how Java stacks up to C++ in this case, with so many people claiming Java is as fast as C++ these days... :)

Share this comment


Link to comment
Those optimisations are really nice. I implemented them, however i got some interesting results with the assembler instruction optimisation.

fistp relies on the rounding state the fpu is currently in, with default being rounding to the nearest integer, instead of clamping towards zero. In this case you can get negative fractional parts, which however gives some nice results.
Maybe a procedural texture method of generating "modern metal with cracks" is born ;)

This article describes it way better than i can [stereopsis on fistp]

Here are the results:
Fast fBm noise 1
Fast fBm noise 2
Fast fBm noise 3 - this one is my favorite

h.

Share this comment


Link to comment
I hope you are able to see new posts on old journal entries, or this may go unanswered. I wanted to ask a little about your fast implementation here. It looks like what you have done is basically taken the "Perlin" out of Perlin noise. What I mean is, your resulting noise is no longer gradient noise but ends up being very similar to plain old quintic interpolated noise. All of the noise variation seems to revert back to the boxy, axis-aligned look that non gradient noise exhibits. This characteristic can clearly be seen in the fBm screenshots that you provided; your implementation displays the the axis-aligned clustering of detail. You have artificially limited your amplitude at each level to [-0.7, 0.7] which masks some of the clustering, but I can reproduce a very similar image to your fast perlin noise by limiting my amplitude on quintic interpolated fBm.

That being said, maybe you can help me understand if I have implemented what you've done incorrectly. Ken's fast noise uses a gradient function taking the hash (from p[]), x, y, and z coordinates to calculate a gradient.

inline float grad(int hash, float x, float y, float z)
{
// Convert LO 4 bits of hash code into 12 gradient directions
hash &= 15;
float u = (hash < 8) ? x : y;
float v = (hash < 4) ? y : (hash == 12 || hash == 14) ? x : z;
return (((hash & 1) == 0) ? u : -u) + (((hash & 2) == 0) ? v : -v);
}

which you have replaced with only a single lookup into your ms_grad4 table with the hash value. Now, how do you account for the other inputs in this lookup table? From the way your init step looks, they are not accounted for and thus no interpolation of gradients actually occurs, only an interpolation of noise values, which is exactly what Perlin noise builds upon to begin with.

I noticed in your init step you used ms_p2 which is not the same as ms_p used in your original code. Are these two tables actually different? Is this where you do some extra processing that was not shown in the init step. In my implementation I simply used the same p table that I've been using for Perlin noise in that location. Can you check the raw output of just 1 octave of perlin noise without the fBm for comparison, like I said I am getting axis-aligned noise.

You mentioned SSE2 noise in a later article, and that you've replaced all memory lookups with calculations. Does this mean in your latest implementation that this quality issue has basically gone away?

Thanks, Jeff

Share this comment


Link to comment
I was changing my perlin noise code over to SSE2 today and while I didn't notice too much speed improvement intially (perhaps X2) when doing four calculations at a time I did get a great shot in the arm when I modified it to use the registers in 16 bit mode allowing allowing eight calculations at a time.

It seems that 16 bits should provide enough detail for noise at any given octave, the final results are fed back in 128x32 bit float registers (doubled up as I'm doing 8 calculations at a time).

Returning to browse the intrinsics now to see if I can speed up the initial input register packing!

Slightly OT but there's a good paper on rounding asm here:
http://ldesoras.free.fr/doc/articles/rounding_en.pdf

Share this comment


Link to comment
Actually the reason I started looking at doing 8 rather than 4 noise calculations at once was because on my sse2 pc I can't find a simple side by side __m128i X 4 int multiply (equiv to _mm_mullo_epi32 in sse4). This meant that in the inner part of the noise function I was having to drop out of the sse2 pipeline to convert to types I could use to do the integar multiply. Fixed by changing over to eight calcs at once but in my 2d function I only need four results to lerp between so it ends up being slower than it could be because the return path supports 8 return values. This I would like to fix....

So what's the fastest way to multiply four integars in one __m128i register with four integars in another?

btw: Currently my 2d function runs minutely slower than my 3d function! The only place that it differs is the drop to 32 bit ints for the inner noise function which I'm assuming is down to some memory shenanigans during the conversion to and from the 128 bit value (hidden using the union approach). This I expected to slow things down but am still a little shocked at the extent!

Share this comment


Link to comment
Solution to my 2d perlin optimization is simply to use the 64 bit MMX functions!

_mm_mullo_pi16 - Multiplies four 16-bit values in m1 by four 16-bit values in m2 and produces the low 16 bits of the four results
_mm_cvtpi16_ps - Converts the four 16-bit signed integer values in a to four single-precision, floating-point values
_mm_set_pi16 - Sets the four signed 16-bit integer values.

I'll stop abusing this thread now.

Share this comment


Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now