• entries
232
1462
• views
958348

# Fast Perlin noise

4985 views

I'm here going to present a "tweak" to the Perlin noise function that multiplies its speed by a factor of 7.

The original Perlin noise function that is used is Perlin's Improved noise function.

Of course, a speed increase of x7 means that we're going to loose some nice properties / accuracy of the original noise, but everything is a trade-off in life, isn't it ? Whether you want to use the fast version instead of the original version is up to you and depends on if you need performance over quality or not..

On a Pentium 4 @ 3 Ghz, Improved noise as implemented directly in C++ from the code linked above takes 7270 milliseconds to generate 1024x1024 samples of fBm noise ( the basis function being Improved noise ) with 12 octaves. In other words, for each sample, the Improved noise function is called 12 times. I'm using the 3D version of the algorithm. I ran the test 8 times and averaged the results to get something relevant.

The first thing that can be improved in the Improved noise are the casts from float-to-int:

const TInt xtr = floorf(xyz.x);const TInt ytr = floorf(xyz.y);const TInt ztr = floorf(xyz.z);

Those are really, really.. REALLY bad. Instead you can use a bit of assembly:

__forceinline TInt __stdcall MFloatToInt(const TFloat x){	TInt t;	__asm fld x  	__asm fistp t	return t;}

and replace the floor calls by:
const TInt xtr = MFloatToInt(xyz.x - 0.5f);const TInt ytr = MFloatToInt(xyz.y - 0.5f);const TInt ztr = MFloatToInt(xyz.z - 0.5f);

The same performance test ( 1024x1024 @ 12 octaves ) now takes 3324 milliseconds, a 118% improvement.

The next trick is an idea of a co-worker, Inigo Quilez, who's heavily involved in the 64 KB-demo coding scene. His idea is simple: instead of computing a gradiant, just replace it with a lookup table.

The original code looked like this:
	return(_lerp(w, _lerp(v, _lerp(u, _grad(ms_p[8][AA], x, y, z),                                      _grad(ms_p[8][BA], x - 1, y, z)),                             _lerp(u, _grad(ms_p[8][AB], x, y - 1, z),                                      _grad(ms_p[8][BB], x - 1, y - 1, z))),                    _lerp(v, _lerp(u, _grad(ms_p[8][AA + 1], x, y, z - 1),                                      _grad(ms_p[8][BA + 1], x - 1, y, z - 1)),                             _lerp(u, _grad(ms_p[8][AB + 1], x, y - 1, z - 1),                                      _grad(ms_p[8][BB + 1], x - 1, y - 1, z - 1)))));

The "fast" version looks like this:
	return(_lerp(w, _lerp(v, _lerp(u, ms_grad4[AA],                                      ms_grad4[BA]),                             _lerp(u, ms_grad4[AB],                                      ms_grad4[BB])),                    _lerp(v, _lerp(u, ms_grad4[AA + 1],                                      ms_grad4[BA + 1]),                             _lerp(u, ms_grad4[AB + 1],                                      ms_grad4[BB + 1]))));

Here, "ms_grad4" is a 512-entry lookup table that contains absolutely random float values in the [-0.7; +0.7] range.

It is initialized like this:
static TFloat ms_grad4[512];TFloat kkf[256];for (TInt i = 0; i < 256; i++)	kkf = -1.0f + 2.0f * ((TFloat)i / 255.0f);for (TInt i = 0; i < 256; i++){	ms_grad4 = kkf[ms_p2] * 0.7f;}

At this point you're maybe wondering why 0.7 ? This is not clear to us yet; we're suspecting it has something to do with the average value you can get from a normal Perlin noise basis. We tried a range of [ -1; +1 ] originally, but we found that the fBm was saturating to black or white too often, hurting quality by a lot.

This "fast" version of noise runs at 1027 milliseconds, a 223% improvement compared to the previous tweak, or a 607% improvement compared to the original Improved noise version.

As for quality, well, everything has a price. There's no visible artifacts in the fast noise, but the features don't seem to be as well regular and well distributed than the improved noise. The following image has been generated with the same frequency/scaling values. The features appear different since the lookup table holds different values in the two versions, but that's normal.

http://www.fl-tw.com/Infinity/Media/Misc/fastnoise.jpg

Very clever, thanks for sharing - might come in handy one day!

Jack

I get a 404 on that link :(

Can you put the file back up? I checked your site and it is no longer there. I'm really curious what it looks like :)

Sorry, i accidentely moved the file. It should be back up now.

Nice. The most interesting thing I've seen on gamedev for months. [smile]

Hmm, have you tried comparing noise generated on the CPU to noise generated on the GPU ? A quick test on my (simplex) GPU Noise/Starfield Generator experiment on my old Radeon 9800 pro shows that it takes me less than 200 ms to generate 1024x1024 12 octave noise, including a readback of the final result and some tone-mapping. Now my timing method is probably "less than optimal", initialization costs are higher, it takes some GPU memory, doesnt work on anything withoit PS 2.0 and with my card I'm more or less limited to 16-bit FP precision, but if you need to generate lots of noise values it seems like a good alternative. Or has anyone else tested this and got completely different results ?

Nice journal post anyway, the LUT trick here is new to me :)

It is a possibility that i'm keeping in mind. What i'm afraid of is slowing down the rendering by 200 ms * 6. The *6 is due to the fact that the place where i make the most intensive usage of noise are the planetary global texture generation and the spacebox background texture generation, and those are cubes of 6 faces. Of course, they could be spread over many frames, but still.. you'd see some serious slow downs / stuttering..

My idea was to use multi-threading instead. It would take more time to calculate, but it wouldn't affect the rendering speed..

Interestingly enough, the original java code runs even faster than your optimized C++ code on my 1.8Ghz Intel Centrino duo, which should be a slower CPU than yours.

I got 500 ms when calculating 1024x1024 noise values on JRE 1.6.

Used this as basis:

http://mrl.nyu.edu/~perlin/noise/

Then added some simple benchmarking code:
    public static double a;
private static int bench()
{
a=0;
long start=System.currentTimeMillis();
for (int y=0;y<1024;y++)
for (int x=0;x<1024;x++)
a+=noise(x,y,1);
return (int)(System.currentTimeMillis()-start);
}

public static void main(String args[])
{
System.out.println("Starting");
for (int i=0;i<10;i++)
System.out.println((i+1)+". "+bench());
}


The static a variable is used to ensure the compiler does not just omit the call to noise.

Does this mean that Java 14 times faster than C++? :)

Oops, just read the fine print: 12 octaves... :D

That surely makes the C++ version faster, as my benchmark just used one octave. Still,I would be very interested in knowing how Java stacks up to C++ in this case, with so many people claiming Java is as fast as C++ these days... :)

Those optimisations are really nice. I implemented them, however i got some interesting results with the assembler instruction optimisation.

fistp relies on the rounding state the fpu is currently in, with default being rounding to the nearest integer, instead of clamping towards zero. In this case you can get negative fractional parts, which however gives some nice results.
Maybe a procedural texture method of generating "modern metal with cracks" is born ;)

This article describes it way better than i can [stereopsis on fistp]

Here are the results:
Fast fBm noise 1
Fast fBm noise 2
Fast fBm noise 3 - this one is my favorite

h.

That being said, maybe you can help me understand if I have implemented what you've done incorrectly. Ken's fast noise uses a gradient function taking the hash (from p[]), x, y, and z coordinates to calculate a gradient.

inline float grad(int hash, float x, float y, float z)
{
// Convert LO 4 bits of hash code into 12 gradient directions
hash &= 15;
float u = (hash < 8) ? x : y;
float v = (hash < 4) ? y : (hash == 12 || hash == 14) ? x : z;
return (((hash & 1) == 0) ? u : -u) + (((hash & 2) == 0) ? v : -v);
}

which you have replaced with only a single lookup into your ms_grad4 table with the hash value. Now, how do you account for the other inputs in this lookup table? From the way your init step looks, they are not accounted for and thus no interpolation of gradients actually occurs, only an interpolation of noise values, which is exactly what Perlin noise builds upon to begin with.

I noticed in your init step you used ms_p2 which is not the same as ms_p used in your original code. Are these two tables actually different? Is this where you do some extra processing that was not shown in the init step. In my implementation I simply used the same p table that I've been using for Perlin noise in that location. Can you check the raw output of just 1 octave of perlin noise without the fBm for comparison, like I said I am getting axis-aligned noise.

You mentioned SSE2 noise in a later article, and that you've replaced all memory lookups with calculations. Does this mean in your latest implementation that this quality issue has basically gone away?

Thanks, Jeff

I was changing my perlin noise code over to SSE2 today and while I didn't notice too much speed improvement intially (perhaps X2) when doing four calculations at a time I did get a great shot in the arm when I modified it to use the registers in 16 bit mode allowing allowing eight calculations at a time.

It seems that 16 bits should provide enough detail for noise at any given octave, the final results are fed back in 128x32 bit float registers (doubled up as I'm doing 8 calculations at a time).

Returning to browse the intrinsics now to see if I can speed up the initial input register packing!

Slightly OT but there's a good paper on rounding asm here:
http://ldesoras.free.fr/doc/articles/rounding_en.pdf

Actually the reason I started looking at doing 8 rather than 4 noise calculations at once was because on my sse2 pc I can't find a simple side by side __m128i X 4 int multiply (equiv to _mm_mullo_epi32 in sse4). This meant that in the inner part of the noise function I was having to drop out of the sse2 pipeline to convert to types I could use to do the integar multiply. Fixed by changing over to eight calcs at once but in my 2d function I only need four results to lerp between so it ends up being slower than it could be because the return path supports 8 return values. This I would like to fix....

So what's the fastest way to multiply four integars in one __m128i register with four integars in another?

btw: Currently my 2d function runs minutely slower than my 3d function! The only place that it differs is the drop to 32 bit ints for the inner noise function which I'm assuming is down to some memory shenanigans during the conversion to and from the 128 bit value (hidden using the union approach). This I expected to slow things down but am still a little shocked at the extent!

Solution to my 2d perlin optimization is simply to use the 64 bit MMX functions!

_mm_mullo_pi16 - Multiplies four 16-bit values in m1 by four 16-bit values in m2 and produces the low 16 bits of the four results
_mm_cvtpi16_ps - Converts the four 16-bit signed integer values in a to four single-precision, floating-point values
_mm_set_pi16 - Sets the four signed 16-bit integer values.

I'll stop abusing this thread now.

## Create an account

Register a new account