SSE enhanced fractal noise library.

Graphics and GPU Programming Programming

Started by bluntman June 16, 2008 10:50 AM

20 comments, last by okroms 12 years, 7 months ago

255

Author

June 16, 2008 10:50 AM

I have spent the last couple of days implementing SSE enhanced versions of my fractal noise library using the MSVC intrinsics. I have found a speed up of >30x for my noise function (based on 'fast noise' method using ms_grad4 lookup table), and a >50x speed up for fractal noise functions using this noise method. The library is composable in that you can plug different noise methods and 'spectral composition' methods together. At the moment 'ridged fractals' are the only one implemented using SSE, FBM style is also present but using a straight C++ implementation. I am wondering if this code is of interest to anyone else? I have not seen any freely available SSE enhanced noise algorithms online, and would be willing to share this one with no strings attached.

Planet rendering.

RealMarkP

216

June 16, 2008 11:21 AM

Quote:Original post by bluntman
I am wondering if this code is of interest to anyone else?

YES!!!! Gimmie!

EDIT: In general, someone, somewhere, will be interested in your work. I'm interested in it because I would like to see how you handled the SSE side of things. Mainly for learning purposes.

------------Anything prior to 9am should be illegal.

Ysaneya

1,391

June 16, 2008 01:17 PM

Sorry, but I'm sceptikal.

SSE can give you, in theory, a performance increase of x4. With the setup and instruction limitations, you can hope for an average increase of x2.

If you indeed get x50, either you used a different type of noise/algorithm, either the SSE version is broken and returns wrong results.

Needless to say, I'm very interested in seeing the code, but make sure to include the "old" noise too, to make fair comparisons :)

Y.

AndyPandyV2

298

June 16, 2008 01:25 PM

I'm interested, but yah... SSE doesn't give 30x speedups... are you comparing it to an unoptimized version of noise without the lookup table?

Moomin

332

June 16, 2008 01:30 PM

Quote:Original post by Ysaneya
SSE can give you, in theory, a performance increase of x4. With the setup and instruction limitations, you can hope for an average increase of x2.

If you are using a iteration based function of noise, you should be able to get closer to x4 than x2 but yeah no way near x30.

bluntman

255

Author

June 16, 2008 01:38 PM

Yes I am skeptical aswell, thats the other reason I want someone to look at my code!! I just had to change a bit of it, and now it registers a 60x speed up for the noise function and 90x speed up for the fractal version!
Here is the link [edit- this is dead now, use the one in later post]
http://billw.atwebpages.com/Fractal_Noise.rar.
Please any comments, negative or positive, or improvements are welcome.
My figures are for compiling under MSVC 2005 Pro SP1, with SSE2 instruction set enabled, timing over a sample of 1,000,000 using SDL_GetTicks for the timing.
I haven't so far looked at what asm is being generated, even if I did I probably couldn't make heads or tails of it.
This is my first (seemingly) successful foray into SIMD, with only the MSDN docs for reference, so there are probably some weirdnesses in the code.
I left out any dependancies in the version I have uploaded above, so add your own timing code...
Also its obviously a work in progress, I have tried to add a few comments but the example should make the usage obvious.

/edit
Here is the actual test code I was running which gave me the figures I quoted:

std::vector<float> xs(1024*1024), ys(1024*1024), zs(1024*1024);math::RidgedMultifractalProvider<> fractalProvider(0.9f, 1.0f);math::SpatialNoiseGenerator<> fractal(&ntest, &fractalProvider, 2.0f, 16.0f, 0.75f);for(int x=0, offs=0; x<1024; ++x){	for(int y=0; y<1024; ++y, ++offs)	{		xs[offs] = ((float)rand()/(float)RAND_MAX)-0.5f;		ys[offs] = ((float)rand()/(float)RAND_MAX)-0.5f;		zs[offs] = ((float)rand()/(float)RAND_MAX)-0.5f;	}}SDL_Init(SDL_INIT_EVERYTHING);t = SDL_GetTicks();std::vector<float> results;fractal(results, xs, ys, zs, -3.0f);ssetime = SDL_GetTicks() - t;t = SDL_GetTicks();for(int x=0, offs=0; x<1024; ++x){	for(int y=0; y<1024; ++y, ++offs)	{		results[offs] = fractal(xs[offs], ys[offs], zs[offs]);	}}nonssetime = SDL_GetTicks() - t;std::cout << "Fractal Test: SSE Time: " << ssetime << ", non SSE Time: " << nonssetime << std::endl;

Also I am running this on an E6850 Core 2 Duo if that makes any difference..

[Edited by - bluntman on June 17, 2008 8:38:21 PM]

Planet rendering.

bluntman

255

Author

June 16, 2008 01:52 PM

Well I just found out I was actually compiling with enhanced instruction set set to SSE, NOT SSE2.
When I switch to SSE2 my times for the SSE version stay the same but my times for the non-see version drops to bring the actual ratio to:
Almost exactly 3 times speed up for the standard noise function and almost exactly 4 times speed up for the fractal noise function.
Any explainations as to why that might be? The figures make more sense but the reason doesn't, at least not to me.

Planet rendering.

Moomin

332

June 16, 2008 02:40 PM

Your code contains quite a few unsigned/signed type miss-matches that you should fix. Also has quite a few non-explicit type conversion, less of an issue.

The function "packFloats" seems to be a custom version of _mm_set_ps which you should be using instead.

I haven't looked through all the SSE code but it seems slightly bloated. You seem to be often using operations on 2x32bit values instead of 4x32. _mm_store_ps might be a good idea for getting the bits of the floats (combined with a union.)

I'd be inclined to replace your flooring function with something like the following, although you would have to test for performance benefit/deficit.

const _MM_ALIGN16 int sign_mask[4] = {0x80000000,0x80000000,0x80000000,0x80000000};const _MM_ALIGN16 float 2_pow_23[4] = {8388608.0f, 8388608.0f, 8388608.0f, 8388608.0f};const _MM_ALIGN16 float one[4] = {1.0f,1.0f,1.0f,1.0f};inline __m128 _mm_round_ps(__m128 x){  __m128 t=_mm_or_ps( _mm_and_ps(_mm_load_ps((const float*)sign_mask),x),                      _mm_load_ps(2_pow_23) );  return _mm_sub_ps(_mm_add_ps(x,t),t);}inline __m128 _mm_floor_ps(__m128 x){  __m128 t=_mm_round_ps(x);  return _mm_sub_ps(t,_mm_and_ps(_mm_cmplt_ps(x,t),_mm_load_ps(one)));}

With the code you posted in your link, I found with all speed optimizations turned on your SSE code ran 13.76% slower than the non-sse version.

AndyPandyV2

298

June 16, 2008 02:42 PM

I tested your SSE/none SSE version(vs2008 release, full optimizations) and got the following..

None SSE Noise: 1659313 microseconds.
SSE Noise: 796888 microseconds.

That was for your 1024*1024 grid.. not sure how many octaves this was? I didn't look to closely.. just called fractal(SSE_results, xs, ys, zs); Is this more then 1 octave bluntman?

AndyPandyV2

298

June 16, 2008 02:48 PM

BTW bluntman, I wrote the results of your SSE version and the none SSE version out to an image file and they look the same and it does look like noise so I believe your producing correct results.

SSE enhanced fractal noise library.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

SSE enhanced fractal noise library.

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines