# SSE enhanced fractal noise library.

This topic is 2707 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Quote:
 Original post by AndyPandyV2I tested your SSE/none SSE version(vs2008 release, full optimizations) and got the following..None SSE Noise: 1659313 microseconds.SSE Noise: 796888 microseconds.

What PC spec was that with? Sorry read milliseconds not micro.
I also compiled with vs2008 release & full optimizations, run on an Intel core 2 duo @ 2.4GHz, using the code you posted with the 1024x1024 iteration, with timing added (there was none in the link you provided), and taking the difference calculation into its own loop. I got :
SSE    ticks = 3718Normal ticks = 2782Difference   = 0.936221

The difference appears to be just an accumulation variable, so per iteration that would make the average difference a small ~8.93 x 10^-7.

##### Share on other sites
Andy:
Thanks for testing it, and yes the test settings are set to use 16 octaves.
So when I put some SDL timing into the test code I provided in the Fractal_Noise project I get the following results using VS2005:
#define SDL_TIMING#include "stdafx.h"#include "perlin.hpp"#include <iostream>#ifdef SDL_TIMING #include <SDL.h>#pragma comment(lib, "SDL.lib")#endif#define SAMPLE_COUNT	(1024*1024) int _tmain(int argc, _TCHAR* argv[]){	// the simple noise generator	noise::Noise noiseGen;	// the rigid multifractal spectral composer	noise::RidgedMultifractalProvider<> fractalProvider(0.9f, 1.0f);	// the spatial noise generator which brings together the noise generator and the spectral composer and applies the paramters	noise::SpatialNoiseGenerator<> fractal(&noiseGen, &fractalProvider, 2.0f, 16.0f, 0.75f);	// generate some random data	std::vector<float> xs(SAMPLE_COUNT), ys(SAMPLE_COUNT), zs(SAMPLE_COUNT);	for(int x=0; x<SAMPLE_COUNT; ++x)	{		xs[x] = ((float)rand()/(float)RAND_MAX)-0.5f;		ys[x] = ((float)rand()/(float)RAND_MAX)-0.5f;		zs[x] = ((float)rand()/(float)RAND_MAX)-0.5f;	}	// call the sse version	std::vector<float> SSE_results;#ifdef SDL_TIMING 	SDL_Init(SDL_INIT_TIMER);	Uint32 t = SDL_GetTicks();#endif	fractal(SSE_results, xs, ys, zs);#ifdef SDL_TIMING	Uint32 ssetime = SDL_GetTicks() - t;	std::cout << "SSE time: " << ssetime << std::endl;#endif	// call the non see version and difference the results for an accuracy comparison	std::vector<float> non_SSE_results(SAMPLE_COUNT);	float diff = 0.0f;#ifdef SDL_TIMING 	t = SDL_GetTicks();#endif	for(int x=0; x<SAMPLE_COUNT; ++x)	{		non_SSE_results[x] = fractal(xs[x], ys[x], zs[x]);		diff += std::abs(non_SSE_results[x] - SSE_results[x]);	}#ifdef SDL_TIMING	Uint32 nonssetime = SDL_GetTicks() - t;	std::cout << "Non-SSE time: " << nonssetime << std::endl;#endif	return 0;}

Using the absolute default Release settings:
SSE time: 350
Non-SSE time: 1071
All default release settings, but with Enhanced Instruction set switched to SSE2:
SSE time: 400
Non-SSE time: 1550

Moomin:
Thanks for the advice! Could you please tell me what compiler version you are using and what settings are you using to compile? (send me the project file if you could so I can try it myself!) I can't find any setting combination that makes the SSE version actually slower, except running in debug mode!
I will try implementing your version of the floor function and post back my results. I was under the impression that the _mm_set functions were for literals only, but I will try using them and see what happens!
/edit
Just saw you posted the compiler ver etc, thanks alot, I will download the EE and try it out!

##### Share on other sites
Well I get pretty much identical results when compiling under vs2008 as I did under 2005 (both the SSE and non SSE are about 2-5% faster).
Moomin any chance I could get your project file to make sure I am using the exact same settings? Any ideas why we are getting such different results?

##### Share on other sites
Recompiled this morning and the SSE version seems to be running much faster (x3.5) Sorry must have missed something last night. The SSE version still can be optimized though.

##### Share on other sites
Thats good to know!
I tried using the ps_set function but as I suspected it is designed for constant immediate values, if I try and use variables then it flips back to FP mode and breaks, also slows it down alot.
The floor function on the other hand is good, get a few percent speed increase, don't need to change the rounding mode, and the results are more accurate (the difference from the non-sse version is 1/6 what it was previously). Thanks alot!
So far, for my setup, the performance seems to level off at about 3.15 times faster than the standard version as the sample size tends towards extremely large (20mil+). Even at 10,000 samples it still seems to hover around 3 times faster, but using SDL_GetTicks for a sample that small isn't really accurate enough.
I have uploaded the new version here:
http://billw.atwebpages.com/Fractal_Noise17_06_08.rar
I also quickly implemented the FBM spectral composer, which currently shows a performance gain of a more modest ~2x.
Moomin would you care to expand on:
Quote:
 You seem to be often using operations on 2x32bit values instead of 4x32. _mm_store_ps might be a good idea for getting the bits of the floats (combined with a union.)

I cannot see anywhere where I am performing operations on 2x32bit values. I was previously, in my floor function, but I have replaced that with the one you provided. I'm not sure what you mean about using a union to allow me get the bits back using the aligned version of store?

##### Share on other sites
Quote:
 Original post by bluntmanI tried using the ps_set function but as I suspected it is designed for constant immediate values, if I try and use variables then it flips back to FP mode and breaks, also slows it down alot.

Hmm this sounds odd to me, the Nebular device uses _mm_set_ps, so does the in built F32vec4 class from fvec.h
Quote:
 /* initialize 4 SP FPs with 4 floats */F32vec4(float f3, float f2, float f1, float f0) { vec= _mm_set_ps(f3,f2,f1,f0); }

Neither msdn nor intel instrinsic reference mention the need for the values to be constant values. Are you sure something else isn't breaking it instead?

Quote:
 I cannot see anywhere where I am performing operations on 2x32bit values. I was previously, in my floor function, but I have replaced that with the one you provided. I'm not sure what you mean about using a union to allow me get the bits back using the aligned version of store?

Yeah I was refering to the floor and also the store int function which at first glance I thought ment store bits from the comments.

##### Share on other sites
http://msdn.microsoft.com/en-us/library/0hey67c0.aspx
When I switch to using the set function rather than using the load and shuffles it kills the performance and mangles the registers for the next operation.
Thanks alot for the help guys. Anybody can feel free to use and abuse this code, I will release a more polished version when I have got some way toward completing the project it is a part of, along with the rest of the project itself. For now I thought some people might like to make use of this as it is.

##### Share on other sites
I am curious if any of you have done speed benchmark of generating noise on the high end GPU versus high end CPU.

##### Share on other sites
I'm guessing SDL tick count is in milliseconds? If so your getting nearly 2x the speed i'm getting for both your SSE/C++(I'm using a 2.9 Pentium D, guess it sucks).
Why in the world does your C++ version program run slower when you enable SSE2? I think you should turn on all optimizations to get a realistic comparison of SSE/C++ since that is the mode in which the final version of your code would be compiled anyway...

Odd about the SSE set functions, I have found in my version that using the SSE set functions was faster then the load functions, not slower, even when what I was setting came from a variable.

One optimization you could consider if you want it to run even faster is that alot of the calculations your performing are the same during each call to Noise(), if your inner loop is along the Z values for instance, you could pre-compute the X/Y for that entire row(the U/V don't change & some other things). This will only work of course if your noise sampling isn't rotated...

Btw I've found that (return x>0 ? int(x) : int(x) - 1;) allowed my C++ noise to run about 15% faster then the lrfloorf(x) function your using. I put it in your code and it had a similar affect.

##### Share on other sites
Rumble - I tested that some on my computer, the GPU version was much easier to write and easily ran faster. My GPU isn't even that good(nvidia 6800 vs 2.9 Pentium D) but it is 2x as fast as my optimized SSE noise, though if both cores were engaged my CPU might be able to keep up.

I was trying to generate volumes of 32^3 with 12 octaves, on the GPU it ran at around 220 FPS generating an entire volume each frame, on the CPU it takes around 7.5 milliseconds using SSE. Gonna stick with the CPU version since I need to run marching cubes on it anyway...

[Edited by - AndyPandyV2 on June 17, 2008 4:54:55 PM]

• ### What is your GameDev Story?

In 2019 we are celebrating 20 years of GameDev.net! Share your GameDev Story with us.

• 9
• 23
• 18
• 13
• 19
• ### Forum Statistics

• Total Topics
634410
• Total Posts
3017286
×