Jump to content
  • Advertisement
Sign in to follow this  
supagu

Fast memset

This topic is 3325 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

i have an array of floats (4bytes) and i want to set this array to 1.f for each entry as fast as possible. I tried memset, but that only accepts a char sized value to set At the moment I have this:
// clear depth buffer
	{
		int count = (int)floorf(float(mWidth * mHeight) / 4.f);
		__m128* depthBuf = (__m128*)mDepthBuffer;
		__m128 clearDepth = _mm_set_ps1(1.f);
		int i = 0;
		for (int c = 0; c < count; ++c, i += 4)
		{
			*depthBuf = clearDepth;
			++depthBuf;
		}
		for (; i < (mWidth * mHeight); ++i)
			mDepthBuffer = 1.f;
	}
but its 2 slow.

Share this post


Link to post
Share on other sites
Advertisement
Assuming a decent SC++L implementation and with optimisations turned up, you probably won't get faster than std::fill_n without sacrificing portability.

Share this post


Link to post
Share on other sites
You could try streaming, (assuming mDepthBuffer is 16byte aligned)


__m128* depthBuf = (__m128*)mDepthBuffer;
const __m128 clearDepth = _mm_set_ps1(1.f);

size_t i = 0;
const size_t size = mWidth * mHeight;
for(size_t index = 0;i + 4 <= size; i+=4,index++)
_mm_stream_ps(depthBuf[index],clearDepth);
for(;i < size; i++)
mDepthBuffer = 1.f;

Share this post


Link to post
Share on other sites
Quote:
Original post by supagu
this requires your data to be in an stl container, i think, which it is not


1) The term "stl" has not been applicable for some time. This is the standard C++ library now.

2) Standard library algorithms are templated on iterators. Pointers are iterators for arrays. :)

Share this post


Link to post
Share on other sites
Oh looky, an optimization problem. First one needs a way to time things.
struct Timer {
Timer() {
reset();
}

double elapsed_ms() const
{
LARGE_INTEGER current_ticks;
QueryPerformanceCounter(&current_ticks);

LONGLONG delta = current_ticks.QuadPart - start_ticks;
return (1.0e3 * delta * frequency);
}

void reset() {
LARGE_INTEGER f;
QueryPerformanceFrequency(&f);
frequency = 1.0 / f.QuadPart;
LARGE_INTEGER st;
QueryPerformanceCounter(&st);
start_ticks = st.QuadPart;
}
double frequency;
long long start_ticks;
};



Use whatever works.

Next, we define a few different test cases:
void fillone_0(float * buf, size_t N) {
for (size_t i = 0; i < N; i++) {
buf = 1.0f;
}
}

void fillone_1(float * buf, size_t N) {
const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
for (size_t i = 0; i < N/4; i++) {
_mm_store_ps(&(buf[4*i]), value);
}
}


void fillone_2(float * buf, size_t N) {
const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
for (size_t i = 0; i < N/32; i++) {
_mm_store_ps(&(buf[32*i]), value);
_mm_store_ps(&(buf[32*i+4]), value);
_mm_store_ps(&(buf[32*i+8]), value);
_mm_store_ps(&(buf[32*i+12]), value);
_mm_store_ps(&(buf[32*i+16]), value);
_mm_store_ps(&(buf[32*i+20]), value);
_mm_store_ps(&(buf[32*i+24]), value);
_mm_store_ps(&(buf[32*i+28]), value);
}
}

void fillone_3(float * buf, size_t N) {
const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
for (size_t i = 0; i < N/4; i++) {
_mm_stream_ps(&(buf[4*i]), value);
}
}

void fillone_4(float * buf, size_t N) {
const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
for (size_t i = 0; i < N/32; i++) {
_mm_stream_ps(&(buf[32*i]), value);
_mm_stream_ps(&(buf[32*i+4]), value);
_mm_stream_ps(&(buf[32*i+8]), value);
_mm_stream_ps(&(buf[32*i+12]), value);
_mm_stream_ps(&(buf[32*i+16]), value);
_mm_stream_ps(&(buf[32*i+20]), value);
_mm_stream_ps(&(buf[32*i+24]), value);
_mm_stream_ps(&(buf[32*i+28]), value);
}
}




And the results from test run are:
int main(int argc, char **argv)
{
const size_t N = 16384*2;
const size_t M = 100000;

void * storage = _aligned_malloc(sizeof(float) * N, 16);
if (storage == NULL) {
printf("Failed to allocated memory.");
return 1;
}
float * buf = (float*)storage;

printf("Testing %d repetitions over %d floats\n", M, N);

Timer timer;
for (int x = 0; x < M; x++) fillone_0(buf, N);
printf("Baseline: %4.4f\n", timer.elapsed_ms());

timer.reset();
for (int x = 0; x < M; x++) fillone_1(buf, N);
printf("_mm_store_ps: %4.4f\n", timer.elapsed_ms());

timer.reset();
for (int x = 0; x < M; x++) fillone_2(buf, N);
printf("_mm_store_ps unrolled: %4.4f\n", timer.elapsed_ms());

timer.reset();
for (int x = 0; x < M; x++) fillone_3(buf, N);
printf("_mm_stream_ps: %4.4f\n", timer.elapsed_ms());

timer.reset();
for (int x = 0; x < M; x++) fillone_4(buf, N);
printf("_mm_stream_ps unrolled: %4.4f\n", timer.elapsed_ms());

for (int i = 0; i < 20; i++) printf("%g ", buf);
printf("\n");
}





Quote:
Testing 100000 repetitions over 2048 floats
Baseline: 45.3244
_mm_store_ps: 45.4975
_mm_store_ps unrolled: 45.2105
_mm_stream_ps: 143.6576
_mm_stream_ps unrolled: 144.0868

It would seem that there aren't any gains to be had here. 8k however means we're completely inside caches.
Quote:
Testing 10000 repetitions over 65536 floats
Baseline: 445.9909
_mm_store_ps: 388.0277
_mm_store_ps unrolled: 372.8233
_mm_stream_ps: 456.1334
_mm_stream_ps unrolled: 461.8451

Modest gain. stream_ps is basically the same as naive for loop, store_ps takes the lead, unrolled version is also slightly faster.

Quote:
Testing 100 repetitions over 4194304 floats
Baseline: 944.1375
_mm_store_ps: 857.2247
_mm_store_ps unrolled: 841.2786
_mm_stream_ps: 289.9918
_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.


So the conclusion would be, surprise surprise: profile, profile, profile, and when done, profile some more. It may also be necessary to play with various prefetches and other issues, my point is merely that it's far from trivial to give a generic answer over what the fastest way is.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus

Quote:
Testing 100 repetitions over 4194304 floats
Baseline: 944.1375
_mm_store_ps: 857.2247
_mm_store_ps unrolled: 841.2786
_mm_stream_ps: 289.9918
_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.


Not that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Also its always nice to have a quick check of the code-gen for intrinsic code to make sure the compiler hasn't had a fit.

Share this post


Link to post
Share on other sites
Those benchmarks don't include std::fill or std::fill_n. Perhaps you could modify it to include them?

Nobody seems to have mentioned duff's device either.

Share this post


Link to post
Share on other sites
If this is a major bottleneck then perhaps you should be looking into ways of not having to clear the buffer in the first place, instead of messing with finicky cache-control features. For instance, there's always the old trick of switching between positive and negative Z coordinates on alternate frames.

Quote:
Original post by Moomin
Not that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Aren't "modern" CPUs supposed to detect streaming writes? I seem to recall reading somewhere in Intel's optimization guide that filling entire cache lines in strictly increasing order ought to avoid the unnecessary read.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!