# Fast memset

This topic is 3408 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

i have an array of floats (4bytes) and i want to set this array to 1.f for each entry as fast as possible. I tried memset, but that only accepts a char sized value to set At the moment I have this:
// clear depth buffer
{
int count = (int)floorf(float(mWidth * mHeight) / 4.f);
__m128* depthBuf = (__m128*)mDepthBuffer;
__m128 clearDepth = _mm_set_ps1(1.f);
int i = 0;
for (int c = 0; c < count; ++c, i += 4)
{
*depthBuf = clearDepth;
++depthBuf;
}
for (; i < (mWidth * mHeight); ++i)
mDepthBuffer = 1.f;
}

but its 2 slow.

##### Share on other sites
Assuming a decent SC++L implementation and with optimisations turned up, you probably won't get faster than std::fill_n without sacrificing portability.

##### Share on other sites
this requires your data to be in an stl container, i think, which it is not

##### Share on other sites
No, std::fill_n() can be used on arrays.

##### Share on other sites
You could try streaming, (assuming mDepthBuffer is 16byte aligned)

__m128* depthBuf = (__m128*)mDepthBuffer;const __m128 clearDepth = _mm_set_ps1(1.f);size_t i = 0;const size_t size = mWidth * mHeight;for(size_t index = 0;i + 4 <= size; i+=4,index++)    _mm_stream_ps(depthBuf[index],clearDepth);for(;i < size; i++)    mDepthBuffer = 1.f;

##### Share on other sites
Quote:
 Original post by supaguthis requires your data to be in an stl container, i think, which it is not

1) The term "stl" has not been applicable for some time. This is the standard C++ library now.

2) Standard library algorithms are templated on iterators. Pointers are iterators for arrays. :)

##### Share on other sites
Oh looky, an optimization problem. First one needs a way to time things.
struct Timer {	Timer() {		reset();	}	double elapsed_ms() const	{		LARGE_INTEGER current_ticks;		QueryPerformanceCounter(&current_ticks);		LONGLONG delta = current_ticks.QuadPart - start_ticks;		return (1.0e3 * delta * frequency);	}	void reset() {		LARGE_INTEGER f;		QueryPerformanceFrequency(&f);		frequency = 1.0 / f.QuadPart;		LARGE_INTEGER st;		QueryPerformanceCounter(&st);		start_ticks = st.QuadPart;	}	double frequency;	long long start_ticks;};
Use whatever works.

Next, we define a few different test cases:
void fillone_0(float * buf, size_t N) {	for (size_t i = 0; i < N; i++) {		buf = 1.0f;	}}void fillone_1(float * buf, size_t N) {	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);	for (size_t i = 0; i < N/4; i++) {		_mm_store_ps(&(buf[4*i]), value);	}}void fillone_2(float * buf, size_t N) {	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);	for (size_t i = 0; i < N/32; i++) {		_mm_store_ps(&(buf[32*i]), value);		_mm_store_ps(&(buf[32*i+4]), value);		_mm_store_ps(&(buf[32*i+8]), value);		_mm_store_ps(&(buf[32*i+12]), value);		_mm_store_ps(&(buf[32*i+16]), value);		_mm_store_ps(&(buf[32*i+20]), value);		_mm_store_ps(&(buf[32*i+24]), value);		_mm_store_ps(&(buf[32*i+28]), value);	}}void fillone_3(float * buf, size_t N) {	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);	for (size_t i = 0; i < N/4; i++) {		_mm_stream_ps(&(buf[4*i]), value);	}}void fillone_4(float * buf, size_t N) {	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);	for (size_t i = 0; i < N/32; i++) {		_mm_stream_ps(&(buf[32*i]), value);		_mm_stream_ps(&(buf[32*i+4]), value);		_mm_stream_ps(&(buf[32*i+8]), value);		_mm_stream_ps(&(buf[32*i+12]), value);		_mm_stream_ps(&(buf[32*i+16]), value);		_mm_stream_ps(&(buf[32*i+20]), value);		_mm_stream_ps(&(buf[32*i+24]), value);		_mm_stream_ps(&(buf[32*i+28]), value);	}}

And the results from test run are:
int main(int argc, char **argv){	const size_t N = 16384*2;	const size_t M = 100000;	void * storage = _aligned_malloc(sizeof(float) * N, 16);	if (storage == NULL) {		printf("Failed to allocated memory.");		return 1;	}	float * buf = (float*)storage;	printf("Testing %d repetitions over %d floats\n", M, N);	Timer timer;	for (int x = 0; x < M; x++) fillone_0(buf, N);	printf("Baseline: %4.4f\n", timer.elapsed_ms());	timer.reset();	for (int x = 0; x < M; x++) fillone_1(buf, N);	printf("_mm_store_ps: %4.4f\n", timer.elapsed_ms());	timer.reset();	for (int x = 0; x < M; x++) fillone_2(buf, N);	printf("_mm_store_ps unrolled: %4.4f\n", timer.elapsed_ms());	timer.reset();	for (int x = 0; x < M; x++) fillone_3(buf, N);	printf("_mm_stream_ps: %4.4f\n", timer.elapsed_ms());	timer.reset();	for (int x = 0; x < M; x++) fillone_4(buf, N);	printf("_mm_stream_ps unrolled: %4.4f\n", timer.elapsed_ms());	for (int i = 0; i < 20; i++) printf("%g ", buf);	printf("\n");}

Quote:
 Testing 100000 repetitions over 2048 floatsBaseline: 45.3244_mm_store_ps: 45.4975_mm_store_ps unrolled: 45.2105_mm_stream_ps: 143.6576_mm_stream_ps unrolled: 144.0868

It would seem that there aren't any gains to be had here. 8k however means we're completely inside caches.
Quote:
 Testing 10000 repetitions over 65536 floatsBaseline: 445.9909_mm_store_ps: 388.0277_mm_store_ps unrolled: 372.8233_mm_stream_ps: 456.1334_mm_stream_ps unrolled: 461.8451

Modest gain. stream_ps is basically the same as naive for loop, store_ps takes the lead, unrolled version is also slightly faster.

Quote:
 Testing 100 repetitions over 4194304 floatsBaseline: 944.1375_mm_store_ps: 857.2247_mm_store_ps unrolled: 841.2786_mm_stream_ps: 289.9918_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.

So the conclusion would be, surprise surprise: profile, profile, profile, and when done, profile some more. It may also be necessary to play with various prefetches and other issues, my point is merely that it's far from trivial to give a generic answer over what the fastest way is.

##### Share on other sites
Quote:
Original post by Antheus

Quote:
 Testing 100 repetitions over 4194304 floatsBaseline: 944.1375_mm_store_ps: 857.2247_mm_store_ps unrolled: 841.2786_mm_stream_ps: 289.9918_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.

Not that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Also its always nice to have a quick check of the code-gen for intrinsic code to make sure the compiler hasn't had a fit.

##### Share on other sites
Those benchmarks don't include std::fill or std::fill_n. Perhaps you could modify it to include them?

Nobody seems to have mentioned duff's device either.

##### Share on other sites
If this is a major bottleneck then perhaps you should be looking into ways of not having to clear the buffer in the first place, instead of messing with finicky cache-control features. For instance, there's always the old trick of switching between positive and negative Z coordinates on alternate frames.

Quote:
 Original post by MoominNot that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Aren't "modern" CPUs supposed to detect streaming writes? I seem to recall reading somewhere in Intel's optimization guide that filling entire cache lines in strictly increasing order ought to avoid the unnecessary read.

1. 1
2. 2
3. 3
4. 4
frob
15
5. 5

• 16
• 12
• 20
• 12
• 18
• ### Forum Statistics

• Total Topics
632161
• Total Posts
3004505

×