# Fast memset

This topic is 3232 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

i have an array of floats (4bytes) and i want to set this array to 1.f for each entry as fast as possible. I tried memset, but that only accepts a char sized value to set At the moment I have this:
// clear depth buffer
{
int count = (int)floorf(float(mWidth * mHeight) / 4.f);
__m128* depthBuf = (__m128*)mDepthBuffer;
__m128 clearDepth = _mm_set_ps1(1.f);
int i = 0;
for (int c = 0; c < count; ++c, i += 4)
{
*depthBuf = clearDepth;
++depthBuf;
}
for (; i < (mWidth * mHeight); ++i)
mDepthBuffer = 1.f;
}

but its 2 slow.

##### Share on other sites
Assuming a decent SC++L implementation and with optimisations turned up, you probably won't get faster than std::fill_n without sacrificing portability.

##### Share on other sites
this requires your data to be in an stl container, i think, which it is not

##### Share on other sites
No, std::fill_n() can be used on arrays.

##### Share on other sites
You could try streaming, (assuming mDepthBuffer is 16byte aligned)

__m128* depthBuf = (__m128*)mDepthBuffer;const __m128 clearDepth = _mm_set_ps1(1.f);size_t i = 0;const size_t size = mWidth * mHeight;for(size_t index = 0;i + 4 <= size; i+=4,index++)    _mm_stream_ps(depthBuf[index],clearDepth);for(;i < size; i++)    mDepthBuffer = 1.f;

##### Share on other sites
Quote:
 Original post by supaguthis requires your data to be in an stl container, i think, which it is not

1) The term "stl" has not been applicable for some time. This is the standard C++ library now.

2) Standard library algorithms are templated on iterators. Pointers are iterators for arrays. :)

##### Share on other sites
Oh looky, an optimization problem. First one needs a way to time things.
struct Timer {	Timer() {		reset();	}	double elapsed_ms() const	{		LARGE_INTEGER current_ticks;		QueryPerformanceCounter(&current_ticks);		LONGLONG delta = current_ticks.QuadPart - start_ticks;		return (1.0e3 * delta * frequency);	}	void reset() {		LARGE_INTEGER f;		QueryPerformanceFrequency(&f);		frequency = 1.0 / f.QuadPart;		LARGE_INTEGER st;		QueryPerformanceCounter(&st);		start_ticks = st.QuadPart;	}	double frequency;	long long start_ticks;};
Use whatever works.

Next, we define a few different test cases:
void fillone_0(float * buf, size_t N) {	for (size_t i = 0; i < N; i++) {		buf = 1.0f;	}}void fillone_1(float * buf, size_t N) {	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);	for (size_t i = 0; i < N/4; i++) {		_mm_store_ps(&(buf[4*i]), value);	}}void fillone_2(float * buf, size_t N) {	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);	for (size_t i = 0; i < N/32; i++) {		_mm_store_ps(&(buf[32*i]), value);		_mm_store_ps(&(buf[32*i+4]), value);		_mm_store_ps(&(buf[32*i+8]), value);		_mm_store_ps(&(buf[32*i+12]), value);		_mm_store_ps(&(buf[32*i+16]), value);		_mm_store_ps(&(buf[32*i+20]), value);		_mm_store_ps(&(buf[32*i+24]), value);		_mm_store_ps(&(buf[32*i+28]), value);	}}void fillone_3(float * buf, size_t N) {	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);	for (size_t i = 0; i < N/4; i++) {		_mm_stream_ps(&(buf[4*i]), value);	}}void fillone_4(float * buf, size_t N) {	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);	for (size_t i = 0; i < N/32; i++) {		_mm_stream_ps(&(buf[32*i]), value);		_mm_stream_ps(&(buf[32*i+4]), value);		_mm_stream_ps(&(buf[32*i+8]), value);		_mm_stream_ps(&(buf[32*i+12]), value);		_mm_stream_ps(&(buf[32*i+16]), value);		_mm_stream_ps(&(buf[32*i+20]), value);		_mm_stream_ps(&(buf[32*i+24]), value);		_mm_stream_ps(&(buf[32*i+28]), value);	}}

And the results from test run are:
int main(int argc, char **argv){	const size_t N = 16384*2;	const size_t M = 100000;	void * storage = _aligned_malloc(sizeof(float) * N, 16);	if (storage == NULL) {		printf("Failed to allocated memory.");		return 1;	}	float * buf = (float*)storage;	printf("Testing %d repetitions over %d floats\n", M, N);	Timer timer;	for (int x = 0; x < M; x++) fillone_0(buf, N);	printf("Baseline: %4.4f\n", timer.elapsed_ms());	timer.reset();	for (int x = 0; x < M; x++) fillone_1(buf, N);	printf("_mm_store_ps: %4.4f\n", timer.elapsed_ms());	timer.reset();	for (int x = 0; x < M; x++) fillone_2(buf, N);	printf("_mm_store_ps unrolled: %4.4f\n", timer.elapsed_ms());	timer.reset();	for (int x = 0; x < M; x++) fillone_3(buf, N);	printf("_mm_stream_ps: %4.4f\n", timer.elapsed_ms());	timer.reset();	for (int x = 0; x < M; x++) fillone_4(buf, N);	printf("_mm_stream_ps unrolled: %4.4f\n", timer.elapsed_ms());	for (int i = 0; i < 20; i++) printf("%g ", buf);	printf("\n");}

Quote:
 Testing 100000 repetitions over 2048 floatsBaseline: 45.3244_mm_store_ps: 45.4975_mm_store_ps unrolled: 45.2105_mm_stream_ps: 143.6576_mm_stream_ps unrolled: 144.0868

It would seem that there aren't any gains to be had here. 8k however means we're completely inside caches.
Quote:
 Testing 10000 repetitions over 65536 floatsBaseline: 445.9909_mm_store_ps: 388.0277_mm_store_ps unrolled: 372.8233_mm_stream_ps: 456.1334_mm_stream_ps unrolled: 461.8451

Modest gain. stream_ps is basically the same as naive for loop, store_ps takes the lead, unrolled version is also slightly faster.

Quote:
 Testing 100 repetitions over 4194304 floatsBaseline: 944.1375_mm_store_ps: 857.2247_mm_store_ps unrolled: 841.2786_mm_stream_ps: 289.9918_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.

So the conclusion would be, surprise surprise: profile, profile, profile, and when done, profile some more. It may also be necessary to play with various prefetches and other issues, my point is merely that it's far from trivial to give a generic answer over what the fastest way is.

##### Share on other sites
Quote:
Original post by Antheus

Quote:
 Testing 100 repetitions over 4194304 floatsBaseline: 944.1375_mm_store_ps: 857.2247_mm_store_ps unrolled: 841.2786_mm_stream_ps: 289.9918_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.

Not that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Also its always nice to have a quick check of the code-gen for intrinsic code to make sure the compiler hasn't had a fit.

##### Share on other sites
Those benchmarks don't include std::fill or std::fill_n. Perhaps you could modify it to include them?

Nobody seems to have mentioned duff's device either.

##### Share on other sites
If this is a major bottleneck then perhaps you should be looking into ways of not having to clear the buffer in the first place, instead of messing with finicky cache-control features. For instance, there's always the old trick of switching between positive and negative Z coordinates on alternate frames.

Quote:
 Original post by MoominNot that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Aren't "modern" CPUs supposed to detect streaming writes? I seem to recall reading somewhere in Intel's optimization guide that filling entire cache lines in strictly increasing order ought to avoid the unnecessary read.

##### Share on other sites
Quote:
 Nobody seems to have mentioned duff's device either.

Let's assume that in the interest of optimization our buffers will be aligned and properly sized.

Quote:
 Those benchmarks don't include std::fill or std::fill_n. Perhaps you could modify it to include them?

I use a semi-automated profiling library which does a few other things as well, so here's the results from that.

rdtsc is used for timing. Each test is ran 50 times, results are treated to be have normal distribution. Results are displayed as Cycles per Array Element. Or, how many CPU cycles does it take to set one element.

Naive fill seems to be extremely poor for tiny array sizes, but does really well overall.

If data is expected to be in cache, then mm_store wins. As expected, its performance is directly proportional with cache. First step at 16k elements (64k L1), second at 131072 elements (512k L2).

For data that is known to be larger than L2 cache, mm_stream is clear winner. However, mm_stream is surprisingly slow in the 128-8kb range (32-2048 elements). It's also highly inconsistent, with frequent 10-20% standard deviation in running time (I ran test several times, it's not a profiling fluke).

And yes, this is a synthetic comparison of optimal performance. It does however give an adequate impression of relative performance. Some numbers of notice are that for data in L1 cache, it takes about 0.5 CPU cycle to set the value, for data in RAM it takes either 1.5 or 4.5 cycles.

The data: (number of elements, time per element, standard deviation of the measurements as %). When measurements start taking longer, context switches and other OS interference starts affecting the measurements, this is why I included that as well.
baseline       2,   10.505, 0.022       4,    5.503, 0.009       8,    3.002, 0.012      16,    1.813, 0.008      32,    1.157, 0.004      64,    0.828, 0.006     128,    0.664, 0.000     256,    0.582, 0.008     512,    0.541, 0.000    1024,    0.521, 0.000    2048,    0.510, 0.000    4096,    0.505, 0.000    8192,    0.503, 0.000   16384,    0.526, 0.115   32768,    1.563, 0.022   65536,    1.561, 0.017  131072,    1.560, 0.048  262144,    4.431, 3.553  524288,    4.439, 3.082 1048576,    4.654, 1.408 2097152,    4.591, 2.344 4194304,    4.629, 1.708 8388608,    4.692, 1.209_mm_store_ps       2,    2.000, 0.000       4,    1.084, 0.000       8,    0.875, 0.000      16,    0.750, 0.000      32,    0.594, 0.001      64,    0.844, 0.003     128,    0.664, 0.000     256,    0.582, 0.000     512,    0.543, 0.000    1024,    0.523, 0.000    2048,    0.511, 0.000    4096,    0.505, 0.000    8192,    0.503, 0.000   16384,    0.509, 0.013   32768,    1.359, 0.490   65536,    1.360, 0.577  131072,    1.358, 0.555  262144,    4.829, 4.703  524288,    4.358, 2.268 1048576,    4.456, 8.545 2097152,    4.457, 1.717 4194304,    4.361, 1.079 8388608,    4.300, 1.254_mm_store_ps unrolled       2,    2.501, 0.000       4,    1.250, 0.000       8,    0.625, 0.000      16,    0.313, 0.000      32,    0.500, 0.000      64,    0.500, 0.000     128,    0.500, 0.000     256,    0.500, 0.000     512,    0.530, 0.000    1024,    0.515, 0.000    2048,    0.508, 0.000    4096,    0.504, 0.000    8192,    0.502, 0.000   16384,    0.512, 0.173   32768,    1.325, 0.044   65536,    1.323, 0.038  131072,    1.322, 0.033  262144,    4.096, 0.758  524288,    4.176, 0.456 1048576,    4.200, 1.881 2097152,    4.244, 1.149 4194304,    4.198, 1.334 8388608,    4.283, 1.216_mm_stream_ps       2,    2.000, 0.000       4,    1.250, 0.000       8,    0.876, 0.087      16,    0.858, 0.279      32,    0.677, 0.958      64,    5.031, 8.625     128,    2.286, 11.217     256,    1.816, 0.286     512,    1.639, 0.306    1024,    2.967, 14.630    2048,    1.600, 0.515    4096,    1.575, 0.548    8192,    1.571, 0.877   16384,    1.579, 0.412   32768,    1.570, 0.302   65536,    1.565, 0.289  131072,    1.568, 0.243  262144,    1.569, 0.157  524288,    1.663, 9.518 1048576,    1.572, 0.122 2097152,    1.581, 2.247 4194304,    1.592, 2.377 8388608,    1.593, 2.594_mm_stream_ps unrolled       2,    2.501, 0.000       4,    1.250, 0.000       8,    0.625, 0.000      16,    0.313, 0.000      32,    0.612, 0.484      64,    4.999, 8.865     128,    2.550, 10.969     256,    1.908, 2.987     512,    2.040, 8.448    1024,    2.216, 9.182    2048,    1.899, 21.804    4096,    1.575, 0.534    8192,    1.565, 0.784   16384,    1.579, 0.367   32768,    1.568, 0.416   65536,    2.155, 27.657  131072,    1.601, 5.425  262144,    1.619, 10.732  524288,    1.614, 6.433 1048576,    1.573, 0.648 2097152,    1.581, 1.913 4194304,    1.787, 23.310 8388608,    1.606, 3.141std::fill       2,    4.001, 0.001       4,    3.501, 0.000       8,    3.252, 0.000      16,    3.938, 0.000      32,    3.469, 0.000      64,    3.235, 0.000     128,    3.118, 0.000     256,    3.059, 0.000     512,    3.030, 0.000    1024,    3.015, 0.000    2048,    3.008, 0.000    4096,    3.004, 0.000    8192,    3.002, 0.001   16384,    3.003, 0.000   32768,    3.005, 0.000   65536,    3.004, 0.018  131072,    3.004, 0.003  262144,    5.325, 3.692  524288,    4.973, 2.300 1048576,    5.033, 1.240 2097152,    5.124, 1.691 4194304,    5.122, 1.703 8388608,    5.040, 1.079

##### Share on other sites
Thanks, that looks great, very informative!

implicit: Damn I hadn't heard of that "old" trick. Excuse me while I try it out in my software renderer.[smile] I've been using std::fill for my Z-Buffer.

Edit: Worked like a charm, particularly since I'm already using an Inverse-Z-buffer (Zero means infinitely far away).
I also learned how to use member function pointers in the process, which is something I've never had a reason to try using.

[Edited by - iMalc on April 19, 2009 3:44:40 AM]

##### Share on other sites
Quote:
 Aren't "modern" CPUs supposed to detect streaming writes? I seem to recall reading somewhere in Intel's optimization guide that filling entire cache lines in strictly increasing order ought to avoid the unnecessary read.

Several different things are going on here.
1) Yes, streaming store instructions avoid the initial RFO (note: more expensive than a mere read due to cache coherency traffic) of the cache line.
2) The Intel Compiler sometimes transforms assignments and even _mm_store intrinsics into streaming stores behind your back.

Doing this on the CPU side would be more complicated. You'd first have to detect sequential stores to the same cache line, requiring some new kind of buffer. The second problem is whether to allocate one of the few WC buffers immediately or wait until the second sequential store is detected, in which case you have to find the prior store buffer. Third, if the CPU guesses wrong and a cache line isn't actually filled, it has to flush WC buffers back to cache memory (the current mechanism involves partial bus writes, which is really inefficient).
This isn't likely to ever happen since the compiler/programmer can simply indicate that they want a WC buffer and intend to fill it, which is the definition of the streaming store instruction.

##### Share on other sites
Antheus, I tried your timing code and added std::fill and std::fill_n, where fill_n was significantly faster, almost exactly the same as the baseline for several different sizes. I think that std::fill has some bounds checking or something in VC++.. did you turn all such things off, if it's present in your compiler, or try fill_n in your later tests?
It seems strange it should be so much slower than the baseline.

##### Share on other sites
Quote:
 Original post by Erik RufeltAntheus, I tried your timing code and added std::fill and std::fill_n, where fill_n was significantly faster, almost exactly the same as the baseline for several different sizes. I think that std::fill has some bounds checking or something in VC++.. did you turn all such things off, if it's present in your compiler, or try fill_n in your later tests?It seems strange it should be so much slower than the baseline.

Ah, good old SECURE_SCL, perhaps?

##### Share on other sites
The graph for the baseline is highly suggestive of a constant overhead of about 20 cycles. A branch misprediction perhaps? :)