• Advertisement
Sign in to follow this  

Fast memset

This topic is 3232 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

i have an array of floats (4bytes) and i want to set this array to 1.f for each entry as fast as possible. I tried memset, but that only accepts a char sized value to set At the moment I have this:
// clear depth buffer
	{
		int count = (int)floorf(float(mWidth * mHeight) / 4.f);
		__m128* depthBuf = (__m128*)mDepthBuffer;
		__m128 clearDepth = _mm_set_ps1(1.f);
		int i = 0;
		for (int c = 0; c < count; ++c, i += 4)
		{
			*depthBuf = clearDepth;
			++depthBuf;
		}
		for (; i < (mWidth * mHeight); ++i)
			mDepthBuffer = 1.f;
	}
but its 2 slow.

Share this post


Link to post
Share on other sites
Advertisement
this requires your data to be in an stl container, i think, which it is not

Share this post


Link to post
Share on other sites
You could try streaming, (assuming mDepthBuffer is 16byte aligned)


__m128* depthBuf = (__m128*)mDepthBuffer;
const __m128 clearDepth = _mm_set_ps1(1.f);

size_t i = 0;
const size_t size = mWidth * mHeight;
for(size_t index = 0;i + 4 <= size; i+=4,index++)
_mm_stream_ps(depthBuf[index],clearDepth);
for(;i < size; i++)
mDepthBuffer = 1.f;

Share this post


Link to post
Share on other sites
Quote:
Original post by supagu
this requires your data to be in an stl container, i think, which it is not


1) The term "stl" has not been applicable for some time. This is the standard C++ library now.

2) Standard library algorithms are templated on iterators. Pointers are iterators for arrays. :)

Share this post


Link to post
Share on other sites
Oh looky, an optimization problem. First one needs a way to time things.
struct Timer {
Timer() {
reset();
}

double elapsed_ms() const
{
LARGE_INTEGER current_ticks;
QueryPerformanceCounter(&current_ticks);

LONGLONG delta = current_ticks.QuadPart - start_ticks;
return (1.0e3 * delta * frequency);
}

void reset() {
LARGE_INTEGER f;
QueryPerformanceFrequency(&f);
frequency = 1.0 / f.QuadPart;
LARGE_INTEGER st;
QueryPerformanceCounter(&st);
start_ticks = st.QuadPart;
}
double frequency;
long long start_ticks;
};



Use whatever works.

Next, we define a few different test cases:
void fillone_0(float * buf, size_t N) {
for (size_t i = 0; i < N; i++) {
buf = 1.0f;
}
}

void fillone_1(float * buf, size_t N) {
const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
for (size_t i = 0; i < N/4; i++) {
_mm_store_ps(&(buf[4*i]), value);
}
}


void fillone_2(float * buf, size_t N) {
const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
for (size_t i = 0; i < N/32; i++) {
_mm_store_ps(&(buf[32*i]), value);
_mm_store_ps(&(buf[32*i+4]), value);
_mm_store_ps(&(buf[32*i+8]), value);
_mm_store_ps(&(buf[32*i+12]), value);
_mm_store_ps(&(buf[32*i+16]), value);
_mm_store_ps(&(buf[32*i+20]), value);
_mm_store_ps(&(buf[32*i+24]), value);
_mm_store_ps(&(buf[32*i+28]), value);
}
}

void fillone_3(float * buf, size_t N) {
const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
for (size_t i = 0; i < N/4; i++) {
_mm_stream_ps(&(buf[4*i]), value);
}
}

void fillone_4(float * buf, size_t N) {
const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
for (size_t i = 0; i < N/32; i++) {
_mm_stream_ps(&(buf[32*i]), value);
_mm_stream_ps(&(buf[32*i+4]), value);
_mm_stream_ps(&(buf[32*i+8]), value);
_mm_stream_ps(&(buf[32*i+12]), value);
_mm_stream_ps(&(buf[32*i+16]), value);
_mm_stream_ps(&(buf[32*i+20]), value);
_mm_stream_ps(&(buf[32*i+24]), value);
_mm_stream_ps(&(buf[32*i+28]), value);
}
}




And the results from test run are:
int main(int argc, char **argv)
{
const size_t N = 16384*2;
const size_t M = 100000;

void * storage = _aligned_malloc(sizeof(float) * N, 16);
if (storage == NULL) {
printf("Failed to allocated memory.");
return 1;
}
float * buf = (float*)storage;

printf("Testing %d repetitions over %d floats\n", M, N);

Timer timer;
for (int x = 0; x < M; x++) fillone_0(buf, N);
printf("Baseline: %4.4f\n", timer.elapsed_ms());

timer.reset();
for (int x = 0; x < M; x++) fillone_1(buf, N);
printf("_mm_store_ps: %4.4f\n", timer.elapsed_ms());

timer.reset();
for (int x = 0; x < M; x++) fillone_2(buf, N);
printf("_mm_store_ps unrolled: %4.4f\n", timer.elapsed_ms());

timer.reset();
for (int x = 0; x < M; x++) fillone_3(buf, N);
printf("_mm_stream_ps: %4.4f\n", timer.elapsed_ms());

timer.reset();
for (int x = 0; x < M; x++) fillone_4(buf, N);
printf("_mm_stream_ps unrolled: %4.4f\n", timer.elapsed_ms());

for (int i = 0; i < 20; i++) printf("%g ", buf);
printf("\n");
}





Quote:
Testing 100000 repetitions over 2048 floats
Baseline: 45.3244
_mm_store_ps: 45.4975
_mm_store_ps unrolled: 45.2105
_mm_stream_ps: 143.6576
_mm_stream_ps unrolled: 144.0868

It would seem that there aren't any gains to be had here. 8k however means we're completely inside caches.
Quote:
Testing 10000 repetitions over 65536 floats
Baseline: 445.9909
_mm_store_ps: 388.0277
_mm_store_ps unrolled: 372.8233
_mm_stream_ps: 456.1334
_mm_stream_ps unrolled: 461.8451

Modest gain. stream_ps is basically the same as naive for loop, store_ps takes the lead, unrolled version is also slightly faster.

Quote:
Testing 100 repetitions over 4194304 floats
Baseline: 944.1375
_mm_store_ps: 857.2247
_mm_store_ps unrolled: 841.2786
_mm_stream_ps: 289.9918
_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.


So the conclusion would be, surprise surprise: profile, profile, profile, and when done, profile some more. It may also be necessary to play with various prefetches and other issues, my point is merely that it's far from trivial to give a generic answer over what the fastest way is.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus

Quote:
Testing 100 repetitions over 4194304 floats
Baseline: 944.1375
_mm_store_ps: 857.2247
_mm_store_ps unrolled: 841.2786
_mm_stream_ps: 289.9918
_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.


Not that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Also its always nice to have a quick check of the code-gen for intrinsic code to make sure the compiler hasn't had a fit.

Share this post


Link to post
Share on other sites
Those benchmarks don't include std::fill or std::fill_n. Perhaps you could modify it to include them?

Nobody seems to have mentioned duff's device either.

Share this post


Link to post
Share on other sites
If this is a major bottleneck then perhaps you should be looking into ways of not having to clear the buffer in the first place, instead of messing with finicky cache-control features. For instance, there's always the old trick of switching between positive and negative Z coordinates on alternate frames.

Quote:
Original post by Moomin
Not that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Aren't "modern" CPUs supposed to detect streaming writes? I seem to recall reading somewhere in Intel's optimization guide that filling entire cache lines in strictly increasing order ought to avoid the unnecessary read.

Share this post


Link to post
Share on other sites
Quote:
Nobody seems to have mentioned duff's device either.


Let's assume that in the interest of optimization our buffers will be aligned and properly sized.

Quote:
Those benchmarks don't include std::fill or std::fill_n. Perhaps you could modify it to include them?


I use a semi-automated profiling library which does a few other things as well, so here's the results from that.

rdtsc is used for timing. Each test is ran 50 times, results are treated to be have normal distribution. Results are displayed as Cycles per Array Element. Or, how many CPU cycles does it take to set one element.



std::fill is slow, but steady.
Naive fill seems to be extremely poor for tiny array sizes, but does really well overall.

If data is expected to be in cache, then mm_store wins. As expected, its performance is directly proportional with cache. First step at 16k elements (64k L1), second at 131072 elements (512k L2).

For data that is known to be larger than L2 cache, mm_stream is clear winner. However, mm_stream is surprisingly slow in the 128-8kb range (32-2048 elements). It's also highly inconsistent, with frequent 10-20% standard deviation in running time (I ran test several times, it's not a profiling fluke).


And yes, this is a synthetic comparison of optimal performance. It does however give an adequate impression of relative performance. Some numbers of notice are that for data in L1 cache, it takes about 0.5 CPU cycle to set the value, for data in RAM it takes either 1.5 or 4.5 cycles.


The data: (number of elements, time per element, standard deviation of the measurements as %). When measurements start taking longer, context switches and other OS interference starts affecting the measurements, this is why I included that as well.
baseline
2, 10.505, 0.022
4, 5.503, 0.009
8, 3.002, 0.012
16, 1.813, 0.008
32, 1.157, 0.004
64, 0.828, 0.006
128, 0.664, 0.000
256, 0.582, 0.008
512, 0.541, 0.000
1024, 0.521, 0.000
2048, 0.510, 0.000
4096, 0.505, 0.000
8192, 0.503, 0.000
16384, 0.526, 0.115
32768, 1.563, 0.022
65536, 1.561, 0.017
131072, 1.560, 0.048
262144, 4.431, 3.553
524288, 4.439, 3.082
1048576, 4.654, 1.408
2097152, 4.591, 2.344
4194304, 4.629, 1.708
8388608, 4.692, 1.209
_mm_store_ps
2, 2.000, 0.000
4, 1.084, 0.000
8, 0.875, 0.000
16, 0.750, 0.000
32, 0.594, 0.001
64, 0.844, 0.003
128, 0.664, 0.000
256, 0.582, 0.000
512, 0.543, 0.000
1024, 0.523, 0.000
2048, 0.511, 0.000
4096, 0.505, 0.000
8192, 0.503, 0.000
16384, 0.509, 0.013
32768, 1.359, 0.490
65536, 1.360, 0.577
131072, 1.358, 0.555
262144, 4.829, 4.703
524288, 4.358, 2.268
1048576, 4.456, 8.545
2097152, 4.457, 1.717
4194304, 4.361, 1.079
8388608, 4.300, 1.254
_mm_store_ps unrolled
2, 2.501, 0.000
4, 1.250, 0.000
8, 0.625, 0.000
16, 0.313, 0.000
32, 0.500, 0.000
64, 0.500, 0.000
128, 0.500, 0.000
256, 0.500, 0.000
512, 0.530, 0.000
1024, 0.515, 0.000
2048, 0.508, 0.000
4096, 0.504, 0.000
8192, 0.502, 0.000
16384, 0.512, 0.173
32768, 1.325, 0.044
65536, 1.323, 0.038
131072, 1.322, 0.033
262144, 4.096, 0.758
524288, 4.176, 0.456
1048576, 4.200, 1.881
2097152, 4.244, 1.149
4194304, 4.198, 1.334
8388608, 4.283, 1.216
_mm_stream_ps
2, 2.000, 0.000
4, 1.250, 0.000
8, 0.876, 0.087
16, 0.858, 0.279
32, 0.677, 0.958
64, 5.031, 8.625
128, 2.286, 11.217
256, 1.816, 0.286
512, 1.639, 0.306
1024, 2.967, 14.630
2048, 1.600, 0.515
4096, 1.575, 0.548
8192, 1.571, 0.877
16384, 1.579, 0.412
32768, 1.570, 0.302
65536, 1.565, 0.289
131072, 1.568, 0.243
262144, 1.569, 0.157
524288, 1.663, 9.518
1048576, 1.572, 0.122
2097152, 1.581, 2.247
4194304, 1.592, 2.377
8388608, 1.593, 2.594
_mm_stream_ps unrolled
2, 2.501, 0.000
4, 1.250, 0.000
8, 0.625, 0.000
16, 0.313, 0.000
32, 0.612, 0.484
64, 4.999, 8.865
128, 2.550, 10.969
256, 1.908, 2.987
512, 2.040, 8.448
1024, 2.216, 9.182
2048, 1.899, 21.804
4096, 1.575, 0.534
8192, 1.565, 0.784
16384, 1.579, 0.367
32768, 1.568, 0.416
65536, 2.155, 27.657
131072, 1.601, 5.425
262144, 1.619, 10.732
524288, 1.614, 6.433
1048576, 1.573, 0.648
2097152, 1.581, 1.913
4194304, 1.787, 23.310
8388608, 1.606, 3.141
std::fill
2, 4.001, 0.001
4, 3.501, 0.000
8, 3.252, 0.000
16, 3.938, 0.000
32, 3.469, 0.000
64, 3.235, 0.000
128, 3.118, 0.000
256, 3.059, 0.000
512, 3.030, 0.000
1024, 3.015, 0.000
2048, 3.008, 0.000
4096, 3.004, 0.000
8192, 3.002, 0.001
16384, 3.003, 0.000
32768, 3.005, 0.000
65536, 3.004, 0.018
131072, 3.004, 0.003
262144, 5.325, 3.692
524288, 4.973, 2.300
1048576, 5.033, 1.240
2097152, 5.124, 1.691
4194304, 5.122, 1.703
8388608, 5.040, 1.079

Share this post


Link to post
Share on other sites
Thanks, that looks great, very informative!
Which compiler was this with?

implicit: Damn I hadn't heard of that "old" trick. Excuse me while I try it out in my software renderer.[smile] I've been using std::fill for my Z-Buffer.

Edit: Worked like a charm, particularly since I'm already using an Inverse-Z-buffer (Zero means infinitely far away).
I also learned how to use member function pointers in the process, which is something I've never had a reason to try using.

[Edited by - iMalc on April 19, 2009 3:44:40 AM]

Share this post


Link to post
Share on other sites
Quote:
Aren't "modern" CPUs supposed to detect streaming writes? I seem to recall reading somewhere in Intel's optimization guide that filling entire cache lines in strictly increasing order ought to avoid the unnecessary read.

Several different things are going on here.
1) Yes, streaming store instructions avoid the initial RFO (note: more expensive than a mere read due to cache coherency traffic) of the cache line.
2) The Intel Compiler sometimes transforms assignments and even _mm_store intrinsics into streaming stores behind your back.

Doing this on the CPU side would be more complicated. You'd first have to detect sequential stores to the same cache line, requiring some new kind of buffer. The second problem is whether to allocate one of the few WC buffers immediately or wait until the second sequential store is detected, in which case you have to find the prior store buffer. Third, if the CPU guesses wrong and a cache line isn't actually filled, it has to flush WC buffers back to cache memory (the current mechanism involves partial bus writes, which is really inefficient).
This isn't likely to ever happen since the compiler/programmer can simply indicate that they want a WC buffer and intend to fill it, which is the definition of the streaming store instruction.

Share this post


Link to post
Share on other sites
Antheus, I tried your timing code and added std::fill and std::fill_n, where fill_n was significantly faster, almost exactly the same as the baseline for several different sizes. I think that std::fill has some bounds checking or something in VC++.. did you turn all such things off, if it's present in your compiler, or try fill_n in your later tests?
It seems strange it should be so much slower than the baseline.

Share this post


Link to post
Share on other sites
Quote:
Original post by Erik Rufelt
Antheus, I tried your timing code and added std::fill and std::fill_n, where fill_n was significantly faster, almost exactly the same as the baseline for several different sizes. I think that std::fill has some bounds checking or something in VC++.. did you turn all such things off, if it's present in your compiler, or try fill_n in your later tests?
It seems strange it should be so much slower than the baseline.


Ah, good old SECURE_SCL, perhaps?

Share this post


Link to post
Share on other sites
The graph for the baseline is highly suggestive of a constant overhead of about 20 cycles. A branch misprediction perhaps? :)

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement