Home » Community » Forums » General Programming » Fast memset
Intel sponsors gamedev.net search:
Control Panel Register Bookmarks Who's Online Active Topics Stats FAQ Search

Add Forum to Favorites |  Send Topic To a Friend | View Forum FAQ | Track this topic


 Last Thread Next Thread 
 Fast memset
Post New Topic  Post Reply 
i have an array of floats (4bytes) and i want to set this array to 1.f for each entry as fast as possible.

I tried memset, but that only accepts a char sized value to set

At the moment I have this:

// clear depth buffer
	{
		int count = (int)floorf(float(mWidth * mHeight) / 4.f);
		__m128* depthBuf = (__m128*)mDepthBuffer;
		__m128 clearDepth = _mm_set_ps1(1.f);
		int i = 0;
		for (int c = 0; c < count; ++c, i += 4)
		{
			*depthBuf = clearDepth;
			++depthBuf;
		}
		for (; i < (mWidth * mHeight); ++i)
			mDepthBuffer[i] = 1.f;
	}


but its 2 slow.

 User Rating: 1032   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Assuming a decent SC++L implementation and with optimisations turned up, you probably won't get faster than std::fill_n without sacrificing portability.


. David Gill :: Facebook :: twitter .

 User Rating: 1713   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

this requires your data to be in an stl container, i think, which it is not

 User Rating: 1032   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

No, std::fill_n() can be used on arrays.

 User Rating: 2071   |  Rate This User  Send Private MessageView ProfileView JournalView GD Showcase Entries Report this Post to a Moderator | Link

You could try streaming, (assuming mDepthBuffer is 16byte aligned)

__m128* depthBuf = (__m128*)mDepthBuffer;
const __m128 clearDepth = _mm_set_ps1(1.f);

size_t i = 0;
const size_t size = mWidth * mHeight;
for(size_t index = 0;i + 4 <= size; i+=4,index++)
    _mm_stream_ps(depthBuf[index],clearDepth);
for(;i < size; i++)
    mDepthBuffer[i] = 1.f;


 User Rating: 1154   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by supagu
this requires your data to be in an stl container, i think, which it is not


1) The term "stl" has not been applicable for some time. This is the standard C++ library now.

2) Standard library algorithms are templated on iterators. Pointers are iterators for arrays. :)



As a general rule, if you post in For Beginners and your code contains the word 'char', you have a bug. std::string roxors teh big one one one one.
"OMG! I'm so happy! I have "1 Friends"!!!" -- coldacid
"Basically whenever you invoke the dread ellipses construct you leave the happy world of type safety." -- SiCrane
"I mean, if you had sex for every time O'Reilly used the word Patriotism you'd be almost as awesome as Chuck Norris." -- tthibault

<triforce101> uh im not a noob i finished the game

 User Rating: 1992   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Oh looky, an optimization problem. First one needs a way to time things.
struct Timer {
	Timer() {
		reset();
	}

	double elapsed_ms() const
	{
		LARGE_INTEGER current_ticks;
		QueryPerformanceCounter(&current_ticks);

		LONGLONG delta = current_ticks.QuadPart - start_ticks;
		return (1.0e3 * delta * frequency);
	}

	void reset() {
		LARGE_INTEGER f;
		QueryPerformanceFrequency(&f);
		frequency = 1.0 / f.QuadPart;
		LARGE_INTEGER st;
		QueryPerformanceCounter(&st);
		start_ticks = st.QuadPart;
	}
	double frequency;
	long long start_ticks;
};



Use whatever works.

Next, we define a few different test cases:
void fillone_0(float * buf, size_t N) {
	for (size_t i = 0; i < N; i++) {
		buf[i] = 1.0f;
	}
}

void fillone_1(float * buf, size_t N) {
	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
	for (size_t i = 0; i < N/4; i++) {
		_mm_store_ps(&(buf[4*i]), value);
	}
}


void fillone_2(float * buf, size_t N) {
	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
	for (size_t i = 0; i < N/32; i++) {
		_mm_store_ps(&(buf[32*i]), value);
		_mm_store_ps(&(buf[32*i+4]), value);
		_mm_store_ps(&(buf[32*i+8]), value);
		_mm_store_ps(&(buf[32*i+12]), value);
		_mm_store_ps(&(buf[32*i+16]), value);
		_mm_store_ps(&(buf[32*i+20]), value);
		_mm_store_ps(&(buf[32*i+24]), value);
		_mm_store_ps(&(buf[32*i+28]), value);
	}
}

void fillone_3(float * buf, size_t N) {
	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
	for (size_t i = 0; i < N/4; i++) {
		_mm_stream_ps(&(buf[4*i]), value);
	}
}

void fillone_4(float * buf, size_t N) {
	const __m128 value = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
	for (size_t i = 0; i < N/32; i++) {
		_mm_stream_ps(&(buf[32*i]), value);
		_mm_stream_ps(&(buf[32*i+4]), value);
		_mm_stream_ps(&(buf[32*i+8]), value);
		_mm_stream_ps(&(buf[32*i+12]), value);
		_mm_stream_ps(&(buf[32*i+16]), value);
		_mm_stream_ps(&(buf[32*i+20]), value);
		_mm_stream_ps(&(buf[32*i+24]), value);
		_mm_stream_ps(&(buf[32*i+28]), value);
	}
}




And the results from test run are:
int main(int argc, char **argv)
{
	const size_t N = 16384*2;
	const size_t M = 100000;

	void * storage = _aligned_malloc(sizeof(float) * N, 16);
	if (storage == NULL) {
		printf("Failed to allocated memory.");
		return 1;
	}
	float * buf = (float*)storage;

	printf("Testing %d repetitions over %d floats\n", M, N);

	Timer timer;
	for (int x = 0; x < M; x++) fillone_0(buf, N);
	printf("Baseline: %4.4f\n", timer.elapsed_ms());

	timer.reset();
	for (int x = 0; x < M; x++) fillone_1(buf, N);
	printf("_mm_store_ps: %4.4f\n", timer.elapsed_ms());

	timer.reset();
	for (int x = 0; x < M; x++) fillone_2(buf, N);
	printf("_mm_store_ps unrolled: %4.4f\n", timer.elapsed_ms());

	timer.reset();
	for (int x = 0; x < M; x++) fillone_3(buf, N);
	printf("_mm_stream_ps: %4.4f\n", timer.elapsed_ms());

	timer.reset();
	for (int x = 0; x < M; x++) fillone_4(buf, N);
	printf("_mm_stream_ps unrolled: %4.4f\n", timer.elapsed_ms());

	for (int i = 0; i < 20; i++) printf("%g ", buf[i]);
	printf("\n");
}





Quote:
Testing 100000 repetitions over 2048 floats
Baseline: 45.3244
_mm_store_ps: 45.4975
_mm_store_ps unrolled: 45.2105
_mm_stream_ps: 143.6576
_mm_stream_ps unrolled: 144.0868

It would seem that there aren't any gains to be had here. 8k however means we're completely inside caches.
Quote:
Testing 10000 repetitions over 65536 floats
Baseline: 445.9909
_mm_store_ps: 388.0277
_mm_store_ps unrolled: 372.8233
_mm_stream_ps: 456.1334
_mm_stream_ps unrolled: 461.8451

Modest gain. stream_ps is basically the same as naive for loop, store_ps takes the lead, unrolled version is also slightly faster.

Quote:
Testing 100 repetitions over 4194304 floats
Baseline: 944.1375
_mm_store_ps: 857.2247
_mm_store_ps unrolled: 841.2786
_mm_stream_ps: 289.9918
_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.


So the conclusion would be, surprise surprise: profile, profile, profile, and when done, profile some more. It may also be necessary to play with various prefetches and other issues, my point is merely that it's far from trivial to give a generic answer over what the fastest way is.

 User Rating: 1892   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by Antheus

Quote:
Testing 100 repetitions over 4194304 floats
Baseline: 944.1375
_mm_store_ps: 857.2247
_mm_store_ps unrolled: 841.2786
_mm_stream_ps: 289.9918
_mm_stream_ps unrolled: 293.2771

Interesting surprise, stream_ps takes the lead by far.


Not that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Also its always nice to have a quick check of the code-gen for intrinsic code to make sure the compiler hasn't had a fit.


 User Rating: 1154   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Those benchmarks don't include std::fill or std::fill_n. Perhaps you could modify it to include them?

Nobody seems to have mentioned duff's device either.

-------------------------
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

 User Rating: 1737   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

If this is a major bottleneck then perhaps you should be looking into ways of not having to clear the buffer in the first place, instead of messing with finicky cache-control features. For instance, there's always the old trick of switching between positive and negative Z coordinates on alternate frames.

Quote:
Original post by Moomin
Not that surprising, the streaming stores mean that data doesn't have to be read into the cache first before being stored into memory. Since this is the only test of the 3 that probably won't fit in the cache it wins. In a more realistic scenario, where the array being cleared might be displaced from the cache between each call, it might win in the smaller size tests.
Aren't "modern" CPUs supposed to detect streaming writes? I seem to recall reading somewhere in Intel's optimization guide that filling entire cache lines in strictly increasing order ought to avoid the unnecessary read.

 User Rating: 1255   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Nobody seems to have mentioned duff's device either.


Let's assume that in the interest of optimization our buffers will be aligned and properly sized.

Quote:
Those benchmarks don't include std::fill or std::fill_n. Perhaps you could modify it to include them?


I use a semi-automated profiling library which does a few other things as well, so here's the results from that.

rdtsc is used for timing. Each test is ran 50 times, results are treated to be have normal distribution. Results are displayed as Cycles per Array Element. Or, how many CPU cycles does it take to set one element.



std::fill is slow, but steady.
Naive fill seems to be extremely poor for tiny array sizes, but does really well overall.

If data is expected to be in cache, then mm_store wins. As expected, its performance is directly proportional with cache. First step at 16k elements (64k L1), second at 131072 elements (512k L2).

For data that is known to be larger than L2 cache, mm_stream is clear winner. However, mm_stream is surprisingly slow in the 128-8kb range (32-2048 elements). It's also highly inconsistent, with frequent 10-20% standard deviation in running time (I ran test several times, it's not a profiling fluke).


And yes, this is a synthetic comparison of optimal performance. It does however give an adequate impression of relative performance. Some numbers of notice are that for data in L1 cache, it takes about 0.5 CPU cycle to set the value, for data in RAM it takes either 1.5 or 4.5 cycles.


The data: (number of elements, time per element, standard deviation of the measurements as %). When measurements start taking longer, context switches and other OS interference starts affecting the measurements, this is why I included that as well.
baseline
       2,   10.505, 0.022
       4,    5.503, 0.009
       8,    3.002, 0.012
      16,    1.813, 0.008
      32,    1.157, 0.004
      64,    0.828, 0.006
     128,    0.664, 0.000
     256,    0.582, 0.008
     512,    0.541, 0.000
    1024,    0.521, 0.000
    2048,    0.510, 0.000
    4096,    0.505, 0.000
    8192,    0.503, 0.000
   16384,    0.526, 0.115
   32768,    1.563, 0.022
   65536,    1.561, 0.017
  131072,    1.560, 0.048
  262144,    4.431, 3.553
  524288,    4.439, 3.082
 1048576,    4.654, 1.408
 2097152,    4.591, 2.344
 4194304,    4.629, 1.708
 8388608,    4.692, 1.209
_mm_store_ps
       2,    2.000, 0.000
       4,    1.084, 0.000
       8,    0.875, 0.000
      16,    0.750, 0.000
      32,    0.594, 0.001
      64,    0.844, 0.003
     128,    0.664, 0.000
     256,    0.582, 0.000
     512,    0.543, 0.000
    1024,    0.523, 0.000
    2048,    0.511, 0.000
    4096,    0.505, 0.000
    8192,    0.503, 0.000
   16384,    0.509, 0.013
   32768,    1.359, 0.490
   65536,    1.360, 0.577
  131072,    1.358, 0.555
  262144,    4.829, 4.703
  524288,    4.358, 2.268
 1048576,    4.456, 8.545
 2097152,    4.457, 1.717
 4194304,    4.361, 1.079
 8388608,    4.300, 1.254
_mm_store_ps unrolled
       2,    2.501, 0.000
       4,    1.250, 0.000
       8,    0.625, 0.000
      16,    0.313, 0.000
      32,    0.500, 0.000
      64,    0.500, 0.000
     128,    0.500, 0.000
     256,    0.500, 0.000
     512,    0.530, 0.000
    1024,    0.515, 0.000
    2048,    0.508, 0.000
    4096,    0.504, 0.000
    8192,    0.502, 0.000
   16384,    0.512, 0.173
   32768,    1.325, 0.044
   65536,    1.323, 0.038
  131072,    1.322, 0.033
  262144,    4.096, 0.758
  524288,    4.176, 0.456
 1048576,    4.200, 1.881
 2097152,    4.244, 1.149
 4194304,    4.198, 1.334
 8388608,    4.283, 1.216
_mm_stream_ps
       2,    2.000, 0.000
       4,    1.250, 0.000
       8,    0.876, 0.087
      16,    0.858, 0.279
      32,    0.677, 0.958
      64,    5.031, 8.625
     128,    2.286, 11.217
     256,    1.816, 0.286
     512,    1.639, 0.306
    1024,    2.967, 14.630
    2048,    1.600, 0.515
    4096,    1.575, 0.548
    8192,    1.571, 0.877
   16384,    1.579, 0.412
   32768,    1.570, 0.302
   65536,    1.565, 0.289
  131072,    1.568, 0.243
  262144,    1.569, 0.157
  524288,    1.663, 9.518
 1048576,    1.572, 0.122
 2097152,    1.581, 2.247
 4194304,    1.592, 2.377
 8388608,    1.593, 2.594
_mm_stream_ps unrolled
       2,    2.501, 0.000
       4,    1.250, 0.000
       8,    0.625, 0.000
      16,    0.313, 0.000
      32,    0.612, 0.484
      64,    4.999, 8.865
     128,    2.550, 10.969
     256,    1.908, 2.987
     512,    2.040, 8.448
    1024,    2.216, 9.182
    2048,    1.899, 21.804
    4096,    1.575, 0.534
    8192,    1.565, 0.784
   16384,    1.579, 0.367
   32768,    1.568, 0.416
   65536,    2.155, 27.657
  131072,    1.601, 5.425
  262144,    1.619, 10.732
  524288,    1.614, 6.433
 1048576,    1.573, 0.648
 2097152,    1.581, 1.913
 4194304,    1.787, 23.310
 8388608,    1.606, 3.141
std::fill
       2,    4.001, 0.001
       4,    3.501, 0.000
       8,    3.252, 0.000
      16,    3.938, 0.000
      32,    3.469, 0.000
      64,    3.235, 0.000
     128,    3.118, 0.000
     256,    3.059, 0.000
     512,    3.030, 0.000
    1024,    3.015, 0.000
    2048,    3.008, 0.000
    4096,    3.004, 0.000
    8192,    3.002, 0.001
   16384,    3.003, 0.000
   32768,    3.005, 0.000
   65536,    3.004, 0.018
  131072,    3.004, 0.003
  262144,    5.325, 3.692
  524288,    4.973, 2.300
 1048576,    5.033, 1.240
 2097152,    5.124, 1.691
 4194304,    5.122, 1.703
 8388608,    5.040, 1.079


 User Rating: 1892   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Thanks, that looks great, very informative!
Which compiler was this with?

implicit: Damn I hadn't heard of that "old" trick. Excuse me while I try it out in my software renderer. I've been using std::fill for my Z-Buffer.

Edit: Worked like a charm, particularly since I'm already using an Inverse-Z-buffer (Zero means infinitely far away).
I also learned how to use member function pointers in the process, which is something I've never had a reason to try using.

[Edited by - iMalc on April 19, 2009 3:44:40 AM]

-------------------------
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

 User Rating: 1737   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Aren't "modern" CPUs supposed to detect streaming writes? I seem to recall reading somewhere in Intel's optimization guide that filling entire cache lines in strictly increasing order ought to avoid the unnecessary read.

Several different things are going on here.
1) Yes, streaming store instructions avoid the initial RFO (note: more expensive than a mere read due to cache coherency traffic) of the cache line.
2) The Intel Compiler sometimes transforms assignments and even _mm_store intrinsics into streaming stores behind your back.

Doing this on the CPU side would be more complicated. You'd first have to detect sequential stores to the same cache line, requiring some new kind of buffer. The second problem is whether to allocate one of the few WC buffers immediately or wait until the second sequential store is detected, in which case you have to find the prior store buffer. Third, if the CPU guesses wrong and a cache line isn't actually filled, it has to flush WC buffers back to cache memory (the current mechanism involves partial bus writes, which is really inefficient).
This isn't likely to ever happen since the compiler/programmer can simply indicate that they want a WC buffer and intend to fill it, which is the definition of the streaming store instruction.

 User Rating: 1573   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Antheus, I tried your timing code and added std::fill and std::fill_n, where fill_n was significantly faster, almost exactly the same as the baseline for several different sizes. I think that std::fill has some bounds checking or something in VC++.. did you turn all such things off, if it's present in your compiler, or try fill_n in your later tests?
It seems strange it should be so much slower than the baseline.

 User Rating: 1649   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by Erik Rufelt
Antheus, I tried your timing code and added std::fill and std::fill_n, where fill_n was significantly faster, almost exactly the same as the baseline for several different sizes. I think that std::fill has some bounds checking or something in VC++.. did you turn all such things off, if it's present in your compiler, or try fill_n in your later tests?
It seems strange it should be so much slower than the baseline.


Ah, good old SECURE_SCL, perhaps?

 User Rating: 1769   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

The graph for the baseline is highly suggestive of a constant overhead of about 20 cycles. A branch misprediction perhaps? :)

 User Rating: 1992   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

All times are ET (US)

Post Reply
 Last Thread Next Thread 
Forum Rules:
You may not post new threads
You may post replies
You may not edit your posts
You may not use HTML in your posts
Jump To:
Administrative Options: