Back to General and Gameplay Programming

faster than memset?

General and Gameplay Programming Programming

Started by Ului April 12, 2004 03:33 AM

8 comments, last by Ului 20 years ago

Ului

122

Author

April 12, 2004 03:33 AM

Anyone know of a routine that is faster than memset, or even better for that matter. ZeroMemory, as far as I know uses the same thing. Not that i''d be calling this amillion times, but its always good to know u have the fastest thing out there. Dun mes wit me!

Dun mes wit me!

sjelkjd

171

April 12, 2004 04:03 AM

Memset is pretty fast. Fast enough to use everywhere, and if you profile and determine that it''s a problem somewhere, you can try to replace it with something else.

Ului

122

Author

April 12, 2004 04:41 AM

Neva Mind, I worked it out. Coded a faster one for what I need to do, as long as it can be divided by 64 it''s fine..

mov eax,dword ptr [nSize]; // start position for loop
mov edx, dword ptr [buffer]; //pointer to begining of array
myloop: // start loop
sub eax,64; // decrese counter

mov dword ptr [edx+eax],0;
mov dword ptr [edx+eax+4],0;
mov dword ptr [edx+eax+8],0;
mov dword ptr [edx+eax+12],0;
mov dword ptr [edx+eax+16],0;
mov dword ptr [edx+eax+20],0;
mov dword ptr [edx+eax+24],0;
mov dword ptr [edx+eax+28],0;
mov dword ptr [edx+eax+32],0;
mov dword ptr [edx+eax+36],0;
mov dword ptr [edx+eax+40],0;
mov dword ptr [edx+eax+44],0;
mov dword ptr [edx+eax+48],0;
mov dword ptr [edx+eax+52],0;
mov dword ptr [edx+eax+56],0;
mov dword ptr [edx+eax+60],0;

cmp eax,0; // if counter = 0
jne myloop; // else loop

Is there anyway to optimize this further??

Dun mes wit me!

Dun mes wit me!

Anonymous

April 12, 2004 05:00 AM

Yes.. first.. you must decide if you want to load the area you are filling into the memory caches or not - typically when you are filling memory you do NOT want to also fill the caches but what you are doing there is most certainly filling the caches.

For when you desire/need cache pollution and the destination is 32bit aligned and 512 bytes or less..

sub eax, eax
mov edi, StartAddress
mov ecx, BytesToFill/4
rep stosd

In all other cases.. use MMX instructions - the AMD optimisation guide (available online) has a nice memory fill routine for MMX that is also near optimal on intel machines.

- Rockoon

Ului

122

Author

April 12, 2004 05:08 AM

sweet, so could u say what each command does in what u said.. I haven''t learnt all ASM yet.
Can''t find the Microsoft MASM documentation.

Dun mes wit me!

Dun mes wit me!

markr

1,692

April 12, 2004 05:25 AM

I tried compiling code which calls memset with g++ on intel.

If you call memset() without optimisation enabled, it calls memset from the C library.

If you enable optimisation, it plants code similar to Anonymous Poster''s - that''s to say, it uses rep stosl.

So just call memset, if you enable optimisation it should be as fast as any. I fail to see how it can be any faster than a rep stosl instruction.

Mark

Ului

122

Author

April 12, 2004 05:40 AM

Thanks. as far as cache pollution goes, what is that?
How does it work. Was I filling caches & memory, which part.
This way I can learn and avoid possible desaster.
& also cache pollution a bad thing?

Dun mes wit me!

Dun mes wit me!

Drakonite

215

April 12, 2004 07:38 AM

When we first start thinking about it, unrolling loops so we don''t have the extra increment, compare, and jumps every iteration sounds like it''d be faster (less ops have to be done) But in most cases this doesn''t work.

The CPU has two caches, a data cache and a program cache. As our program is ran, the CPU loads chunks of it into the program cache and then runs the instructions from there. We get a speed increase from this because reading from cache is much faster than reading from RAM.

When we unroll a loop we may lower the number of ops the CPU has to perform, but we also greatly increase the size of the program code. With increased program code we can end up with a much higher number of cache misses, which is when the code needed isn''t in cache.

Everytime there is a cache miss the CPU has to flush the cache and refill it from RAM, which we know is quite slow, and while it''s waiting for the cache to be refilled the CPU has no choice but to sit and do nothing. To make matters worse, the increase in program size means there is more code that must be loaded into cache everytime we have a cache miss.

We know that we are going to have cache misses. A couple dozen cache misses per frame isn''t a big deal, but it isn''t hard to imagine how bad it would be if we caused a cache miss for every pixel we plot, every polygon we render, or anything else we do thousands of times per frame. Typically games are made of a couple small pieces of code that is run thousands of times in a row each frame, and with a little care we can get them to fit in the cache, giving us some great performance.

Now back to unrolling loops... The only time we tend to gain anything from unrolling a loop is when we have a small loop (only a few lines of code and few iterations) that gets run a large number of times (such as a loop that does some simple op for each vertex in a tri, and gets run for every tri in the scene). Large loops or loops with a large number of iterations (such as copying memory byte by byte) tend to almost always cause more penalties because of cache misses and code size than they could ever hope to gain by unrolling.

And a bit off topic... Optimizing by unrolling loops is one of the last things you want to do. If you find you are spending a lot of time in a particular loop modifying/changing to a better algorithm, cleaning up code inside loop, and minimizing the number of times the loop will be called will have a much greater effect.

Drakonite

Shoot Pixels Not People

Shoot Pixels Not People

flangazor

516

April 13, 2004 02:40 AM

Maybe a better optimization would be to not call memset so much. The quickest operation is the one you don''t perform.

Chuck3d

122

April 13, 2004 04:13 AM

Very good, cheer ! clap clap clap.

!o)

!o)

faster than memset?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

faster than memset?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines